Stereophile Benchmark Ahb2

We offer two splits for each dataset: Dev and Test. The multi-turn interaction requires an LLMs to generate around 4k and 13k times respectively. Here is the scores on test set (standard) results of ...

Some results have been hidden because they may be inaccessible to you