Measuring Model Performance

4don MSN

OpenAI says GPT-5 stacks up to humans in a wide range of jobs

A new test from OpenAI aims to understand how close AI is to outperforming humans at economically valuable work.

10d

Evaluations As A North Star For AI Companies

Sebastian Crossa is the Co-founder of ZeroEval (YC S25), a platform to measure and optimize the quality of AI agents.

21don MSN

Popular AI model performance benchmark may be flawed, Meta researchers warn

A popular benchmark for measuring the performance of artificial intelligence models could be flawed, a group of Meta Platforms researchers warned, raising fresh questions on the veracity of ...

NextBigFuture

OpenAI Releases O3 Model With High Performance and High Cost

OpenaI o3 sets new records in several key areas, particularly in reasoning, coding and mathematical problem-solving. It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task in ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results