Did xAI lie about Grok 3’s benchmarks?


Discussions of AI reference values ​​- and how they are reported by Laboratory AI laboratories – they are poured into a public view.

This week an employee of the openii defendant AI Company Elon Musk, XAI, Publishing Wrong Reference results for its latest AI model, Grok 3. One of the co -founders XAI, Igor Babushkin, insisted that the company was right.

The truth lies somewhere in between.

In a Post on the XAI blogThe company has published a chart showing the performance of Grok 3 on Aime 2025, a collection of challenging mathematical issues from a recent Invitation Mathematical Examination. Some experts have questioned Aime validity as AI reference value. Nevertheless, the Aime 2025 and the elderly versions are usually used to test the mathematical abilities of the model.

The XAI chart showed two GRAK 3 variants, GROK 3 resonate of beta and Grok 3 mini resonation, beating the Openai’s best performance available model, o3 -min tallOn Aime 2025, but the Openii employees at X quickly pointed out that the XAI graph did not include O3-MIGH’s Aime 2025 result on “Cons@64”.

What is Cons@64, could you ask? Well, it is brief for “consensus@64”, and the basis gives the model 64 trying to respond to any problem in the reference value and takes the answers generated most often as the final answers. As you can imagine, the Cons@64 tends to increase the reference results of the model, and omitting from the chart could look like one model surpasses the other, which is not the case in reality.

GRAK 3 Beta and Grok 3 Mini Mini Resolution for Aime 2025 on “@1” -What means that the first result that the models have received on reference value-under the results of the O3-Mi-High results. Grok 3 Beta Establishment Also behind Openai O1 model Place on “medium” calculation. Yet it is Advertising Grok 3 as “the smartest AI in the world.”

Babus ARGUAMENTED to x This Openi has published similar misconceptions in the past – although the charts have compared the performance of their own models. The neutral party in the discussion has compiled a “more accurate” chart that shows almost every model of model performance on defects@64:

But as an AI Nathan Lambert researcher pointed out in the postPerhaps the most important metric remains a mystery: computational (and monetary) costs that were needed to achieve each model its best result. It just shows how little AI reference values ​​communicate about the limitations of the model – and their forces.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *