AI has a stupid secret: we’re still not sure how to test for human levels of intelligence
Two of San Francisco’s leading players in artificial intelligence have challenged the public to come up with questions capable of testing the capabilities of large language models (LLMs) like Google Gemini and OpenAI’s o1. Scale AI, which specialises in preparing the vast tracts of data on which the LLMs are trained, teamed up with the Center for AI Safety (CAIS) to launch the initiative, Humanity’s Last Exam.
Featuring prizes of US$5,000 (£3,800) for those who come up with the top 50 questions selected for the test, Scale and CAIS say the goal is to test how close we are to achieving “expert-level AI systems” using the “largest, broadest coalition of experts in history”.
Why do this? The leading LLMs are already acing many established tests in intelligence, mathematics and law, but it’s hard to be sure how meaningful this is. In many cases, they may have pre-learned the answers due to the gargantuan quantities of data on which they are trained, including a significant percentage of everything on the internet.
Data is fundamental to this whole area. It is behind the paradigm shift from conventional computing to AI, from “telling” to “showing” these machines what to do. This requires good training datasets, but also good tests. Developers typically do this using data that hasn’t already been used for training, known in the jargon as “test datasets”.
If LLMs are not already able to pre-learn the answer to established tests like bar exams, they probably will soon. The AI analytics site Epoch estimates that 2028 will mark the point at which the AIs will effectively have read everything ever written by humans. An equally important challenge is how to keep assessing AIs once that rubicon has been crossed.
Of course, the internet is expanding all the time, with millions of new items being added daily. Could that take care of these problems?
Perhaps, but this bleeds into another insidious difficulty, referred to as “model collapse”. As the internet becomes increasingly flooded by AI-generated material which recirculates into future AI training sets, this may cause AIs to perform increasingly poorly. To overcome this problem, many developers are already collecting data from their AIs’ human interactions, adding fresh data for training and testing.
Some specialists argue that AIs also need to become “embodied”: moving around in the real world and acquiring their own experiences, as humans do. This might sound far-fetched until you realise that Tesla has been doing it for years with its cars. Another opportunity is human wearables, such as Meta’s popular smart glasses by Ray-Ban. These are equipped with cameras and microphones, and can be used to collect vast quantities of human-centric video and audio data.
Narrow tests
Yet even if such products guarantee enough training data in future, there is still the conundrum of how to define and measure intelligence – particularly artificial general intelligence (AGI), meaning an AI that equals or surpasses human intelligence.
Traditional human IQ tests have long been controversial for failing to capture the multifaceted nature of intelligence, encompassing everything from language to mathematics to empathy to sense of direction.
There’s an analagous problem with the tests used on AIs. There are many well established tests covering such tasks as summarising text, understanding it, drawing correct inferences from information, recognising human poses and gestures, and machine vision.
Some tests are being retired, usually because the AIs are doing so well at them, but they’re so task-specific as to be very narrow measures of intelligence. For instance, the chess-playing AI Stockfish is way ahead of Magnus Carlsen, the highest scoring human player of all time, on the Elo rating system. Yet Stockfish is incapable of doing other tasks such as understanding language. Clearly it would be wrong to conflate its chess capabilities with broader intelligence.
But with AIs now demonstrating broader intelligent behaviour, the challenge is to devise new benchmarks for comparing and measuring their progress. One notable approach has come from French Google engineer François Chollet. He argues that true intelligence lies in the ability to adapt and generalise learning to new, unseen situations. In 2019, he came up with the “abstraction and reasoning corpus” (ARC), a collection of puzzles in the form of simple visual grids designed to test an AI’s ability to infer and apply abstract rules.
Unlike previous benchmarks that test visual object recognition by training an AI on millions of images, each with information about the objects contained, ARC gives it minimal examples in advance. The AI has to figure out the puzzle logic and can’t just learn all the possible answers.
Though the ARC tests aren’t particularly difficult for humans to solve, there’s a prize of US$600,000 to the first AI system to reach a score of 85%. At the time of writing, we’re a long way from that point. Two recent leading LLMs, OpenAI’s o1 preview and Anthropic’s Sonnet 3.5, both score 21% on the ARC public leaderboard (known as the ARC-AGI-Pub).
Another recent attempt using OpenAI’s GPT-4o scored 50%, but somewhat controversially because the approach generated thousands of possible solutions before choosing the one that gave the best answer for the test. Even then, this was still reassuringly far from triggering the prize – or matching human performances of over 90%.
While ARC remains one of the most credible attempts to test for genuine intelligence in AI today, the Scale/CAIS initiative shows that the search continues for compelling alternatives. (Fascinatingly, we may never see some of the prize-winning questions. They won’t be published on the internet, to ensure the AIs don’t get a peek at the exam papers.)
We need to know when machines are getting close to human-level reasoning, with all the safety, ethical and moral questions this raises. At that point, we’ll presumably be left with an even harder exam question: how to test for a superintelligence. That’s an even more mind-bending task that we need to figure out.