ChatGPT exploded onto the scene late last year, dazzling people with its human-like conversational abilities, and the release of latest version prompted a crypto rally and calls for a pause in development. But according to a new study, the leading AI bot’s skills may actually be on the decline.
Researchers at Stanford and UC Berkeley systematically analyzed different versions of ChatGPT from March and June 2022. They developed rigorous benchmarks to evaluate the model’s competency in math, coding, and visual reasoning tasks. The results of ChatGPT’s performance over time were not good.
The tests revealed a startling drop-off in performance between versions. On a math challenge of determining prime numbers, ChatGPT solved 488 out of 500 questions correctly in March, an accuracy of 97.6%. However, in June, ChatGPT only managed to get 12 questions right, plunging to 2.4% accuracy.
The decline was especially steep in the chatbot’s software coding abilities.
“For GPT-4, the percentage of generations that are directly executable dropped from 52.0% in March to 10.0% in June,” the research found. These results were obtained by using the pure version of the models, meaning, no code interpreter plugins were involved.
To assess reasoning, the researchers leveraged visual prompts from the Abstract Reasoning Corpus (ARC) dataset. Even here, while not as steep, a decline was observable. “GPT-4 in June made mistakes on queries on which it was correct for in March” the study reads.
What could explain ChatGPT’s apparent downgrade after just a few months? Researchers hypothesize it may be a side effect of optimizations being made by OpenAI, its creator.
One possibility cause is changes introduced to prevent ChatGPT from answering dangerous questions. This safety alignment could impair ChatGPT’s usefulness for other tasks, though. The researchers found the model now tends to give verbose, indirect responses instead of clear answers.
“GPT-4 is getting worse over time, not better,” said AI expert Santiago Valderrama on Twitter. Valderrama also raised the possibility that a “cheaper and faster” mixture of models may have replaced the original ChatGPT architecture.
“Rumors suggest they are using several smaller and specialized GPT-4 models that act similarly to a large model but are less expensive to run,” he hypothesized, which he said could accelerate responses for users but reduce competency.
There are hundreds (maybe thousands already?) of replies from people saying they have noticed the degradation in quality.
Browse the comments, and you’ll read about many situations where GPT-4 is not working as before.
— Santiago (@svpino) July 19, 2023
Another expert, Dr. Jm, Fan also shared his insights on a Twitter Thread.
“Unfortunately, more safety typically comes at the cost of less usefulness,” he wrote, saying he was trying to make sense of the results by linking them to the way OpenAI finetunes its models. “My guess (no evidence, just speculation) is that OpenAI spent the majority of efforts doing lobotomy from March to June, and didn’t have time to fully recover the other capabilities that matter.”
Fan argues that other factors may have come into play, namely cost-cutting efforts, the introduction of warnings and disclaimers that may “dumb down” the model, and the lack of broader feedback from the community.
While more comprehensive testing is warranted, the findings align with users’ expressed frustrations over declining coherence in ChatGPT’s once eloquent outputs.
How can we prevent further deterioration? Some enthusiasts advocated for open-source models like Meta’s LLaMA (which has just been updated) that enable community debugging. Continuous benchmarking to catch regressions early is crucial.
For now, ChatGPT fans may need to temper their expectations. The wild idea-generating machine many first encountered appears tamer—and perhaps less brilliant. But age-related decline appears to be inevitable, even for AI celebrities.