5 months, 3 weeks ago

Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be

For a while now, companies like OpenAI and Google have been touting advanced "reasoning" capabilities as the next big step in their latest artificial intelligence models. Now, though, a new study from six Apple engineers shows that the mathematical "reasoning" displayed by advanced large language models can be extremely brittle and unreliable in the face of seemingly trivial changes to common benchmark problems. Mix It Up In "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models"—currently available as a preprint paper—the six Apple researchers start with GSM8K's standardized set of more than 8,000 grade-school level mathematical word problems, which is often used as a benchmark for modern LLMs' complex reasoning capabilities. The fact that such small changes lead to such variable results suggests to the researchers that these models are not doing any "formal" reasoning but are instead “attempt to perform a kind of in-distribution pattern-matching, aligning given questions and solution steps with similar ones seen in the training data.” Don’t Get Distracted Still, the overall variance shown for the GSM-Symbolic tests was often relatively small in the grand scheme of things. Adding in these red herrings led to what the researchers termed "catastrophic performance drops" in accuracy compared to GSM8K, ranging from 17.5 percent to a whopping 65.7 percent, depending on the model tested.

Wired

Discover Related