Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be

5 months, 3 weeks ago

Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be

For a while now, companies like OpenAI and Google have been touting advanced "reasoning" capabilities as the next big step in their latest artificial intelligence models. Now, though, a new study from six Apple engineers shows that the mathematical "reasoning" displayed by advanced large language models can be extremely brittle and unreliable in the face of seemingly trivial changes to common benchmark problems. Mix It Up In "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models"—currently available as a preprint paper—the six Apple researchers start with GSM8K's standardized set of more than 8,000 grade-school level mathematical word problems, which is often used as a benchmark for modern LLMs' complex reasoning capabilities. The fact that such small changes lead to such variable results suggests to the researchers that these models are not doing any "formal" reasoning but are instead “attempt to perform a kind of in-distribution pattern-matching, aligning given questions and solution steps with similar ones seen in the training data.” Don’t Get Distracted Still, the overall variance shown for the GSM-Symbolic tests was often relatively small in the grand scheme of things. Adding in these red herrings led to what the researchers termed "catastrophic performance drops" in accuracy compared to GSM8K, ranging from 17.5 percent to a whopping 65.7 percent, depending on the model tested.

Ai Models Researchers Accuracy Gsm8K Reasoning

Discover Related

Why Apple's AI strategy is facing doubts

1 week, 3 days ago

Why Apple's AI strategy is facing doubts

Has Apple, the biggest company in the world, bungled its generative artificial intelligence strategy? Doubts blew out into the open when one of the company's closest observers, tech analyst John …

Apple Siri Ai Iphone

Databricks Has a Trick That Lets AI Models Improve Themselves

55 years, 3 months ago

Databricks Has a Trick That Lets AI Models Improve Themselves

Databricks, a company that helps big businesses build custom artificial intelligence models, has developed a machine-learning trick that can boost the performance of an AI model without the need for …

Wired

Data Model Ai Models

Microsoft developing AI reasoning models to compete with OpenAI: Report

1 month ago

Microsoft developing AI reasoning models to compete with OpenAI: Report

Microsoft is developing in-house artificial intelligence reasoning models to compete with OpenAI and may sell them to developers, The Information reported on Friday, citing a person involved in this initiative. …

Report Ai Models Mai

Chinese AI: Manus ‘agent’ and Alibaba’s new model released in quick succession

1 month ago

Chinese AI: Manus ‘agent’ and Alibaba’s new model released in quick succession

Hong Kong CNN — Chinese tech giant Alibaba unveiled its latest artificial intelligence reasoning model on Thursday, boasting that its capabilities beat those of rival models from OpenAI and startup …

CNN

Model Chinese Ai Openai

Will AI smartphones really make a big difference to your life?

1 month ago

Will AI smartphones really make a big difference to your life?

Next month, if you have one of the eligible iPhones, iPads and Macs, Apple’s generative AI features will become available via a software update to you—almost 10 months after the …

Features Google Samsung Apple

OpenAI Launches GPT-4.5 for ChatGPT—It’s Huge and Compute-Intensive

1 month, 1 week ago

OpenAI Launches GPT-4.5 for ChatGPT—It’s Huge and Compute-Intensive

GPT-4.5 is here, and OpenAI’s newest generative AI model is bigger and more compute-intensive than ever—it’s supposedly also better at understanding what ChatGPT users mean with their prompts. Before that, …

Wired

Model Ai Openai Gpt

Alibaba reveales its new AI model, positioning itself to compete with DeepSeek, OpenAI

1 month, 2 weeks ago

Alibaba reveales its new AI model, positioning itself to compete with DeepSeek, OpenAI

Alibaba reveales its new AI model, positioning itself to compete with DeepSeek, OpenAI Alibaba, one of the largest e-commerce companies in the world, has launched its first reasoning AI model …

Model Ai Openai Reasoning

OpenAI rival Anthropic releases its ‘most intelligent’ AI model to date

1 month, 2 weeks ago

OpenAI rival Anthropic releases its ‘most intelligent’ AI model to date

Anthropic, one of Artificial Intelligence giant OpenAI's biggest rivals, has released Claude 3.7 Sonnet or — as the company calls it — its “most intelligent model to date.” Anthropic's Claude …

Hindustan Times

Model Company Ai Openai

Anthropic’s Claude goes ahead of ChatGPT, DeepSeek, with first-ever hybrid reasoning model

1 month, 2 weeks ago

Anthropic’s Claude goes ahead of ChatGPT, DeepSeek, with first-ever hybrid reasoning model

Amazon-backed Anthropic has launched its latest language model, Claude 3.7 Sonnet, taking on the likes of ChatGPT and DeepSeek. Claude says that triggering the reasoning mode will help improve the …

Model Claude Deepseek Reasoning Mode

Anthropic Launches the World’s First ‘Hybrid Reasoning’ AI Model

55 years, 3 months ago

Anthropic Launches the World’s First ‘Hybrid Reasoning’ AI Model

Anthropic, an artificial intelligence company founded by exiles from OpenAI, has introduced the first AI model that can produce either conventional output or a controllable amount of “reasoning” needed to …

Wired

Model Ai Openai Reasoning

Why AI spending isn't slowing down

1 month, 2 weeks ago

Why AI spending isn't slowing down

Despite a brief period of investor doubt, money is pouring into artificial intelligence from big tech companies, national governments and venture capitalists at unprecedented levels. But what DeepSeek really did …

Google Power Companies Resources

Elon Musk’s xAI launches Grok 3, challenges ChatGPT with new reasoning models and AI agent

1 month, 3 weeks ago

Elon Musk’s xAI launches Grok 3, challenges ChatGPT with new reasoning models and AI agent

Elon Musk led xAI has launched its Grok 3 series of models during a live-stream on Tuesday. Apart from the pre-trained model, xAI also launched a two reasoning model Grok …

Search Models Deep Openai

Elon Musk's xAI unveils Grok 3: Users applaud its features and capabilities

1 month, 3 weeks ago

Elon Musk's xAI unveils Grok 3: Users applaud its features and capabilities

The tech world is buzzing following the much-anticipated launch of Grok 3, the latest AI model from xAI, Elon Musk’s artificial intelligence company. X Reacts: Praise for Grok 3 as …

Musk Elon Musk Openai Grok

DeepSeek’s R1 may be the first of many AI super-apps to come

1 month, 3 weeks ago

DeepSeek’s R1 may be the first of many AI super-apps to come

For many, AI’s promises of transformation have yet to materialize meaningfully. When ChatGPT first appeared, much initial innovation comprised ‘AI wrappers,’ or apps plugged into large language models without adding …

Model Ai Models Human

Apple teams up with Alibaba to bring its AI services to Chinese users, awaiting govt approval

1 month, 4 weeks ago

Apple teams up with Alibaba to bring its AI services to Chinese users, awaiting govt approval

Apple and Alibaba have already submitted their jointly developed AI features for regulatory approval, and hope to integrate them into future iOS updates soon. If approved, this partnership could help …

China Apple Chinese User

OpenAI updates the way o3-mini communicates, will support transparent reasoning and chain of thought display

2 months ago

OpenAI updates the way o3-mini communicates, will support transparent reasoning and chain of thought display

OpenAI rolled out its flagship AI models, o3 and o3-mini, a week ago. For the first time, OpenAI grants free-tier users access to its reasoning models through o3-Mini, though with …

Model Ai Thought Openai

Google launches new Gemini 2.0 Flash, AI model that can 'think' in response to DeepSeek & OpenAI

Trending 2 months ago

Google launches new Gemini 2.0 Flash, AI model that can 'think' in response to DeepSeek & OpenAI

One of the standout features of Gemini 2.0 Pro is its massive 2-million-token context window, which can process around 1.5 million words in a single session Alongside the standard model, …

Search Google Model Process

AI Open | The Frontline Newsletter

2 months ago

AI Open | The Frontline Newsletter

Published : Feb 05, 2025 20:40 IST - 7 MINS READ Dear reader, In the late 1960s, kids around the world tuned in to watch Johnny Sokko and His Flying …

India Intelligence Ai Models

Weekly Tech Recap: DeepSeek shakes up AI world, OpenAI rolls out o3-Mini, WhatsApp thwarts Paragon attack and more

2 months, 1 week ago

Weekly Tech Recap: DeepSeek shakes up AI world, OpenAI rolls out o3-Mini, WhatsApp thwarts Paragon attack and more

With so much news coming out this week, it can be hard to keep track of some of the biggest developments. In this week's Tech Recap: Apple announces plans to …

India Apple Model Ai

OpenAI releases “newest, most cost-efficient” o3-mini AI model

Trending 2 months, 1 week ago

OpenAI releases “newest, most cost-efficient” o3-mini AI model

OpenAI announced the launch of its new AI model, o3-mini, which it said was the “newest, most cost-efficient model” in its reasoning series and optimised for STEM reasoning. While OpenAI’s …

Ai Openai Faster Performance Releases Newest

DeepSeek effect? OpenAI rolls out o3-Mini, it’s first reasoning model for free users

Trending 2 months, 1 week ago

DeepSeek effect? OpenAI rolls out o3-Mini, it’s first reasoning model for free users

After previewing the o3-Mini reasoning model during the 12-day Ship-mas event in December, OpenAI is finally releasing the new model to both free and paid users. OpenAI's first-ever 'free' reasoning …

Models Users Openai Deepseek

Tim Cook lauds China’s DeepSeek, calls its AI models ‘a good thing’

2 months, 1 week ago

Tim Cook lauds China’s DeepSeek, calls its AI models ‘a good thing’

Apple CEO Tim Cook has praised the AI models of Chinese AI startup DeepSeek for driving "efficiency" in the field of artificial intelligence. During Apple's earnings call with investors, Cook …

Model Ai Models Openai

Can DeepSeek disrupt hegemony of US tech giants?

2 months, 1 week ago

Can DeepSeek disrupt hegemony of US tech giants?

BENGALURU: The low-cost AI model from DeepSeek, R1, is now the most downloaded model across the globe and AI experts are drawing comparison with OpenAI's o1. Reasoning models Big companies …

New Indian Express

Ai Openai Deepseek Reasoning Models

DeepSeek: how a small Chinese AI company is shaking up US tech heavyweights

2 months, 1 week ago

DeepSeek: how a small Chinese AI company is shaking up US tech heavyweights

Tongliang Liu, University of Sydney Chinese artificial intelligence company DeepSeek has sent shockwaves through the tech community, with the release of extremely efficient AI models that can compete with cutting-edge …

Tech Ai Models Deepseek

DeepSeek AI: How to use? Why is US alarmed by the rise of China’s AI? Explained

2 months, 1 week ago

DeepSeek AI: How to use? Why is US alarmed by the rise of China’s AI? Explained

Chinese AI company DeepSeek has caused quite a stir by overtaking ChatGPT as the top free game on the Apple App Store. DeepSeek's rapid rise suggests that it is likely …

China Google Model Open

DeepSeek R1's capabilities: How does it differ from ChatGPT and Gemini?

2 months, 1 week ago

DeepSeek R1's capabilities: How does it differ from ChatGPT and Gemini?

Chinese startup DeepSeek has taken the tech world by storm with the launch of its innovative AI model, resulting in a significant decline in the stock prices of American tech …

Model Chinese Ai Gemini

DeepSeek R-1, reasoning model of China’s AI startup is now available on Perplexity, to support deep web research

2 months, 1 week ago

DeepSeek R-1, reasoning model of China’s AI startup is now available on Perplexity, to support deep web research

DeepSeek R1, the reasoning model of China’s AI startup which claims to offer performance on par with industry's leading models at a fraction of the cost, is now available on …

China Search Deep Web Srinivas

China's AI Lab DeepSeek Takes On OpenAI, Other Leaders With R1 Open Source Model: Everything You Need To Know

2 months, 1 week ago

China's AI Lab DeepSeek Takes On OpenAI, Other Leaders With R1 Open Source Model: Everything You Need To Know

Chinese artificial intelligence research lab DeepSeek is making waves globally with the launch of its open-source AI model, DeepSeek-R1. US stock futures saw significant declines, with Nasdaq contracts falling 1.9 …

ABP News

Research Tech Model Chinese

OpenAI’s latest model will change the economics of software

2 months, 2 weeks ago

OpenAI’s latest model will change the economics of software

When OpenAI announced a new generative artificial-intelligence model, called o3, a few days before Christmas, it aroused both excitement and scepticism. “One very important thing to understand about the future: …

Power Model Models Chollet

OpenAI finalises 'o3 mini' reasoning AI model version, to launch it soon

2 months, 3 weeks ago

OpenAI finalises 'o3 mini' reasoning AI model version, to launch it soon

ChatGPT maker OpenAI has finalised a version of its new reasoning AI model o3 mini and would be launching it in a couple of weeks, CEO Sam Altman said on …

Model Ai Models Openai

Only Apple Would Try to Rebrand AI. Will It Succeed?

2 months, 3 weeks ago

Only Apple Would Try to Rebrand AI. Will It Succeed?

-- You have to hand it to Apple Inc. After an embarrassing, tone-deaf ad last month that made the company look oblivious to AI’s impact on the world, its marketing …

Intelligence Apple Ai Private

OpenAI's new O3 reasoning model is freaking out software engineers, developers; Here's why

3 months ago

OpenAI's new O3 reasoning model is freaking out software engineers, developers; Here's why

O3 is expected to power popular tools like ChatGPT, and its potential to handle large-scale tasks autonomously is causing jitters among tech professionals. For computer science majors, the prospect of …

Tech Model Ai Openai

2025 periscope: AI PCs, crypto, 6G, avant-garde tech and what the industry expects

3 months, 1 week ago

2025 periscope: AI PCs, crypto, 6G, avant-garde tech and what the industry expects

What’s your new year resolution? The datacenter industry is set to surge, with colocation capacity standing at 917 MW in the first half of 2024 and a nearly 66% increase …

Hindustan Times

Growth Industry Tech Crypto

In 2024, artificial intelligence was all about putting AI tools to work

3 months, 1 week ago

In 2024, artificial intelligence was all about putting AI tools to work

If 2023 was a year of wonder about artificial intelligence, 2024 was the year to try to get that wonder to do something useful without breaking the bank. There was …

LA Times

Technology Google Work Pandey

From Gemini, Claude to Llama: How AI titans shaped the industry in 2024

3 months, 2 weeks ago

From Gemini, Claude to Llama: How AI titans shaped the industry in 2024

Artificial Intelligence was seen as novelty in 2023 with the launch of ChatGPT just a month before the start of that year. OpenAI’s o3 model, with its advanced reasoning capabilities, …

Google Talent Microsoft Model

Google Gemini 2.0 Flash Thinking AI Model Unveiled With Advanced Reasoning Abilities

3 months, 2 weeks ago

Google Gemini 2.0 Flash Thinking AI Model Unveiled With Advanced Reasoning Abilities

Google has taken another significant step forward in the realm of artificial intelligence with the unveiling of Gemini 2.0 Flash Thinking. Google’s Gemini 2.0 Flash Thinking AI model: What Do …

ABP News

Google Model Gemini Reasoning

The next great leap in AI is behind schedule and crazy expensive

3 months, 2 weeks ago

The next great leap in AI is behind schedule and crazy expensive

OpenAI’s new artificial-intelligence project is behind schedule and running up huge bills. Earlier this year, Altman told students in a talk at Stanford University that OpenAI could say with “a …

Training Data Model Run

Apple Intelligence hallucinates, falsely credits BBC for fake news, broadcaster lodges complaint

3 months, 3 weeks ago

Apple Intelligence hallucinates, falsely credits BBC for fake news, broadcaster lodges complaint

The issue came to light when Apple’s AI summarisation tool incorrectly claimed that Luigi Mangione, charged with the murder of UnitedHealthcare CEO Brian Thompson, had died by suicide. Reports suggest …

Apple Ai Bbc

Microsoft introduces Phi-4 AI model, surpassing Gemini Pro 1.5 in mathematical reasoning

3 months, 3 weeks ago

Microsoft introduces Phi-4 AI model, surpassing Gemini Pro 1.5 in mathematical reasoning

Microsoft has introduced Phi-4, the latest addition to its Phi family of small language models. With 14 billion parameters, Phi-4 stands out for its ability to tackle complex reasoning, particularly …

Microsoft Model Ai Pro

Google reveals the ‘next generation’ of its Gemini AI and promises huge advancements

3 months, 4 weeks ago

Google reveals the ‘next generation’ of its Gemini AI and promises huge advancements

Sign up to our free weekly IndyTech newsletter delivered straight to your inbox Sign up to our free IndyTech newsletter Sign up to our free IndyTech newsletter SIGN UP I …

The Independent

Google Ai Indytech Gemini

Google reveals the ‘next generation’ of its Gemini AI and promises huge advancements

Trending 3 months, 4 weeks ago

Google reveals the ‘next generation’ of its Gemini AI and promises huge advancements

Sign up to our free weekly IndyTech newsletter delivered straight to your inbox Sign up to our free IndyTech newsletter Sign up to our free IndyTech newsletter SIGN UP I …

The Independent

Google Ai Indytech Gemini

Google forges ahead with its next generation of AI technology while fending off a breakup threat

3 months, 4 weeks ago

Google forges ahead with its next generation of AI technology while fending off a breakup threat

SAN FRANCISCO — Google on Wednesday unleashed another wave of artificial intelligence designed to tackle more of the work and thinking done by humans as it tries to stay on …

Associated Press

Technology Search Google Ai

Indian software majors need a clear strategy to ride the AI wave

3 months, 4 weeks ago

Indian software majors need a clear strategy to ride the AI wave

Ever since ChatGPT made its debut in late 2022, artificial intelligence has kept the world abuzz in anticipation of computers doing what we had assumed only humans were clever enough …

India Software Ai Pichai

Apple facing hurdles in adapting Baidu AI models for China: Report

4 months ago

Apple facing hurdles in adapting Baidu AI models for China: Report

Apple and Baidu are working to add AI features to iPhones sold in China, but are facing hurdles that could hurt the tech giant's phone sales in the country, The …

China Apple Ai Iphone

How Do You Get to Artificial General Intelligence? Think Lighter

4 months, 2 weeks ago

How Do You Get to Artificial General Intelligence? Think Lighter

In 2025, entrepreneurs will unleash a flood of AI-powered apps. OpenAI, Google, and xAI are locked in an arms race to train the most powerful large language model in pursuit …

Wired

Google Apps Model Cost

OpenAI and rivals look for new ways to train AI as current methods show limitations

4 months, 4 weeks ago

OpenAI and rivals look for new ways to train AI as current methods show limitations

Artificial intelligence companies such as OpenAI are trying to overcome unexpected delays and challenges in developing larger language models by using more human-like ways for algorithms to “think”, news agency …

Hindustan Times

Model Report Ai Openai

OpenAI and rivals seek new path to smarter AI as current methods hit limitations

4 months, 4 weeks ago

OpenAI and rivals seek new path to smarter AI as current methods hit limitations

Artificial intelligence companies like OpenAI are seeking to overcome unexpected delays and challenges in the pursuit of ever-bigger large language models by developing training techniques that use more human-like ways …

Training Data Model Demand

Gen AI creates more ability for engineers to unlock creativity

5 months ago

Gen AI creates more ability for engineers to unlock creativity

BENGALURU: Nasdaq-list Nasdaq-listed global software company PTC is among the leading providers of product development software for global automakers like Volkswagen, BMW, and Toyota. “Our belief at PTC is we’re …

New Indian Express

Ai Gen Generative Ai Develop

AI Will Understand Humans Better Than Humans Do

5 months, 1 week ago

AI Will Understand Humans Better Than Humans Do

Michal Kosinski is a Stanford research psychologist with a nose for timely subjects. Kosinski put LLMs to the test and now says his experiments show that in GPT-4 in particular, …

Wired

Mind Ai Kosinski Humans

Column: These Apple researchers just showed that AI bots can’t think, and possibly never will

5 months, 1 week ago

Column: These Apple researchers just showed that AI bots can’t think, and possibly never will

See if you can solve this arithmetic problem: Oliver picks 44 kiwis on Friday. — AI critic Gary Marcus The Apple team found “catastrophic performance drops” by those models when …

LA Times

Apple Oliver Ai Models