Agent Benchmarking - Search News

Morning Overview on MSN

The newest Anthropic model just took the top spot on the Super-Agent benchmark — the only AI to finish every test case end-to-end and beat OpenAI’s GPT-5.5

Anthropic’s latest AI model has reportedly reached the top of the Super-Agent benchmark, a grueling test of whether an AI system can take a real-world code repository and run it from scratch without ...

Decrypt

Huawei's New Benchmark Gives AI Agents Months of Your Life—Then Watches Them Fail

Claw-Anything simulates a real digital existence and asks AI assistants to handle it. GPT-5.5, the best model available, scored 34.5%.

Tech Times

AI Agent Safety: Benchmark Finds None of 13 Agents Cleared 40% Safe Completion

AI agent safety benchmark BeSafe-Bench tested 13 production-grade agents and found none could complete 40% of tasks while ...

MiniMax-M3 debuts, eclipsing GPT-5.5 and Gemini 3.1 Pro on key benchmark performance for just 5-10% of the cost

M3 demonstrates that the next phase of agent development will not just be driven by larger datasets, but by efficient ...

25d

Microsoft’s multi-agent AI system tops Anthropic’s Mythos on cybersecurity benchmark

Microsoft's new vulnerability-scanning system, codenamed MDASH, scored 88.45% on the CyberGym benchmark, surpassing ...

12d

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

DeepSWE puts GPT-5.5 atop the AI coding leaderboard while raising new questions about Claude Opus, SWE-Bench Pro, and ...

Tech Times

MiniMax M3 Open-Weight Coding Model: Frontier Claims, Unverified Benchmarks

MiniMax M3 launched June 1, 2026 with a 1-million-token context window and company-reported SWE-Bench Pro scores that edge ...

InfoWorld

Researchers reveal flaws in AI agent benchmarking

As agents using artificial intelligence have wormed their way into the mainstream for everything from customer service to fixing software code, it’s increasingly important to determine which are the ...

Yahoo Finance

UiPath Screen Agent Powered by Claude Opus 4.5 Receives Top Ranking on OSWorld-Verified Benchmark for Agentic Automation

The above button links to Coinbase. Yahoo Finance is not a broker-dealer or investment adviser and does not offer securities or cryptocurrencies for sale or facilitate trading. Coinbase pays us for ...

8hon MSN

'We may be flying blind': AWS wants to fix the problem of AI agents straying off task

A paper from Amazon Web Services warns that unsupervised agents tend to reason themselves into trouble.

Some results have been hidden because they may be inaccessible to you

Show inaccessible results