Video Coding Benchmarks

21d

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

DeepSWE puts GPT-5.5 atop the AI coding leaderboard while raising new questions about Claude Opus, SWE-Bench Pro, and benchmark leakage.

i-SCOOP

Kimi K2.7 Code, the open-weight coding model that thinks 30% less

A deep dive into Kimi K2.7 Code from Moonshot AI: architecture, benchmarks, pricing, and how to put its 256K context and ...

GIGAZINE

DeepSWE is a benchmark that prevents cheating using coding AI and allows for more accurate measurement of programming performance.

In recent years, it has become common for developers to use coding AI in software development, and various benchmarks exist to measure the performance of coding AI. Now, a new benchmark called ...

Morning Overview on MSN

Microsoft’s new MAI-Code tool turns plain-English descriptions into working app code

Microsoft has introduced MAI-Code, a tool designed to convert plain-English descriptions into functional application code.

Geeky Gadgets

Anthropic Claude Opus 4.5 Tops Coding Benchmarks While Slashing Token Use

What if the future of coding wasn’t human, but instead powered by an AI so advanced it could outpace even the most skilled developers? Enter Claude Opus 4.5, a model that doesn’t just assist with ...

Bleeping Computer

Grok 4 benchmark results: Tops math, ranks second in coding

Grok 4 is a huge leap from Grok 3, but how good is it compared to other models in the market, such as Gemini 2.5 Pro? We now have answers, thanks to new independent benchmarks. LMArena.ai, which is an ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results