Large Language Models Benchmarks

MiniMax-M3 debuts, eclipsing GPT-5.5 and Gemini 3.1 Pro on key benchmark performance for just 5-10% of the cost

M3 demonstrates that the next phase of agent development will not just be driven by larger datasets, but by efficient ...

Geeky Gadgets

AI Benchmarks Are Broken : The Leaderboard Illusion

What if the tools we trust to measure progress are actually holding us back? In the rapidly evolving world of large language models (LLMs), AI benchmarks and leaderboards have become the gold standard ...

ITWeb on MSN

The 400ms benchmark: Why infrastructure is the real hurdle for SA AI bots to overcome

The 400ms benchmark: Why infrastructure is the real hurdle for SA AI bots to overcomeBy Bruce von Maltitz, CEO of 1StreamIssued by 1streamJohannesburg, 05 Jun 2026 One of the most critical technical ...

ascopubs.org

RadOncRAG: A Novel Retrieval-Augmented Generation Framework Improves Large Language Model Benchmark Performance in Radiation Oncology

Large language models (LLMs) show promise in assisting knowledge-intensive fields such as oncology, where up-to-date information and multidisciplinary expertise are critical. Traditional LLMs risk ...

SiliconANGLE

Elon Musk’s xAI sets AI benchmark records with new reasoning-optimized Grok 4 model

Elon Musk’s xAI Holdings Corp. has debuted a new large language model, Grok 4, that’s optimized for reasoning tasks such as generating code. The LLM’s late Wednesday launch followed a turbulent week ...

AI's Web3 Reality Check: New Benchmark Finds Leading Models Fall Short in Blockchain's Most Critical Use Cases

SINGAPORE, SG / ACCESS Newswire / June 1, 2026 / Artificial intelligence has rapidly become the technology industry's ...

Geeky Gadgets

How to Build Custom LLM Benchmarks for Your AI Applications

Have you ever wondered why off-the-shelf large language models (LLMs) sometimes fall short of delivering the precision or context you need for your specific application? Whether you’re working in a ...

Yahoo Finance

MITRE and FAA Introduce Novel Aerospace Large Language Model Evaluation Benchmark

MCLEAN, Va., September 17, 2025--(BUSINESS WIRE)--The Federal Aviation Administration (FAA) and MITRE are introducing a new benchmark to enable the evaluation and assessment of large language models ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results