Benchmarks

👋 Sign in for the ability to sort posts by relevant, latest, or top.

Breach Protocol

Jul 2

Turn the camera away, and the AI's world freezes

#worldmodels #videogeneration #robotics #benchmarks

3 min read

Breach Protocol

Jul 1

Reliable, and still wrong

#evaluation #llmasjudge #benchmarks

3 min read

Breach Protocol

Jul 1

Put AI agents in charge of a Civilization game and they reach for the nukes

#agents #alignment #safety #benchmarks

3 min read

CopperSunDev

Jun 25

We Benchmarked BrassCoders Against a Frontier Model

#benchmarks #opensource #codereview

5 min read

Peremptory

Jun 12

Claude Fable 5 Scores 95% on SWE-bench, Then Hands Off to Opus 4.8

#anthropic #claude #benchmarks #safety

3 min read

Arthur

Jun 11

An LLM benchmark is only useful for as long as it's hard

#llm #evaluation #benchmarks #humaneval

10 min read

Rob

Jun 2

An AMD GPU Beat My Mac on Llama 8B. The Same GPU Lost on Phi-3.

#performance #benchmarks #machinelearning #gpu

5 min read

Milliseconds.dev

Jun 2

NLTK vs Compiled Regex: Tokenizing 100 MB of Text in .NET

#dotnet #csharp #performance #benchmarks

3 min read

cucoleadan

Jun 3

Why AI Benchmarks Fail Real Hermes Agent Workflows

#agents #benchmarks #workflows #routing

10 min read

Milliseconds.dev

May 31

pypdf vs PdfPig: Text Extraction at Scale

#dotnet #csharp #performance #benchmarks

2 min read

Milliseconds.dev

May 31

NetworkX vs CSR + TensorPrimitives: PageRank on 28M Edges

#dotnet #csharp #performance #benchmarks

3 min read

Rob

Jun 3

Cross-Machine Memory Query: About 20 Milliseconds, Most Days

#performance #benchmarks #machinelearning #wireguard

9 min read

Peremptory

May 29

Single-Prompt Safety Scores Are Measuring the Wrong Thing

#safety #benchmarks #redteaming #security

3 min read

Milliseconds.dev

May 30

textdistance vs ArrayPool: Edit Distance Without the Allocations

#dotnet #csharp #performance #benchmarks

3 min read

Peremptory

May 22

Four Chinese Labs Rewrote the Open-Weights Leaderboard in 18 Days

#openweights #chineseai #benchmarks #codingmodels

3 min read

👋 Sign in for the ability to sort posts by relevant, latest, or top.

DEV Community

# benchmarks

Turn the camera away, and the AI's world freezes

Reliable, and still wrong

Put AI agents in charge of a Civilization game and they reach for the nukes

We Benchmarked BrassCoders Against a Frontier Model

Claude Fable 5 Scores 95% on SWE-bench, Then Hands Off to Opus 4.8

An LLM benchmark is only useful for as long as it's hard

An AMD GPU Beat My Mac on Llama 8B. The Same GPU Lost on Phi-3.

NLTK vs Compiled Regex: Tokenizing 100 MB of Text in .NET

Why AI Benchmarks Fail Real Hermes Agent Workflows

pypdf vs PdfPig: Text Extraction at Scale

NetworkX vs CSR + TensorPrimitives: PageRank on 28M Edges

Cross-Machine Memory Query: About 20 Milliseconds, Most Days

Single-Prompt Safety Scores Are Measuring the Wrong Thing

textdistance vs ArrayPool: Edit Distance Without the Allocations

Four Chinese Labs Rewrote the Open-Weights Leaderboard in 18 Days