Skip to content
Navigation menu
Search
Powered by Algolia
Search
Log in
Create account
DEV Community
Close
#
benchmarks
Follow
Hide
Posts
Left menu
đź‘‹
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
Right menu
Turn the camera away, and the AI's world freezes
Breach Protocol
Breach Protocol
Breach Protocol
Follow
Jul 2
Turn the camera away, and the AI's world freezes
#
worldmodels
#
videogeneration
#
robotics
#
benchmarks
Comments
Add Comment
3 min read
Reliable, and still wrong
Breach Protocol
Breach Protocol
Breach Protocol
Follow
Jul 1
Reliable, and still wrong
#
evaluation
#
llmasjudge
#
benchmarks
Comments
Add Comment
3 min read
Put AI agents in charge of a Civilization game and they reach for the nukes
Breach Protocol
Breach Protocol
Breach Protocol
Follow
Jul 1
Put AI agents in charge of a Civilization game and they reach for the nukes
#
agents
#
alignment
#
safety
#
benchmarks
Comments
Add Comment
3 min read
We Benchmarked BrassCoders Against a Frontier Model
CopperSunDev
CopperSunDev
CopperSunDev
Follow
Jun 25
We Benchmarked BrassCoders Against a Frontier Model
#
benchmarks
#
opensource
#
codereview
Comments
Add Comment
5 min read
Claude Fable 5 Scores 95% on SWE-bench, Then Hands Off to Opus 4.8
Peremptory
Peremptory
Peremptory
Follow
Jun 12
Claude Fable 5 Scores 95% on SWE-bench, Then Hands Off to Opus 4.8
#
anthropic
#
claude
#
benchmarks
#
safety
Comments
Add Comment
3 min read
An LLM benchmark is only useful for as long as it's hard
Arthur
Arthur
Arthur
Follow
Jun 11
An LLM benchmark is only useful for as long as it's hard
#
llm
#
evaluation
#
benchmarks
#
humaneval
2
 reactions
Comments
Add Comment
10 min read
An AMD GPU Beat My Mac on Llama 8B. The Same GPU Lost on Phi-3.
Rob
Rob
Rob
Follow
Jun 2
An AMD GPU Beat My Mac on Llama 8B. The Same GPU Lost on Phi-3.
#
performance
#
benchmarks
#
machinelearning
#
gpu
Comments
Add Comment
5 min read
NLTK vs Compiled Regex: Tokenizing 100 MB of Text in .NET
Milliseconds.dev
Milliseconds.dev
Milliseconds.dev
Follow
Jun 2
NLTK vs Compiled Regex: Tokenizing 100 MB of Text in .NET
#
dotnet
#
csharp
#
performance
#
benchmarks
Comments
Add Comment
3 min read
Why AI Benchmarks Fail Real Hermes Agent Workflows
cucoleadan
cucoleadan
cucoleadan
Follow
Jun 3
Why AI Benchmarks Fail Real Hermes Agent Workflows
#
agents
#
benchmarks
#
workflows
#
routing
Comments
Add Comment
10 min read
pypdf vs PdfPig: Text Extraction at Scale
Milliseconds.dev
Milliseconds.dev
Milliseconds.dev
Follow
May 31
pypdf vs PdfPig: Text Extraction at Scale
#
dotnet
#
csharp
#
performance
#
benchmarks
Comments
Add Comment
2 min read
NetworkX vs CSR + TensorPrimitives: PageRank on 28M Edges
Milliseconds.dev
Milliseconds.dev
Milliseconds.dev
Follow
May 31
NetworkX vs CSR + TensorPrimitives: PageRank on 28M Edges
#
dotnet
#
csharp
#
performance
#
benchmarks
Comments
Add Comment
3 min read
Cross-Machine Memory Query: About 20 Milliseconds, Most Days
Rob
Rob
Rob
Follow
Jun 3
Cross-Machine Memory Query: About 20 Milliseconds, Most Days
#
performance
#
benchmarks
#
machinelearning
#
wireguard
1
 reaction
Comments
Add Comment
9 min read
Single-Prompt Safety Scores Are Measuring the Wrong Thing
Peremptory
Peremptory
Peremptory
Follow
May 29
Single-Prompt Safety Scores Are Measuring the Wrong Thing
#
safety
#
benchmarks
#
redteaming
#
security
Comments
Add Comment
3 min read
textdistance vs ArrayPool: Edit Distance Without the Allocations
Milliseconds.dev
Milliseconds.dev
Milliseconds.dev
Follow
May 30
textdistance vs ArrayPool: Edit Distance Without the Allocations
#
dotnet
#
csharp
#
performance
#
benchmarks
1
 reaction
Comments
1
 comment
3 min read
Four Chinese Labs Rewrote the Open-Weights Leaderboard in 18 Days
Peremptory
Peremptory
Peremptory
Follow
May 22
Four Chinese Labs Rewrote the Open-Weights Leaderboard in 18 Days
#
openweights
#
chineseai
#
benchmarks
#
codingmodels
Comments
Add Comment
3 min read
đź‘‹
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
We're a place where coders share, stay up-to-date and grow their careers.
Log in
Create account