How Good Are New GPT-OSS Models? We Put Them to the Test.

OpenAI hasn’t launched an open-weight language mannequin since GPT-2 again in 2019. Six years later, they shocked everybody with two: gpt-oss-120b and the smaller gpt-oss-20b.

Naturally, we needed to know — how do they really carry out?

To seek out out, we ran each fashions by our open-source workflow optimization framework, syftr. It evaluates fashions throughout completely different configurations — quick vs. low cost, excessive vs. low accuracy — and consists of assist for OpenAI’s new “considering effort” setting.

In concept, extra considering ought to imply higher solutions. In follow? Not all the time.

We additionally use syftr to discover questions like “is LLM-as-a-Judge actually working?” and “what workflows perform well across many datasets?”.

Our first outcomes with GPT-OSS may shock you: the very best performer wasn’t the largest mannequin or the deepest thinker.

As an alternative, the 20b mannequin with low considering effort persistently landed on the Pareto frontier, even rivaling the 120b medium configuration on benchmarks like FinanceBench, HotpotQA, and MultihopRAG. In the meantime, excessive considering effort not often mattered in any respect.

How we arrange our experiments

We didn’t simply pit GPT-OSS in opposition to itself. As an alternative, we needed to see the way it stacked up in opposition to different robust open-weight fashions. So we in contrast gpt-oss-20b and gpt-oss-120b with:

qwen3-235b-a22b
glm-4.5-air
nemotron-super-49b
qwen3-30b-a3b
gemma3-27b-it
phi-4-multimodal-instruct

To check OpenAI’s new “considering effort” function, we ran every GPT-OSS mannequin in three modes: low, medium, and excessive considering effort. That gave us six configurations in complete:

gpt-oss-120b-low / -medium / -high
gpt-oss-20b-low / -medium / -high

For analysis, we solid a large web: 5 RAG and agent modes, 16 embedding fashions, and a spread of circulate configuration choices. To guage mannequin responses, we used GPT-4o-mini and in contrast solutions in opposition to recognized floor reality.

Lastly, we examined throughout 4 datasets:

FinanceBench (monetary reasoning)
HotpotQA (multi-hop QA)
MultihopRAG (retrieval-augmented reasoning)
PhantomWiki (artificial Q&A pairs)

We optimized workflows twice: as soon as for accuracy + latency, and as soon as for accuracy + value—capturing the tradeoffs that matter most in real-world deployments.

Optimizing for latency, value, and accuracy

After we optimized the GPT-OSS fashions, we checked out two tradeoffs: accuracy vs. latency and accuracy vs. value. The outcomes had been extra stunning than we anticipated:

GPT-OSS 20b (low considering effort):
Quick, cheap, and persistently correct. This setup appeared on the Pareto frontier repeatedly, making it the very best default selection for many non-scientific duties. In follow, which means faster responses and decrease payments in comparison with increased considering efforts.
GPT-OSS 120b (medium considering effort):
Greatest fitted to duties that demand deeper reasoning, like monetary benchmarks. Use this when accuracy on advanced issues issues greater than value.
GPT-OSS 120b (excessive considering effort):
Costly and often pointless. Hold it in your again pocket for edge circumstances the place different fashions fall quick. For our benchmarks, it didn’t add worth.

Determine 1: Accuracy-latency optimization with syftr

Figure 02 cost — Determine 2: Accuracy-cost optimization with syftr

Studying the outcomes extra rigorously

At first look, the outcomes look simple. However there’s an necessary nuance: an LLM’s high accuracy rating relies upon not simply on the mannequin itself, however on how the optimizer weighs it in opposition to different fashions within the combine. For example, let’s have a look at FinanceBench.

When optimizing for latency, all GPT-OSS fashions (besides excessive considering effort) landed with comparable Pareto-frontiers. On this case, the optimizer had little purpose to focus on the 20b low considering configuration—its high accuracy was solely 51%.

Figure 03 latency financebench — Determine 3: Per-LLM Pareto-frontiers for latency optimization on FinanceBench

When optimizing for value, the image shifts dramatically. The identical 20b low considering configuration jumps to 57% accuracy, whereas the 120b medium configuration really drops 22%. Why? As a result of the 20b mannequin is much cheaper, so the optimizer shifts extra weight towards it.

Figure 04 cost financebench — Determine 4: Per-LLM Pareto-frontiers for value optimization on FinanceBench

The takeaway: Efficiency will depend on context. Optimizers will favor completely different fashions relying on whether or not you’re prioritizing velocity, value, or accuracy. And given the large search house of potential configurations, there could also be even higher setups past those we examined.

Discovering agentic workflows that work properly in your setup

The brand new GPT-OSS fashions carried out strongly in our checks — particularly the 20b with low considering effort, which frequently outpaced costlier rivals. The larger lesson? Extra mannequin and extra effort doesn’t all the time imply extra accuracy. Typically, paying extra simply will get you much less.

That is precisely why we constructed syftr and made it open-source. Each use case is completely different, and the very best workflow for you will depend on the tradeoffs you care about most. Need decrease prices? Quicker responses? Most accuracy?

Run your own experiments and discover the Pareto candy spot that balances these priorities in your setup.

Source link

China figured out how to sell EVs. Now it has to bury their batteries.

DataRobot Q4 update: driving success across the full agentic AI lifecycle

The brewing GenAI data science revolution

Creating psychological safety in the AI era

Why it’s time to reset our expectations for AI

A brief history of Sam Altman’s hype

HelloFresh Meal Kit’s Discount Code for December 2025 Unlocks a Free Zwilling Knife

Holiday weather forecast: Will there be a white Christmas?

I volunteered at camp for the displaced from el-Fasher. Here is what I saw | Opinions

SCOTUS Revisits COVID Vaccine Religious Exemptions For Children

The FDA Often Doesn’t Test the Quality of Generic Drugs, So We Did — ProPublica

Top Picks

Federal Reserve cuts outlook for US economy but holds interest rates steady

Sex Toys for Couples – HealthyWomen

Singer Chris Brown pleads not guilty to 2 further charges over London assault case

Train derails near Russia-Ukraine border, killing at least seven | Russia-Ukraine war News

How Good Are New GPT-OSS Models? We Put Them to the Test.

How we arrange our experiments

Optimizing for latency, value, and accuracy

Studying the outcomes extra rigorously

Discovering agentic workflows that work properly in your setup

Related Posts