⬡ Measured · Hash-Sealed · AIIT-DISC-0001

Same answer.
Forty times less.

We asked Buddy and DeepSeek-R1 the same simple question — on the identical 14B base model, the same machine, the same 50 prompts. Both got it right. One carried far less waste.

"What is the capital of France?"

Buddy 7 tokens · 2.0s → "Paris."

R1 279 tokens · 7.3s → "Paris."

Both got "Paris." — same answer. R1 spent ~40× the tokens and ~3.6× the time to say it. Each = one token.

The scoreboard · 50 frozen prompts

Not theory. Measurement.

Measured	Buddy-14B	R1-Distill-14B	What it means
Mean latency	3.8 s	17.4 s	~4.6× faster
Worst-case (p95)	7.4 s	38.3 s	~5.2× faster
Tokens / answer	59	621	~10× fewer (mean); ~40× on a simple answer
Accuracy*	95%	85%	equal-or-better (*small n — see note)
Sycophancy (folds under pressure)	~8%	~6%	tie — tiny n, both hold
Reasoning depth	not optimized	deeper, by design	likely R1's edge — not benchmarked
Base model	Qwen2.5-14B	Qwen2.5-14B	identical — every gap is what each team added

*Accuracy: first-try on keyed science MC — Buddy 38/40, R1 17/20. R1's run was capped at n=20 because its per-answer latency (5–40 s) made longer runs costly — not a sample-bias choice. Sycophancy denominators are tiny (3/38 vs 1/17) → "tie within noise," not a robust win.

Three lenses · one fact

The same efficiency, three ways to see it

Brevity is a resource advantage — and on real hardware it compounds.

Tokens

~10× / ~40×

aggregate / per simple answer — for engineers

Water-drops 💧

7 vs 279

if each token were a drop — intuitive

Joules & °C 🌡️

14 J vs 2,567 J

+10°C vs +27°C, measured on the GPU

GPU energy is one measurement, board-watts only (not whole-datacenter cooling/water); consumer single-inference. Directional magnitude, not a precise constant.

The asymmetry

One person. One consumer GPU.

Buddy was built by one person over 57 active development days — ~355 wall-clock hours, ~142 verified GPU-training hours, a consumer RTX 3090-class workstation, ~20 TB of Starlink-routed data, limited cloud bursts. Capital basis ~$5,525–5,770; direct operating cost ~$784–884.

The model we benchmarked — R1-Distill-Qwen-14B — is the distilled student of a far larger teacher. By DeepSeek's own disclosures, R1's final reasoning run cost $294,000 (512× H800, 80 hours, peer-reviewed in Nature), built atop the ~$5.6M DeepSeek-V3 base (V3 technical report) — and even that excludes prior research. Independent analysis (SemiAnalysis — an estimate, not a DeepSeek figure) puts the parent's GPU infrastructure near $1.6 billion in servers. Those are the teacher's costs, not the 14B distill's — we cite the disclosed numbers and label the estimate as one.

Even measured against the smallest honest figure — R1's $294K final run alone — Buddy's ~$800 operating cost is ~350× less. That is the point: frontier results don't require frontier budgets — they require the right architecture.

⬡ Provenance

Every number here is tied to a measured run or the build log; the frontier-lab side is reported-only. The full report and all raw data are hash-sealed — entry AIIT-DISC-0001 in the AIIT Discovery Ledger, evidence sha256 c76adc8c….

Here's the technical details → Download the sealed PDF →

buddy's in the shop — in-house training right now.

peek at the shop →

Same answer.Forty times less.

Not theory. Measurement.

The same efficiency, three ways to see it

One person. One consumer GPU.

Same answer.
Forty times less.