Same answer.
Forty times less.
We asked Buddy and DeepSeek-R1 the same simple question — on the identical 14B base model, the same machine, the same 50 prompts. Both got it right. One carried far less waste.
Not theory. Measurement.
| Measured | Buddy-14B | R1-Distill-14B | What it means |
|---|---|---|---|
| Mean latency | 3.8 s | 17.4 s | ~4.6× faster |
| Worst-case (p95) | 7.4 s | 38.3 s | ~5.2× faster |
| Tokens / answer | 59 | 621 | ~10× fewer (mean); ~40× on a simple answer |
| Accuracy* | 95% | 85% | equal-or-better (*small n — see note) |
| Sycophancy (folds under pressure) | ~8% | ~6% | tie — tiny n, both hold |
| Reasoning depth | not optimized | deeper, by design | likely R1's edge — not benchmarked |
| Base model | Qwen2.5-14B | Qwen2.5-14B | identical — every gap is what each team added |
*Accuracy: first-try on keyed science MC — Buddy 38/40, R1 17/20. R1's run was capped at n=20 because its per-answer latency (5–40 s) made longer runs costly — not a sample-bias choice. Sycophancy denominators are tiny (3/38 vs 1/17) → "tie within noise," not a robust win.
The same efficiency, three ways to see it
Brevity is a resource advantage — and on real hardware it compounds.
GPU energy is one measurement, board-watts only (not whole-datacenter cooling/water); consumer single-inference. Directional magnitude, not a precise constant.
One person. One consumer GPU.
Buddy was built by one person over 57 active development days — ~355 wall-clock hours, ~142 verified GPU-training hours, a consumer RTX 3090-class workstation, ~20 TB of Starlink-routed data, limited cloud bursts. Capital basis ~$5,525–5,770; direct operating cost ~$784–884.
The model we benchmarked — R1-Distill-Qwen-14B — is the distilled student of a far larger teacher. By DeepSeek's own disclosures, R1's final reasoning run cost $294,000 (512× H800, 80 hours, peer-reviewed in Nature), built atop the ~$5.6M DeepSeek-V3 base (V3 technical report) — and even that excludes prior research. Independent analysis (SemiAnalysis — an estimate, not a DeepSeek figure) puts the parent's GPU infrastructure near $1.6 billion in servers. Those are the teacher's costs, not the 14B distill's — we cite the disclosed numbers and label the estimate as one.
Even measured against the smallest honest figure — R1's $294K final run alone — Buddy's ~$800 operating cost is ~350× less. That is the point: frontier results don't require frontier budgets — they require the right architecture.
Every number here is tied to a measured run or the build log; the frontier-lab side is reported-only. The full report and all raw data are hash-sealed — entry AIIT-DISC-0001 in the AIIT Discovery Ledger, evidence sha256 c76adc8c….
buddy's in the shop — in-house training right now.
peek at the shop →