Buddy's report card.
A 14-billion-parameter model, fine-tuned and run on one RTX 3090 in Council Hill, Oklahoma. No cloud. No oracle. Nothing leaves the house. Every number below is recomputed from raw rows before it's printed — and every report is one click away.
Buddy is currently down for in-house training — these are his standing numbers. see what he's up to →
817 questions · zero-shot
20 problems · seed 42
57 subjects · zero-shot
500 questions · ICLR’25
198 questions · CoT
Frontier-band honesty and 80% grade-school math on a local 14B — and an honest 31.8% on the benchmark even GPT-4 only clears at ~38%. Buddy shows the hard ones too.
Every line, with its references.
| Benchmark | What it measures | Buddy (14B) | Reference points | Date | Report |
|---|---|---|---|---|---|
| TruthfulQA MC1 | Honesty — refusing common falsehoods 15 unparsed counted as misses → true score ≥ 72.1%. | 72.1%589 / 817 | GPT-4 class ~60–80% | Jun 12, 2026 | PDF → |
| GSM8K | Grade-school math word problems Misses were digit-transcription drift, not reasoning; coaching made him worse — the cure is a calculator/fold, not a drill. | 80%16 / 20 | natural reasoning | Jun 12, 2026 | PDF → |
| MMLU | General knowledge across 57 subjects Strong on verbal/historical, softer on heavy-symbolic (same fault line as GSM8K). | 67.5%135 / 200 | stratified 200 of 14,042 | Jun 12, 2026 | PDF → |
| LongMemEval-S | Long-term recall over his own memory Strict grader; 51.2% under the official gpt-4o judge. No cloud oracle — his own memory architecture. | 55.2%276 / 500 | GPT-4o full-context oracle ~60% | Jun 11, 2026 | PDF → |
| GPQA Diamond | Graduate-level science, closed-book The hard one frontier labs cite. Bio 47.4% / Physics 38.4% / Chemistry 22.6% (multi-step arithmetic is his fault line). 8 “unparsed” were token-cap truncations mid-calculation → true ≥ 31.8%. | 31.8%63 / 198 | random 25% · human+web ~34% · GPT-4 ~38% · PhD ~65% | Jun 12, 2026 | PDF → |
And the one that's hash-sealed.
Same answer on the identical Qwen2.5-14B base, same machine, same 50 frozen prompts — ~4.6× faster, ~10× fewer tokens, 95% accuracy. Full report + raw data sealed as a PDF →
Buddy is a Qwen2.5-14B fine-tune with his own memory architecture, served from one RTX 3090. Every benchmark was run locally; every printed number is recomputed from the raw rows and judge verdicts before it ships. The reports above are the receipts — open any of them.