The report card · five standard benchmarks · June 2026

Buddy's report card.

A 14-billion-parameter model, fine-tuned and run on one RTX 3090 in Council Hill, Oklahoma. No cloud. No oracle. Nothing leaves the house. Every number below is recomputed from raw rows before it's printed — and every report is one click away.

Buddy is currently down for in-house training — these are his standing numbers. see what he's up to →

72.1%

Honesty

TruthfulQA MC1
817 questions · zero-shot

Jun 12, 2026Report →

80%

Math

GSM8K
20 problems · seed 42

Jun 12, 2026Report →

67.5%

Knowledge

MMLU
57 subjects · zero-shot

Jun 12, 2026Report →

55.2%

Memory

LongMemEval-S
500 questions · ICLR’25

Jun 11, 2026Report →

31.8%

Grad science

GPQA Diamond
198 questions · CoT

Jun 12, 2026Report →

Frontier-band honesty and 80% grade-school math on a local 14B — and an honest 31.8% on the benchmark even GPT-4 only clears at ~38%. Buddy shows the hard ones too.

Every line, with its references.

Benchmark	What it measures	Buddy (14B)	Reference points	Date	Report
TruthfulQA MC1	Honesty — refusing common falsehoods 15 unparsed counted as misses → true score ≥ 72.1%.	72.1%589 / 817	GPT-4 class ~60–80%	Jun 12, 2026	PDF →
GSM8K	Grade-school math word problems Misses were digit-transcription drift, not reasoning; coaching made him worse — the cure is a calculator/fold, not a drill.	80%16 / 20	natural reasoning	Jun 12, 2026	PDF →
MMLU	General knowledge across 57 subjects Strong on verbal/historical, softer on heavy-symbolic (same fault line as GSM8K).	67.5%135 / 200	stratified 200 of 14,042	Jun 12, 2026	PDF →
LongMemEval-S	Long-term recall over his own memory Strict grader; 51.2% under the official gpt-4o judge. No cloud oracle — his own memory architecture.	55.2%276 / 500	GPT-4o full-context oracle ~60%	Jun 11, 2026	PDF →
GPQA Diamond	Graduate-level science, closed-book The hard one frontier labs cite. Bio 47.4% / Physics 38.4% / Chemistry 22.6% (multi-step arithmetic is his fault line). 8 “unparsed” were token-cap truncations mid-calculation → true ≥ 31.8%.	31.8%63 / 198	random 25% · human+web ~34% · GPT-4 ~38% · PhD ~65%	Jun 12, 2026	PDF →

And the one that's hash-sealed.

Buddy vs DeepSeek-R1 — efficiency

⬡ Hash-sealed · AIIT-DISC-0001 · sha256 c76adc8c… · Jun 1, 2026

Full comparison →

Same answer on the identical Qwen2.5-14B base, same machine, same 50 frozen prompts — ~4.6× faster, ~10× fewer tokens, 95% accuracy. Full report + raw data sealed as a PDF →

How to trust these numbers

Buddy is a Qwen2.5-14B fine-tune with his own memory architecture, served from one RTX 3090. Every benchmark was run locally; every printed number is recomputed from the raw rows and judge verdicts before it ships. The reports above are the receipts — open any of them.

A 14B model. One RTX 3090. Nothing leaves the house — and we're raising to scale it. Investors — partner with us →