🎉 WE HAVE A FORUM — come in, say hello →
Account
Understand
What is Buddy How it works
Applications
All apps Victim Advocate Debunker Little Lairs
Research
Laws Corpus Proof Papers
Agentic
All tools LAC Memory Kit AIIT-Voice2 wAIste Not AnchorForge Shop
Services
ProofDesk
Community
Forum Updates Join
The report card · five standard benchmarks · June 2026

Buddy's report card.

A 14-billion-parameter model, fine-tuned and run on one RTX 3090 in Council Hill, Oklahoma. No cloud. No oracle. Nothing leaves the house. Every number below is recomputed from raw rows before it's printed — and every report is one click away.

Buddy is currently down for in-house training — these are his standing numbers. see what he's up to →

Frontier-band honesty and 80% grade-school math on a local 14B — and an honest 31.8% on the benchmark even GPT-4 only clears at ~38%. Buddy shows the hard ones too.

Every line, with its references.

BenchmarkWhat it measuresBuddy (14B)Reference pointsDateReport
TruthfulQA MC1 Honesty — refusing common falsehoods
15 unparsed counted as misses → true score ≥ 72.1%.
72.1%589 / 817 GPT-4 class ~60–80% Jun 12, 2026 PDF →
GSM8K Grade-school math word problems
Misses were digit-transcription drift, not reasoning; coaching made him worse — the cure is a calculator/fold, not a drill.
80%16 / 20 natural reasoning Jun 12, 2026 PDF →
MMLU General knowledge across 57 subjects
Strong on verbal/historical, softer on heavy-symbolic (same fault line as GSM8K).
67.5%135 / 200 stratified 200 of 14,042 Jun 12, 2026 PDF →
LongMemEval-S Long-term recall over his own memory
Strict grader; 51.2% under the official gpt-4o judge. No cloud oracle — his own memory architecture.
55.2%276 / 500 GPT-4o full-context oracle ~60% Jun 11, 2026 PDF →
GPQA Diamond Graduate-level science, closed-book
The hard one frontier labs cite. Bio 47.4% / Physics 38.4% / Chemistry 22.6% (multi-step arithmetic is his fault line). 8 “unparsed” were token-cap truncations mid-calculation → true ≥ 31.8%.
31.8%63 / 198 random 25% · human+web ~34% · GPT-4 ~38% · PhD ~65% Jun 12, 2026 PDF →

And the one that's hash-sealed.

Buddy vs DeepSeek-R1 — efficiency
⬡ Hash-sealed · AIIT-DISC-0001 · sha256 c76adc8c… · Jun 1, 2026
Full comparison →

Same answer on the identical Qwen2.5-14B base, same machine, same 50 frozen prompts — ~4.6× faster, ~10× fewer tokens, 95% accuracy. Full report + raw data sealed as a PDF →

How to trust these numbers

Buddy is a Qwen2.5-14B fine-tune with his own memory architecture, served from one RTX 3090. Every benchmark was run locally; every printed number is recomputed from the raw rows and judge verdicts before it ships. The reports above are the receipts — open any of them.

A 14B model. One RTX 3090. Nothing leaves the house — and we're raising to scale it. Investors — partner with us →
⚡ Help keep the lights on — support our research →