Cheaper where you'd already be right.
Correct where you'd be wrong.

Memory does two different jobs, so we measure them separately and never blend the numbers. Every claim on this page is dated, model-pinned, and shipped with its caveats — runs are pre-registered before we know the result.

The two regimes

Execution-bound tasks (the answer is reachable by exploring the repo) and knowledge-bound tasks (the answer lives in org-local knowledge) are different claims with different baselines. Frontier models solve the first kind for everyone equally — the durable edge lives in the second.

Execution-bound · efficiency

~27% cheaper

~22–26% fewer tool calls, identical correctness

Same agent, same model, same tasks — the only variable is whether the substrate is warm. Memory removes re-exploration; correctness is unchanged by design of the regime.

Tier-1 auto-prime A/B + write-back cold/warm A/B · 2026-06-04 · claude-sonnet-4-6 both arms · 20 runs, 0 failed

Knowledge-bound · correctness

Claude Code ✗ ✗ ✗ 0/3
lakecode, cold ✗ ✗ ✗ 0/3
lakecode + substrate ✓ ✓ ✓ 3/3

A causal matrix: when the answer depends on knowledge that isn't in the repo, every cold arm is confidently wrong — and one substrate claim flips the result from 0% to 100%. No model upgrade closes this gap, because the knowledge isn't in any training set.

Grounded-coding bench causal matrix · 2026-06-08 · claude-sonnet-4-6 everywhere · n=3 per arm · single knowledge-trap domain

Why two regimes? Because blending them overstates both claims. The efficiency number is measured at identical correctness; the correctness number is measured where baselines fail. They are never the same number, and we never present one as the other.

The full proof table

Everything we claim publicly, in one place — with the run, the date, and the caveats attached. If a number isn't on this table, we don't ship it.

Claim Result Run · date Caveats
Coding parity with Claude Code 6/6 = 6/6; bug-fixes byte-identical (same model both arms) coding-bench v1 · 2026-06 small generic fixture; n=1 per task
Efficiency moat (warm vs cold substrate) ~22–26% fewer tool calls, ~27% cheaper, identical correctness auto-prime + write-back A/Bs · 2026-06-04 · claude-sonnet-4-6 execution-bound tasks; correctness unchanged by design of the regime
Correctness on knowledge-bound tasks CC 0/3 · lakecode cold 0/3 · lakecode seeded 3/3 — one claim flips 0→100% grounded-coding bench · 2026-06-08 · claude-sonnet-4-6 n=3; single knowledge-trap domain
Cross-session code-change persistence cold 0/2 vs warm 2/2 on a seeded convention; cold fabricated nothing flywheel causal validation · 2026-06-09 · claude-sonnet-4-6 n=2
Substrate self-heals failing production query fixed by one ingested clarifying finding — zero code change, same day cp-07 timestamped record · 2026-05 single observed instance
Single-turn pipeline economics $0.033 vs $0.113 per question (~3× cheaper), 5–8× fewer tokens, equal quality on covered questions single-turn pipeline vs CC · May 2026 · Sonnet 4.6 both sides coverage-limited on repo internals; the CC baseline itself moved 30% in 17 days — hence absolute costs
Memory pollution personal-note corpus never leaked into codebase answers (0/2; notes ranked 75–100) memory-as-findings probe · 2026-06-02 relevance-based; no hard filter yet

Pipeline economics in detail

Stage 1 — 11 Databricks SDK questions × 3 replicates (33 graded answers), blind LLM judge, answers anonymized and order-shuffled. Both systems answer with the same model (Claude Sonnet 4.6), so the gap is the compiled context, not the model.

$0.033
lakecode pipeline, mean cost / question · 32/33 correct + 1 partial, 0 wrong
$0.113
Claude Code, mean cost / question · 33/33 correct, 0 wrong
3.4×
cheaper at matched accuracy — 2.6× common APIs, 3.4× internals, 5.4× cross-cutting

A real cross-cutting question from the workload. Both systems answered correctly in all 3 replicates — lakecode compiled the answer from its substrate at $0.034 per question; the baseline searched, read, and reconstructed at $0.198. Same answer, 5.8× the cost.

This is the regime where compiled context shines: questions whose answer spans the codebase. It is an economics claim, not a correctness claim — the baseline got there too, it just paid full price for the trip.

Methodology & culture

  • Pre-registered runs. Hypotheses and pass bars are written down before the result exists; when a run misses the target, we publish the measured number, not the target.
  • Model pinned everywhere. Same answering model on both sides of every comparison (claude-sonnet-4-6) — differences are architecture, not a cheaper model.
  • Causal designs. The correctness and flywheel results hold everything constant except one variable: what the substrate contains. Cold arms verify the token is absent before seeding.
  • Blind grading. LLM judge with a 4-way verdict (correct / partial / incorrect / abstained); answers anonymized and order-shuffled.
  • Caveats travel with claims. Sample sizes are small and stated; baselines move (the CC baseline shifted 30% in 17 days), so ratios always ship with absolute costs.

Full method docs — pre-registrations, judge prompts, raw run records — are shared with beta partners.

See it on your own repo

The flywheel demo — cold fail, teach once, warm pass — runs live on your code during beta onboarding.