Benchmarks — lakecode

The two regimes

Execution-bound tasks (the answer is reachable by exploring the repo) and knowledge-bound tasks (the answer lives in org-local knowledge) are different claims with different baselines. Frontier models solve the first kind for everyone equally — the durable edge lives in the second.

Execution-bound · efficiency

~27% cheaper

~22–26% fewer tool calls, identical correctness

Same agent, same model, same tasks — the only variable is whether the substrate is warm. Memory removes re-exploration; correctness is unchanged by design of the regime.

Tier-1 auto-prime A/B + write-back cold/warm A/B · 2026-06-04 · claude-sonnet-4-6 both arms · 20 runs, 0 failed

Knowledge-bound · correctness

              Claude Code
              ✗ ✗ ✗
              0/3
            

              lakecode, cold
              ✗ ✗ ✗
              0/3
            

              lakecode + substrate
              ✓ ✓ ✓
              3/3
            

A causal matrix: when the answer depends on knowledge that isn't in the repo, every cold arm is confidently wrong — and one substrate claim flips the result from 0% to 100%. No model upgrade closes this gap, because the knowledge isn't in any training set.

Grounded-coding bench causal matrix · 2026-06-08 · claude-sonnet-4-6 everywhere · n=3 per arm · single knowledge-trap domain

Why two regimes? Because blending them overstates both claims. The efficiency number is measured at identical correctness; the correctness number is measured where baselines fail. They are never the same number, and we never present one as the other.

The full proof table

Everything we claim publicly, in one place — with the run, the date, and the caveats attached. If a number isn't on this table, we don't ship it.

Claim	Result	Run · date	Caveats
Coding parity with Claude Code	6/6 = 6/6; bug-fixes byte-identical (same model both arms)	coding-bench v1 · 2026-06	small generic fixture; n=1 per task
Efficiency moat (warm vs cold substrate)	~22–26% fewer tool calls, ~27% cheaper, identical correctness	auto-prime + write-back A/Bs · 2026-06-04 · claude-sonnet-4-6	execution-bound tasks; correctness unchanged by design of the regime
Correctness on knowledge-bound tasks	CC 0/3 · lakecode cold 0/3 · lakecode seeded 3/3 — one claim flips 0→100%	grounded-coding bench · 2026-06-08 · claude-sonnet-4-6	n=3; single knowledge-trap domain
Cross-session code-change persistence	cold 0/2 vs warm 2/2 on a seeded convention; cold fabricated nothing	flywheel causal validation · 2026-06-09 · claude-sonnet-4-6	n=2
Substrate self-heals	failing production query fixed by one ingested clarifying finding — zero code change, same day	cp-07 timestamped record · 2026-05	single observed instance
Single-turn pipeline economics	$0.033 vs $0.113 per question (~3× cheaper), 5–8× fewer tokens, equal quality on covered questions	single-turn pipeline vs CC · May 2026 · Sonnet 4.6 both sides	coverage-limited on repo internals; the CC baseline itself moved 30% in 17 days — hence absolute costs
Memory pollution	personal-note corpus never leaked into codebase answers (0/2; notes ranked 75–100)	memory-as-findings probe · 2026-06-02	relevance-based; no hard filter yet

Pipeline economics in detail

Stage 1 — 11 Databricks SDK questions × 3 replicates (33 graded answers), blind LLM judge, answers anonymized and order-shuffled. Both systems answer with the same model (Claude Sonnet 4.6), so the gap is the compiled context, not the model.

$0.033

lakecode pipeline, mean cost / question · 32/33 correct + 1 partial, 0 wrong

$0.113

Claude Code, mean cost / question · 33/33 correct, 0 wrong

3.4×

cheaper at matched accuracy — 2.6× common APIs, 3.4× internals, 5.4× cross-cutting

Worked example · "How does pagination work in the SDK?"

A real cross-cutting question from the workload. Both systems answered correctly in all 3 replicates — lakecode compiled the answer from its substrate at $0.034 per question; the baseline searched, read, and reconstructed at $0.198. Same answer, 5.8× the cost.

This is the regime where compiled context shines: questions whose answer spans the codebase. It is an economics claim, not a correctness claim — the baseline got there too, it just paid full price for the trip.

Methodology & culture

Pre-registered runs. Hypotheses and pass bars are written down before the result exists; when a run misses the target, we publish the measured number, not the target.
Model pinned everywhere. Same answering model on both sides of every comparison (claude-sonnet-4-6) — differences are architecture, not a cheaper model.
Causal designs. The correctness and flywheel results hold everything constant except one variable: what the substrate contains. Cold arms verify the token is absent before seeding.
Blind grading. LLM judge with a 4-way verdict (correct / partial / incorrect / abstained); answers anonymized and order-shuffled.
Caveats travel with claims. Sample sizes are small and stated; baselines move (the CC baseline shifted 30% in 17 days), so ratios always ship with absolute costs.

Full method docs — pre-registrations, judge prompts, raw run records — are shared with beta partners.

Cheaper where you'd already be right.
Correct where you'd be wrong.

The two regimes

The full proof table

Pipeline economics in detail

Methodology & culture

See it on your own repo

Cheaper where you'd already be right.Correct where you'd be wrong.

The two regimes

The full proof table

Pipeline economics in detail

Methodology & culture

See it on your own repo

Cheaper where you'd already be right.
Correct where you'd be wrong.