Memory does two different jobs, so we measure them separately and never blend the numbers. Every claim on this page is dated, model-pinned, and shipped with its caveats — runs are pre-registered before we know the result.
Execution-bound tasks (the answer is reachable by exploring the repo) and knowledge-bound tasks (the answer lives in org-local knowledge) are different claims with different baselines. Frontier models solve the first kind for everyone equally — the durable edge lives in the second.
Execution-bound · efficiency
~27% cheaper
~22–26% fewer tool calls, identical correctness
Same agent, same model, same tasks — the only variable is whether the substrate is warm. Memory removes re-exploration; correctness is unchanged by design of the regime.
Tier-1 auto-prime A/B + write-back cold/warm A/B · 2026-06-04 · claude-sonnet-4-6 both arms · 20 runs, 0 failed
Knowledge-bound · correctness
A causal matrix: when the answer depends on knowledge that isn't in the repo, every cold arm is confidently wrong — and one substrate claim flips the result from 0% to 100%. No model upgrade closes this gap, because the knowledge isn't in any training set.
Grounded-coding bench causal matrix · 2026-06-08 · claude-sonnet-4-6 everywhere · n=3 per arm · single knowledge-trap domain
Why two regimes? Because blending them overstates both claims. The efficiency number is measured at identical correctness; the correctness number is measured where baselines fail. They are never the same number, and we never present one as the other.
Everything we claim publicly, in one place — with the run, the date, and the caveats attached. If a number isn't on this table, we don't ship it.
| Claim | Result | Run · date | Caveats |
|---|---|---|---|
| Coding parity with Claude Code | 6/6 = 6/6; bug-fixes byte-identical (same model both arms) | coding-bench v1 · 2026-06 | small generic fixture; n=1 per task |
| Efficiency moat (warm vs cold substrate) | ~22–26% fewer tool calls, ~27% cheaper, identical correctness | auto-prime + write-back A/Bs · 2026-06-04 · claude-sonnet-4-6 | execution-bound tasks; correctness unchanged by design of the regime |
| Correctness on knowledge-bound tasks | CC 0/3 · lakecode cold 0/3 · lakecode seeded 3/3 — one claim flips 0→100% | grounded-coding bench · 2026-06-08 · claude-sonnet-4-6 | n=3; single knowledge-trap domain |
| Cross-session code-change persistence | cold 0/2 vs warm 2/2 on a seeded convention; cold fabricated nothing | flywheel causal validation · 2026-06-09 · claude-sonnet-4-6 | n=2 |
| Substrate self-heals | failing production query fixed by one ingested clarifying finding — zero code change, same day | cp-07 timestamped record · 2026-05 | single observed instance |
| Single-turn pipeline economics | $0.033 vs $0.113 per question (~3× cheaper), 5–8× fewer tokens, equal quality on covered questions | single-turn pipeline vs CC · May 2026 · Sonnet 4.6 both sides | coverage-limited on repo internals; the CC baseline itself moved 30% in 17 days — hence absolute costs |
| Memory pollution | personal-note corpus never leaked into codebase answers (0/2; notes ranked 75–100) | memory-as-findings probe · 2026-06-02 | relevance-based; no hard filter yet |
Stage 1 — 11 Databricks SDK questions × 3 replicates (33 graded answers), blind LLM judge, answers anonymized and order-shuffled. Both systems answer with the same model (Claude Sonnet 4.6), so the gap is the compiled context, not the model.
A real cross-cutting question from the workload. Both systems answered correctly in all 3 replicates — lakecode compiled the answer from its substrate at $0.034 per question; the baseline searched, read, and reconstructed at $0.198. Same answer, 5.8× the cost.
This is the regime where compiled context shines: questions whose answer spans the codebase. It is an economics claim, not a correctness claim — the baseline got there too, it just paid full price for the trip.
Full method docs — pre-registrations, judge prompts, raw run records — are shared with beta partners.
The flywheel demo — cold fail, teach once, warm pass — runs live on your code during beta onboarding.