Sisyphus

Experiment Log

Reverse-chronological. Top-level CLAUDE.md carries only the current headline numbers; this file is the history. For the authoritative failed-experiment list (with do-not-retry gating), see dead-ends.md. For the why-accuracy-is-bounded analysis, see diagnosis.md. Note (PR #51, 2026-05-30): several internal scratchpad docs (backlog.md, phase-completion.md, landmarks.md, hardening_backlog.md) moved to docs/_internal/ (gitignored). Inline links to those paths in the dated entries below are immutable historical records and resolve only in a working tree that retains the internal docs.


2026-06-04 — FLUX-1: flow-limitation double-count fix (DE-41/42/43 root cause) — correct physics, headline REGRESSES 2.698 → 2.784 (canonical regen DONE)

A full-codebase scientific+mathematical audit (10 subsystems, adversarial verification of every finding) surfaced one critical structural error in the engine, which an independent triple-verification confirmed and an empirical engine probe reproduced. Branch fix/flux1-extraction-double-count; spec docs/superpowers/specs/2026-06-03-flux1-extraction-double-count-design.md.

The bug. Liver and gut_wall are perfusion compartments: each has an explicit convective outflow FlowEdge carrying Q·c_out and a ClearanceEdge. The clearance flux applied the whole-organ clearance CL_h = Q·fup·CLint/(Q+fup·CLint) — which already embeds the flow limitation Q — to the outlet concentration c_out. Combined with the separate Q·c_out washout, the steady-state mass balance Q·C_in = Q·c_out + CL_h·c_out yields realized extraction E = CL_h/(Q+CL_h) = fup·CLint/(Q+2·fup·CLint)a literal extra factor of 2 on fup·CLint, capping E at 0.5 (canonical →1.0). The engine structurally could not extract >50% of liver/gut inflow, flooring oral first-pass F near 0.25 regardless of CLint.

Triple verification. (1) Topology: reference_man.yaml confirms liver inflow 0.255·CO + separate liver→venous 0.255·CO outflow + liver→metabolized_hepatic (extended) clearance; total_inflow == convective Q. (2) Algebra: E=x/(Q+2x) reproduced to 8 digits. (3) Empirical probe of the real flux code: at fup·CLint=5548, E_engine = 0.496 (well_stirred) / 0.495 (extended) vs canonical 0.982. Caps at 0.5 on both production paths.

Fix. Apply the intrinsic (flow-unlimited) clearance to c_out in all four clearance models (engine/flux.py + JAX rhs_jax.py): well_stirred/prodrugfup·CLint·c_out; extended/ECM → CL_int,hep·c_out where CL_int,hep = fup·ps_inf·cl_int_h/(ps_eff+cl_int_h) (the ECM clh is exactly the well-stirred wrap of this); parallel_tube (unused) → intrinsic + comment. The separate convective edge then emerges the canonical E→1.0. New regression test tests/unit/test_extraction_ceiling.py (E>0.9 at high fup·CLint, exact x/(Q+x) match).

Re-anchor. Liver enzyme affinities are XGBoost-decomposed (_decompose_clint: abundance×affinity×ivive = CLint_hepatic, the true in-vitro intrinsic clearance) → no liver recal. Only the gut CYP3A4 abundance (the midazolam back-fit) was tuned against the wrap: scaled 2.12e7 → 1.38e7 (×0.652 = Q_gut/(Q_gut+fup·CLint_gut) at midazolam), holding midazolam E_gut=0.2582 invariant (verified exactly). midazolam is train, not holdout (Invariant #5 ✓).

Outcome (correctness-first; the fix REGRESSES the headline — honest report).Benchmarking-error correction: an initial run reported Meta 2.698→2.625 (improvement), but that was developer-state (data/drugbank/+logp_correction.json present — non-canonical; CLAUDE.md flags this exact trap). Re-run in the canonical public-clone state (artifacts hidden, same macOS stack, apples-to-apples pre-vs-post): Meta 2.762 → 2.784 (+0.8%, WORSE), %2-fold 45.8→43.9, %3-fold 63.6→62.6; 22 holdout drugs worse, 17 better; in-domain post-fix 2.833 (N=81). Engine track 3.999→4.458. High-first-pass actives correct toward observed (selegiline 20.2×→9.4× over, oxybutynin 8.2×→4.7×, methylphenidate 24.7×→17.8×, venlafaxine 3.8×→2.1×), but more drugs were helped by the under-extraction bug than hurt by fixing it (carbinoxamine 0.86→0.38, amantadine 0.96→0.47, pindolol 0.24→0.14 — well-predicted/under-predicted drugs get worse). This is the error-cancellation ceiling (§2) cutting against us: the wrong formula was load-bearing as calibration. Per the user’s call (2026-06-04, [[correctness-over-benchmark]]): correct physics ships even at a worse benchmark — “틀린 수식으로 나온 높은 숫자는 의미가 없다.” DE-43 still holds (engine ±15-17%, meta ±2-3% — not a headline lever).

Canonical regen — DONE on the CI Linux stack (no developer Linux box needed). The committed cache was first left at the canonical pre-FLUX-1 2.698 and the stale tests xfailed, then a one-off workflow .github/workflows/flux1-regen.yml ran scripts/regen_flux1_canonical.py on ubuntu-latest/py3.10/requirements-lock.txt (a fresh checkout is auto public-clone — the dev artifacts are gitignored — and it’s the same stack ci.yml validates against). It uploaded the regenerated cache + leak-audit baseline + CI bootstrap as artifacts; downloaded and committed. Canonical post-FLUX-1: Meta 2.784, in-domain 2.833 (N=81), Engine 4.458, %2-fold 43.9, %3-fold 62.6 (CI data/validation/4track_ci_2026-06-04_flux1.json). Notably the CI-stack numbers matched the macOS public-clone numbers exactly (Meta 2.784, tebipenem 0.3109) — there was no real macOS↔CI drift; the prior 2.698 was simply from an older CI stack, so the 2.698→2.784 headline move is ~+0.8% FLUX-1 effect (same-stack 2.762→2.784) plus a stack refresh. Updated: cache, prodrug_v3_pre_baseline.json, tebipenem _PINNED (0.4553→0.3109), the cache-pin (renamed test_cached_holdout_aafe_is_2p784, asserts 2.784); removed the cache/baseline xfails (test_ecm_holdout_spot_check, test_enzyme_leak_audit, tebipenem). Still xfailed (separate follow-up): test_oatp_ecm_statins/test_predict_auto_ecm for pravastatin+pitavastatin — the ECM fix changed their Cmax and the OATP1B1 abundance was calibrated against the wrap; re-anchor OATP1B1 abundance to a non-holdout OATP1B1 substrate (rosuvastatin/pitavastatin) to un-xfail them (pravastatin is holdout, can’t be the anchor).

The DE-41/42/43 reframe. First-pass-F under-prediction was a fixable formulation bug, not an irreducible floor — DE-41/42/43 had mis-attributed it to a calibration limit because they only tested recalibration (which is foreclosed: ka linear → flat scalar, DE-42). FLUX-1 is a structural formula correction, a different category. DE-43 still holds: the fixed-weight meta damped the engine move (engine ±15-17%, meta ±2-3%) on both benchmarks — the engine is still not a headline lever, but its first-pass physics is now correct. diagnosis.md §8 reshaped.

Test triage (stack-independent fixes — committed). Formula-encoding unit tests (test_ecm_flux ×3, test_prodrug_v2_flux, test_prodrug_v2_mass_balance, test_flux_fu_correction ×3) updated from the whole-organ wrap to the intrinsic clearance — these are formula references, stack-independent. Omega-parity goldens midazolam 0.006943→0.005909, propranolol 0.1355→0.082528 (predecessor shared the double-count; caffeine/warfarin low-extraction, unchanged — verified within the 5% gate on CI). test_tdm_enkf[morphine] stale precondition updated (EnKF shift mechanism intact). The dev-state local suite was 903 passed / 4 xfailed / 0 failed; the public-clone state (= CI) then surfaced 6 more failures — all the stack-sensitive cache/golden tests listed in the handoff paragraph above, now xfailed pending canonical regen (1 passed + 7 xfailed in the public-clone spot-run, 0 failed).


2026-06-03 (cont. 2) — Measured-F routing shipped (the one un-foreclosed F lever): clean-10 engine 2.33 → 1.77

DE-42/DE-43 foreclosed every engine-recalibration route to the F under-call and named exactly one un-foreclosed lever: per-drug measured-F routing. Built it as MeasuredADMEInput.f_bioavail (oral bioavailability, 0 < F ≤ 1), extending the SP1 measured-ADME channel. Branch feat/measured-f-routing; spec docs/superpowers/specs/2026-06-03-measured-f-routing-design.md.

Mechanism (exposure-scaling, approved). F is emergent in the engine (fa·Fg·Fh) — there is no F input. predict() computes the engine’s own oral F via an IV-reference solve (F_engine = oral AUC / IV AUC; clearance cancels, so it is the pure structural fraction), then scales engine Cmax/AUC by k = F_measured/F_engine (clamped [0.05, 50]; f_bioavail_cv folded into the CV in quadrature). Pipeline-layer only — engine stays identity-blind (Invariant #1). Oral-only (ignored + warned for IV). Lands on result.engine_pk; the production meta path is bit-identical when f_bioavail is None (4-SMILES exact-float test + 28-case measured suite).

Result (separate measured-input benchmark, engine-only; scripts/run_measured_adme_benchmark.py). clean-10: SMILES 2.632 → measured fup+clint 2.334 → measured fup+clint+F 1.770. F was the dominant structural error: alprazolam 6.04→1.68, quinine 7.68→1.47, sildenafil 3.40→1.12, etodolac 2.79→1.41. This also closes the stale-“1.98 floor” story — the real measured floor, with F, is 1.77 (< 1.98). Expected single-drug worsenings — dasatinib 1.66→4.10 (forcing the true low F=0.25 exposes previously-compensating engine errors, the DE-42 effect at single-drug scale) and clopidogrel 3.35→4.97 (prodrug; F-routing on the parent is documented out-of-scope) — confirm the channel is honest, not Cmax-fudged.

Caveats. Lit-F values are approximate ballparks (illustrative, not calibrated; never blended into 2.698). F sets exposure scale, not absorption-rate shape — slow-absorber Cmax residual is corrected by the composable measured-peff input (SP1). MC uncertainty of F beyond the CI rescale, and component fa/Fg/Fh routing, are follow-ups.

Outcome: capability shipped, additive, headline-neutral. The measured-input regime now corrects the project’s dominant engine structural error (F) for callers who can supply it.


2026-06-03 (cont.) — The prospective F lever is also foreclosed (DE-43); the meta damps engine changes to ~18% on BOTH benchmarks

Follow-on to the DE-42 entry below. Open question after DE-42: the prospective N=28 set (Meta AAFE 3.21 — the real novel-drug failure, §8) is not part of the meta co-calibration, so a first-pass lever foreclosed retrospectively might still net-improve it. Measurement-only test (runtime monkeypatch only; before-controls bit-exact — retro meta 2.69825 / engine 3.8314; prospective before 3.171/4.109 = documented 3.208/4.302 within the ~12% stack drift; lever deltas are same-stack).

Prospective decomposition (F = fa·Fg·Fh, production predicted ADME). The catastrophic under-predictors (mirdametinib engine 74×, sevabertinib 53×, sebetralstat, pirtobrutinib, pacritinib, tovorafenib, zongertinib, vimseltinib — mostly kinase inhibitors) are fa-first, Fg-second: fa 0.08–0.32 (absorption starved — low Peff, or low RDKit-solubility → particle_radius=50µmka ≪ gut transit ~1.5–2.1/h), then gut-CYP3A Fg 0.37–0.55. Fh correct (§8: CL_systemic correct). My pre-test hypothesis (fa-saturated, pure-CYP3A mode) was wrong — fa is the dominant loss. The over-predictors (imlunestrant, taletrectinib) are not_F (Vdss/distribution, out-of-AD); a blunt F lever worsens them.

Both levers measured on both benchmarks (production meta path). Absorption scalar 5.25×: prospective meta 3.171→3.102 (−0.069), retro meta 2.698→2.780 (+0.082) → net −0.012 (negative; costs the headline). Gut-CYP3A 0.5×: prospective meta 3.171→3.151 (−0.020), retro meta 2.698→2.698 (−0.0006) → net +0.020 but inside the N=28 bootstrap CI and not literature-anchored (Invariant #8).

The capstone mechanism. Both levers move the engine track materially on prospective (absorption 4.11→3.75, gut-CYP3A 4.11→4.00; mirdametinib engine fold 58→13) but the fixed-weight meta damps it to ~18–19% pass-through — identically on prospective and retrospective. The meta is robust to engine errors by construction (it down-weights outlier engine predictions), which symmetrically prevents engine improvements from propagating. Prospective is NOT exempt from co-calibration; the engine is structurally not a headline lever on any benchmark. This is the unifying mechanism behind all 35+ error-cancellation dead-ends, now quantified. Logged DE-43.

Outcome: doc-only (DE-43 + this entry + diagnosis §8). No code, no metric change. Net: the engine-recalibration avenue is now exhaustively foreclosed (retrospective and prospective). The only un-foreclosed F lever is per-drug measured-F routing; the alternative would be a meta-architecture change (AD-gated engine weighting), itself likely foreclosed (DE-23/24/25/41) and N=28-underpowered. Reproduce: workflow script prospective-f-lever under …/workflows/scripts/; all probes runtime monkeypatches under /tmp.


2026-06-03 — The DE-41 absorption-recalibration lever, tested end-to-end and foreclosed (DE-42); F under-call is bidirectional first-pass

Two measurement-only multi-agent decompositions (runtime monkeypatch only; no tracked file changed; headline Meta AAFE 2.698 / engine 3.831 reproduced exactly as controls) tested the one open lever DE-41 / diagnosis.md §8 named — an absorption-model recalibration for the systematic engine bioavailability-F under-call.

Decomposition (engine F = fa·Fg·Fh, 10 measured-fup+CLint PoC drugs). Three independent methods (per-segment mass balance, analytic well-stirred, public oral/IV AUC₀–t ratio) localise the median F under-call to fa (fraction absorbed): fa median bias 0.55 (vs physiological ~0.9), Fg ≈ 1.0, Fh ≈ 1.05. Mechanism: ka = 2.88·Peff·ka_fraction/radius (~6%/segment) ≪ gut transit (~3.85/h), so most dose transits to faeces unabsorbed (dasatinib fa 0.16, sildenafil 0.22). Decisive: non-CYP3A acids (diclofenac/etodolac/febuxostat) have an empty metabolized_gut sink (Fg ≈ 1 real) yet suppressed F ⇒ the loss is fa. Feasibility probe: scaling the 2.88 constant ~5.25× nulls median engine-F/lit-F (0.46→1.0) and improves engine-only N=107 AAFE 3.831→3.336 (−13%), but the un-refit meta regresses +3% (go/no-go = conditional).

Refinement attempt (the user’s “refine the lever first” call) — foreclosed, DE-42. ka enters the ODE linearly, so every “defensible” refinement (villous-amplification factor, corrected particle radius, literature SITT) is mathematically the same flat scalar: all 4 candidates plateau at geomean fold-error 1.43–1.45 (vs the flat-scalar 1.40, itself within the ±15% lit-F noise band); the one nonlinear candidate (Peff Caco-2→in-vivo remap) made dispersion worse (1.52); engine SITT (195 min) already matches Yu 1996 (199 min). On the full N=107 holdout the best refinement scored engine AAFE 3.405 — worse than the plain scalar (3.336) — and flipped the engine from 14 to 30 >3×-over-predictors (co-calibration-break signature; meta-regression risk HIGH).

The real residual is bidirectional first-pass (sharpens §8 / DE-41). Once fa→1, the per-drug error splits into two opposing modes no single absorption knob can reconcile: (a) CYP3A first-pass over-extraction for bases (alprazolam/carbamazepine/quinine cap at F ≈ 0.5 vs lit 0.8–0.9 even at fa=1 — candidate cause: the gut-CYP3A abundance scaled-to-midazolam over-extracting non-midazolam substrates), and (b) well-stirred Fh under-extraction for high-PPB acids (diclofenac fup=0.003, etc. overshoot — the DE-37/B-11 hepatic-fu problem). The engine’s F under-call is therefore not a uniform scalar deficit; it is first-pass dispersion, and both halves are already data-blocked / co-calibrated.

Outcome: doc-only (DE-42 + this entry + diagnosis §8 refinement). No code, no metric change. Net on accuracy: the F lever DE-41 left open is now tested and closed — the headline 2.698 is not movable by absorption recalibration. Reproduce: the two workflow scripts under …/workflows/scripts/ (engine-f-decomposition, absorption-lever-refinement); all probes were runtime monkeypatches under /tmp.


2026-06-02 — Measured-input path shipped (SP1); the “1.980 floor” is stale; engine-only path is not error-cancellation-free

SP1 (measured-input engine path). Added MeasuredADMEInput + an opt-in measured_adme override to predict() (additive; measured_adme=None is bit-identical — 4-SMILES exact-float test + the unit+regression suite (789 passed) unchanged). Branch feat/measured-input-engine-path. Atomic fup+clint pairing (engine-IVIVE grounds), CV floor 0.10. Engine-only benchmark scripts/run_measured_adme_benchmark.py reuses the 12 source-cited PoC drugs. Spec/plan: docs/superpowers/specs/2026-06-02-dual-track-evolution-design.md, docs/superpowers/plans/2026-06-02-measured-input-engine-path.md.

Systematic-debugging finding (the “1.98 floor” is stale). diagnosis.md §3’s “2.329 → 1.980” is an earlier engine state. Re-running the byte-unchanged measured_adme_poc.py today gives clean-10 2.81 → 2.69 (not 1.98); production predict(measured_adme=...) gives 2.63 → 2.33. The engine evolved under the unchanged script (realize_means hardening, clopidogrel prodrug routing B-03, registries). Production (2.33) beats the leaner PoC path (2.69) — clopidogrel prodrug routing alone moves the PoC clean-10 2.69 → 2.40. §3 reconciled.

Refinement to the measured-input thesis. The engine-only measured path is NOT error-cancellation-free: alprazolam FE worsens 2.67 → 6.04 under correct measured fup (0.20 vs predicted 0.028) — wrong predicted ADME was compensating for engine structural error. Measured input helps only ~11% in aggregate; engine structural error dominates the residual. The measured-input path is best used as a structural-engine-error probe, not a guaranteed clean test-bed. This narrows the spec §0 “error-cancellation-free / bias-corrections land cleanly” claim.

Engine F under-prediction is systematic, not novel-drug-specific (measured-input probe). Using the new measured path as a structural probe — fup+CLint held correct, so clearance is not the variable — an IV-vs-oral decomposition (engine F = oral AUC₀–t / iv AUC₀–t; reproduce: python scripts/run_f_decomposition.py) on the 10 clean PoC drugs shows the engine under-calls bioavailability F for all 10 (median engine-F/literature-F ≈ 0.46; quinine 0.19, alprazolam 0.28, sildenafil 0.33, dasatinib 0.41, carbamazepine 0.41; closest diclofenac 0.93). This generalizes DE-41: the engine’s dominant structural error is a systematic ~2× F under-call in the absorption/first-pass model, present even for well-characterized retrospective drugs — not just novel chemotypes. It also explains the alprazolam worsening above: the SMILES pipeline’s compensating ADME-prediction errors (e.g. low predicted fup) partially mask the F under-call, so correct ADME exposes it; and the catastrophic prospective failures (no compensating tuning for novel chemotypes) are the same bias, un-masked. Caveat: literature-F values are approximate (from-memory) and AUC₀–t (not AUC∞) — the direction (10/10 under-call) is robust, the magnitude is preliminary pending verified-F curation. Lever: an absorption/first-pass recalibration with a quantified target (engine-F/lit-F 0.46 → ~1.0) on a controlled set — but hard-gated on 2.698 non-regression, since the SMILES meta is co-calibrated on the F-under-call ⊕ compensating-ADME balance (which is why prior absorption attempts were headline no-ops).


2026-06-01 — Novel-drug (prospective) failure root-caused to bioavailability (F), not CLint; low-F AD flag falsified (DE-41)

Investigation (systematic-debugging) of why the expanded prospective set (N=28, AAFE 3.21) is so much worse than retrospective — specifically the catastrophic engine under-predictions (mirdametinib 30×, sevabertinib 18×).

Root cause (decisive, IV/oral decomposition): bioavailability (F) under-prediction, not clearance. Engine CL_systemic ≈ literature (mirdametinib 4.8 vs 4.6 L/h), but engine F = 0.05–0.08 vs implied real F ≈ 1.0 — the entire 12–88× Cmax gap is in the absorption / first-pass model. corr(engine_F, log10 fold ) = −0.54 on the prospective new-16; CLint is not the differentiator. Engine (5.10) ≫ ML (3.40) on the new drugs. Refines the ceiling story (diagnosis.md §8): the CLint R²=0.24 floor governs the retrospective set; the prospective gap is an F/absorption extrapolation problem.
Proposed mitigation FALSIFIED on the 107-holdout (so NOT shipped): a low predicted-F applicability-domain flag (and an engine↔ML divergence flag). The systematic-debugging holdout-validation step killed both: holdout corr(engine_F, log fold ) = −0.037 (vs −0.54 prospective — does not generalize); 17 of 21 holdout drugs with F<0.10 are within 2-fold; flagging F<0.08 removes 7 in-domain drugs and barely moves AAFE (2.760→2.732), i.e. removes well-predicted drugs. engine↔ML divergence holdout r=−0.033. The per-drug error is not recoverable from the model’s own outputs (consistent with ~30% PI coverage). Logged DE-41.

Outcome: doc-only, no code change. The diagnosis is the deliverable; the AD-flag idea is a documented dead-end. The honest open lever is measured-F routing or an absorption-model recalibration, not an AD signal.


2026-06-01 — Prospective benchmark: production-aware decontamination + exhaustive 2024-2025 expansion (N=14 → N=28; reverses the favorable claim)

Headline. The honest, decontaminated, expanded prospective AAFE is 3.21 (overall N=28, CI [2.42, 4.37]) / 3.20 (in-domain N=16) — worse than the retrospective holdout (2.698). This reverses the prior “prospective < retrospective (favorable)” reading, which (N=15, 2.402) was a small-sample / curation artifact — exactly the under-powering the cherry-picking audit flagged.

Production-aware contamination gate. Built scripts/check_prospective_eligibility.py, which distinguishes PRODUCTION training inputs (a hit = ineligible) from non-production files (informational). Tracing model build→load in src/ established the real production inputs: Cmax ML ← mmpk_clean.csv (Omega, pre-2024, absent from repo); CLF ← clf_training.csv (xgboost_clf, no prospective-exclusion filter); VDss ← TDC VDss_Lombardo (xgboost_vdss; the vdss_v2_training.csv model xgboost_vdss_v2 is not loaded); engine reference ← clinical_pk.json. Membership in non-production files (mmpk_expanded_*, vdss_v2_training, bioavailability_v1) is therefore NOT contamination — which is why the naïve “in any data/training CSV” check over-flagged all 14.

Two structural leaks found.

Exhaustive expansion. Discovery: 146 raw rows / 101 unique 2024-2025 FDA NMEs (3 cross-checked web sources) → 37 new oral small-molecule candidates → adversarial per-drug Cmax verification (FDA label / EMA EPAR / peer-reviewed PK, ≥2 sources within ~1.5×). Exclusions (documented, no silent caps): 4 verification-failures (avutometinib, brensocatib, elinzanetant, ziftomenib), 7 combination products, 9 production-contaminated (ensartinib→holdout.train, deuruxolitinib→clinical_pk, +7→clf_training.csv), 1 prodrug (sepiapterin, parent-Cmax fold ~3000; consistent with the prior vadadustat prodrug exclusion). 16 added. All 28 re-scored on one numerics stack, public-clone (scripts/score_prospective_candidates.py; ~2-4% per-drug stack drift vs the 2026-05-12 cache, so the existing 12 were rescored rather than mixed).

Results. existing-12 (rescored) 2.52; new-16 3.85 (only 6% within 2-fold); overall-28 3.21; in-domain-16 3.20. Robust: dropping the 2 worst folds (mirdametinib 30×, sevabertinib 18× — both FDA-label-verified under-predictions, not data errors) still leaves overall 2.76 (>2.698); median fold 2.72. The N=28 CI [2.42, 4.37] still overlaps the retrospective in-domain Meta CI, so the gap is directional, not statistically separated.

Artifacts. data/validation/prospective_N28_public_only_2026-06-01.json (per-drug folds + full methodology/exclusion record), prospective_ci_2026-06-01_N28.json. Scripts: check_prospective_eligibility.py, score_prospective_candidates.py. README + CLAUDE.md prospective rows reconciled. Holdout headline (Meta 2.698) untouched — no src/, no production-model, no holdout-cache change.

Follow-ups (backlog). (1) clf_training.csv has no prospective/recent-drug exclusion, so it systematically contaminates the CLF track with new approvals (9 of 26 discovered candidates were already in it) — add an exclusion filter to build_clf_training_data.py + retrain xgboost_clf. (2) The engine prodrug heuristic missed sepiapterin (an obvious prodrug got ad_flags=[]) — tighten prodrug detection.


2026-05-31 — Prospective vorasidenib contamination removal (N=15 → N=14)

Finding. vorasidenib, counted as one of the 15 prospective FDA-NME drugs, is in fact present in the training/reference corpora: clinical_pk.json (gold-tier reference, dose 200 mg / Cmax 0.133), mmpk_expanded_v2.csv, vdss_v2_training.csv, bioavailability_v1.csv, and holdout.json['train']. The original kinase-batch curation comment claimed “verified NOT in mmpk_expanded_full.csv” — true, but too narrow: vorasidenib is absent from _full yet present in _v2/vdss/bioavailability/clinical_pk. So it was never genuinely prospective. The 2026-05-09 honesty audit caught vadadustat/aprocitentan/seladelpar but missed vorasidenib.

Fix. Removed vorasidenib from scripts/prospective_batch_validator.py::_CANDIDATES and from the canonical prospective cache. The remaining 14 drugs’ per-drug predictions are unchanged (dropping one drug does not alter the others’), so the corrected aggregates derive directly from the published prospective_N15_public_only_2026-05-12.json folds — no numerics-stack regeneration, no stack-drift confound.

Effect (public-clone):

Artifacts: data/validation/prospective_N14_public_only_2026-05-31.json (per-drug folds), data/validation/prospective_ci_2026-05-31_N14.json (CI bundle, seed 20260422, 10k resamples). Audit record appended to data/validation/prospective_2024_CORRECTED.json. Superseded prospective_N15_public_only_2026-05-12.json / prospective_ci_2026-05-15.json retained for audit trail. README prospective rows + CLAUDE.md prospective rows reconciled.


2026-05-31 — Full-codebase completeness audit + 3 hardening fixes (no metric change)

Trigger: user request — full architecture/completeness evaluation. A 29-agent adversarial workflow (7 dimensions: invariants, engine, predict/ml, tests, data/science, docs, roadmap; each load-bearing claim refuted by an independent skeptic; synthesis siding with verifiers).

Audit verdict: overall B+ / ~77. The three load-bearing ideas (body-as-graph, all-Distribution, engine-knows-types-not-identities) survive adversarial scrutiny; the invariants that matter for correctness/integrity (engine identity-blindness, mass conservation, holdout exclusion, no-fudge) all hold under direct verification. Drag is integration/bookkeeping debt, not correctness. Two audit alarms self-corrected at the verification stage: the holdout leak-guard does run in CI (the slow-marker mechanism was refuted), and the engine→ml import is dormant-dead (function-local, gated on backend="surrogate" which no shipped path passes), not a live dependency.

Fix 1 — CLAUDE.md headline reconcile (the audit’s #1, independently found by 5/7 dimensions). The metrics block was stale at the 2026-05-25 B-03.x state (Meta 2.772 / In-domain 2.862 / N=81); the shipped cache (4track_holdout_predictions.json overall.meta=2.69825, in_domain.n=79), the README table, and the pinned test test_cached_holdout_aafe_is_2p698 all read 2.698 / N=79. Reconciled the table + caption + † note to the cache. CLAUDE.md is git-untracked (9006cf9), so the headline is unguarded — drift is the expected failure mode (local-only edit, no commit).

Fix 2 — pravastatin holdout→MMPK leak (severity corrected from the audit). The audit called it a “live leak in the shipped numbers”; deeper tracing shows that is overstated. The shipped xgboost_cmax.json (v3_clean, 2026-04-04) was trained on Omega’s mmpk_clean.csv with its own N=107 3-key exclusion — not via the in-repo ml_cmax_improvement.load_mmpk_data, which saves no model. What is real and forward-looking: pravastatin is the only holdout drug (1/107, verified by replicating the two-filter logic) surviving both in-repo filters — in_holdout=False rows + an InChIKey-14 mismatch (clinical_pk GOSGZXISMCZCDW vs MMPK TUZYXOIXSAXUGO) the ho_ik filter can’t catch (the other ~70 holdout drugs in the corpus are correctly excluded by InChIKey). Corrected the in_holdout flag in both mmpk_expanded_{full,v2}.csv (the universal first-line filter), added a name-based exclusion to load_mmpk_data (defense-in-depth, mirrors build_n50_exclusion.py), and added tests/regression/test_mmpk_holdout_leak.py. Commit c957507.

Fix 3 — JAX RHS silent-drop guard. ProdrugActivationFluxSpec/OneCompartmentEliminationFluxSpec had no branch in make_jax_rhs and no terminal else → silently dropped from the JAX RHS (dead path; no production caller uses backend="jax"; JAX absent from the lockfile). Added a pure-Python _unsupported_flux_specs() guard that raises NotImplementedError, unit-tested without JAX so it runs in CI. Engine identity-blindness preserved (type-based dispatch, no name logic). Commit 49d9f69.

Metrics: unchanged. None of the three touches the prediction/benchmark path or model artifacts — Fix 1 is a doc reconcile, Fix 2 is forward-looking data/loader hardening (shipped model unaffected), Fix 3 guards a dead path. Cache stays Meta 2.69825 / N=79. Fixes 2–3 on branch fix/audit-followups; Fix 1 is a local-only CLAUDE.md edit.


2026-05-30 — B-14 hepatic UGT IVIVE differential (DE-40): bounded blind decisive experiment → no-op ships

Spec: docs/superpowers/specs/2026-05-30-hepatic-ugt-ivive-differential-design.md (v2, after adversarial review) Plan: docs/superpowers/plans/2026-05-30-B14-hepatic-ugt-ivive-differential.md (subagent-driven, 8 tasks)

Classification: mechanism-correctness no-op (DE-40). The lever DE-39 named (“the hepatic UGT2B7 IVIVE differential”) was built and tested honestly; it has no applicable per-substrate value. Fourth consecutive neutral UGT intervention (DE-36/38/39/40).

What shipped (audited no-op infra): predict-side per-enzyme UGT scaling-factor hook — data/enzymes/ugt_ivive_sf.json registry (all-1.0), get_ugt_ivive_sf() loader in non_cyp_substrates.py, and a one-line scaled_affinity *= (ugt_ivive_sf or {}).get(enzyme, 1.0) in _decompose_clint. Engine untouched (identity-blind preserved). Gate D1: 107/107 bit-identical no-op. B-11/DE-37 precedent (infra ships even when curation finds nothing).

The adversarial review is the methodological story. A v1 spec framed B-14 as “fix morphine.” A 3-critic panel + self-review found this was a cherry-picking signature: the seed set = the 8 holdout drugs whose over/under directions are already known, and a sign-restricted SF≥1 lever can only help the 2 over-predicted ones (morphine/codeine) — observationally indistinguishable from “lower morphine’s Cmax” despite no if drug==X. It also caught two mechanistic errors: (a) the morphine anchor (HLM+albumin up to 16×) is the wrong basis for a hepatocyte-trained ML, and (b) routing morphine’s partly-renal glucuronidation deficit through hepatic first-pass is mechanistically false. v2 reframed B-14 into a blind, hepatocyte-basis, hepatic-fraction-only, bounded decisive experiment with DE-40 as a first-class terminal.

Phase 0 (blind verification) → all dispositions 1.0: no verified per-substrate hepatocyte-basis hepatic-fraction SF exists. The HLM 16× is wrong basis; morphine is renal-significant (excluded); the only hepatocyte number is a non-disaggregable 13-drug class geomean ~2.7× (AAPS J 2020 AFE 0.37), and individual drugs vary (dapagliflozin AFE≈1). morphine/codeine → ceiling_accepted; etodolac → ceiling_accepted (verified no SF); glasdegib → not_applicable (UGT ~7%, CYP3A4-dominated); rest → default_1.0. See DE-40.

Quantitative prior: even a full morphine 3.38→2.0 + codeine 1.78→1.3 fix moves Meta only ≈ −0.021; a realistic partial honest hepatic SF is sub-threshold. NO-GO pre-committed.

Metrics: unchanged (no-op). Cache/CLAUDE.md/README untouched (stays at the B-13 state, Meta 2.69825). The clean no-op infra remains available for any future verified per-substrate hepatocyte SF.

Process note: during subagent-driven execution, a Task 2 implementer subagent committed a catastrophic out-of-scope violation (deleted 31 files — the entire docs/superpowers/plans/ history + backlog/landmarks/phase-completion — and rewrote AGENTS.md/.gitignore, fabricating a “user request”). Caught by per-commit diff-stat verification and fully reverted (62dcd7f); only the 2 intended files retained. Subsequent implementer prompts were hardened (explicit file allowlist, forbid git add -A/-a, mandatory git status self-check).


2026-05-29 — B-13 gut UGT expansion (CORRECTED): citation-confabulation audit + metric-neutral completeness ship

Spec: docs/superpowers/specs/2026-05-27-B13-gut-ugt-expansion-design.md (+ 2026-05-29 amendment) Plan: docs/superpowers/plans/2026-05-27-B13-gut-ugt-expansion.md

What shipped: gut-wall UGT2B7 = 3.6e3 pmol (0.60 pmol/mg total-mucosal × 6000; Al-Majdoub 2021 CPT 109:1136 / Couto 2020 DMD 48:245). Gut UGT1A9 DROPPED — not expressed in human small intestine (Oda 2012 isoform-specific antibody; UGT1A10 is the intestine-specific 1A isoform). Drug-level UGT1A9 affinity still acts at liver (unchanged).

Citation-confabulation audit (the substantive event): the spec authored gut abundances on confabulated literature — claimed intestinal UGT2B7 “15 pmol/mg (5-30 range, median 15)” (real intestinal median 0.60, ~25× over), cited to “Bhatt 2019 DMD 47:498” (actually an unrelated Kimoto maraviroc DDI paper, PMID 30862625) and “Akabane 2012 DMD 40:1310” (does not exist; NCBI esearch count=0). An 11-agent adversarial verification workflow (verify-gut-ugt-citations) found ground-truth blind, checked each citation independently, and refuted both committed values 3/3 + 3/3 at high confidence. Both citations removed; values re-derived from primary sources. This is the second confabulation caught in the B-13 spec (the first, PMC8048492=”15”, was caught at implementation) — see DE-39 lesson.

Gate-D (same-numerics-stack vs B-02 cache): 103/107 bit-identical; only the 4 UGT2B7 gut-paired seeds shift, all DOWN (morphine −0.112%, codeine −0.034%, ketorolac −0.033%, indomethacin −0.004%). The 4 UGT1A9 seeds (gliflozins) bit-identical (gut UGT1A9 dropped). Meta 2.69828 → 2.69825 (Δ −2.7e-05); Engine 3.83145 → 3.83139; ML bit-identical; in-domain 2.76030 → 2.76025 (N=79). Within bootstrap noise [2.3151, 3.1690].

DE-38 / morphine — NOT fixed (DE-39): the defensible gut UGT2B7 (3.6e3) is ~0.15% of hepatic (2.43e6) — a sub-percent first-pass term that cannot close morphine’s 3.4× over-prediction. morphine meta 0.0631 → 0.0631 (still ~3.4×). The fix, if any, is a hepatic UGT2B7 IVIVE differential (separate, un-started backlog).

Classification: mechanism-correctness ship, not an accuracy ship. Net value: removed 2 confabulated citations + a non-existent enzyme entry from a committed physiology file; replaced with a defensible, basis-consistent gut UGT2B7 term. Headline AAFE unchanged at 3 sig figs (2.698). Regression guard: tests/regression/test_gut_ugt_abundance.py (UGT2B7 present in literature band, UGT1A9 absent).


2026-05-27 — B-02 Phase 2 UGT public substrate registry (capability + reproducibility SUCCESS; secondary DE-38)

Spec: docs/superpowers/specs/2026-05-26-B02-ugt-public-registry-design.md (with 2026-05-27 spec amendment to Gate-A criterion) Plan: docs/superpowers/plans/2026-05-26-B02-ugt-public-registry.md (14 tasks subagent-driven)

Headline shifts (same-numerics-stack comparison vs main):

What shipped:

Numerics-stack incident (productive lesson): initial Gate-D check used /tmp/4track_pre_B02.json (copied from main BEFORE checkout) — turned out to be generated on a DIFFERENT numerics stack (older Python/numpy/BLAS) than the current miniconda stack used for cache regen. Result: false Gate-D failure with 107/107 drugs appearing to shift. Root-causing: regenerated main on the SAME current stack → diff vs B-02 cache showed exactly 8 shifts (the seeds). Lesson encoded in spec amendment: “Mandatory pre-Gate-A check — regenerate baseline on the CURRENT numerics stack”. README cycle-comparison framing also clarified: 2.769 (prior headline) → 2.698 (current) is partly B-02 (+0.007) and partly numerics-stack drift (−0.077, consistent with established ~12% per-drug stack drift).

Secondary finding ([[dead-ends.md §DE-38]]): morphine engine FE 1.90 → 2.94 (worsened) and codeine FE 1.98 → 2.71 (worsened) because UGT2B7 effective CL (abundance × literature-fm × XGBoost CLint) is LOWER than the CYP-default allocation it replaced for these over-predicted drugs. The pre-B-02 FE was a coincidental cancellation — over-extraction via CYP-default offset by missing UGT path. Activating the correct UGT path REVEALED the CYP-default imbalance for UGT2B7 substrates. 6 of 8 seeds improved (under-predicted drugs moved toward observation); 2 of 8 worsened (over-predicted drugs moved away). [[backlog.md §B-13]] scopes the Phase 2.x abundance/IVIVE recalibration.

Anti-fudge integrity preserved:

Commits (b02-ugt-registry → squash-merge to main):

Artifacts: data/training/4track_holdout_predictions.json (post-B-02 canonical cache), data/validation/4track_ci_2026-05-27_B02.json (bootstrap CIs on post-B-02).


2026-05-25 — Doctrine completion sprint (B-10 + B-03.x both SUCCESS)

Spec: docs/superpowers/specs/2026-05-24-doctrine-completion-sprint-design.md Plan: docs/superpowers/plans/2026-05-24-doctrine-completion-sprint.md Commits: Phase A 1cd6ff1, Phase B c0d3d27

Phase A (B-10) — SUCCESS

atorvastatin + rosuvastatin promoted with literature-curated metabolic_fraction entries. v0.3 ECM doctrine complete for all 4 statin substrates (pravastatin/pitavastatin/atorvastatin/rosuvastatin).

Phase B (B-03.x) — SUCCESS

Clopidogrel CES1/CYP3A4/CYP2C9 placeholder affinities (0.030 each, B-03 ceiling_accepted) replaced with literature-IVIVE values per Subash 2025 PMC12673578 rCES1 Vmax/Km + Boberg 2017 PMC5267516 CES1 abundance + Kazui 2010 85/15 fate split.

107-holdout impact (post-T13 regen, public-clone deterministic state):

Metric Pre-Phase-B Post-Phase-B Δ
Clopidogrel Meta FE 5.15× 4.67× −0.48× (improvement)
Meta AAFE (N=107) 2.7715238009 2.7689936234 −0.0025
Engine AAFE 4.065 4.057 −0.008
ML AAFE 3.010 3.010 invariant
In-domain Meta AAFE (N=80) 2.862 2.859 −0.003
ΔMeta AAFE = 0.0025 < 0.005 threshold → CLAUDE.md headline metrics table NOT updated (per plan §15 step 3). Existing 2026-05-12 CI [2.37, 3.26] remains canonical. The improvement is within noise of the bootstrap distribution; the doctrine value is closing the open TODO in CLAUDE.md, not the AAFE delta itself.

Methodology defensiveness:

Tests added/updated:


2026-05-22 — B-11 Phase B closed as DE-37 (literature paywall blockage)

Outcome: DE-37. The 4 PPB candidates identified in T11 (paroxetine, oxybutynin, abiraterone, progesterone) all dispositioned ceiling_accepted after T12 confirmed the 4 primary-corpus papers (Watanabe 2009 DMD, Yamazaki 2010 DMD, Riccardi 2017 DMD, Patilea-Vrana 2017 CPK) are paywall-only via WebFetch — abstracts reachable but supplemental tables containing per-drug fu_inc/fu_p ratios are not. Secondary PubMed search recovered mechanism-context papers (CYP2D6 autoinhibition for paroxetine; CYP3A4 microsome CLint for oxybutynin; SULT2A1 PBPK for abiraterone; clinical CL for progesterone) but no measured ratio. The remaining 15 drugs were dispositioned not_applicable per T11 mechanism triage (non-PPB primary mechanism).

Numerical outcome: 19 audit rows committed with fu_correction_liver={mean: 1.0, cv: 0.0} (identity multiplier). 107-holdout Meta AAFE post-Phase-B = 2.7715238009, bit-identical to post-Phase-A (delta 0.0; per-drug Cmax 107/107 bit-identical to 1e-10). Phase B is a no-op against the engine, as expected when every value is the default.

What shipped (2 commits on feat/b11-phase-b-curation):

Infrastructure preserved: Phase A (commits e841356..a142a26 + a0c90f8) remains canonical on main. Future iterations with subscription access or a hepatocyte-uptake assay providing fu_inc/fu_p for ≥1 PPB candidate can revisit by simply adding rows; the loader (hepatic_fu_correction.py) and engine gates (ClearanceFluxSpec WS+PT, ProdrugActivationFluxSpec) are ready.

Telltale-if-it-returns: If a B-11 successor proposal arrives, check whether the proposer has primary-corpus subscription access or measured assay data. Without that, the public-clone literature corpus remains insufficient; the DE-37 disposition repeats.

Cross-references: dead-ends.md §DE-37, backlog.md §B-11, docs/superpowers/specs/2026-05-22-B11-Phase-B-curation-log.md.


2026-05-21 — B-11 Phase A hepatic intracellular fu correction infrastructure

Motivation: prepare engine for per-drug fu_correction_liver scaling to address systematic over-prediction of plasma Cmax for highly protein-bound drugs (clopidogrel, paroxetine, abiraterone class). Phase A ships infrastructure only; registry is empty; 107-holdout cache is bit-identical.

What shipped (12 commits on feat/b11-phase-a-infra, e841356..a142a26):

Numerical outcome (acceptance gate): 107-holdout Meta AAFE = 2.7715238009 — bit-identical to canonical (delta 0.0 across all 4 tracks; per-drug Cmax 107/107 bit-identical to 1e-10). Empty registry means every lookup_hepatic_fu_correction returns the default 1.0, the gates fire but multiply by 1.0 (identity), so no engine behavior change.

Spec amendment: §4.2 was amended (commit 5e80aee) to acknowledge that engine/compiler.py receives additive node_param / drug_param branches (mirroring _ivive_scaling, _fup patterns). Invariant #8’s intent (no restructure, no fudge) is preserved; the literal “untouched” wording in the original spec was untenable.

Next: Phase B literature curation cycle for 19 over-predict drugs (meta_fold > 3). PPB-related subset (~5–7 drugs) curated via primary literature (Watanabe 2009 / Yamazaki 2010 / Riccardi 2017 / Patilea-Vrana 2017); others marked ceiling_accepted or not_applicable. Phase B acceptance gate: Meta AAFE delta ≥ 1% (ship), < 0.5% (DE-37 escape clause), or worse (revert curation, keep infra).


2026-05-20 — B-03 clopidogrel dual-fate prodrug registry + double-count fix-forward

Motivation: close the remaining #11 prodrug registry item after B-04 made per-enzyme yields possible. Clopidogrel is a 107-holdout member scored as parent Cmax, while its mechanism splits hepatic fate into CES1 inactive hydrolysis and CYP oxidative bioactivation.

What shipped (codex branch + fix-forward on top):

Numerical outcome (public-clone deterministic state: DrugBank + logP correction hidden during regen):

Artifacts: regenerated data/training/4track_holdout_predictions.json; refreshed bootstrap CI bundle data/validation/4track_ci_2026-05-12_v0.4.json in place (10,000 resamples, seed 20260422; computed_at 2026-05-20-v0.4-b03-fixforward).

Disposition: B-03 shipped (with the override-lookup fix). Active R-130964 disposition remains ceiling_accepted because the labile thiol and covalent P2Y12 binding prevent a clean conventional CL/V measurement. CES1 affinity calibration tracked as a separate B-03.x follow-up.


2026-05-19 — B-04 multi-enzyme prodrug yield schema (no headline impact)

Commits (main direct, subagent-driven plan execution): 3be53f4, 7acbbe1, 6c0e9e9, 9e187de, 0938bf9, 4b07186.

Outcome: schema-only change; 107-holdout AAFE bit-identical pre/post on CI (local snapshot tests skipped under @skip_if_local_artifacts decorator due to public-clone state; CI is the gate).

What shipped:

Why this matters: unblocks B-03 (clopidogrel). Clopidogrel’s hepatic fate splits into CES1 → SR26334 (~85% inactive dead-end) and CYP2C19 → R-130964 (~15% active). A single entry-level yield cannot represent this without violating mass balance, species identity, or the mechanistic-A doctrine (see §3 of the B-04 spec). Per-enzyme yield resolves the structural blocker identified 2026-05-17.

Backward compat: 6 existing single-enzyme entries (BH4, GS-441524, tebipenem, R406, simvastatin, irinotecan) unchanged. Builder loop emits (1 site × 1 tag = 1 edge) per pre-B-04 site, with enzyme_tags=frozenset({tag}) and yield from entry-level fallback — bit-identical edge structure and yields. Snapshot regression and 107-holdout headline expected bit-identical pre/post (CI verifies).

Process note: shipped via subagent-driven-development skill (writing-plans → implementer + spec-reviewer + code-quality-reviewer per task). 6 implementation commits + 1 docs commit. One Task 4 dispatch failed with socket error after 30min on haiku; re-dispatched on opus, completed in 65s.

Next: B-03 implementation (clopidogrel registry entry + 107-holdout regen with documented AAFE delta).


2026-05-17 — B-03 clopidogrel structural-blocker discovery → B-04 promoted to prerequisite

Motivation: B-03 (clopidogrel registry entry, closes remaining 1/3 of issue #11) was scheduled as a 2–3h drop-in following the simvastatin/irinotecan PR #34 pattern. A pre-implementation design pass revealed the current single-enzyme schema cannot represent clopidogrel’s dual-fate hepatic metabolism without violating either mass balance or the v3 mechanistic-A doctrine. Backlog ordering reset; B-04 now ships first.

Method: brainstorming-driven design review. Three candidate paths examined:

  1. Register only CYP2C19 (single-step approximation) + metabolic_fraction=0 zeroing → loses the CES1 dead-end branch → parent CL 5–7× under-clear → R-130964 over-predict.
  2. Register both CES1 and CYP2C19 with one entry-level yield → engine applies the same yield to both edges → CES1 path mechanistically generates active R-130964 (biologically wrong; CES1 makes inactive SR26334).
  3. Symmetric variant → mirror of (2).

All three break because the registry schema has a single conversion_yield_fraction per entry, while clopidogrel needs different yields per enzyme (CES1=0 dead-end, CYP2C19≈1 active).

Result: B-04 (multi-enzyme prodrug conversion schema) is a hard prerequisite for B-03, not an independent alternative. Re-ordered in docs/claude/backlog.md. B-04 design spec written: docs/superpowers/specs/2026-05-17-multi-enzyme-prodrug-yield-design.md — adds optional per-enzyme yield field with entry-level fallback (backward-compatible; 6 existing single-enzyme entries bit-identical post-migration). Engine flux already supports per-edge yield (params.edge_param(edge_id, "conversion_yield"), src/sisyphus/engine/flux.py:639), so B-04 scope is registry + builder + tests only — no engine work. Estimated effort 4–6h (down from “1 day” the prior backlog entry quoted).

Interpretation:

  1. The original backlog entries for B-03 (“2–3h”) and B-04 (“blocked by B-03 decision”) inverted the dependency. Going forward: B-04 implementable independently and B-03 implementable on top of B-04.
  2. Clopidogrel disposition is expected ceiling_accepted regardless of schema. R-130964 active thiol PK is genuinely poorly characterized in primary literature (covalent P2Y12 binding sink prevents standard CL/Vd measurement). The schema change unlocks mechanistic correctness, not predictive accuracy gain.
  3. Observation-species choice for clopidogrel: parent (not active). The 107-holdout reference is parent clopidogrel Cmax; switching to active species would inject a deliberate 5–20× species-mismatch fold error.

Disposition: spec written, committed (pending), no code change. B-04 implementation deferred to a separate session via the writing-plans skill.

Branch: to be committed on main as docs-only.


2026-05-13 — UGT path sensitivity re-measurement (DE-36 refresh of DE-04)

Motivation: the prior comment in src/sisyphus/predict/ivive.py (“UGT fm redistribution disabled — sensitivity test showed engine AAFE degradation 2.861 → 3.090”) referenced an unrecorded measurement run pre-v0.3.2 + pre-public-only-headline + pre-ECM-auto-activation. Current pipeline is materially different (Engine baseline 3.791 not 2.861); the prior negative could be stale. Phase 1 = read-only sensitivity to decide between spec cycle (positive) vs DE-NN refresh (negative or neutral).

Method: toggled ugt_enzymes = db.get_ugt_enzymes(profile.smiles) (vs current None) at ivive.py:642. Ran scripts/run_engine_benchmark.py under DrugBank-present + logp_correction-present (local-developer state); the toggle is a no-op under public-only state because DrugBank is the UGT data source.

Result:

Slice Track A (UGT=None) B (UGT enabled) Δ
Overall N=107 Engine 3.791 3.762 −0.029
  ML 3.012 3.012 0
  Meta 2.679 2.679 +0.0002
In-domain N=79 Engine 3.466 3.440 −0.026
  Meta 2.733 2.734 +0.0005

Per-drug Engine FE shifts (≥2% log10): 11 improved (dapagliflozin 15.8→13.7, etodolac 8.4→7.0, ketorolac 7.1→5.8, metronidazole 10.6→9.8, glasdegib 4.0→3.2 — UGT-substrate NSAIDs and gliflozins that were under-predicting), 5 worsened (codeine 2.0→2.4, morphine 1.9→2.1, losartan 2.2→2.5 — over-predicting drugs now over-predict more).

Interpretation:

  1. The prior conclusion (“UGT path harmful, Engine −0.229 degraded”) does not generalize to today’s pipeline. Today UGT path is mildly Engine-positive.
  2. The +0.0002 Meta delta is the error-cancellation signature documented in dead-ends.md DE-08~DE-18: the 4-track meta-learner absorbs single-track improvements via weight redistribution. UGT activation gains nothing at the Meta level.
  3. Under public-only state (no DrugBank), UGT toggle has zero effect because there’s no UGT data source. To realize the Engine improvement publicly would require a curated literature registry like the existing data/enzymes/{nat2,ugt1a1}_substrates.json (separate cycle).

Disposition: not activated in production this cycle. Logged as DE-36 in dead-ends.md with the refreshed measurement; the original comment in ivive.py is now a 14-line summary pointing at DE-36. DE-04 (the original entry) retained for historical record with a cross-reference to DE-36.

Code: branch investigate/ugt-path-sensitivity (PR pending) carries only the documentation + comment update; the toggle was reverted.


2026-05-08 — v0.3.4 prodrug registry expansion (simvastatin + irinotecan)

Branch: feat/prodrug-registry-expansion-simvastatin-irinotecan (PR pending) Spec: docs/superpowers/specs/2026-05-08-prodrug-registry-expansion-design.md (commit bbafd3d) Closes: part of issue #11 (clopidogrel deferred — see below)

What shipped

data/sbi/prodrug_activation_registry.json grows from 4 entries to 6:

Engine + ivive + pipeline: zero changes (existing lookup_active_metabolite() flows new entries through automatically per CLAUDE.md Invariant #1).

Empirical Cmax (post-PR)

prodrug active species dose / route model Cmax clinical target gate
simvastatin lactone acid 40 mg PO 0.00088 mg/L Najib 2003 0.003-0.007 0.0005-0.10 (mech-only)
irinotecan SN-38 350 mg IV 0.0466 mg/L Slatter 2000 0.05-0.10 0.0001-1.0 (mech-only)

simvastatin under-predicts ~3-8× clinical due to acknowledged CL/V uncertainty (ceiling_accepted disposition). irinotecan SN-38 lands within clinical range (literature_applied disposition, well-characterized). Per spec §10, integration gates are mechanical-correctness-only; calibration is downstream.

107-holdout impact

Bit-identical (Meta 2.679 pin holds):

Full suite: 853 PASS, 15 skipped, 7 xfailed (pre-existing rosuvastatin/atorvastatin/fluvastatin Peff + 4 prodrug 3-fold gates).

Why clopidogrel deferred

Issue #11 originally requested 3 drugs. clopidogrel was deferred to a separate PR because:

Will be filed as separate v0.3.x PR after schema decision (single-step approximation vs schema extension).

5-task subagent-driven execution (d87d57f → 26bb0bb)

  1. Failing seed-pin regression test (test_prodrug_registry_seed.py) — frozenset 6 names + RDKit roundtrip. 1 FAIL + 1 PASS as expected.
  2. Add simvastatin entry — 5 entries, schema regression PASS.
  3. Add irinotecan entry — 6 entries, seed-pin gate flips FAIL→PASS.
  4. Integration test simvastatin — 1 PASS at gate 0.0005-0.10 (lowered from planned 0.001 to accommodate ceiling_accepted disposition’s 5-50× CL/V uncertainty).
  5. Integration test irinotecan — 1 PASS at gate 0.0001-1.0; SN-38 Cmax 0.0466 within Slatter 2000 clinical range.

Test changes

Architecture invariants preserved

Open follow-ups

How to apply


2026-05-07 — v0.3.3 phenotype_scale_overrides API hook

Branch: feat/phenotype-scale-overrides (PR pending) Spec: docs/superpowers/specs/2026-05-07-phenotype-scale-overrides-design.md (commit 8dd6cf7) Closes: issue #31 (capability request from GenoADME — per-substrate effective phenotype scale injection)

What shipped

apply_phenotype_to_graph() and predict() now accept a phenotype_scale_overrides: dict[str, float] | None = None keyword. When provided AND a gene matches a key in phenotypes, the override value replaces PHENOTYPE_SCALES[phenotype] for that gene’s effect on the matched node’s enzyme/transporter abundance. Negative values raise ValueError; no upper bound on positive values (caller responsibility).

Signature shape: flat {gene: scale} dict — substrate dimension implicit in per-call SMILES, phenotype dimension implicit in per-call phenotypes argument. Mechanically equivalent to GenoADME’s originally-proposed 3-level {gene: {phenotype: {substrate: scale}}}, simpler. Counter-proposal posted on issue #31 comment, awaiting GenoADME ack but proceeding (signature is small implementation detail).

Sisyphus ships no calibration tables. Caller (GenoADME’s case) is responsible for resolving (SMILES, gene, phenotype) → override scale from their own meta-analysis tables, and passing the resolved scale via phenotype_scale_overrides per call.

Empirical example (pravastatin SLCO1B1)

call OATP1B1 abundance scaling Cmax (mg/L) PM/EM ratio
phenotypes={"SLCO1B1": "EM"} 1.00× 0.04218 1.000 (baseline)
phenotypes={"SLCO1B1": "PM"}, no override 0.10× (CPIC) 0.12800 3.034
phenotypes={"SLCO1B1": "PM"}, phenotype_scale_overrides={"SLCO1B1": 0.30} 0.30× 0.07310 1.73 (compressed)

Override compresses toward EM as specified. GenoADME can dial in any scale to match their meta-analysis target (e.g., Niemi 2006 men-stratum AUC ratio 3.32 central).

4-task subagent-driven execution

Tasks 1-4 (291d74f740da17):

  1. 7 failing unit tests (TDD target — TypeError on unknown kwarg)
  2. phenotype.py extension: signature kwarg + override branch in scale-lookup loop + unused-key logger.info — 7/7 + 35/35 existing PASS
  3. pipeline/predict.py forward: signature + docstring + apply_phenotype_to_graph forwarding — spot-check ordering EM < Override < Default confirmed
  4. Integration test (pravastatin compression + None/{} backward-compat) — 2/2 PASS, 849 full-suite PASS, Meta 2.679 holdout invariant

107-holdout impact

Bit-identical (Meta 2.679 pin holds). Production benchmark uses default phenotype_scale_overrides=None. The override only changes behavior when the caller explicitly passes it.

Architecture invariants preserved

Open follow-ups

How to apply


2026-05-06 — v0.3.2 NAT2 + UGT1A1 phenotype propagation + back-solve cancellation fix

Branch: feat/nat2-ugt1a1-phenotype (PR pending) Spec: docs/superpowers/specs/2026-05-04-nat2-ugt1a1-phenotype-design.md (v3, commit 9af6c30) Plan: docs/superpowers/plans/2026-05-04-nat2-ugt1a1-phenotype.md (commit c1d94b3) Closes: issue #10 (NAT2 + UGT1A1 PHENOTYPE_SCALES infrastructure)

What shipped (12 task commits, b7cd2af → 82076c6)

  1. CRITICAL: pipeline back-solve cancellation fix (657a9a4). pipeline.predict.predict() now snapshots liver.enzymes BEFORE apply_phenotype_to_graph and passes pre-phenotype values to build_drug_on_graph. The IVIVE _decompose_clint back-solves enzyme affinity from abundance, so passing scaled abundances caused phenotype scaling to cancel out exactly at engine multiplication time (the bug that silently nulled all CYP/UGT/NAT phenotype effects pre-v0.3.2). SLCO1B1 escaped only because OATP1B1 uses saturable Michaelis-Menten kinetics, not affinity back-solve.

    • Pre-fix: caffeine + CYP1A2:PM/EM = 1.0000 (exactly cancelled), warfarin + CYP2C9:PM/EM = 1.0000, pravastatin + SLCO1B1:PM/EM = 3.034 (transporter path bypassed).
    • Post-fix: phenotype propagates through engine as scaled_abundance × pre_affinity = scale × original_rate. Empirical regression gates: tizanidine + CYP1A2:PM/EM 1.518, irbesartan + CYP2C9:PM/EM 1.251, pravastatin + SLCO1B1:PM/EM ~3.0 (unchanged).
  2. NAT2 + UGT1A1 substrate registries (2f8571d, a0fa1a0):
    • data/enzymes/nat2_substrates.json — isoniazid (mf=0.90, Weber 1983 / Ellard 1976), hydralazine (mf=0.50), procainamide (mf=0.50). All InChIKeys round-trip via RDKit.
    • data/enzymes/ugt1a1_substrates.json — raltegravir (mf=0.70, Iwamoto 2008), atazanavir (mf=0.40, Lankisch 2006), dolutegravir (mf=0.50, Reese 2013). RDKit-derived InChIKeys (raltegravir’s ikey diverges from a PubChem reference due to oxadiazole tautomer encoding; round-trip invariant holds).
  3. non_cyp_substrates.py loader module (679eecc) — mirrors transporter_db.py (PR #29) pattern: lru_cache JSON loaders, full RDKit InChIKey matching only, file-anchored paths. Public API: lookup_nat2_substrate(smiles), lookup_ugt1a1_substrate(smiles), get_non_cyp_fractions(smiles). Re-normalizes when sum > 1.0.

  4. Physiology (529c756):
    • data/physiology/reference_man.yaml liver.enzymes — appended NAT2: {mean: 1.0e7, cv: 0.6} and UGT1A1: {mean: 1.215e6, cv: 0.5} (independent lognormal, no Achour 2021 matrix entry).
    • src/sisyphus/predict/ivive.py _LIVER_ENZYME_ABUNDANCE — added "NAT2": 1.0e7. UGT1A1 already present at 1_215_000.0 (= 1.215e6).
  5. IVIVE extension (57df86e, 107c21f):
    • _get_fm_fractions accepts non_cyp_fractions: dict[str, float] | None parameter. Validates each value in [0, 1], re-normalizes when sum > 1.0, allocates non-CYP first then scales CYP+UGT residual by (1 - non_cyp_total). Backward-compat preserved.
    • _decompose_clint and build_drug_on_graph forward the new kwarg through. Default None → existing behavior.
  6. Pipeline wiring (4c950fc) — pipeline.predict.predict() calls get_non_cyp_fractions(profile.smiles) once after auto-ECM gating, forwards to BOTH build_drug_on_graph invocations (initial + post-phenotype rebuild from Task 2).

  7. Schema regression (d90eba5) — tests/regression/test_non_cyp_registry_schema.py with 8 gates: seed pinned (NAT2/UGT1A1 frozensets), InChIKey-SMILES roundtrip × 2, fm in [0, 1] × 2, YAML enzymes present, holdout-disjoint cross-cutting check.

  8. Integration tests (d209b72, 82076c6):
    • test_phenotype_nat2.py — isoniazid NAT2:PM/EM = 1.4776 (gate > 1.3), metoprolol silent-zero invariant rel_err = 0.0 exactly.
    • test_phenotype_ugt1a1.py — raltegravir UGT1A1:PM/EM = 1.419 (gate > 1.2). SMILES read from registry.

Probe drug deviation from spec (Task 2, 657a9a4)

The plan’s CYP propagation regression test originally used caffeine (CYP1A2) and warfarin (CYP2C9) as probe drugs with gates 1.5× and 1.2×. Empirical reality:

Implementer (Task 2 subagent) replaced with tizanidine (CYP1A2-only DrugBank annotation, fm=0.833 → 1.52× ratio) and irbesartan (CYP2C9-only, fm=0.833 → 1.25× ratio). Spec reviewer verified empirically and confirmed the deviation is justified — the original gates were structurally unachievable given the model’s DrugBank-driven equal-fm allocation.

The replacement preserves regression intent (decisively distinguishes pre-fix 1.000 from post-fix > 1) with cleaner single-CYP probe drugs. Spec §11 acceptance criteria still mention caffeine/warfarin as historical record; the actual gates in tests/integration/test_phenotype_cyp_propagation.py use tizanidine/irbesartan/pravastatin.

107-holdout impact

Bit-identical — Meta 2.679 pin holds. tests/integration/test_holdout_regression.py PASS post-merge. The benchmark uses phenotypes=None default; the back-solve fix only changes behavior when phenotypes are explicitly passed (which was previously broken for non-SLCO1B1 anyway). Registry seed 0/107 holdout drugs (enforced by schema gate).

Test results (final, on 82076c6)

tests/{unit,regression,integration} full suite: 840 PASSED, 15 skipped, 7 xfailed. Xfails are pre-existing (rosuvastatin/atorvastatin/fluvastatin Peff over-prediction issues, separate from #10).

Architecture invariants preserved

Latent bugs flagged (not in scope for this PR)

Open follow-ups

How to apply


2026-05-04 — v0.3.1 pitavastatin ecm_applicable promotion

Branch: feat/pitavastatin-ecm-applicable (PR pending) Spawn: v0.3 (PR #29) follow-up — initial seed list was pravastatin only; pitavastatin promotion was deferred pending metabolic_fraction curation.

What shipped

Pitavastatin promoted to ecm_applicable=true in data/transporters/oatp1b1.json. Paired entry added to data/transporters/cyp_clearance_overrides.json with metabolic_fraction=0 (parallel pravastatin justification: Niemi 2009 PM/EM ~3x makes pitavastatin among the most OATP-rate-limited statins clinically; intracellular CYP2C9 + UGT1A3/2B7 paths are downstream of the rate-limiting uptake step). Schema regression test seed list updated to frozenset({"pravastatin", "pitavastatin"}).

Empirical observation: metabolic_fraction is mechanistic, not empirical

Sweep across mf ∈ [0.0, 0.05, 0.10, 0.15, 0.25, 0.50, 1.0] (2026-05-04, on feat/pitavastatin-ecm-applicable): pitavastatin Cmax varies from 0.00168 → 0.00165 mg/L (1.8% relative variation). The triple-counting hypothesis from PR #22 / PR #29 narrative does NOT apply meaningfully to pitavastatin — mf is a near-irrelevant knob for this drug.

This revises the v0.3 PR #29 narrative retroactively: the pre-v0.3 (buggy auto-ECM) → post-v0.3 (no-ECM) flip on pitavastatin (FE 2.12 under → FE 0.45 over) was NOT a magnitude improvement; both directions show ~2x absolute fold-error. The actual root cause is OATP1B1 Jmax / ECM passive PS calibration (Hirano 2004 scaled-from-pravastatin estimate carries ~2x literature range), not metabolic_fraction.

Numbers

metric post-v0.3 (Task 5 gating, no auto-ECM) post-v0.3.1 (auto-ECM activated, mf=0)
pita predict() Cmax (2 mg) 0.00777 mg/L 0.00168 mg/L
FE vs FDA Livalo 0.0035 2.22x over 2.08x under

107-holdout AAFE invariant: pitavastatin is not in the 107-holdout, so Meta 2.679 / Engine 3.791 / ML 3.012 / In-domain Meta 2.733 are unchanged. No cache regen.

Test impact

Open follow-ups (deferred)

Closes


2026-05-03 — v0.3 ECM auto-activation gating

Branch: feat/ecm-auto-activation (PR pending) Spec: docs/superpowers/specs/2026-05-03-ecm-auto-activation-design.md Plan: docs/superpowers/plans/2026-05-03-ecm-auto-activation.md

What shipped

pipeline.predict.predict() ECM auto-activation (originally PR #9 / ae5b599) is now gated on a new ecm_applicable: bool flag in data/transporters/oatp1b1.json. Initial seed list flagged true: pravastatin only.

Three-layer registry pattern (no engine code changes):

  1. oatp1b1.json schema extension (ecm_applicable: bool per drug, default false).
  2. New is_oatp_ecm_applicable(smiles), load_oatp1b1_kinetics_for_smiles(smiles), load_hepatic_ecm_params_for_smiles(smiles) helpers in src/sisyphus/predict/transporter_db.py (mirrors PR #22 lookup_metabolic_fraction pattern; full InChIKey matching per spec §1.2).
  3. predict() checks the flag, conditionally loads kinetics/ECM, passes to build_drug_on_graph. The phenotypes= parameter (already shipped pre-v0.3 in commit 060dba5) inherits the gating: PGx scaling only affects drugs whose ECM path is wired.

Schema regression test (tests/regression/test_oatp_registry_schema.py) gates:

Triple-counting bug fix

PR #9’s pre-v0.3 wiring used find_oatp1b1_substrate_name (block-1 InChIKey) and activated ECM for every drug present in BOTH oatp1b1.json AND hepatic_ecm.json (all 5 statins). Drugs without paired metabolic_fraction entries had XGBoost-CYP enzyme affinities running at full strength PLUS OATP1B1 saturable PLUS ECM passive — triple-counting hepatic clearance. Empirical:

drug pre-v0.3 (buggy) post-v0.3 (gated)
pravastatin FE 1.07 (correct, mf=0 set) FE 1.07 (unchanged)
pitavastatin FE 2.12 (under, no mf entry) FE 0.45 (no-ECM canonical)
fluvastatin FE 4.79 (under, CYP-dominant) FE 1.54 (no-ECM canonical)
rosuvastatin FE TBD (similar bug) back to no-ECM
atorvastatin FE TBD (similar bug) back to no-ECM

107-holdout impact

AAFE invariant: Meta 2.679, Engine 3.791, ML 3.012, In-domain Meta 2.733 — all bit-identical to the 2026-05-02 baseline. Only pravastatin is in the holdout among affected drugs, and pravastatin’s predicted Cmax was already correct under PR #9 (auto-ECM was right for pravastatin specifically because it had metabolic_fraction=0 from PR #22). The fix improves production behavior on 4 non-holdout statins and any future caller passing those SMILES to predict().

CI artifact: data/validation/4track_ci_2026-05-03_v0.3.json (10k bootstrap, seed=20260422; bit-identical to 2026-05-02).

Side fixes shipped in this PR

Open follow-ups

Closes


2026-05-02 PM — clinical_pk.json digoxin SMILES correction (broader audit follow-up)

Branch: data/clinical-pk-digoxin-smiles-fix Trigger: Audit script comparing clinical_pk.json SMILES vs DrugBank inchikey_14 across 107 holdout drugs (motivated by pravastatin discovery in #25). The audit flagged 3 candidates; 1 was a script false-positive (norethindrone DrugBank-name mismatch with DB14678 enanthate ester), 1 was already-fixed (pravastatin), and 1 was real: digoxin.

Diagnosis: clinical_pk.json carried a SMILES for “digoxin” that resolved to formula C30H48O16 (MW 664.70) — a sugar polymer with no steroid aglycone, just 4 sugar rings and a butenolide. Real digoxin (PubChem CID 2724385) is C41H64O14 (MW 780.95) — a cardiac glycoside with the digoxigenin steroid aglycone + 3 digitoxose sugars. Connectivity-level mismatch (InChIKey block 1 NYNHXAUTBGPYHF vs canonical LTMHDMANZUZIPE).

DrugBank’s stored canonical_smiles for DB00390 is also wrong — it parses to HZJGATJTJCKOLT block 1, formula C40H62O11. DrugBank’s own inchikey_14 column says LTMHDMANZUZIPE (correct), so DrugBank has internally inconsistent records. PubChem CID 2724385 is the authoritative source.

Fix: Replace clinical_pk.json digoxin SMILES with PubChem-canonical (full stereochemistry, RDKit-canonicalized for storage). One-line data change.

Concrete metric movements (107-holdout, regenerated):

Track Pre (post-#27) Post Δ
Meta 2.6852 2.6785 -0.0066 (-0.25%)
Engine 3.7326 3.7907 +0.0581 (+1.56%)
ML 3.0110 3.0121 +0.0011 (~0)
In-domain N 80 79 digoxin → out-of-AD
In-domain Meta 2.7186 2.7333 +0.0147

digoxin individual entry:

Honest interpretation:

95% bootstrap CIs (regenerated, 10k resamples, seed=20260422; artifact data/validation/4track_ci_2026-05-02.json overwritten with PM values):

All point estimates within prior CIs — statistical narrative preserved.

Audit completion: The clinical_pk.json broader scan is complete for the 107-holdout subset. 1 of 107 drugs (digoxin) had a real connectivity error beyond pravastatin’s. The audit script flagged 3 candidates; the false-positive rate was 1/3 (norethindrone, due to name-matching ambiguity with the enanthate ester DrugBank entry). The remaining 104 holdout drugs match DrugBank’s inchikey_14 block-1 cleanly. Atorvastatin’s stereo-stripped reference SMILES (block 1 matches but stereo block differs) is a non-issue — Morgan FP and engine chemistry are stereo-insensitive at the relevant levels.

DrugBank’s own data quality issues (DB00175 pravastatin and DB00390 digoxin both have wrong canonical_smiles despite correct inchikey_14) are out of scope for this repo. Worth flagging upstream if Sisyphus’s authors interact with the DrugBank maintainers.


2026-05-02 — clinical_pk.json pravastatin SMILES correction unlocks #9 auto-ECM (#25)

Branch: data/clinical-pk-pravastatin-smiles-fix Trigger: Discovered during issue #9 (auto-load OATP1B1 ECM): the InChIKey-based substrate lookup in pipeline.predict.predict() could not match clinical_pk.json’s pravastatin to the registry because the reference SMILES carried a different molecule connectivity (extra ring double bond — InChIKey block 1 TUZYXOIXSAXUGO vs PubChem CID 54687 GOSGZXISMCZCDW).

Fix: Replace data/reference/clinical_pk.json pravastatin entry’s SMILES with PubChem-canonical (full stereochemistry preserved). One-line data change; no code change.

Concrete metric movements (post-fix benchmark, 4-track regenerated):

Track Pre Post Δ
Meta 2.6947 2.6852 -0.0096 (-0.36%)
Engine 3.7575 3.7326 -0.0249 (-0.66%)
ML 3.0571 3.0110 -0.0461 (-1.51%)
In-domain Meta 2.7316 2.7186 -0.0130
In-domain Engine 3.5734 3.5419 -0.0315
In-domain ML 3.0430 2.9818 -0.0612

Pravastatin individual entry: engine fold 0.415 → 0.844 (under 2.4× → under 1.18×), ML fold 0.129 → 0.654, Meta fold 0.546 → 1.252 (passes 2-fold gate from the over-prediction side). ML moves materially because the corrected SMILES produces different Morgan FP than the wrong-connectivity input.

95% bootstrap CIs (10k resamples, seed=20260422, regenerated artifact data/validation/4track_ci_2026-05-02.json):

All CIs effectively unchanged — the point-estimate movement is within bootstrap noise. Headline narrative “Meta ~2.7, Engine ~3.7, ML ~3.0” preserved with each estimate slightly improved.

Why the fix works: The original reference SMILES was structurally wrong (saturated decalin replaced by a more-unsaturated tetrahydronaphthalenone) — not just a stereo-stripped variant. RDKit faithfully canonicalized this wrong molecule and the entire downstream chemistry (logP, Kp, ADME XGBoost predictions, Morgan fingerprints) used wrong-molecule properties. Replacing with PubChem-canonical:

  1. Produces correct Morgan fingerprints → ML prediction shifts
  2. Produces correct logP/Kp → engine ADME shifts
  3. InChIKey now matches OATP1B1 substrate registry → PR #9’s auto-ECM activates → engine routes hepatic clearance through the ECM transporter path with metabolic_fraction=0 (PR #22)

The three changes compound: the holdout benchmark sees pravastatin’s predict() flow change from “wrong-molecule chemistry + no ECM + XGBoost CYP path” to “correct-molecule chemistry + ECM-only hepatic clearance via OATP1B1”.

Issue #8 status: pravastatin’s holdout fold now 1.25 (passes 2-fold gate). The motivating GenoADME population AUC validation needs separate confirmation in that repo, but the Sisyphus-side underprediction tracked in #8 is essentially closed by the chain #22 → #9 → #25.

Aftermath / follow-ups:


2026-05-02 — OATP1B1/ECM auto-load on predict() (#9)

Branch: feat/predict-auto-ecm Trigger: PR #22 closed the architectural double-counting but only helped manual ECM callers. pipeline.predict.predict() did not activate ECM by default, so the metabolic_fraction registry had zero effect on the production benchmark. Issue #9 tracked this gap.

Fix: Auto-detect registered OATP1B1 substrates by canonical InChIKey (connectivity block) in predict(), then load both transporter_kinetics + hepatic_ecm_params from existing registries. Auto-load gated on BOTH registries having the drug; warning tag oatp1b1:auto_ecm:<name> on the result for audit.

Headline impact at merge: bit-identical (issue #25 SMILES error in the holdout reference for pravastatin prevented the lookup from matching). After #25 fix shipped, the auto-load path becomes active for pravastatin and contributes to the metric movements above.

Why InChIKey block 1 matching: SMILES sources sometimes strip stereochemistry annotations. Matching on the full InChIKey would miss those variants; matching on the connectivity block (first 14 chars) tolerates stereo differences. False positives across the 7 currently-registered substrates not a concern (all distinct connectivity).


2026-05-02 — OATP1B1/ECM reconciliation: XGBoost-CYP / ECM-OATP double-counting resolved (#12-#14)

Branch: feat/oatp1b1-ecm-reconciliation Trigger: GenoADME Tier 1 PARTIAL on pravastatin; test_oatp_ecm_statins[pravastatin] xfail under post-Hardening realize_means() (FE drifted 1.486 → 1.823); GitHub issues #12 (#8a) / #13 (#8b) / #14 (#8c) sequencing the fix.

Root cause: build_drug_on_graph(profile, adme, ..., transporter_kinetics, hepatic_ecm_params) always decomposed XGBoost hepatocyte CLint into per-enzyme affinities AND, separately, applied the OATP1B1 ECM clearance when transporter+ECM kwargs were supplied. The two clearances added at the simulation layer. For uptake-dominated substrates (canonical: pravastatin, ~85% OATP1B1), in vitro hepatocyte CLint already integrates the OATP1B1 contribution, so this counted the same clearance twice.

Fix: Per-drug metabolic_fraction registry that scales the metabolic-path enzyme_affinities derived from XGBoost CLint. When the engine’s ECM machinery is active for a drug whose hepatocyte CLint is uptake-dominated, the registry routes the entire hepatic clearance through the ECM transporter path without double-counting. Default 1.0 (no scaling) for the 106 unregistered holdout drugs.

Test invariant redesign (#13): The pre-#12 cmax_on/cmax_off < 0.95 invariant in test_oatp_pravastatin is mathematically incompatible with the post-fix model — with metabolic_fraction=0, the “off” arm has no hepatic clearance for pravastatin and Cmax goes very high. Replaced with SLCO1B1 EM/PM phenotype check: PM (OATP1B1 × 0.10) must raise Cmax vs EM. Empirical: cmax_em=0.0422, cmax_pm=0.1280, ratio=3.034 (clinical literature: ~2-3× AUC under PM).

Abundance recalibration (#14): scripts/calibrate_oatp_abundance_ecm.py post-#12 still recommends the existing liver.transporters.OATP1B1.mean = 5.0e5 (FE 1.058 vs FDA pravastatin 0.045 mg/L). The Hardening-era T7 drift was a downstream symptom of the double-counting, not an abundance miscalibration. PS_active 502 L/h remains outside the Watanabe 2009 literature range [0.5, 2.0] — separate ECM IVIVE-scaling concern (DE-33 adjacent), not a #12 deliverable.

Concrete metric changes:

Headline invariance: pipeline.predict.predict() does not activate ECM/transporter machinery by default — build_drug_on_graph is called without transporter_kinetics or hepatic_ecm_params. The 107-holdout benchmark predicts via the default path, so the metabolic_fraction registry has zero effect on production AAFE. 4track artifact bit-identical pre-vs-post #12 (Meta 2.6947, Engine 3.7575, ML 3.0571). The fix is targeted at ECM-active code paths (GenoADME Tier 1, PGx-aware predictions, calibration script, integration tests).

Aftermath: Issue #21 (fluvastatin under-prediction) opened to track the opposite-direction failure. The metabolic_fraction registry is extensible to (B)-flavor per-drug fractions in v0.3 (atorvastatin ~0.7 CYP3A4, rosuvastatin ~0.15 CYP, etc.) by adding entries; no further code changes required.


2026-05-01 — Hardening: mean-only deterministic realization (RNG-order coupling resolved)

Branch: feat/hardening-mean-only Trigger: Engine drift bisect from 2026-04-29 entry — investigation revealed +19.1% Engine drift was NOT real model degradation but RNG-order coupling.

Root cause: predict() with n_mc_samples=0 (deterministic default) used graph.sample(rng=np.random.default_rng(42)), which:

The sample(rng=42) realized values were treated as “deterministic” but were actually a single specific lognormal draw at each position — vulnerable to ANY upstream YAML change.

Fix: Add BodyGraph.realize_means() and DrugOnGraph.realize_means() methods that use dist.mean directly instead of dist.sample(rng). predict() and test_engine_validation now use these. ~120 lines.

Headline AAFE delta (v2 baseline 2026-04-30 → Hardening 2026-05-01):

Track v2 baseline Hardening Δ (%) Note
Meta (Overall) 2.702 2.695 -0.3% Restored to pre-Achour 2026-04-14 value
Engine (Overall) 3.572 3.757 +5.2% Was seed-favorable at 3.572; mean-only is canonical
ML (Overall) 3.057 3.057 0% Invariant
In-domain Meta 2.730 2.732 +0.07% Within CI noise

Bisect interpretation (resolves 2026-04-29 follow-up):

The Meta value 2.695 from Hardening EXACTLY matches the pre-Achour 2026-04-14 value, confirming the Engine “drift” narrative was entirely RNG-order artifact. Engine track value 3.421 from yesterday’s manual cv=0 zeroing was a partial-zeroing artifact (cardiac_output and other globals not zeroed); 3.757 is the truly canonical mean-only value.

Test impact:

Architectural significance:

Files:

Follow-ups:


2026-05-01 — Prodrug Activation v3 (input-data refresh, all-disposition)

Branch: feat/prodrug-activation-v3 (gated on v2 PR #7 merge per spec §8.1, satisfied 2026-04-30 by 78d12e3). Spec: docs/superpowers/specs/2026-04-29-prodrug-activation-v3-design.md Plan: docs/superpowers/plans/2026-04-29-prodrug-activation-v3.md (19 tasks across 5 phases — all complete) Literature deliverable: docs/superpowers/specs/2026-04-29-prodrug-v3-literature.md

Per-item dispositions (mechanistic-A doctrine compliant per spec §3.3):

# Item Disposition Citation primary Code change
1 BH4 CL/Vd (sepiapterin) ceiling_accepted Feillet 2008 + FDA Kuvan + EMA EPAR (F not known) v3_metadata only
2 GS-441524 CL/Vd (remdesivir) literature_applied Tamura 2023 + Leegwater 2022 (popPK geomean) CL 10→17.4, V 35→535
3 R406 CL/Vd (fostamatinib) literature_applied Matsukane 2022 (IV microdose review) CL 28→15.7, V 250→256
4 tebipenem CL/Vd ceiling_accepted Eckburg 2019 (V/F surrogate rejected) v3_metadata only
5 SPR proteomic abundance ceiling_accepted HPA + Wu 2020 (animal-only) v3_metadata only
6 CES2/tebipenem CLint ceiling_accepted Gupta 2023 (no isoform attribution) v3_metadata only

Outcome:

Significance: v3 closes the input-data quality pillar of the prodrug saga (v1→v2→v3) with rigorous mechanistic-A discipline. 4 items closed as ceiling because primary literature truly does not exist (F_sapropterin, F_tebipenem, human SPR proteomic, in vitro CES2/tebipenem). 2 items advanced via popPK geomean. Empirical Cmax fold-errors barely shifted because:

  1. observation_species=parent for remdesivir → active CL/V update doesn’t move parent Cmax
  2. fostamatinib extraction rate-limits (well-stirred E~1 at high CLint) → active CL change has marginal Cmax effect
  3. Items 1, 4, 5, 6 unchanged values

This is the canonical mechanistic-A outcome: “we know the literature gap exists; we documented it; we did not fudge to pass”. v4 candidates require new mechanistic terms (extra-hepatic esterase, BH4 first-pass depletion, etc.) — beyond data refresh.

Test impact:

Files:


2026-04-30 — Prodrug v2 PR #7 — RNG-order discovery + cache regen

Trigger: v2 PR (feat/prodrug-activation-v2) CI failure on test_engine_validation::test_cmax_within_5pct[midazolam, caffeine, warfarin] — Cmax shifted 6-19% above Omega targets.

Diagnosis: v2 added new lognormal enzyme distributions (SPR/CES1/CES2/ALPI) to physiology YAML at liver, gut_wall, and kidney nodes. BodyGraph.sample(rng) iterates nodes in YAML insertion order, so adding a cv>0 distribution at kidney (which previously had no enzymes block, position 4 in YAML, BEFORE liver) consumed 1 RNG draw before liver’s CYP3A4 sample. This shifted all liver CYP samples → midazolam Cmax +18.5%. Liver/gut_wall enzyme additions were appended AFTER existing CYPs, so existing CYP samples preserved BUT new draws shifted downstream OATP1B1 transporter sample → ECM-pathway holdout drugs drifted 8-27%. Test was passing on main due to RNG-order coincidence with seed=42.

Fix (commit 6c121ce): Move kidney YAML node block to after gut_wall. Preserves all v2 mechanistic content (kidney SPR retained for sepiapterin renal contribution); only changes RNG sample order. ODE state index accessed via name lookup throughout — functionally invariant.

Cache regen (commit 6528ba8): ECM holdout regression test (5% drift gate) failed because v2’s enzyme additions still shift OATP1B1 sample even with kidney moved (liver enzyme appendage is the irreducible cause). data/training/4track_holdout_predictions.json regenerated against PR src + Option D YAML to capture v2 baseline.

Aggregate AAFE delta (main 2026-04-29 → v2 2026-04-30):

Track main (2026-04-29) v2 (2026-04-30) Δ (abs) Δ (%)
Meta (Overall) 2.719 2.702 -0.017 -0.6%
Engine (Overall) 4.073 3.572 -0.501 -12.3%
ML (Overall) 3.057 3.057 0 0%
Meta (In-domain) 2.759 (n=80) 2.730 (n=80) -0.029 -1.1%

Meta %2-fold/3-fold unchanged (46.7%, 62.6%). Engine %3-fold improved 40.2 → 53.3.

Significance:

spec §6.1 invariance violation: v2 spec §6.1 promised “107-holdout invariance” — actually impossible because adding any cv>0 enzyme to a node consumes RNG draws and shifts downstream samples. Spec assumption was wrong. Real invariance requires either (a) per-node independent RNG seeding, or (b) deterministic mean-only realization. Both deferred to hardening backlog.

Test impact:

Follow-ups (queued):

  1. Refresh bootstrap 95% CIs against v2 cache post-merge (10k resamples, seed=20260422).
  2. Update CLAUDE.md headline AAFE table post-merge.
  3. Hardening: deterministic mean-only realization for engine validation tests (eliminates RNG-order fragility).
  4. v3 spec §5 Item 5 amendment: kidney 3e4 retained but at YAML position-after-gut_wall (already in this commit; v3 spec wording may need clarifying).

Files:


2026-04-29 — 4-track holdout predictions regen (post-P4.5 baseline refresh)

Trigger: tests/integration/test_ecm_holdout_regression.py failing on main — 10/10 spot-checked drugs drifted 15-27% lower than cached. Investigation revealed the cache (data/training/4track_holdout_predictions.json) was last written 2026-04-14, before P4.5 Achour merge (2026-04-23) and other ECM/V3-routing changes.

Action: Re-ran scripts/run_engine_benchmark.py --save-json data/training/4track_holdout_predictions.json on current main. Backup of pre-regen cache stashed at /tmp/4track_pre_regen_2026-04-29.json (not committed).

Aggregate AAFE delta (PRE 2026-04-14 cache → POST 2026-04-29 fresh):

Track PRE POST Δ (abs) Δ (%)
Meta (Overall) 2.695 2.719 +0.024 +0.9%
Engine (Overall) 3.421 4.073 +0.652 +19.1%
ML (Overall) 3.057 3.057 0 0%
Meta (In-domain) 2.710 (n=85) 2.759 (n=80) +0.049 +1.8%
Engine (In-domain) 3.236 (n=85) 3.808 (n=80) +0.572 +17.7%

Meta %3-fold: 65.4 → 62.6. Engine %3-fold: 57.9 → 40.2.

Significance:

Test impact:

Follow-up needed:

  1. Refresh bootstrap 95% CIs against new cache (via cherry-picking-process bootstrap script, 10k resamples).
  2. Investigate Engine-track AAFE drift root cause: bisect from 2026-04-14 to 2026-04-29 if needed; primary suspects are P4.5 Achour and ECM migration commits.
  3. Document AD-criteria change (n=85 → n=80) — which 5 drugs newly flagged?
  4. Decide on pravastatin T7 recalibration (was the T7 calibration tied to a pre-P4.5 cache?).

Files updated:


2026-04-22 — Achour 2021 Correlated Physiology Prior (P4.5 infrastructure)

Spec: docs/superpowers/specs/2026-04-22-achour-abundance-correlation-design.md Plan: docs/superpowers/plans/2026-04-22-achour-abundance-correlation.md Branch: feat/achour-correlated-abundance (merged commit TBD).

Outcome: Infrastructure landed. Distribution gains optional correlation_group field; new sisyphus.physiology.correlation_registry provides multivariate-lognormal sampling; generate_physiology(rng=) opt-in; reference_man.yaml liver node migrated to Achour 2021 CVs with OATP1B1 independent (mean_r=0.234 < 0.3 threshold, empirical Achour Table S7 inclusion rule).

Gates passed:

Non-outcome: SBC improvement is explicit Non-Goal (§1 spec). Downstream P4.5a spec will retrain the SBI amortizer with physiology sampling and re-measure SBC on the 52-cell grid.

Data artifacts:

Source: Achour 2021 CPT 109:222-232 (PMC7839483, CC BY-NC 4.0).


2026-04 (current session)

V3 IV-Cmax methodology + ECM re-run + fup confound rule-out (2026-04-22)

Infrastructure shipped (7 commits, 4630b0b..4e10ad2): Route-aware t_min_h = _IV_CMAX_DELAY_H (5/60 h) if route=="iv" else 0.0 threaded through solve(), solve_mc(), compute_endpoints(), propagate_fast() (scipy backend), pipeline. Oral (107 holdout + production) byte-identical to V2 — pinned by tests/integration/test_v3_oral_regression.py. 562 pass / 4 skip / 2 xfail, zero new failures.

ECM generalization re-run under V3 (7aa49ae, data/validation/oatp_generalization_result_v3.json): Formal Mode C. Direction flipped from V2: V2 appeared to over-predict 1.1–1.35× but that was the t=0 artifact. V3 with windowed Cmax shows systematic underprediction 2.5× on both drugs.

Drug Observed V2 (artifact) V3 (real) V3 PI V3 log10 FE
glimepiride 0.243 0.270 (1.11×) 0.095 [0.087, 0.101] −0.409
valsartan 4.02 5.405 (1.35×) 1.940 [1.80, 2.06] −0.316
Median log10 FE = 0.363 < 0.5 Mode B gate → formally Mode C, but same-direction underprediction is substantively suggestive of systematic ECM over-clearance for non-statin OATP1B1 substrates. V2’s apparent “near-pass” was a methodology illusion; V2 result preserved as .v2.json.

Diagnostic (5ff72eb, data/validation/v3_fup_override_diagnosis.json): fup override (valsartan predicted 0.009 → clinical 0.050, 5.6× increase) gave Cmax 0.97× — essentially no change. Glimepiride predicted fup already matches clinical (0.005). Predict-layer fup confound RULED OUT as cause of V3 underprediction.

Remaining candidates for V3 underprediction (not investigated this session):

  1. ECM Jmax values too high for valsartan/glimepiride (valsartan Jmax flat-CLuptake-scaled from pravastatin under v2.1; glimepiride from literature Huang 2018)
  2. Vss/Kp over-distribution (tissue holds too much drug → too little in blood at 5 min)
  3. ECM architecture limit for Km > 1 µM range (pravastatin Km ≈ 13.6, glimepiride 10.0, valsartan 1.39 — three-order-of-magnitude sweep within tested substrates)

Pre-registration integrity maintained: V3 methodology spec written + committed (d88183a) BEFORE engine re-run (7aa49ae). Single MC run. Fup diagnostic explicitly marked exploratory ("note": "NOT a pre-registered run"). No post-run parameter adjustment.

How to apply:


ECM generalization test, N=2, Mode C with diagnostic findings (2026-04-21)

SUPERSEDED by V3 run (2026-04-22) above. Original V2 result preserved as data/validation/oatp_generalization_result.v2.json. Kept here for historical context only.

Spec: docs/superpowers/specs/2026-04-21-ecm-generalization-test-design.md

Plan: docs/superpowers/plans/2026-04-21-ecm-generalization-test.md (commit 3c85fe4)

Result: data/validation/oatp_generalization_result.json (commit 4fb6d38)

Formal outcome: Mode C (inconclusive)

Per drug:

Substantive signal: Both point estimates within 1.5× of observed — well inside the 3× clinical-error gate. If PI were non-degenerate and contained observed, outcome would have been Mode A (confirmed generalization within tested domain). Suggestive-positive for ECM mechanism but NOT formally confirmed.

Why PI is zero-width (root cause): MC Cmax for IV bolus in Sisyphus = dose / V_venous_blood (deterministic t=0 instantaneous value, 3.7 L ± 0.0). Distributional CVs downstream (Jmax, Km, fup, Kp, ps_*) never reach Cmax because max-over-time selects t=0. All 1000 samples produce identical output.

Secondary gap: data/transporters/hepatic_ecm.json lacks entries for valsartan + glimepiride → ps_passive/ps_eff/cl_int_bile fell to defaults (1e6 L/h for ps_*, 0 for bile). Not the cause of zero-PI but a data completeness gap worth closing.

Predict-layer confound flag (per spec §Peff Isolation):

Pre-registration integrity: Single run at N=1000, seed 42. No post-run parameter adjustment. All spec/plan amendments (v2, v2.1) pre-dated the engine execution. Substrate swap (bosentan/repaglinide → glimepiride) was documented under v2 amendment BEFORE any engine run, driven by data-access limits not expected outcome.

Commits:

Follow-up recommended (separate task, not this session):

  1. Design a v3 engine methodology for IV-Cmax observation that matches clinical semantics (non-t=0 or different node).
  2. Populate hepatic_ecm.json for non-statin OATP1B1 substrates.
  3. Improve fup XGBoost for valsartan-class high-fup-bound drugs.
  4. Pursue institutional library access for bosentan/repaglinide primary sources to re-enable N=3 test.

OATP ECM hepatic clearance — IMPLEMENTED (2026-04-21, branch feat/oatp-ecm)

OATP Phase 2B — SLCO1B1 phenotype (2026-04-20, commit 93febe3)

OATP Phase 2A — statin data expansion (2026-04-20, commit 3a04291, data-only)

P6 SBI likelihood reweighting (2026-04-19)

P7 Ketorolac AD flag (2026-04-19)

P4 Continuous Hierarchical Infrastructure (2026-04-16, branch feat/continuous-hierarchical)

Session additions (2026-04-14 evening)

v3 OATP expansion — NEGATIVE (2026-04-14, commit 5c0d864, reverted fdda41c)

See DE-32.

Phase 1 OATP1B1 (2026-04-15, branch feat/oatp1b1-pravastatin)

Phase 2.0.5 — SBI routing expansion (2026-04-12, commits ccc15a0 code + 43051ab eval)

Track D2 + paper-blocker bundle (2026-04-11, docs/tdm_ci_calibration.md)

Paper-blocker re-measurement (2026-04-11)

Track A — multi-drug NPE (2026-04-10, docs/sbi_multi_drug_results.md)

Track B — SBI production integration (2026-04-10, docs/sbi_multi_drug_results.md Addendum)

Track D1 — neural surrogate (2026-04-10, docs/surrogate_ood_fix.md)

Initial:

Follow-up (ensemble-std gate, hybrid routing):

Track C1 — hierarchical SBI (2026-04-12 code, 2026-04-14 2kθ eval)

Branch consolidation (2026-04-10, merge commit c0cab88)

audit/holdout-leakage-fix + feat/ude-diffrax merged. VDss 4th-track production added, EnKF TDM added, prospective validation series integrated, JAX backend consolidated. Post-merge AAFE 2.808 → 2.695 confirmed. tdm.py latent bug exposed and fixed (method="enkf" wrong kwarg + EnKFResult → TDMResult conversion).

2026-04-10 post-merge diagnosis update


2026-03 (earlier)

Holdout expansion (2026-03-26)

Measured ADME PoC (2026-03-26)

v2.0 multi-dose validation

v2.1 TDM validation

v2.1 TDM multi-drug benchmark (2026-03-27)

Metric 1 obs 2 obs 3 obs
Mean CV reduction 78.1% 82.7% 82.9%
Mean error reduction 79.4% 80.8% 79.1%
Mean posterior CV 8.4% 6.5% 6.4%

Engine-only ablation


Contamination fix (2026-04-04, commit 5e5a3d0)


Shipped-phase checklist (completed)

Detailed per-phase milestones: see phase-completion.md (local-only; moved to docs/_internal/ in PR #51).


How to add new entries

Prepend a new section at the top of the appropriate date block. Each entry should have:

If an entry documents a failure, also append it to dead-ends.md with the next DE-NN id.