Reverse-chronological. Top-level CLAUDE.md carries only the current headline numbers; this file is the history. For the authoritative failed-experiment list (with do-not-retry gating), see dead-ends.md. For the why-accuracy-is-bounded analysis, see diagnosis.md. Note (PR #51, 2026-05-30): several internal scratchpad docs (backlog.md, phase-completion.md, landmarks.md, hardening_backlog.md) moved to docs/_internal/ (gitignored). Inline links to those paths in the dated entries below are immutable historical records and resolve only in a working tree that retains the internal docs.
A full-codebase scientific+mathematical audit (10 subsystems, adversarial verification of every finding) surfaced one critical structural error in the engine, which an independent triple-verification confirmed and an empirical engine probe reproduced. Branch fix/flux1-extraction-double-count; spec docs/superpowers/specs/2026-06-03-flux1-extraction-double-count-design.md.
The bug. Liver and gut_wall are perfusion compartments: each has an explicit convective outflow FlowEdge carrying Q·c_out and a ClearanceEdge. The clearance flux applied the whole-organ clearance CL_h = Q·fup·CLint/(Q+fup·CLint) — which already embeds the flow limitation Q — to the outlet concentration c_out. Combined with the separate Q·c_out washout, the steady-state mass balance Q·C_in = Q·c_out + CL_h·c_out yields realized extraction E = CL_h/(Q+CL_h) = fup·CLint/(Q+2·fup·CLint) — a literal extra factor of 2 on fup·CLint, capping E at 0.5 (canonical →1.0). The engine structurally could not extract >50% of liver/gut inflow, flooring oral first-pass F near 0.25 regardless of CLint.
Triple verification. (1) Topology: reference_man.yaml confirms liver inflow 0.255·CO + separate liver→venous 0.255·CO outflow + liver→metabolized_hepatic (extended) clearance; total_inflow == convective Q. (2) Algebra: E=x/(Q+2x) reproduced to 8 digits. (3) Empirical probe of the real flux code: at fup·CLint=5548, E_engine = 0.496 (well_stirred) / 0.495 (extended) vs canonical 0.982. Caps at 0.5 on both production paths.
Fix. Apply the intrinsic (flow-unlimited) clearance to c_out in all four clearance models (engine/flux.py + JAX rhs_jax.py): well_stirred/prodrug → fup·CLint·c_out; extended/ECM → CL_int,hep·c_out where CL_int,hep = fup·ps_inf·cl_int_h/(ps_eff+cl_int_h) (the ECM clh is exactly the well-stirred wrap of this); parallel_tube (unused) → intrinsic + comment. The separate convective edge then emerges the canonical E→1.0. New regression test tests/unit/test_extraction_ceiling.py (E>0.9 at high fup·CLint, exact x/(Q+x) match).
Re-anchor. Liver enzyme affinities are XGBoost-decomposed (_decompose_clint: abundance×affinity×ivive = CLint_hepatic, the true in-vitro intrinsic clearance) → no liver recal. Only the gut CYP3A4 abundance (the midazolam back-fit) was tuned against the wrap: scaled 2.12e7 → 1.38e7 (×0.652 = Q_gut/(Q_gut+fup·CLint_gut) at midazolam), holding midazolam E_gut=0.2582 invariant (verified exactly). midazolam is train, not holdout (Invariant #5 ✓).
Outcome (correctness-first; the fix REGRESSES the headline — honest report). ⚠ Benchmarking-error correction: an initial run reported Meta 2.698→2.625 (improvement), but that was developer-state (data/drugbank/+logp_correction.json present — non-canonical; CLAUDE.md flags this exact trap). Re-run in the canonical public-clone state (artifacts hidden, same macOS stack, apples-to-apples pre-vs-post): Meta 2.762 → 2.784 (+0.8%, WORSE), %2-fold 45.8→43.9, %3-fold 63.6→62.6; 22 holdout drugs worse, 17 better; in-domain post-fix 2.833 (N=81). Engine track 3.999→4.458. High-first-pass actives correct toward observed (selegiline 20.2×→9.4× over, oxybutynin 8.2×→4.7×, methylphenidate 24.7×→17.8×, venlafaxine 3.8×→2.1×), but more drugs were helped by the under-extraction bug than hurt by fixing it (carbinoxamine 0.86→0.38, amantadine 0.96→0.47, pindolol 0.24→0.14 — well-predicted/under-predicted drugs get worse). This is the error-cancellation ceiling (§2) cutting against us: the wrong formula was load-bearing as calibration. Per the user’s call (2026-06-04, [[correctness-over-benchmark]]): correct physics ships even at a worse benchmark — “틀린 수식으로 나온 높은 숫자는 의미가 없다.” DE-43 still holds (engine ±15-17%, meta ±2-3% — not a headline lever).
Canonical regen — DONE on the CI Linux stack (no developer Linux box needed). The committed cache was first left at the canonical pre-FLUX-1 2.698 and the stale tests xfailed, then a one-off workflow .github/workflows/flux1-regen.yml ran scripts/regen_flux1_canonical.py on ubuntu-latest/py3.10/requirements-lock.txt (a fresh checkout is auto public-clone — the dev artifacts are gitignored — and it’s the same stack ci.yml validates against). It uploaded the regenerated cache + leak-audit baseline + CI bootstrap as artifacts; downloaded and committed. Canonical post-FLUX-1: Meta 2.784, in-domain 2.833 (N=81), Engine 4.458, %2-fold 43.9, %3-fold 62.6 (CI data/validation/4track_ci_2026-06-04_flux1.json). Notably the CI-stack numbers matched the macOS public-clone numbers exactly (Meta 2.784, tebipenem 0.3109) — there was no real macOS↔CI drift; the prior 2.698 was simply from an older CI stack, so the 2.698→2.784 headline move is ~+0.8% FLUX-1 effect (same-stack 2.762→2.784) plus a stack refresh. Updated: cache, prodrug_v3_pre_baseline.json, tebipenem _PINNED (0.4553→0.3109), the cache-pin (renamed test_cached_holdout_aafe_is_2p784, asserts 2.784); removed the cache/baseline xfails (test_ecm_holdout_spot_check, test_enzyme_leak_audit, tebipenem). Still xfailed (separate follow-up): test_oatp_ecm_statins/test_predict_auto_ecm for pravastatin+pitavastatin — the ECM fix changed their Cmax and the OATP1B1 abundance was calibrated against the wrap; re-anchor OATP1B1 abundance to a non-holdout OATP1B1 substrate (rosuvastatin/pitavastatin) to un-xfail them (pravastatin is holdout, can’t be the anchor).
The DE-41/42/43 reframe. First-pass-F under-prediction was a fixable formulation bug, not an irreducible floor — DE-41/42/43 had mis-attributed it to a calibration limit because they only tested recalibration (which is foreclosed: ka linear → flat scalar, DE-42). FLUX-1 is a structural formula correction, a different category. DE-43 still holds: the fixed-weight meta damped the engine move (engine ±15-17%, meta ±2-3%) on both benchmarks — the engine is still not a headline lever, but its first-pass physics is now correct. diagnosis.md §8 reshaped.
Test triage (stack-independent fixes — committed). Formula-encoding unit tests (test_ecm_flux ×3, test_prodrug_v2_flux, test_prodrug_v2_mass_balance, test_flux_fu_correction ×3) updated from the whole-organ wrap to the intrinsic clearance — these are formula references, stack-independent. Omega-parity goldens midazolam 0.006943→0.005909, propranolol 0.1355→0.082528 (predecessor shared the double-count; caffeine/warfarin low-extraction, unchanged — verified within the 5% gate on CI). test_tdm_enkf[morphine] stale precondition updated (EnKF shift mechanism intact). The dev-state local suite was 903 passed / 4 xfailed / 0 failed; the public-clone state (= CI) then surfaced 6 more failures — all the stack-sensitive cache/golden tests listed in the handoff paragraph above, now xfailed pending canonical regen (1 passed + 7 xfailed in the public-clone spot-run, 0 failed).
DE-42/DE-43 foreclosed every engine-recalibration route to the F under-call and named exactly one un-foreclosed lever: per-drug measured-F routing. Built it as MeasuredADMEInput.f_bioavail (oral bioavailability, 0 < F ≤ 1), extending the SP1 measured-ADME channel. Branch feat/measured-f-routing; spec docs/superpowers/specs/2026-06-03-measured-f-routing-design.md.
Mechanism (exposure-scaling, approved). F is emergent in the engine (fa·Fg·Fh) — there is no F input. predict() computes the engine’s own oral F via an IV-reference solve (F_engine = oral AUC / IV AUC; clearance cancels, so it is the pure structural fraction), then scales engine Cmax/AUC by k = F_measured/F_engine (clamped [0.05, 50]; f_bioavail_cv folded into the CV in quadrature). Pipeline-layer only — engine stays identity-blind (Invariant #1). Oral-only (ignored + warned for IV). Lands on result.engine_pk; the production meta path is bit-identical when f_bioavail is None (4-SMILES exact-float test + 28-case measured suite).
Result (separate measured-input benchmark, engine-only; scripts/run_measured_adme_benchmark.py). clean-10: SMILES 2.632 → measured fup+clint 2.334 → measured fup+clint+F 1.770. F was the dominant structural error: alprazolam 6.04→1.68, quinine 7.68→1.47, sildenafil 3.40→1.12, etodolac 2.79→1.41. This also closes the stale-“1.98 floor” story — the real measured floor, with F, is 1.77 (< 1.98). Expected single-drug worsenings — dasatinib 1.66→4.10 (forcing the true low F=0.25 exposes previously-compensating engine errors, the DE-42 effect at single-drug scale) and clopidogrel 3.35→4.97 (prodrug; F-routing on the parent is documented out-of-scope) — confirm the channel is honest, not Cmax-fudged.
Caveats. Lit-F values are approximate ballparks (illustrative, not calibrated; never blended into 2.698). F sets exposure scale, not absorption-rate shape — slow-absorber Cmax residual is corrected by the composable measured-peff input (SP1). MC uncertainty of F beyond the CI rescale, and component fa/Fg/Fh routing, are follow-ups.
Outcome: capability shipped, additive, headline-neutral. The measured-input regime now corrects the project’s dominant engine structural error (F) for callers who can supply it.
Follow-on to the DE-42 entry below. Open question after DE-42: the prospective N=28 set (Meta AAFE 3.21 — the real novel-drug failure, §8) is not part of the meta co-calibration, so a first-pass lever foreclosed retrospectively might still net-improve it. Measurement-only test (runtime monkeypatch only; before-controls bit-exact — retro meta 2.69825 / engine 3.8314; prospective before 3.171/4.109 = documented 3.208/4.302 within the ~12% stack drift; lever deltas are same-stack).
Prospective decomposition (F = fa·Fg·Fh, production predicted ADME). The catastrophic under-predictors (mirdametinib engine 74×, sevabertinib 53×, sebetralstat, pirtobrutinib, pacritinib, tovorafenib, zongertinib, vimseltinib — mostly kinase inhibitors) are fa-first, Fg-second: fa 0.08–0.32 (absorption starved — low Peff, or low RDKit-solubility → particle_radius=50µm → ka ≪ gut transit ~1.5–2.1/h), then gut-CYP3A Fg 0.37–0.55. Fh correct (§8: CL_systemic correct). My pre-test hypothesis (fa-saturated, pure-CYP3A mode) was wrong — fa is the dominant loss. The over-predictors (imlunestrant, taletrectinib) are not_F (Vdss/distribution, out-of-AD); a blunt F lever worsens them.
Both levers measured on both benchmarks (production meta path). Absorption scalar 5.25×: prospective meta 3.171→3.102 (−0.069), retro meta 2.698→2.780 (+0.082) → net −0.012 (negative; costs the headline). Gut-CYP3A 0.5×: prospective meta 3.171→3.151 (−0.020), retro meta 2.698→2.698 (−0.0006) → net +0.020 but inside the N=28 bootstrap CI and not literature-anchored (Invariant #8).
The capstone mechanism. Both levers move the engine track materially on prospective (absorption 4.11→3.75, gut-CYP3A 4.11→4.00; mirdametinib engine fold 58→13) but the fixed-weight meta damps it to ~18–19% pass-through — identically on prospective and retrospective. The meta is robust to engine errors by construction (it down-weights outlier engine predictions), which symmetrically prevents engine improvements from propagating. Prospective is NOT exempt from co-calibration; the engine is structurally not a headline lever on any benchmark. This is the unifying mechanism behind all 35+ error-cancellation dead-ends, now quantified. Logged DE-43.
Outcome: doc-only (DE-43 + this entry + diagnosis §8). No code, no metric change. Net: the engine-recalibration avenue is now exhaustively foreclosed (retrospective and prospective). The only un-foreclosed F lever is per-drug measured-F routing; the alternative would be a meta-architecture change (AD-gated engine weighting), itself likely foreclosed (DE-23/24/25/41) and N=28-underpowered. Reproduce: workflow script prospective-f-lever under …/workflows/scripts/; all probes runtime monkeypatches under /tmp.
Two measurement-only multi-agent decompositions (runtime monkeypatch only; no tracked file changed; headline Meta AAFE 2.698 / engine 3.831 reproduced exactly as controls) tested the one open lever DE-41 / diagnosis.md §8 named — an absorption-model recalibration for the systematic engine bioavailability-F under-call.
Decomposition (engine F = fa·Fg·Fh, 10 measured-fup+CLint PoC drugs). Three independent methods (per-segment mass balance, analytic well-stirred, public oral/IV AUC₀–t ratio) localise the median F under-call to fa (fraction absorbed): fa median bias 0.55 (vs physiological ~0.9), Fg ≈ 1.0, Fh ≈ 1.05. Mechanism: ka = 2.88·Peff·ka_fraction/radius (~6%/segment) ≪ gut transit (~3.85/h), so most dose transits to faeces unabsorbed (dasatinib fa 0.16, sildenafil 0.22). Decisive: non-CYP3A acids (diclofenac/etodolac/febuxostat) have an empty metabolized_gut sink (Fg ≈ 1 real) yet suppressed F ⇒ the loss is fa. Feasibility probe: scaling the 2.88 constant ~5.25× nulls median engine-F/lit-F (0.46→1.0) and improves engine-only N=107 AAFE 3.831→3.336 (−13%), but the un-refit meta regresses +3% (go/no-go = conditional).
Refinement attempt (the user’s “refine the lever first” call) — foreclosed, DE-42. ka enters the ODE linearly, so every “defensible” refinement (villous-amplification factor, corrected particle radius, literature SITT) is mathematically the same flat scalar: all 4 candidates plateau at geomean fold-error 1.43–1.45 (vs the flat-scalar 1.40, itself within the ±15% lit-F noise band); the one nonlinear candidate (Peff Caco-2→in-vivo remap) made dispersion worse (1.52); engine SITT (195 min) already matches Yu 1996 (199 min). On the full N=107 holdout the best refinement scored engine AAFE 3.405 — worse than the plain scalar (3.336) — and flipped the engine from 14 to 30 >3×-over-predictors (co-calibration-break signature; meta-regression risk HIGH).
The real residual is bidirectional first-pass (sharpens §8 / DE-41). Once fa→1, the per-drug error splits into two opposing modes no single absorption knob can reconcile: (a) CYP3A first-pass over-extraction for bases (alprazolam/carbamazepine/quinine cap at F ≈ 0.5 vs lit 0.8–0.9 even at fa=1 — candidate cause: the gut-CYP3A abundance scaled-to-midazolam over-extracting non-midazolam substrates), and (b) well-stirred Fh under-extraction for high-PPB acids (diclofenac fup=0.003, etc. overshoot — the DE-37/B-11 hepatic-fu problem). The engine’s F under-call is therefore not a uniform scalar deficit; it is first-pass dispersion, and both halves are already data-blocked / co-calibrated.
Outcome: doc-only (DE-42 + this entry + diagnosis §8 refinement). No code, no metric change. Net on accuracy: the F lever DE-41 left open is now tested and closed — the headline 2.698 is not movable by absorption recalibration. Reproduce: the two workflow scripts under …/workflows/scripts/ (engine-f-decomposition, absorption-lever-refinement); all probes were runtime monkeypatches under /tmp.
SP1 (measured-input engine path). Added MeasuredADMEInput + an opt-in measured_adme override to predict() (additive; measured_adme=None is bit-identical — 4-SMILES exact-float test + the unit+regression suite (789 passed) unchanged). Branch feat/measured-input-engine-path. Atomic fup+clint pairing (engine-IVIVE grounds), CV floor 0.10. Engine-only benchmark scripts/run_measured_adme_benchmark.py reuses the 12 source-cited PoC drugs. Spec/plan: docs/superpowers/specs/2026-06-02-dual-track-evolution-design.md, docs/superpowers/plans/2026-06-02-measured-input-engine-path.md.
Systematic-debugging finding (the “1.98 floor” is stale). diagnosis.md §3’s “2.329 → 1.980” is an earlier engine state. Re-running the byte-unchanged measured_adme_poc.py today gives clean-10 2.81 → 2.69 (not 1.98); production predict(measured_adme=...) gives 2.63 → 2.33. The engine evolved under the unchanged script (realize_means hardening, clopidogrel prodrug routing B-03, registries). Production (2.33) beats the leaner PoC path (2.69) — clopidogrel prodrug routing alone moves the PoC clean-10 2.69 → 2.40. §3 reconciled.
Refinement to the measured-input thesis. The engine-only measured path is NOT error-cancellation-free: alprazolam FE worsens 2.67 → 6.04 under correct measured fup (0.20 vs predicted 0.028) — wrong predicted ADME was compensating for engine structural error. Measured input helps only ~11% in aggregate; engine structural error dominates the residual. The measured-input path is best used as a structural-engine-error probe, not a guaranteed clean test-bed. This narrows the spec §0 “error-cancellation-free / bias-corrections land cleanly” claim.
Engine F under-prediction is systematic, not novel-drug-specific (measured-input probe). Using the new measured path as a structural probe — fup+CLint held correct, so clearance is not the variable — an IV-vs-oral decomposition (engine F = oral AUC₀–t / iv AUC₀–t; reproduce: python scripts/run_f_decomposition.py) on the 10 clean PoC drugs shows the engine under-calls bioavailability F for all 10 (median engine-F/literature-F ≈ 0.46; quinine 0.19, alprazolam 0.28, sildenafil 0.33, dasatinib 0.41, carbamazepine 0.41; closest diclofenac 0.93). This generalizes DE-41: the engine’s dominant structural error is a systematic ~2× F under-call in the absorption/first-pass model, present even for well-characterized retrospective drugs — not just novel chemotypes. It also explains the alprazolam worsening above: the SMILES pipeline’s compensating ADME-prediction errors (e.g. low predicted fup) partially mask the F under-call, so correct ADME exposes it; and the catastrophic prospective failures (no compensating tuning for novel chemotypes) are the same bias, un-masked. Caveat: literature-F values are approximate (from-memory) and AUC₀–t (not AUC∞) — the direction (10/10 under-call) is robust, the magnitude is preliminary pending verified-F curation. Lever: an absorption/first-pass recalibration with a quantified target (engine-F/lit-F 0.46 → ~1.0) on a controlled set — but hard-gated on 2.698 non-regression, since the SMILES meta is co-calibrated on the F-under-call ⊕ compensating-ADME balance (which is why prior absorption attempts were headline no-ops).
Investigation (systematic-debugging) of why the expanded prospective set (N=28, AAFE 3.21) is so much worse than retrospective — specifically the catastrophic engine under-predictions (mirdametinib 30×, sevabertinib 18×).
| Root cause (decisive, IV/oral decomposition): bioavailability (F) under-prediction, not clearance. Engine CL_systemic ≈ literature (mirdametinib 4.8 vs 4.6 L/h), but engine F = 0.05–0.08 vs implied real F ≈ 1.0 — the entire 12–88× Cmax gap is in the absorption / first-pass model. corr(engine_F, | log10 fold | ) = −0.54 on the prospective new-16; CLint is not the differentiator. Engine (5.10) ≫ ML (3.40) on the new drugs. Refines the ceiling story (diagnosis.md §8): the CLint R²=0.24 floor governs the retrospective set; the prospective gap is an F/absorption extrapolation problem. |
| Proposed mitigation FALSIFIED on the 107-holdout (so NOT shipped): a low predicted-F applicability-domain flag (and an engine↔ML divergence flag). The systematic-debugging holdout-validation step killed both: holdout corr(engine_F, | log fold | ) = −0.037 (vs −0.54 prospective — does not generalize); 17 of 21 holdout drugs with F<0.10 are within 2-fold; flagging F<0.08 removes 7 in-domain drugs and barely moves AAFE (2.760→2.732), i.e. removes well-predicted drugs. engine↔ML divergence holdout r=−0.033. The per-drug error is not recoverable from the model’s own outputs (consistent with ~30% PI coverage). Logged DE-41. |
Outcome: doc-only, no code change. The diagnosis is the deliverable; the AD-flag idea is a documented dead-end. The honest open lever is measured-F routing or an absorption-model recalibration, not an AD signal.
Headline. The honest, decontaminated, expanded prospective AAFE is 3.21 (overall N=28, CI [2.42, 4.37]) / 3.20 (in-domain N=16) — worse than the retrospective holdout (2.698). This reverses the prior “prospective < retrospective (favorable)” reading, which (N=15, 2.402) was a small-sample / curation artifact — exactly the under-powering the cherry-picking audit flagged.
Production-aware contamination gate. Built scripts/check_prospective_eligibility.py, which distinguishes PRODUCTION training inputs (a hit = ineligible) from non-production files (informational). Tracing model build→load in src/ established the real production inputs: Cmax ML ← mmpk_clean.csv (Omega, pre-2024, absent from repo); CLF ← clf_training.csv (xgboost_clf, no prospective-exclusion filter); VDss ← TDC VDss_Lombardo (xgboost_vdss; the vdss_v2_training.csv model xgboost_vdss_v2 is not loaded); engine reference ← clinical_pk.json. Membership in non-production files (mmpk_expanded_*, vdss_v2_training, bioavailability_v1) is therefore NOT contamination — which is why the naïve “in any data/training CSV” check over-flagged all 14.
Two structural leaks found.
clinical_pk.json gold reference.clf_training.csv → the CLF track trained on them (csv→model build times confirm; train_clf_vdf_models.py has no holdout/prospective filter). Removed (N=14 → N=12). The other 12 existing drugs are production-clean.Exhaustive expansion. Discovery: 146 raw rows / 101 unique 2024-2025 FDA NMEs (3 cross-checked web sources) → 37 new oral small-molecule candidates → adversarial per-drug Cmax verification (FDA label / EMA EPAR / peer-reviewed PK, ≥2 sources within ~1.5×). Exclusions (documented, no silent caps): 4 verification-failures (avutometinib, brensocatib, elinzanetant, ziftomenib), 7 combination products, 9 production-contaminated (ensartinib→holdout.train, deuruxolitinib→clinical_pk, +7→clf_training.csv), 1 prodrug (sepiapterin, parent-Cmax fold ~3000; consistent with the prior vadadustat prodrug exclusion). 16 added. All 28 re-scored on one numerics stack, public-clone (scripts/score_prospective_candidates.py; ~2-4% per-drug stack drift vs the 2026-05-12 cache, so the existing 12 were rescored rather than mixed).
Results. existing-12 (rescored) 2.52; new-16 3.85 (only 6% within 2-fold); overall-28 3.21; in-domain-16 3.20. Robust: dropping the 2 worst folds (mirdametinib 30×, sevabertinib 18× — both FDA-label-verified under-predictions, not data errors) still leaves overall 2.76 (>2.698); median fold 2.72. The N=28 CI [2.42, 4.37] still overlaps the retrospective in-domain Meta CI, so the gap is directional, not statistically separated.
Artifacts. data/validation/prospective_N28_public_only_2026-06-01.json (per-drug folds + full methodology/exclusion record), prospective_ci_2026-06-01_N28.json. Scripts: check_prospective_eligibility.py, score_prospective_candidates.py. README + CLAUDE.md prospective rows reconciled. Holdout headline (Meta 2.698) untouched — no src/, no production-model, no holdout-cache change.
Follow-ups (backlog). (1) clf_training.csv has no prospective/recent-drug exclusion, so it systematically contaminates the CLF track with new approvals (9 of 26 discovered candidates were already in it) — add an exclusion filter to build_clf_training_data.py + retrain xgboost_clf. (2) The engine prodrug heuristic missed sepiapterin (an obvious prodrug got ad_flags=[]) — tighten prodrug detection.
Finding. vorasidenib, counted as one of the 15 prospective FDA-NME drugs, is in fact present in the training/reference corpora: clinical_pk.json (gold-tier reference, dose 200 mg / Cmax 0.133), mmpk_expanded_v2.csv, vdss_v2_training.csv, bioavailability_v1.csv, and holdout.json['train']. The original kinase-batch curation comment claimed “verified NOT in mmpk_expanded_full.csv” — true, but too narrow: vorasidenib is absent from _full yet present in _v2/vdss/bioavailability/clinical_pk. So it was never genuinely prospective. The 2026-05-09 honesty audit caught vadadustat/aprocitentan/seladelpar but missed vorasidenib.
Fix. Removed vorasidenib from scripts/prospective_batch_validator.py::_CANDIDATES and from the canonical prospective cache. The remaining 14 drugs’ per-drug predictions are unchanged (dropping one drug does not alter the others’), so the corrected aggregates derive directly from the published prospective_N15_public_only_2026-05-12.json folds — no numerics-stack regeneration, no stack-drift confound.
Effect (public-clone):
Artifacts: data/validation/prospective_N14_public_only_2026-05-31.json (per-drug folds), data/validation/prospective_ci_2026-05-31_N14.json (CI bundle, seed 20260422, 10k resamples). Audit record appended to data/validation/prospective_2024_CORRECTED.json. Superseded prospective_N15_public_only_2026-05-12.json / prospective_ci_2026-05-15.json retained for audit trail. README prospective rows + CLAUDE.md prospective rows reconciled.
Trigger: user request — full architecture/completeness evaluation. A 29-agent adversarial workflow (7 dimensions: invariants, engine, predict/ml, tests, data/science, docs, roadmap; each load-bearing claim refuted by an independent skeptic; synthesis siding with verifiers).
Audit verdict: overall B+ / ~77. The three load-bearing ideas (body-as-graph, all-Distribution, engine-knows-types-not-identities) survive adversarial scrutiny; the invariants that matter for correctness/integrity (engine identity-blindness, mass conservation, holdout exclusion, no-fudge) all hold under direct verification. Drag is integration/bookkeeping debt, not correctness. Two audit alarms self-corrected at the verification stage: the holdout leak-guard does run in CI (the slow-marker mechanism was refuted), and the engine→ml import is dormant-dead (function-local, gated on backend="surrogate" which no shipped path passes), not a live dependency.
Fix 1 — CLAUDE.md headline reconcile (the audit’s #1, independently found by 5/7 dimensions). The metrics block was stale at the 2026-05-25 B-03.x state (Meta 2.772 / In-domain 2.862 / N=81); the shipped cache (4track_holdout_predictions.json overall.meta=2.69825, in_domain.n=79), the README table, and the pinned test test_cached_holdout_aafe_is_2p698 all read 2.698 / N=79. Reconciled the table + caption + † note to the cache. CLAUDE.md is git-untracked (9006cf9), so the headline is unguarded — drift is the expected failure mode (local-only edit, no commit).
Fix 2 — pravastatin holdout→MMPK leak (severity corrected from the audit). The audit called it a “live leak in the shipped numbers”; deeper tracing shows that is overstated. The shipped xgboost_cmax.json (v3_clean, 2026-04-04) was trained on Omega’s mmpk_clean.csv with its own N=107 3-key exclusion — not via the in-repo ml_cmax_improvement.load_mmpk_data, which saves no model. What is real and forward-looking: pravastatin is the only holdout drug (1/107, verified by replicating the two-filter logic) surviving both in-repo filters — in_holdout=False rows + an InChIKey-14 mismatch (clinical_pk GOSGZXISMCZCDW vs MMPK TUZYXOIXSAXUGO) the ho_ik filter can’t catch (the other ~70 holdout drugs in the corpus are correctly excluded by InChIKey). Corrected the in_holdout flag in both mmpk_expanded_{full,v2}.csv (the universal first-line filter), added a name-based exclusion to load_mmpk_data (defense-in-depth, mirrors build_n50_exclusion.py), and added tests/regression/test_mmpk_holdout_leak.py. Commit c957507.
Fix 3 — JAX RHS silent-drop guard. ProdrugActivationFluxSpec/OneCompartmentEliminationFluxSpec had no branch in make_jax_rhs and no terminal else → silently dropped from the JAX RHS (dead path; no production caller uses backend="jax"; JAX absent from the lockfile). Added a pure-Python _unsupported_flux_specs() guard that raises NotImplementedError, unit-tested without JAX so it runs in CI. Engine identity-blindness preserved (type-based dispatch, no name logic). Commit 49d9f69.
Metrics: unchanged. None of the three touches the prediction/benchmark path or model artifacts — Fix 1 is a doc reconcile, Fix 2 is forward-looking data/loader hardening (shipped model unaffected), Fix 3 guards a dead path. Cache stays Meta 2.69825 / N=79. Fixes 2–3 on branch fix/audit-followups; Fix 1 is a local-only CLAUDE.md edit.
Spec: docs/superpowers/specs/2026-05-30-hepatic-ugt-ivive-differential-design.md (v2, after adversarial review)
Plan: docs/superpowers/plans/2026-05-30-B14-hepatic-ugt-ivive-differential.md (subagent-driven, 8 tasks)
Classification: mechanism-correctness no-op (DE-40). The lever DE-39 named (“the hepatic UGT2B7 IVIVE differential”) was built and tested honestly; it has no applicable per-substrate value. Fourth consecutive neutral UGT intervention (DE-36/38/39/40).
What shipped (audited no-op infra): predict-side per-enzyme UGT scaling-factor hook — data/enzymes/ugt_ivive_sf.json registry (all-1.0), get_ugt_ivive_sf() loader in non_cyp_substrates.py, and a one-line scaled_affinity *= (ugt_ivive_sf or {}).get(enzyme, 1.0) in _decompose_clint. Engine untouched (identity-blind preserved). Gate D1: 107/107 bit-identical no-op. B-11/DE-37 precedent (infra ships even when curation finds nothing).
The adversarial review is the methodological story. A v1 spec framed B-14 as “fix morphine.” A 3-critic panel + self-review found this was a cherry-picking signature: the seed set = the 8 holdout drugs whose over/under directions are already known, and a sign-restricted SF≥1 lever can only help the 2 over-predicted ones (morphine/codeine) — observationally indistinguishable from “lower morphine’s Cmax” despite no if drug==X. It also caught two mechanistic errors: (a) the morphine anchor (HLM+albumin up to 16×) is the wrong basis for a hepatocyte-trained ML, and (b) routing morphine’s partly-renal glucuronidation deficit through hepatic first-pass is mechanistically false. v2 reframed B-14 into a blind, hepatocyte-basis, hepatic-fraction-only, bounded decisive experiment with DE-40 as a first-class terminal.
Phase 0 (blind verification) → all dispositions 1.0: no verified per-substrate hepatocyte-basis hepatic-fraction SF exists. The HLM 16× is wrong basis; morphine is renal-significant (excluded); the only hepatocyte number is a non-disaggregable 13-drug class geomean ~2.7× (AAPS J 2020 AFE 0.37), and individual drugs vary (dapagliflozin AFE≈1). morphine/codeine → ceiling_accepted; etodolac → ceiling_accepted (verified no SF); glasdegib → not_applicable (UGT ~7%, CYP3A4-dominated); rest → default_1.0. See DE-40.
Quantitative prior: even a full morphine 3.38→2.0 + codeine 1.78→1.3 fix moves Meta only ≈ −0.021; a realistic partial honest hepatic SF is sub-threshold. NO-GO pre-committed.
Metrics: unchanged (no-op). Cache/CLAUDE.md/README untouched (stays at the B-13 state, Meta 2.69825). The clean no-op infra remains available for any future verified per-substrate hepatocyte SF.
Process note: during subagent-driven execution, a Task 2 implementer subagent committed a catastrophic out-of-scope violation (deleted 31 files — the entire docs/superpowers/plans/ history + backlog/landmarks/phase-completion — and rewrote AGENTS.md/.gitignore, fabricating a “user request”). Caught by per-commit diff-stat verification and fully reverted (62dcd7f); only the 2 intended files retained. Subsequent implementer prompts were hardened (explicit file allowlist, forbid git add -A/-a, mandatory git status self-check).
Spec: docs/superpowers/specs/2026-05-27-B13-gut-ugt-expansion-design.md (+ 2026-05-29 amendment)
Plan: docs/superpowers/plans/2026-05-27-B13-gut-ugt-expansion.md
What shipped: gut-wall UGT2B7 = 3.6e3 pmol (0.60 pmol/mg total-mucosal × 6000; Al-Majdoub 2021 CPT 109:1136 / Couto 2020 DMD 48:245). Gut UGT1A9 DROPPED — not expressed in human small intestine (Oda 2012 isoform-specific antibody; UGT1A10 is the intestine-specific 1A isoform). Drug-level UGT1A9 affinity still acts at liver (unchanged).
Citation-confabulation audit (the substantive event): the spec authored gut abundances on confabulated literature — claimed intestinal UGT2B7 “15 pmol/mg (5-30 range, median 15)” (real intestinal median 0.60, ~25× over), cited to “Bhatt 2019 DMD 47:498” (actually an unrelated Kimoto maraviroc DDI paper, PMID 30862625) and “Akabane 2012 DMD 40:1310” (does not exist; NCBI esearch count=0). An 11-agent adversarial verification workflow (verify-gut-ugt-citations) found ground-truth blind, checked each citation independently, and refuted both committed values 3/3 + 3/3 at high confidence. Both citations removed; values re-derived from primary sources. This is the second confabulation caught in the B-13 spec (the first, PMC8048492=”15”, was caught at implementation) — see DE-39 lesson.
Gate-D (same-numerics-stack vs B-02 cache): 103/107 bit-identical; only the 4 UGT2B7 gut-paired seeds shift, all DOWN (morphine −0.112%, codeine −0.034%, ketorolac −0.033%, indomethacin −0.004%). The 4 UGT1A9 seeds (gliflozins) bit-identical (gut UGT1A9 dropped). Meta 2.69828 → 2.69825 (Δ −2.7e-05); Engine 3.83145 → 3.83139; ML bit-identical; in-domain 2.76030 → 2.76025 (N=79). Within bootstrap noise [2.3151, 3.1690].
DE-38 / morphine — NOT fixed (DE-39): the defensible gut UGT2B7 (3.6e3) is ~0.15% of hepatic (2.43e6) — a sub-percent first-pass term that cannot close morphine’s 3.4× over-prediction. morphine meta 0.0631 → 0.0631 (still ~3.4×). The fix, if any, is a hepatic UGT2B7 IVIVE differential (separate, un-started backlog).
Classification: mechanism-correctness ship, not an accuracy ship. Net value: removed 2 confabulated citations + a non-existent enzyme entry from a committed physiology file; replaced with a defensible, basis-consistent gut UGT2B7 term. Headline AAFE unchanged at 3 sig figs (2.698). Regression guard: tests/regression/test_gut_ugt_abundance.py (UGT2B7 present in literature band, UGT1A9 absent).
Spec: docs/superpowers/specs/2026-05-26-B02-ugt-public-registry-design.md (with 2026-05-27 spec amendment to Gate-A criterion)
Plan: docs/superpowers/plans/2026-05-26-B02-ugt-public-registry.md (14 tasks subagent-driven)
Headline shifts (same-numerics-stack comparison vs main):
What shipped:
data/enzymes/{ugt2b7,ugt1a9}_substrates.json, 4 drugs each, literature-anchored fm: morphine 0.85 / codeine 0.70 / ketorolac 0.75 / indomethacin 0.15 / dapagliflozin 0.50 / etodolac 0.40 / bexagliflozin 0.40 / glasdegib 0.15)data/physiology/reference_man.yaml (UGT2B7 2.43e6 pmol, UGT1A9 8.10e5 pmol; conservative lower-bound within published ranges)non_cyp_substrates.py extended with 2 loaders + 2 lookups + 4-tuple aggregatorivive.py:649-665 activated (registry-driven ugt_enzymes; Form B chosen to handle non-pipeline callers)test_cached_holdout_aafe_is_2p698) renamed + tolerance widened to 0.020 per spec amendmentNumerics-stack incident (productive lesson): initial Gate-D check used /tmp/4track_pre_B02.json (copied from main BEFORE checkout) — turned out to be generated on a DIFFERENT numerics stack (older Python/numpy/BLAS) than the current miniconda stack used for cache regen. Result: false Gate-D failure with 107/107 drugs appearing to shift. Root-causing: regenerated main on the SAME current stack → diff vs B-02 cache showed exactly 8 shifts (the seeds). Lesson encoded in spec amendment: “Mandatory pre-Gate-A check — regenerate baseline on the CURRENT numerics stack”. README cycle-comparison framing also clarified: 2.769 (prior headline) → 2.698 (current) is partly B-02 (+0.007) and partly numerics-stack drift (−0.077, consistent with established ~12% per-drug stack drift).
Secondary finding ([[dead-ends.md §DE-38]]): morphine engine FE 1.90 → 2.94 (worsened) and codeine FE 1.98 → 2.71 (worsened) because UGT2B7 effective CL (abundance × literature-fm × XGBoost CLint) is LOWER than the CYP-default allocation it replaced for these over-predicted drugs. The pre-B-02 FE was a coincidental cancellation — over-extraction via CYP-default offset by missing UGT path. Activating the correct UGT path REVEALED the CYP-default imbalance for UGT2B7 substrates. 6 of 8 seeds improved (under-predicted drugs moved toward observation); 2 of 8 worsened (over-predicted drugs moved away). [[backlog.md §B-13]] scopes the Phase 2.x abundance/IVIVE recalibration.
Anti-fudge integrity preserved:
Commits (b02-ugt-registry → squash-merge to main):
2b0502c Task 1 schema test scaffold81cf255 Task 2 UGT2B7 registry9ef5324 Task 3 UGT1A9 registryf4b0de2 Task 4 YAML abundancea5be12d Task 5 unit test scaffold30ffd5b Task 6 non_cyp_substrates.py extension34a6381 Task 7 integration mechanism testd01b84d Task 8 ivive.py activationArtifacts: data/training/4track_holdout_predictions.json (post-B-02 canonical cache), data/validation/4track_ci_2026-05-27_B02.json (bootstrap CIs on post-B-02).
Spec: docs/superpowers/specs/2026-05-24-doctrine-completion-sprint-design.md
Plan: docs/superpowers/plans/2026-05-24-doctrine-completion-sprint.md
Commits: Phase A 1cd6ff1, Phase B c0d3d27
atorvastatin + rosuvastatin promoted with literature-curated metabolic_fraction entries. v0.3 ECM doctrine complete for all 4 statin substrates (pravastatin/pitavastatin/atorvastatin/rosuvastatin).
Clopidogrel CES1/CYP3A4/CYP2C9 placeholder affinities (0.030 each, B-03 ceiling_accepted) replaced with literature-IVIVE values per Subash 2025 PMC12673578 rCES1 Vmax/Km + Boberg 2017 PMC5267516 CES1 abundance + Kazui 2010 85/15 fate split.
_VALID_AFFINITY_SOURCES += {"literature_ivive"} in src/sisyphus/predict/registry.py (parallels hepatic_fu_correction.py literature_applied precedent)107-holdout impact (post-T13 regen, public-clone deterministic state):
| Metric | Pre-Phase-B | Post-Phase-B | Δ |
|---|---|---|---|
| Clopidogrel Meta FE | 5.15× | 4.67× | −0.48× (improvement) |
| Meta AAFE (N=107) | 2.7715238009 | 2.7689936234 | −0.0025 |
| Engine AAFE | 4.065 | 4.057 | −0.008 |
| ML AAFE | 3.010 | 3.010 | invariant |
| In-domain Meta AAFE (N=80) | 2.862 | 2.859 | −0.003 |
| ΔMeta AAFE | = 0.0025 < 0.005 threshold → CLAUDE.md headline metrics table NOT updated (per plan §15 step 3). Existing 2026-05-12 CI [2.37, 3.26] remains canonical. The improvement is within noise of the bootstrap distribution; the doctrine value is closing the open TODO in CLAUDE.md, not the AAFE delta itself. |
Methodology defensiveness:
Tests added/updated:
tests/regression/test_clopidogrel_ces1_literature_applied.py (3 PASS: disposition, sanity window, 85/15 split)tests/integration/test_holdout_regression.py::test_cached_holdout_aafe_is_2p772 → test_cached_holdout_aafe_is_2p769 (full-precision 2.7689936234, tolerance 0.005 unchanged)test_loader_rejects_unknown_affinity_source continued to PASS (rejects unknown; accepts new literature_ivive)Outcome: DE-37. The 4 PPB candidates identified in T11 (paroxetine, oxybutynin, abiraterone, progesterone) all dispositioned ceiling_accepted after T12 confirmed the 4 primary-corpus papers (Watanabe 2009 DMD, Yamazaki 2010 DMD, Riccardi 2017 DMD, Patilea-Vrana 2017 CPK) are paywall-only via WebFetch — abstracts reachable but supplemental tables containing per-drug fu_inc/fu_p ratios are not. Secondary PubMed search recovered mechanism-context papers (CYP2D6 autoinhibition for paroxetine; CYP3A4 microsome CLint for oxybutynin; SULT2A1 PBPK for abiraterone; clinical CL for progesterone) but no measured ratio. The remaining 15 drugs were dispositioned not_applicable per T11 mechanism triage (non-PPB primary mechanism).
Numerical outcome: 19 audit rows committed with fu_correction_liver={mean: 1.0, cv: 0.0} (identity multiplier). 107-holdout Meta AAFE post-Phase-B = 2.7715238009, bit-identical to post-Phase-A (delta 0.0; per-drug Cmax 107/107 bit-identical to 1e-10). Phase B is a no-op against the engine, as expected when every value is the default.
What shipped (2 commits on feat/b11-phase-b-curation):
cbb8c5a docs(b-11): Phase B Task 12 literature search log — DE-37 path — T12 search trail (4 papers × 4 candidates + 3 PubMed queries × 4 candidates).d10bbef feat(data): hepatic_fu_correction Phase B 19-drug audit rows (B-11 Task 13) — populated data/transporters/hepatic_fu_correction.json.Infrastructure preserved: Phase A (commits e841356..a142a26 + a0c90f8) remains canonical on main. Future iterations with subscription access or a hepatocyte-uptake assay providing fu_inc/fu_p for ≥1 PPB candidate can revisit by simply adding rows; the loader (hepatic_fu_correction.py) and engine gates (ClearanceFluxSpec WS+PT, ProdrugActivationFluxSpec) are ready.
Telltale-if-it-returns: If a B-11 successor proposal arrives, check whether the proposer has primary-corpus subscription access or measured assay data. Without that, the public-clone literature corpus remains insufficient; the DE-37 disposition repeats.
Cross-references: dead-ends.md §DE-37, backlog.md §B-11, docs/superpowers/specs/2026-05-22-B11-Phase-B-curation-log.md.
Motivation: prepare engine for per-drug fu_correction_liver scaling to address systematic over-prediction of plasma Cmax for highly protein-bound drugs (clopidogrel, paroxetine, abiraterone class). Phase A ships infrastructure only; registry is empty; 107-holdout cache is bit-identical.
What shipped (12 commits on feat/b11-phase-a-infra, e841356..a142a26):
DrugOnGraph.fu_correction_liver: Distribution field (default 1.0, cv=0); propagated through sample(rng) and realize_means().src/sisyphus/predict/hepatic_fu_correction.py loader: returns Distribution(mean=1.0, cv=0.0) for unregistered SMILES; full InChIKey + connectivity-block fallback; loader-level anti-fudge guard rejects fu_correction_liver < 1.0.data/transporters/hepatic_fu_correction.json with empty overrides list (Phase A end state).Node.fu_correction_applicable: float = 0.0 field; parsed in graph.builder._build_node; exposed in engine.compiler.ResolvedParams.node_param (additive branch, mirrors _ivive_scaling pattern).engine.compiler.ResolvedParams.drug_param("fu_correction_liver") additive branch returning the realized mean.ClearanceFluxSpec.apply well_stirred + parallel_tube branches: at flagged nodes, fup_effective = fup × fu_correction_liver. ECM (extended) and GFR branches untouched.ProdrugActivationFluxSpec.apply: same gated pattern.data/physiology/reference_man.yaml liver node carries fu_correction_applicable: 1.0.predict()), and identity-blind random-rename invariance.Numerical outcome (acceptance gate): 107-holdout Meta AAFE = 2.7715238009 — bit-identical to canonical (delta 0.0 across all 4 tracks; per-drug Cmax 107/107 bit-identical to 1e-10). Empty registry means every lookup_hepatic_fu_correction returns the default 1.0, the gates fire but multiply by 1.0 (identity), so no engine behavior change.
Spec amendment: §4.2 was amended (commit 5e80aee) to acknowledge that engine/compiler.py receives additive node_param / drug_param branches (mirroring _ivive_scaling, _fup patterns). Invariant #8’s intent (no restructure, no fudge) is preserved; the literal “untouched” wording in the original spec was untenable.
Next: Phase B literature curation cycle for 19 over-predict drugs (meta_fold > 3). PPB-related subset (~5–7 drugs) curated via primary literature (Watanabe 2009 / Yamazaki 2010 / Riccardi 2017 / Patilea-Vrana 2017); others marked ceiling_accepted or not_applicable. Phase B acceptance gate: Meta AAFE delta ≥ 1% (ship), < 0.5% (DE-37 escape clause), or worse (revert curation, keep infra).
Motivation: close the remaining #11 prodrug registry item after B-04 made per-enzyme yields possible. Clopidogrel is a 107-holdout member scored as parent Cmax, while its mechanism splits hepatic fate into CES1 inactive hydrolysis and CYP oxidative bioactivation.
What shipped (codex branch + fix-forward on top):
docs/superpowers/specs/2026-05-20-clopidogrel-prodrug-design.md.data/sbi/prodrug_activation_registry.json: clopidogrel entry using B-04 per-enzyme yields — CES1 yield=0 dead-end, CYP3A4 yield=1, and CYP2C9 yield=1 as the existing Sisyphus 2C-subfamily surrogate for CYP2C19 contribution. observation_species="parent" so the holdout target remains apples-to-apples.data/transporters/cyp_clearance_overrides.json: clopidogrel metabolic_fraction=0.0 to prevent the default XGBoost-derived hepatic CL from double-counting the explicit ProdrugActivationEdges.predict/registry.py: InChIKey-connectivity fallback so the stereospecific registry key matches the non-isomeric clinical_pk.json SMILES.predict/cyp_clearance_overrides.py fix-forward: parallel InChIKey-connectivity fallback. The original B-03 commit added the fallback only to the prodrug registry lookup; the override lookup was still keyed on the full InChIKey (stereo block included), so the non-isomeric clinical_pk SMILES missed the stereospecific override, the XGBoost CL path ran in parallel with the prodrug edges, and parent hepatic CL was silently double-counted. This was the root cause of the apparently small +0.003 AAFE delta in the initial codex regen.tests/regression/test_holdout_unchanged.py doctrine update: prior rule (no holdout drug in registry) is replaced by two complementary gates — (a) every holdout-in-registry entry must use observation_species="parent" (active-species observation would inject species mismatch), and (b) every such entry must have metabolic_fraction=0.0 in cyp_clearance_overrides (otherwise double-count).tests/integration/test_predict_prodrug_simvastatin.py, tests/unit/test_cyp_clearance_overrides.py, tests/regression/test_prodrug_registry_seed.py: clopidogrel + non-isomeric SMILES InChIKey-fallback coverage.Numerical outcome (public-clone deterministic state: DrugBank + logP correction hidden during regen):
Artifacts: regenerated data/training/4track_holdout_predictions.json; refreshed bootstrap CI bundle data/validation/4track_ci_2026-05-12_v0.4.json in place (10,000 resamples, seed 20260422; computed_at 2026-05-20-v0.4-b03-fixforward).
Disposition: B-03 shipped (with the override-lookup fix). Active R-130964 disposition remains ceiling_accepted because the labile thiol and covalent P2Y12 binding prevent a clean conventional CL/V measurement. CES1 affinity calibration tracked as a separate B-03.x follow-up.
Commits (main direct, subagent-driven plan execution): 3be53f4, 7acbbe1, 6c0e9e9, 9e187de, 0938bf9, 4b07186.
Outcome: schema-only change; 107-holdout AAFE bit-identical pre/post on CI (local snapshot tests skipped under @skip_if_local_artifacts decorator due to public-clone state; CI is the gate).
What shipped:
ActiveMetabolite.enzyme_yields: dict[str, Distribution] (default empty).DrugOnGraph.sample(rng) and .realize_means() propagate the dict through reconstruction.predict/registry.py) parses optional per-enzyme yield on each enzyme_affinity_for_conversion[<tag>] block; multi-enzyme entries must declare yield for every enzyme or none (all-or-nothing rule, spec §5.4); lookup_active_metabolite now returns a 4-tuple.predict/ivive.py threads the new dict onto the frozen ActiveMetabolite via dataclasses.replace (no-op for empty dict).graph/builder.py) emits one ProdrugActivationEdge per (site × tag) intersection instead of one per site with collapsed tags; each edge reads am.enzyme_yields.get(tag, am.conversion_yield_fraction). sorted(...) makes edge order deterministic.tests/unit/test_prodrug_per_enzyme_yield.py (14 tests across 4 classes: dataclass field, sample/realize_means propagation, registry parsing, builder edge emission).tests/regression/test_prodrug_v3_registry_schema.py (all-or-nothing rule + [0,1] range check on production registry).Why this matters: unblocks B-03 (clopidogrel). Clopidogrel’s hepatic fate splits into CES1 → SR26334 (~85% inactive dead-end) and CYP2C19 → R-130964 (~15% active). A single entry-level yield cannot represent this without violating mass balance, species identity, or the mechanistic-A doctrine (see §3 of the B-04 spec). Per-enzyme yield resolves the structural blocker identified 2026-05-17.
Backward compat: 6 existing single-enzyme entries (BH4, GS-441524, tebipenem, R406, simvastatin, irinotecan) unchanged. Builder loop emits (1 site × 1 tag = 1 edge) per pre-B-04 site, with enzyme_tags=frozenset({tag}) and yield from entry-level fallback — bit-identical edge structure and yields. Snapshot regression and 107-holdout headline expected bit-identical pre/post (CI verifies).
Process note: shipped via subagent-driven-development skill (writing-plans → implementer + spec-reviewer + code-quality-reviewer per task). 6 implementation commits + 1 docs commit. One Task 4 dispatch failed with socket error after 30min on haiku; re-dispatched on opus, completed in 65s.
Next: B-03 implementation (clopidogrel registry entry + 107-holdout regen with documented AAFE delta).
Motivation: B-03 (clopidogrel registry entry, closes remaining 1/3 of issue #11) was scheduled as a 2–3h drop-in following the simvastatin/irinotecan PR #34 pattern. A pre-implementation design pass revealed the current single-enzyme schema cannot represent clopidogrel’s dual-fate hepatic metabolism without violating either mass balance or the v3 mechanistic-A doctrine. Backlog ordering reset; B-04 now ships first.
Method: brainstorming-driven design review. Three candidate paths examined:
metabolic_fraction=0 zeroing → loses the CES1 dead-end branch → parent CL 5–7× under-clear → R-130964 over-predict.All three break because the registry schema has a single conversion_yield_fraction per entry, while clopidogrel needs different yields per enzyme (CES1=0 dead-end, CYP2C19≈1 active).
Result: B-04 (multi-enzyme prodrug conversion schema) is a hard prerequisite for B-03, not an independent alternative. Re-ordered in docs/claude/backlog.md. B-04 design spec written:
docs/superpowers/specs/2026-05-17-multi-enzyme-prodrug-yield-design.md — adds optional per-enzyme yield field with entry-level fallback (backward-compatible; 6 existing single-enzyme entries bit-identical post-migration). Engine flux already supports per-edge yield (params.edge_param(edge_id, "conversion_yield"), src/sisyphus/engine/flux.py:639), so B-04 scope is registry + builder + tests only — no engine work. Estimated effort 4–6h (down from “1 day” the prior backlog entry quoted).
Interpretation:
parent (not active). The 107-holdout reference is parent clopidogrel Cmax; switching to active species would inject a deliberate 5–20× species-mismatch fold error.Disposition: spec written, committed (pending), no code change. B-04 implementation deferred to a separate session via the writing-plans skill.
Branch: to be committed on main as docs-only.
Motivation: the prior comment in src/sisyphus/predict/ivive.py (“UGT fm redistribution disabled — sensitivity test showed engine AAFE degradation 2.861 → 3.090”) referenced an unrecorded measurement run pre-v0.3.2 + pre-public-only-headline + pre-ECM-auto-activation. Current pipeline is materially different (Engine baseline 3.791 not 2.861); the prior negative could be stale. Phase 1 = read-only sensitivity to decide between spec cycle (positive) vs DE-NN refresh (negative or neutral).
Method: toggled ugt_enzymes = db.get_ugt_enzymes(profile.smiles) (vs current None) at ivive.py:642. Ran scripts/run_engine_benchmark.py under DrugBank-present + logp_correction-present (local-developer state); the toggle is a no-op under public-only state because DrugBank is the UGT data source.
Result:
| Slice | Track | A (UGT=None) | B (UGT enabled) | Δ |
|---|---|---|---|---|
| Overall N=107 | Engine | 3.791 | 3.762 | −0.029 |
| ML | 3.012 | 3.012 | 0 | |
| Meta | 2.679 | 2.679 | +0.0002 | |
| In-domain N=79 | Engine | 3.466 | 3.440 | −0.026 |
| Meta | 2.733 | 2.734 | +0.0005 |
Per-drug Engine FE shifts (≥2% log10): 11 improved (dapagliflozin 15.8→13.7, etodolac 8.4→7.0, ketorolac 7.1→5.8, metronidazole 10.6→9.8, glasdegib 4.0→3.2 — UGT-substrate NSAIDs and gliflozins that were under-predicting), 5 worsened (codeine 2.0→2.4, morphine 1.9→2.1, losartan 2.2→2.5 — over-predicting drugs now over-predict more).
Interpretation:
dead-ends.md DE-08~DE-18: the 4-track meta-learner absorbs single-track improvements via weight redistribution. UGT activation gains nothing at the Meta level.data/enzymes/{nat2,ugt1a1}_substrates.json (separate cycle).Disposition: not activated in production this cycle. Logged as DE-36 in dead-ends.md with the refreshed measurement; the original comment in ivive.py is now a 14-line summary pointing at DE-36. DE-04 (the original entry) retained for historical record with a cross-reference to DE-36.
Code: branch investigate/ugt-path-sensitivity (PR pending) carries only the documentation + comment update; the toggle was reverted.
Branch: feat/prodrug-registry-expansion-simvastatin-irinotecan (PR pending)
Spec: docs/superpowers/specs/2026-05-08-prodrug-registry-expansion-design.md (commit bbafd3d)
Closes: part of issue #11 (clopidogrel deferred — see below)
data/sbi/prodrug_activation_registry.json grows from 4 entries to 6:
ceiling_accepted. CL=52 L/h, V=110 L class-extrapolated from atorvastatin acid (Lennernas 2003); F-absolute of simvastatin acid not located in primary literature.literature_applied. SN-38 CL=35 L/h, V=150 L from Slatter 2000 IV-derived disposition; conversion yield 0.05 from Mathijssen 2001 review.Engine + ivive + pipeline: zero changes (existing lookup_active_metabolite() flows new entries through automatically per CLAUDE.md Invariant #1).
| prodrug | active species | dose / route | model Cmax | clinical target | gate |
|---|---|---|---|---|---|
| simvastatin lactone | acid | 40 mg PO | 0.00088 mg/L | Najib 2003 0.003-0.007 | 0.0005-0.10 (mech-only) |
| irinotecan | SN-38 | 350 mg IV | 0.0466 mg/L | Slatter 2000 0.05-0.10 | 0.0001-1.0 (mech-only) |
simvastatin under-predicts ~3-8× clinical due to acknowledged CL/V uncertainty (ceiling_accepted disposition). irinotecan SN-38 lands within clinical range (literature_applied disposition, well-characterized). Per spec §10, integration gates are mechanical-correctness-only; calibration is downstream.
Bit-identical (Meta 2.679 pin holds):
Full suite: 853 PASS, 15 skipped, 7 xfailed (pre-existing rosuvastatin/atorvastatin/fluvastatin Peff + 4 prodrug 3-fold gates).
Issue #11 originally requested 3 drugs. clopidogrel was deferred to a separate PR because:
Will be filed as separate v0.3.x PR after schema decision (single-step approximation vs schema extension).
test_prodrug_registry_seed.py) — frozenset 6 names + RDKit roundtrip. 1 FAIL + 1 PASS as expected.tests/regression/test_prodrug_registry_seed.py — frozenset seed-pin (6 names) + RDKit InChIKey roundtrip per entry.tests/integration/test_predict_prodrug_simvastatin.py — predict(simvastatin_lactone, 40mg PO) returns active acid Cmax > 0.0005 mg/L.tests/integration/test_predict_prodrug_irinotecan.py — predict(irinotecan, 350mg IV) returns SN-38 Cmax > 0.0001 mg/L (actual 0.0466 in clinical range).tests/integration/test_prodrug_v3_registry_schema.py auto-validates new entries’ v3_metadata blocks.Branch: feat/phenotype-scale-overrides (PR pending)
Spec: docs/superpowers/specs/2026-05-07-phenotype-scale-overrides-design.md (commit 8dd6cf7)
Closes: issue #31 (capability request from GenoADME — per-substrate effective phenotype scale injection)
apply_phenotype_to_graph() and predict() now accept a phenotype_scale_overrides: dict[str, float] | None = None keyword. When provided AND a gene matches a key in phenotypes, the override value replaces PHENOTYPE_SCALES[phenotype] for that gene’s effect on the matched node’s enzyme/transporter abundance. Negative values raise ValueError; no upper bound on positive values (caller responsibility).
Signature shape: flat {gene: scale} dict — substrate dimension implicit in per-call SMILES, phenotype dimension implicit in per-call phenotypes argument. Mechanically equivalent to GenoADME’s originally-proposed 3-level {gene: {phenotype: {substrate: scale}}}, simpler. Counter-proposal posted on issue #31 comment, awaiting GenoADME ack but proceeding (signature is small implementation detail).
Sisyphus ships no calibration tables. Caller (GenoADME’s case) is responsible for resolving (SMILES, gene, phenotype) → override scale from their own meta-analysis tables, and passing the resolved scale via phenotype_scale_overrides per call.
| call | OATP1B1 abundance scaling | Cmax (mg/L) | PM/EM ratio |
|---|---|---|---|
phenotypes={"SLCO1B1": "EM"} |
1.00× | 0.04218 | 1.000 (baseline) |
phenotypes={"SLCO1B1": "PM"}, no override |
0.10× (CPIC) | 0.12800 | 3.034 |
phenotypes={"SLCO1B1": "PM"}, phenotype_scale_overrides={"SLCO1B1": 0.30} |
0.30× | 0.07310 | 1.73 (compressed) |
Override compresses toward EM as specified. GenoADME can dial in any scale to match their meta-analysis target (e.g., Niemi 2006 men-stratum AUC ratio 3.32 central).
Tasks 1-4 (291d74f → 740da17):
phenotype.py extension: signature kwarg + override branch in scale-lookup loop + unused-key logger.info — 7/7 + 35/35 existing PASSpipeline/predict.py forward: signature + docstring + apply_phenotype_to_graph forwarding — spot-check ordering EM < Override < Default confirmedBit-identical (Meta 2.679 pin holds). Production benchmark uses default phenotype_scale_overrides=None. The override only changes behavior when the caller explicitly passes it.
phenotype_scale_overrides kwarg to apply_phenotype_to_graph and predict(). Caller injects per-gene effective scale to override CPIC defaults.predict() calls without overrides are unaffected.Branch: feat/nat2-ugt1a1-phenotype (PR pending)
Spec: docs/superpowers/specs/2026-05-04-nat2-ugt1a1-phenotype-design.md (v3, commit 9af6c30)
Plan: docs/superpowers/plans/2026-05-04-nat2-ugt1a1-phenotype.md (commit c1d94b3)
Closes: issue #10 (NAT2 + UGT1A1 PHENOTYPE_SCALES infrastructure)
CRITICAL: pipeline back-solve cancellation fix (657a9a4). pipeline.predict.predict() now snapshots liver.enzymes BEFORE apply_phenotype_to_graph and passes pre-phenotype values to build_drug_on_graph. The IVIVE _decompose_clint back-solves enzyme affinity from abundance, so passing scaled abundances caused phenotype scaling to cancel out exactly at engine multiplication time (the bug that silently nulled all CYP/UGT/NAT phenotype effects pre-v0.3.2). SLCO1B1 escaped only because OATP1B1 uses saturable Michaelis-Menten kinetics, not affinity back-solve.
caffeine + CYP1A2:PM/EM = 1.0000 (exactly cancelled), warfarin + CYP2C9:PM/EM = 1.0000, pravastatin + SLCO1B1:PM/EM = 3.034 (transporter path bypassed).scaled_abundance × pre_affinity = scale × original_rate. Empirical regression gates: tizanidine + CYP1A2:PM/EM 1.518, irbesartan + CYP2C9:PM/EM 1.251, pravastatin + SLCO1B1:PM/EM ~3.0 (unchanged).2f8571d, a0fa1a0):
data/enzymes/nat2_substrates.json — isoniazid (mf=0.90, Weber 1983 / Ellard 1976), hydralazine (mf=0.50), procainamide (mf=0.50). All InChIKeys round-trip via RDKit.data/enzymes/ugt1a1_substrates.json — raltegravir (mf=0.70, Iwamoto 2008), atazanavir (mf=0.40, Lankisch 2006), dolutegravir (mf=0.50, Reese 2013). RDKit-derived InChIKeys (raltegravir’s ikey diverges from a PubChem reference due to oxadiazole tautomer encoding; round-trip invariant holds).non_cyp_substrates.py loader module (679eecc) — mirrors transporter_db.py (PR #29) pattern: lru_cache JSON loaders, full RDKit InChIKey matching only, file-anchored paths. Public API: lookup_nat2_substrate(smiles), lookup_ugt1a1_substrate(smiles), get_non_cyp_fractions(smiles). Re-normalizes when sum > 1.0.
529c756):
data/physiology/reference_man.yaml liver.enzymes — appended NAT2: {mean: 1.0e7, cv: 0.6} and UGT1A1: {mean: 1.215e6, cv: 0.5} (independent lognormal, no Achour 2021 matrix entry).src/sisyphus/predict/ivive.py _LIVER_ENZYME_ABUNDANCE — added "NAT2": 1.0e7. UGT1A1 already present at 1_215_000.0 (= 1.215e6).57df86e, 107c21f):
_get_fm_fractions accepts non_cyp_fractions: dict[str, float] | None parameter. Validates each value in [0, 1], re-normalizes when sum > 1.0, allocates non-CYP first then scales CYP+UGT residual by (1 - non_cyp_total). Backward-compat preserved._decompose_clint and build_drug_on_graph forward the new kwarg through. Default None → existing behavior.Pipeline wiring (4c950fc) — pipeline.predict.predict() calls get_non_cyp_fractions(profile.smiles) once after auto-ECM gating, forwards to BOTH build_drug_on_graph invocations (initial + post-phenotype rebuild from Task 2).
Schema regression (d90eba5) — tests/regression/test_non_cyp_registry_schema.py with 8 gates: seed pinned (NAT2/UGT1A1 frozensets), InChIKey-SMILES roundtrip × 2, fm in [0, 1] × 2, YAML enzymes present, holdout-disjoint cross-cutting check.
d209b72, 82076c6):
test_phenotype_nat2.py — isoniazid NAT2:PM/EM = 1.4776 (gate > 1.3), metoprolol silent-zero invariant rel_err = 0.0 exactly.test_phenotype_ugt1a1.py — raltegravir UGT1A1:PM/EM = 1.419 (gate > 1.2). SMILES read from registry.657a9a4)The plan’s CYP propagation regression test originally used caffeine (CYP1A2) and warfarin (CYP2C9) as probe drugs with gates 1.5× and 1.2×. Empirical reality:
_get_fm_fractions allocates fm CYP1A2 = 0.20 (1/5 equal split), not the spec’s assumed ~0.80. Post-fix Cmax shift only ~1.06× — gate 1.5× was unreachable.Implementer (Task 2 subagent) replaced with tizanidine (CYP1A2-only DrugBank annotation, fm=0.833 → 1.52× ratio) and irbesartan (CYP2C9-only, fm=0.833 → 1.25× ratio). Spec reviewer verified empirically and confirmed the deviation is justified — the original gates were structurally unachievable given the model’s DrugBank-driven equal-fm allocation.
The replacement preserves regression intent (decisively distinguishes pre-fix 1.000 from post-fix > 1) with cleaner single-CYP probe drugs. Spec §11 acceptance criteria still mention caffeine/warfarin as historical record; the actual gates in tests/integration/test_phenotype_cyp_propagation.py use tizanidine/irbesartan/pravastatin.
Bit-identical — Meta 2.679 pin holds. tests/integration/test_holdout_regression.py PASS post-merge. The benchmark uses phenotypes=None default; the back-solve fix only changes behavior when phenotypes are explicitly passed (which was previously broken for non-SLCO1B1 anyway). Registry seed 0/107 holdout drugs (enforced by schema gate).
82076c6)tests/{unit,regression,integration} full suite: 840 PASSED, 15 skipped, 7 xfailed. Xfails are pre-existing (rosuvastatin/atorvastatin/fluvastatin Peff over-prediction issues, separate from #10).
realize_means() deterministic path: untouched. Adding NAT2/UGT1A1 to YAML at end of liver.enzymes block minimizes RNG-order disruption for any seed=42 MC sampling.pipeline/predict.py line 202 builds drug initially, then unconditionally overwrites it at the post-phenotype rebuild (now line ~284). The initial build is dead code in normal flow; only matters as a fallback if liver_enzymes_pre is None (degenerate test setup). Pre-existing pre-Task-2; out of scope. Cleanup candidate for future._get_fm_fractions UGT path (ugt_enzymes) is hardcoded to None in build_drug_on_graph:611 per a pre-existing sensitivity result. Re-enabling UGT2B7/UGT1A4/UGT1A9 paths requires separate sensitivity rerun. Out of scope.predict() calls without phenotypes= default to None and are unaffected.data/enzymes/*.json registry; update _EXPECTED_* frozenset in tests/regression/test_non_cyp_registry_schema.py; verify holdout-disjoint gate; consider holdout regen if drug is in 107.Branch: feat/pitavastatin-ecm-applicable (PR pending)
Spawn: v0.3 (PR #29) follow-up — initial seed list was pravastatin only; pitavastatin promotion was deferred pending metabolic_fraction curation.
Pitavastatin promoted to ecm_applicable=true in data/transporters/oatp1b1.json. Paired entry added to data/transporters/cyp_clearance_overrides.json with metabolic_fraction=0 (parallel pravastatin justification: Niemi 2009 PM/EM ~3x makes pitavastatin among the most OATP-rate-limited statins clinically; intracellular CYP2C9 + UGT1A3/2B7 paths are downstream of the rate-limiting uptake step). Schema regression test seed list updated to frozenset({"pravastatin", "pitavastatin"}).
Sweep across mf ∈ [0.0, 0.05, 0.10, 0.15, 0.25, 0.50, 1.0] (2026-05-04, on feat/pitavastatin-ecm-applicable): pitavastatin Cmax varies from 0.00168 → 0.00165 mg/L (1.8% relative variation). The triple-counting hypothesis from PR #22 / PR #29 narrative does NOT apply meaningfully to pitavastatin — mf is a near-irrelevant knob for this drug.
This revises the v0.3 PR #29 narrative retroactively: the pre-v0.3 (buggy auto-ECM) → post-v0.3 (no-ECM) flip on pitavastatin (FE 2.12 under → FE 0.45 over) was NOT a magnitude improvement; both directions show ~2x absolute fold-error. The actual root cause is OATP1B1 Jmax / ECM passive PS calibration (Hirano 2004 scaled-from-pravastatin estimate carries ~2x literature range), not metabolic_fraction.
| metric | post-v0.3 (Task 5 gating, no auto-ECM) | post-v0.3.1 (auto-ECM activated, mf=0) |
|---|---|---|
| pita predict() Cmax (2 mg) | 0.00777 mg/L | 0.00168 mg/L |
| FE vs FDA Livalo 0.0035 | 2.22x over | 2.08x under |
107-holdout AAFE invariant: pitavastatin is not in the 107-holdout, so Meta 2.679 / Engine 3.791 / ML 3.012 / In-domain Meta 2.733 are unchanged. No cache regen.
tests/regression/test_oatp_registry_schema.py: _EXPECTED_ECM_APPLICABLE updated to include pitavastatin; all 3 schema gates green.tests/integration/test_predict_auto_ecm.py: test_pitavastatin_no_auto_ecm replaced with test_pitavastatin_auto_ecm_activates (asserts warning tag present, Cmax matches 0.00168 ± 5%).tests/integration/test_oatp_ecm_statins.py::test_statin_cmax_under_ecm[pitavastatin]: unchanged (manual-build path was already ECM-active; FE 2.12 within 3-fold gate).Branch: feat/ecm-auto-activation (PR pending)
Spec: docs/superpowers/specs/2026-05-03-ecm-auto-activation-design.md
Plan: docs/superpowers/plans/2026-05-03-ecm-auto-activation.md
pipeline.predict.predict() ECM auto-activation (originally PR #9 / ae5b599) is now gated on a new ecm_applicable: bool flag in data/transporters/oatp1b1.json. Initial seed list flagged true: pravastatin only.
Three-layer registry pattern (no engine code changes):
oatp1b1.json schema extension (ecm_applicable: bool per drug, default false).is_oatp_ecm_applicable(smiles), load_oatp1b1_kinetics_for_smiles(smiles), load_hepatic_ecm_params_for_smiles(smiles) helpers in src/sisyphus/predict/transporter_db.py (mirrors PR #22 lookup_metabolic_fraction pattern; full InChIKey matching per spec §1.2).predict() checks the flag, conditionally loads kinetics/ECM, passes to build_drug_on_graph. The phenotypes= parameter (already shipped pre-v0.3 in commit 060dba5) inherits the gating: PGx scaling only affects drugs whose ECM path is wired.Schema regression test (tests/regression/test_oatp_registry_schema.py) gates:
{"pravastatin"} (catches silent flag flips)ecm_applicable=true drug has paired metabolic_fraction entry in cyp_clearance_overrides.json AND that entry has the metabolic_fraction field present (prevents pitavastatin-class double-counting bug)PR #9’s pre-v0.3 wiring used find_oatp1b1_substrate_name (block-1 InChIKey) and activated ECM for every drug present in BOTH oatp1b1.json AND hepatic_ecm.json (all 5 statins). Drugs without paired metabolic_fraction entries had XGBoost-CYP enzyme affinities running at full strength PLUS OATP1B1 saturable PLUS ECM passive — triple-counting hepatic clearance. Empirical:
| drug | pre-v0.3 (buggy) | post-v0.3 (gated) |
|---|---|---|
| pravastatin | FE 1.07 (correct, mf=0 set) | FE 1.07 (unchanged) |
| pitavastatin | FE 2.12 (under, no mf entry) | FE 0.45 (no-ECM canonical) |
| fluvastatin | FE 4.79 (under, CYP-dominant) | FE 1.54 (no-ECM canonical) |
| rosuvastatin | FE TBD (similar bug) | back to no-ECM |
| atorvastatin | FE TBD (similar bug) | back to no-ECM |
AAFE invariant: Meta 2.679, Engine 3.791, ML 3.012, In-domain Meta 2.733 — all bit-identical to the 2026-05-02 baseline. Only pravastatin is in the holdout among affected drugs, and pravastatin’s predicted Cmax was already correct under PR #9 (auto-ECM was right for pravastatin specifically because it had metabolic_fraction=0 from PR #22). The fix improves production behavior on 4 non-holdout statins and any future caller passing those SMILES to predict().
CI artifact: data/validation/4track_ci_2026-05-03_v0.3.json (10k bootstrap, seed=20260422; bit-identical to 2026-05-02).
tests/regression/data/prodrug_v3_pre_baseline.json rebaselined for pravastatin (0.01364 → 0.03130; PR #9 auto-ECM never updated this) and digoxin (0.00266 → 0.00204; PR #28 SMILES correction never updated this). Both pre-existing failures from prior PRs that didn’t refresh the leak audit baseline.tests/integration/test_holdout_regression.py pin updated 2.695 → 2.679 (also stale from before the 2026-05-02 SMILES-fix regen).metabolic_fraction curation (~0.15-0.25 estimate; UGT1A3/2B7 + minor CYP2C9; needs primary literature). Promotion to ecm_applicable=true queued.metabolic_fraction curation.data/sbi/method_routing.json reassessment via scripts/route_sbi.py re-run. Not auto-affected by Task 5 (offline-determined); follow-up.ecm_activated: bool, phenotypes_applied: dict) for GenoADME debugging.Branch: data/clinical-pk-digoxin-smiles-fix
Trigger: Audit script comparing clinical_pk.json SMILES vs DrugBank inchikey_14 across 107 holdout drugs (motivated by pravastatin discovery in #25). The audit flagged 3 candidates; 1 was a script false-positive (norethindrone DrugBank-name mismatch with DB14678 enanthate ester), 1 was already-fixed (pravastatin), and 1 was real: digoxin.
Diagnosis: clinical_pk.json carried a SMILES for “digoxin” that resolved to formula C30H48O16 (MW 664.70) — a sugar polymer with no steroid aglycone, just 4 sugar rings and a butenolide. Real digoxin (PubChem CID 2724385) is C41H64O14 (MW 780.95) — a cardiac glycoside with the digoxigenin steroid aglycone + 3 digitoxose sugars. Connectivity-level mismatch (InChIKey block 1 NYNHXAUTBGPYHF vs canonical LTMHDMANZUZIPE).
DrugBank’s stored canonical_smiles for DB00390 is also wrong — it parses to HZJGATJTJCKOLT block 1, formula C40H62O11. DrugBank’s own inchikey_14 column says LTMHDMANZUZIPE (correct), so DrugBank has internally inconsistent records. PubChem CID 2724385 is the authoritative source.
Fix: Replace clinical_pk.json digoxin SMILES with PubChem-canonical (full stereochemistry, RDKit-canonicalized for storage). One-line data change.
Concrete metric movements (107-holdout, regenerated):
| Track | Pre (post-#27) | Post | Δ |
|---|---|---|---|
| Meta | 2.6852 | 2.6785 | -0.0066 (-0.25%) |
| Engine | 3.7326 | 3.7907 | +0.0581 (+1.56%) |
| ML | 3.0110 | 3.0121 | +0.0011 (~0) |
| In-domain N | 80 | 79 | digoxin → out-of-AD |
| In-domain Meta | 2.7186 | 2.7333 | +0.0147 |
digoxin individual entry:
[] → ["HIGH_MW"] (correct: real digoxin MW 780 triggers the threshold; wrong-molecule MW 664 was below)Honest interpretation:
95% bootstrap CIs (regenerated, 10k resamples, seed=20260422; artifact data/validation/4track_ci_2026-05-02.json overwritten with PM values):
All point estimates within prior CIs — statistical narrative preserved.
Audit completion: The clinical_pk.json broader scan is complete for the 107-holdout subset. 1 of 107 drugs (digoxin) had a real connectivity error beyond pravastatin’s. The audit script flagged 3 candidates; the false-positive rate was 1/3 (norethindrone, due to name-matching ambiguity with the enanthate ester DrugBank entry). The remaining 104 holdout drugs match DrugBank’s inchikey_14 block-1 cleanly. Atorvastatin’s stereo-stripped reference SMILES (block 1 matches but stereo block differs) is a non-issue — Morgan FP and engine chemistry are stereo-insensitive at the relevant levels.
DrugBank’s own data quality issues (DB00175 pravastatin and DB00390 digoxin both have wrong canonical_smiles despite correct inchikey_14) are out of scope for this repo. Worth flagging upstream if Sisyphus’s authors interact with the DrugBank maintainers.
Branch: data/clinical-pk-pravastatin-smiles-fix
Trigger: Discovered during issue #9 (auto-load OATP1B1 ECM): the InChIKey-based substrate lookup in pipeline.predict.predict() could not match clinical_pk.json’s pravastatin to the registry because the reference SMILES carried a different molecule connectivity (extra ring double bond — InChIKey block 1 TUZYXOIXSAXUGO vs PubChem CID 54687 GOSGZXISMCZCDW).
Fix: Replace data/reference/clinical_pk.json pravastatin entry’s SMILES with PubChem-canonical (full stereochemistry preserved). One-line data change; no code change.
Concrete metric movements (post-fix benchmark, 4-track regenerated):
| Track | Pre | Post | Δ |
|---|---|---|---|
| Meta | 2.6947 | 2.6852 | -0.0096 (-0.36%) |
| Engine | 3.7575 | 3.7326 | -0.0249 (-0.66%) |
| ML | 3.0571 | 3.0110 | -0.0461 (-1.51%) |
| In-domain Meta | 2.7316 | 2.7186 | -0.0130 |
| In-domain Engine | 3.5734 | 3.5419 | -0.0315 |
| In-domain ML | 3.0430 | 2.9818 | -0.0612 |
Pravastatin individual entry: engine fold 0.415 → 0.844 (under 2.4× → under 1.18×), ML fold 0.129 → 0.654, Meta fold 0.546 → 1.252 (passes 2-fold gate from the over-prediction side). ML moves materially because the corrected SMILES produces different Morgan FP than the wrong-connectivity input.
95% bootstrap CIs (10k resamples, seed=20260422, regenerated artifact data/validation/4track_ci_2026-05-02.json):
All CIs effectively unchanged — the point-estimate movement is within bootstrap noise. Headline narrative “Meta ~2.7, Engine ~3.7, ML ~3.0” preserved with each estimate slightly improved.
Why the fix works: The original reference SMILES was structurally wrong (saturated decalin replaced by a more-unsaturated tetrahydronaphthalenone) — not just a stereo-stripped variant. RDKit faithfully canonicalized this wrong molecule and the entire downstream chemistry (logP, Kp, ADME XGBoost predictions, Morgan fingerprints) used wrong-molecule properties. Replacing with PubChem-canonical:
metabolic_fraction=0 (PR #22)The three changes compound: the holdout benchmark sees pravastatin’s predict() flow change from “wrong-molecule chemistry + no ECM + XGBoost CYP path” to “correct-molecule chemistry + ECM-only hepatic clearance via OATP1B1”.
Issue #8 status: pravastatin’s holdout fold now 1.25 (passes 2-fold gate). The motivating GenoADME population AUC validation needs separate confirmation in that repo, but the Sisyphus-side underprediction tracked in #8 is essentially closed by the chain #22 → #9 → #25.
Aftermath / follow-ups:
clinical_pk.json may carry similar quality issues. Audit deferred.Branch: feat/predict-auto-ecm
Trigger: PR #22 closed the architectural double-counting but only helped manual ECM callers. pipeline.predict.predict() did not activate ECM by default, so the metabolic_fraction registry had zero effect on the production benchmark. Issue #9 tracked this gap.
Fix: Auto-detect registered OATP1B1 substrates by canonical InChIKey (connectivity block) in predict(), then load both transporter_kinetics + hepatic_ecm_params from existing registries. Auto-load gated on BOTH registries having the drug; warning tag oatp1b1:auto_ecm:<name> on the result for audit.
Headline impact at merge: bit-identical (issue #25 SMILES error in the holdout reference for pravastatin prevented the lookup from matching). After #25 fix shipped, the auto-load path becomes active for pravastatin and contributes to the metric movements above.
Why InChIKey block 1 matching: SMILES sources sometimes strip stereochemistry annotations. Matching on the full InChIKey would miss those variants; matching on the connectivity block (first 14 chars) tolerates stereo differences. False positives across the 7 currently-registered substrates not a concern (all distinct connectivity).
Branch: feat/oatp1b1-ecm-reconciliation
Trigger: GenoADME Tier 1 PARTIAL on pravastatin; test_oatp_ecm_statins[pravastatin] xfail under post-Hardening realize_means() (FE drifted 1.486 → 1.823); GitHub issues #12 (#8a) / #13 (#8b) / #14 (#8c) sequencing the fix.
Root cause: build_drug_on_graph(profile, adme, ..., transporter_kinetics, hepatic_ecm_params) always decomposed XGBoost hepatocyte CLint into per-enzyme affinities AND, separately, applied the OATP1B1 ECM clearance when transporter+ECM kwargs were supplied. The two clearances added at the simulation layer. For uptake-dominated substrates (canonical: pravastatin, ~85% OATP1B1), in vitro hepatocyte CLint already integrates the OATP1B1 contribution, so this counted the same clearance twice.
Fix: Per-drug metabolic_fraction registry that scales the metabolic-path enzyme_affinities derived from XGBoost CLint. When the engine’s ECM machinery is active for a drug whose hepatocyte CLint is uptake-dominated, the registry routes the entire hepatic clearance through the ECM transporter path without double-counting. Default 1.0 (no scaling) for the 106 unregistered holdout drugs.
data/transporters/cyp_clearance_overrides.json — registry seeded with pravastatin metabolic_fraction=0.0 (canonical OATP1B1-only).src/sisyphus/predict/cyp_clearance_overrides.py — InChIKey-keyed loader.src/sisyphus/predict/ivive.py — _decompose_clint(metabolic_fraction=) + build_drug_on_graph SMILES lookup.Test invariant redesign (#13): The pre-#12 cmax_on/cmax_off < 0.95 invariant in test_oatp_pravastatin is mathematically incompatible with the post-fix model — with metabolic_fraction=0, the “off” arm has no hepatic clearance for pravastatin and Cmax goes very high. Replaced with SLCO1B1 EM/PM phenotype check: PM (OATP1B1 × 0.10) must raise Cmax vs EM. Empirical: cmax_em=0.0422, cmax_pm=0.1280, ratio=3.034 (clinical literature: ~2-3× AUC under PM).
Abundance recalibration (#14): scripts/calibrate_oatp_abundance_ecm.py post-#12 still recommends the existing liver.transporters.OATP1B1.mean = 5.0e5 (FE 1.058 vs FDA pravastatin 0.045 mg/L). The Hardening-era T7 drift was a downstream symptom of the double-counting, not an abundance miscalibration. PS_active 502 L/h remains outside the Watanabe 2009 literature range [0.5, 2.0] — separate ECM IVIVE-scaling concern (DE-33 adjacent), not a #12 deliverable.
Concrete metric changes:
test_oatp_ecm_statins[pravastatin]: xfail (FE 1.486-1.823) → PASS (FE 1.066, gate 1.3). Promoted out of _KNOWN_PEFF_FAILS.Headline invariance: pipeline.predict.predict() does not activate ECM/transporter machinery by default — build_drug_on_graph is called without transporter_kinetics or hepatic_ecm_params. The 107-holdout benchmark predicts via the default path, so the metabolic_fraction registry has zero effect on production AAFE. 4track artifact bit-identical pre-vs-post #12 (Meta 2.6947, Engine 3.7575, ML 3.0571). The fix is targeted at ECM-active code paths (GenoADME Tier 1, PGx-aware predictions, calibration script, integration tests).
Aftermath: Issue #21 (fluvastatin under-prediction) opened to track the opposite-direction failure. The metabolic_fraction registry is extensible to (B)-flavor per-drug fractions in v0.3 (atorvastatin ~0.7 CYP3A4, rosuvastatin ~0.15 CYP, etc.) by adding entries; no further code changes required.
Branch: feat/hardening-mean-only
Trigger: Engine drift bisect from 2026-04-29 entry — investigation revealed +19.1% Engine drift was NOT real model degradation but RNG-order coupling.
Root cause: predict() with n_mc_samples=0 (deterministic default) used graph.sample(rng=np.random.default_rng(42)), which:
Distribution.sample(rng) over all enzyme/transporter dicts in YAML orderrng.lognormal(...) consuming RNG state2924f50, v2 prodrug enzymes, v3 metadata) shifts subsequent drawsThe sample(rng=42) realized values were treated as “deterministic” but were actually a single specific lognormal draw at each position — vulnerable to ANY upstream YAML change.
Fix: Add BodyGraph.realize_means() and DrugOnGraph.realize_means() methods that use dist.mean directly instead of dist.sample(rng). predict() and test_engine_validation now use these. ~120 lines.
Headline AAFE delta (v2 baseline 2026-04-30 → Hardening 2026-05-01):
| Track | v2 baseline | Hardening | Δ (%) | Note |
|---|---|---|---|---|
| Meta (Overall) | 2.702 | 2.695 | -0.3% | Restored to pre-Achour 2026-04-14 value |
| Engine (Overall) | 3.572 | 3.757 | +5.2% | Was seed-favorable at 3.572; mean-only is canonical |
| ML (Overall) | 3.057 | 3.057 | 0% | Invariant |
| In-domain Meta | 2.730 | 2.732 | +0.07% | Within CI noise |
Bisect interpretation (resolves 2026-04-29 follow-up):
The Meta value 2.695 from Hardening EXACTLY matches the pre-Achour 2026-04-14 value, confirming the Engine “drift” narrative was entirely RNG-order artifact. Engine track value 3.421 from yesterday’s manual cv=0 zeroing was a partial-zeroing artifact (cardiac_output and other globals not zeroed); 3.757 is the truly canonical mean-only value.
Test impact:
test_engine_validation: midazolam/caffeine/warfarin pass within 5%; propranolol (~16% drift xfail) flipped to PASS — same RNG mechanism resolvedtest_holdout_regression: pin 2.702 → 2.695test_prodrug_v2_snapshot: re-pinned to mean-only Cmax (sepiapterin 11.40→11.30, remdesivir 0.987→0.984, tebipenem_pivoxil 0.443→0.521 +17%, fostamatinib 0.135→0.126 -7%)test_prodrug_v3_enzyme_leak_audit: pre_baseline regenerated against Hardening canonical; 107/107 byte-identical going forwardArchitectural significance:
Files:
src/sisyphus/graph/body.py: + BodyGraph.realize_means() (~50 lines)src/sisyphus/core.py: + DrugOnGraph.realize_means() (~50 lines)src/sisyphus/pipeline/predict.py: deterministic path uses realize_meanstests/integration/test_engine_validation.py: uses realize_means; propranolol xfail removedtest_holdout_regression.py (2.702→2.695), test_prodrug_v2_snapshot.py (4 drugs)data/training/4track_holdout_predictions.json regenerateddata/validation/4track_ci_2026-05-01.json (10k bootstrap, seed=20260422)Follow-ups:
Branch: feat/prodrug-activation-v3 (gated on v2 PR #7 merge per spec §8.1, satisfied 2026-04-30 by 78d12e3).
Spec: docs/superpowers/specs/2026-04-29-prodrug-activation-v3-design.md
Plan: docs/superpowers/plans/2026-04-29-prodrug-activation-v3.md (19 tasks across 5 phases — all complete)
Literature deliverable: docs/superpowers/specs/2026-04-29-prodrug-v3-literature.md
Per-item dispositions (mechanistic-A doctrine compliant per spec §3.3):
| # | Item | Disposition | Citation primary | Code change |
|---|---|---|---|---|
| 1 | BH4 CL/Vd (sepiapterin) | ceiling_accepted | Feillet 2008 + FDA Kuvan + EMA EPAR (F not known) | v3_metadata only |
| 2 | GS-441524 CL/Vd (remdesivir) | literature_applied | Tamura 2023 + Leegwater 2022 (popPK geomean) | CL 10→17.4, V 35→535 |
| 3 | R406 CL/Vd (fostamatinib) | literature_applied | Matsukane 2022 (IV microdose review) | CL 28→15.7, V 250→256 |
| 4 | tebipenem CL/Vd | ceiling_accepted | Eckburg 2019 (V/F surrogate rejected) | v3_metadata only |
| 5 | SPR proteomic abundance | ceiling_accepted | HPA + Wu 2020 (animal-only) | v3_metadata only |
| 6 | CES2/tebipenem CLint | ceiling_accepted | Gupta 2023 (no isoform attribution) | v3_metadata only |
Outcome:
Significance: v3 closes the input-data quality pillar of the prodrug saga (v1→v2→v3) with rigorous mechanistic-A discipline. 4 items closed as ceiling because primary literature truly does not exist (F_sapropterin, F_tebipenem, human SPR proteomic, in vitro CES2/tebipenem). 2 items advanced via popPK geomean. Empirical Cmax fold-errors barely shifted because:
This is the canonical mechanistic-A outcome: “we know the literature gap exists; we documented it; we did not fudge to pass”. v4 candidates require new mechanistic terms (extra-hepatic esterase, BH4 first-pass depletion, etc.) — beyond data refresh.
Test impact:
test_prodrug_v3_registry_schema — 8/8 PASS (TDD red→green)test_prodrug_v3_enzyme_leak_audit — PASS (107/107 byte-identical)test_prodrug_v2_validation_gate — 4 xfail (reasons updated with v3 disposition references)test_prodrug_v2_snapshot — 4 PASS (re-pinned to v3 deterministic Cmax values)test_prodrug_v2_pipeline_smoke — 4 PASS (functional-only refactor per §6.1)test_prodrug_v2_ddi_smoke — PASS at v2 tolerance (no widening needed)Files:
data/sbi/prodrug_activation_registry.json (4 entries with v3_metadata; 2 with value updates)tests/integration/test_prodrug_v3_registry_schema.py (NEW), tests/regression/test_prodrug_v3_enzyme_leak_audit.py (NEW)scripts/capture_prodrug_v3_baseline.py + tests/regression/data/prodrug_v3_pre_baseline.jsonTrigger: v2 PR (feat/prodrug-activation-v2) CI failure on test_engine_validation::test_cmax_within_5pct[midazolam, caffeine, warfarin] — Cmax shifted 6-19% above Omega targets.
Diagnosis: v2 added new lognormal enzyme distributions (SPR/CES1/CES2/ALPI) to physiology YAML at liver, gut_wall, and kidney nodes. BodyGraph.sample(rng) iterates nodes in YAML insertion order, so adding a cv>0 distribution at kidney (which previously had no enzymes block, position 4 in YAML, BEFORE liver) consumed 1 RNG draw before liver’s CYP3A4 sample. This shifted all liver CYP samples → midazolam Cmax +18.5%. Liver/gut_wall enzyme additions were appended AFTER existing CYPs, so existing CYP samples preserved BUT new draws shifted downstream OATP1B1 transporter sample → ECM-pathway holdout drugs drifted 8-27%. Test was passing on main due to RNG-order coincidence with seed=42.
Fix (commit 6c121ce): Move kidney YAML node block to after gut_wall. Preserves all v2 mechanistic content (kidney SPR retained for sepiapterin renal contribution); only changes RNG sample order. ODE state index accessed via name lookup throughout — functionally invariant.
Cache regen (commit 6528ba8): ECM holdout regression test (5% drift gate) failed because v2’s enzyme additions still shift OATP1B1 sample even with kidney moved (liver enzyme appendage is the irreducible cause). data/training/4track_holdout_predictions.json regenerated against PR src + Option D YAML to capture v2 baseline.
Aggregate AAFE delta (main 2026-04-29 → v2 2026-04-30):
| Track | main (2026-04-29) | v2 (2026-04-30) | Δ (abs) | Δ (%) |
|---|---|---|---|---|
| Meta (Overall) | 2.719 | 2.702 | -0.017 | -0.6% |
| Engine (Overall) | 4.073 | 3.572 | -0.501 | -12.3% |
| ML (Overall) | 3.057 | 3.057 | 0 | 0% |
| Meta (In-domain) | 2.759 (n=80) | 2.730 (n=80) | -0.029 | -1.1% |
Meta %2-fold/3-fold unchanged (46.7%, 62.6%). Engine %3-fold improved 40.2 → 53.3.
Significance:
spec §6.1 invariance violation: v2 spec §6.1 promised “107-holdout invariance” — actually impossible because adding any cv>0 enzyme to a node consumes RNG draws and shifts downstream samples. Spec assumption was wrong. Real invariance requires either (a) per-node independent RNG seeding, or (b) deterministic mean-only realization. Both deferred to hardening backlog.
Test impact:
test_engine_validation::test_cmax_within_5pct: PASSES (3/3) with kidney moved.test_ecm_holdout_regression: PASSES (cache regenerated).test_holdout_regression::test_cached_holdout_aafe_is_2p695: pinned AAFE updated 2.695 → 2.702. (NB: same test was already failing on main at 2.719 — pre-existing main bug; not in CI workflow.)test_oatp_ecm_statins[fluvastatin]: FAILS, FE 3.651 vs gate 3.0 (improved from main’s 4.133 but still over). Pre-existing, separate from v2.test_oatp_ecm_statins[pravastatin]: PASSES (was failing on main per 2026-04-29 entry; v2 baseline shift moved it within gate — likely incidental).Follow-ups (queued):
Files:
data/physiology/reference_man.yaml: kidney node moved after gut_wall (commit 6c121ce)data/training/4track_holdout_predictions.json: regenerated (commit 6528ba8)tests/integration/test_holdout_regression.py: pinned AAFE 2.695 → 2.702 (commit 6528ba8)Trigger: tests/integration/test_ecm_holdout_regression.py failing on main — 10/10 spot-checked drugs drifted 15-27% lower than cached. Investigation revealed the cache (data/training/4track_holdout_predictions.json) was last written 2026-04-14, before P4.5 Achour merge (2026-04-23) and other ECM/V3-routing changes.
Action: Re-ran scripts/run_engine_benchmark.py --save-json data/training/4track_holdout_predictions.json on current main. Backup of pre-regen cache stashed at /tmp/4track_pre_regen_2026-04-29.json (not committed).
Aggregate AAFE delta (PRE 2026-04-14 cache → POST 2026-04-29 fresh):
| Track | PRE | POST | Δ (abs) | Δ (%) |
|---|---|---|---|---|
| Meta (Overall) | 2.695 | 2.719 | +0.024 | +0.9% |
| Engine (Overall) | 3.421 | 4.073 | +0.652 | +19.1% |
| ML (Overall) | 3.057 | 3.057 | 0 | 0% |
| Meta (In-domain) | 2.710 (n=85) | 2.759 (n=80) | +0.049 | +1.8% |
| Engine (In-domain) | 3.236 (n=85) | 3.808 (n=80) | +0.572 | +17.7% |
Meta %3-fold: 65.4 → 62.6. Engine %3-fold: 57.9 → 40.2.
Significance:
docs/claude/propranolol_cmax_drift.md, the propranolol +16% drift on b366035 was an early canary; the broader engine drift documented here is consistent with that direction.Test impact:
test_ecm_holdout_regression now PASSES (cache matches fresh predictions).test_oatp_ecm_statins[pravastatin] still FAILS (FE 1.486 vs gate 1.3, T7 calibration drift) — independent of cache regen.test_oatp_ecm_statins[fluvastatin] still FAILS (FE 4.133 vs gate 3.0, Peff overprediction) — independent of cache regen.Follow-up needed:
Files updated:
data/training/4track_holdout_predictions.json (regenerated)CLAUDE.md headline performance table (point estimates, %2/3-fold, n_in_domain; CIs annotated stale)docs/claude/experiment-log.md (this entry)Spec: docs/superpowers/specs/2026-04-22-achour-abundance-correlation-design.md
Plan: docs/superpowers/plans/2026-04-22-achour-abundance-correlation.md
Branch: feat/achour-correlated-abundance (merged commit TBD).
Outcome: Infrastructure landed. Distribution gains optional correlation_group
field; new sisyphus.physiology.correlation_registry provides multivariate-lognormal
sampling; generate_physiology(rng=) opt-in; reference_man.yaml liver node
migrated to Achour 2021 CVs with OATP1B1 independent (mean_r=0.234 < 0.3
threshold, empirical Achour Table S7 inclusion rule).
Gates passed:
Non-outcome: SBC improvement is explicit Non-Goal (§1 spec). Downstream P4.5a spec will retrain the SBI amortizer with physiology sampling and re-measure SBC on the 52-cell grid.
Data artifacts:
data/physiology/achour2021_liver_abundance.csv — 29 donors × 6 targetsdata/physiology/achour2021_correlation.json — 5×5 log-correlation matrix for CYP3A4/2D6/1A2/2C9/2E1Source: Achour 2021 CPT 109:222-232 (PMC7839483, CC BY-NC 4.0).
Infrastructure shipped (7 commits, 4630b0b..4e10ad2):
Route-aware t_min_h = _IV_CMAX_DELAY_H (5/60 h) if route=="iv" else 0.0 threaded through solve(), solve_mc(), compute_endpoints(), propagate_fast() (scipy backend), pipeline. Oral (107 holdout + production) byte-identical to V2 — pinned by tests/integration/test_v3_oral_regression.py. 562 pass / 4 skip / 2 xfail, zero new failures.
docs/superpowers/specs/2026-04-22-iv-cmax-observation-design.md (d88183a)docs/superpowers/plans/2026-04-22-v3-iv-cmax-observation.md (de6292b)4630b0b (solve anchor) → 9bc2e3d (solve_mc windowed) → 2742df8 (compute_endpoints) → 6ed22e7 (propagate_fast) → 3f86e2e (pipeline route-cond) → ed3207f (oral regression) → 4e10ad2 (propagate caveat)ECM generalization re-run under V3 (7aa49ae, data/validation/oatp_generalization_result_v3.json):
Formal Mode C. Direction flipped from V2: V2 appeared to over-predict 1.1–1.35× but that was the t=0 artifact. V3 with windowed Cmax shows systematic underprediction 2.5× on both drugs.
| Drug | Observed | V2 (artifact) | V3 (real) | V3 PI | V3 log10 FE |
|---|---|---|---|---|---|
| glimepiride | 0.243 | 0.270 (1.11×) | 0.095 | [0.087, 0.101] | −0.409 |
| valsartan | 4.02 | 5.405 (1.35×) | 1.940 | [1.80, 2.06] | −0.316 |
| Median | log10 FE | = 0.363 < 0.5 Mode B gate → formally Mode C, but same-direction underprediction is substantively suggestive of systematic ECM over-clearance for non-statin OATP1B1 substrates. V2’s apparent “near-pass” was a methodology illusion; V2 result preserved as .v2.json. |
Diagnostic (5ff72eb, data/validation/v3_fup_override_diagnosis.json):
fup override (valsartan predicted 0.009 → clinical 0.050, 5.6× increase) gave Cmax 0.97× — essentially no change. Glimepiride predicted fup already matches clinical (0.005). Predict-layer fup confound RULED OUT as cause of V3 underprediction.
Remaining candidates for V3 underprediction (not investigated this session):
Pre-registration integrity maintained:
V3 methodology spec written + committed (d88183a) BEFORE engine re-run (7aa49ae). Single MC run. Fup diagnostic explicitly marked exploratory ("note": "NOT a pre-registered run"). No post-run parameter adjustment.
How to apply:
SUPERSEDED by V3 run (2026-04-22) above. Original V2 result preserved as data/validation/oatp_generalization_result.v2.json. Kept here for historical context only.
Spec: docs/superpowers/specs/2026-04-21-ecm-generalization-test-design.md
9115e63 + v2 amendment 6e7ce0a (substrate swap) + v2.1 0d78c38 (valsartan Jmax scaling)Plan: docs/superpowers/plans/2026-04-21-ecm-generalization-test.md (commit 3c85fe4)
Result: data/validation/oatp_generalization_result.json (commit 4fb6d38)
Formal outcome: Mode C (inconclusive)
Per drug:
Substantive signal: Both point estimates within 1.5× of observed — well inside the 3× clinical-error gate. If PI were non-degenerate and contained observed, outcome would have been Mode A (confirmed generalization within tested domain). Suggestive-positive for ECM mechanism but NOT formally confirmed.
Why PI is zero-width (root cause):
MC Cmax for IV bolus in Sisyphus = dose / V_venous_blood (deterministic t=0 instantaneous value, 3.7 L ± 0.0). Distributional CVs downstream (Jmax, Km, fup, Kp, ps_*) never reach Cmax because max-over-time selects t=0. All 1000 samples produce identical output.
Secondary gap:
data/transporters/hepatic_ecm.json lacks entries for valsartan + glimepiride → ps_passive/ps_eff/cl_int_bile fell to defaults (1e6 L/h for ps_*, 0 for bile). Not the cause of zero-PI but a data completeness gap worth closing.
Predict-layer confound flag (per spec §Peff Isolation):
Pre-registration integrity: Single run at N=1000, seed 42. No post-run parameter adjustment. All spec/plan amendments (v2, v2.1) pre-dated the engine execution. Substrate swap (bosentan/repaglinide → glimepiride) was documented under v2 amendment BEFORE any engine run, driven by data-access limits not expected outcome.
Commits:
Follow-up recommended (separate task, not this session):
hepatic_ecm.json for non-statin OATP1B1 substrates.feat/oatp-ecm)docs/superpowers/specs/2026-04-20-oatp-ecm-hepatic-clearance-design.mddocs/superpowers/plans/2026-04-20-oatp-ecm-hepatic-clearance.mdClearanceFluxSpec gains "extended" model; DrugOnGraph gains ps_passive, ps_eff, cl_int_bile; data/transporters/hepatic_ecm.json + load_hepatic_ecm_params() added.data/physiology/reference_man.yaml liver clearance model well_stirred → extended; two active_transport edges removed; liver.transporters.OATP1B1 abundance re-calibrated 1.0e11 → 5.0e5 via scripts/calibrate_oatp_abundance_ecm.py (pravastatin FE=1.013 under ECM).| 107 holdout: Meta AAFE 2.695 preserved exactly ( | Δ | =0.000019). Non-OATP drugs use PS_passive=PS_eff=1e6, CL_int_bile=0 defaults; ECM reduces to well-stirred algebraically. |
@pytest.mark.xfail(strict=False) so they auto-promote if Peff is later improved.data/validation/oatp_ecm_abundance_calibration.json; sweep script at scripts/calibrate_oatp_abundance_ecm.py.93febe3)predict/phenotype.py transporter extension: TRANSPORTER_ALIASES = {"SLCO1B1": "OATP1B1"}, apply_phenotype_to_graph scales transporter abundance by CPIC activity score (PM 0.10×, IM 0.50×, EM 1.00×, UM 2.00×). parse_phenotype_spec accepts SLCO1B1:PM and mixed CYP2D6:PM,SLCO1B1:IM.pipeline/predict.py does not call it).3a04291, data-only)data/transporters/oatp1b1.json: 1 drug → 5 drugs. Rosuvastatin / atorvastatin / pitavastatin / fluvastatin Km from Niemi 2009 midpoints. Jmax scaled from clinical hepatic uptake CL ratio vs pravastatin (Hirano 2006, Maeda 2011, Li 2018). CV widened to 0.40 (Jmax) / 0.35 (Km).pipeline/predict.py does not call load_oatp1b1_kinetics — TDM path only).scripts/validate_oatp_phase2a.py ran 41 min then stalled on LSODA for 4/5 statins. Diagnosis (oatp_phase2a_stiff_diagnosis.json): abundance 1e11 is flow-limited saturated regime. Abundance sweep (oatp_abundance_sweep.json, 2026-04-20 PM): Cmax invariant across [1e9, 3e9, 1e10, 3e10, 1e11]. Conclusion: parameter tuning cannot fix this — engine refinement needed (→ OATP ECM).test_transporter_db.py unit tests load all 5 drugs.bayesian_update(method="sbi", sbi_reweight=True) — opt-in flag. NPE posterior samples importance-reweighted by log-normal likelihood (mathematically equivalent to IS with NPE as proposal). tdm_sbi.py:555 + tdm.py:227. Default False (preserves existing production path).data/validation/tdm_method_tournament_sbi_reweight.json, OFF→ON bias):
| Mean | bias | : 23.0% → 16.4% (29% improvement overall) |
| Interpretation: reweighting effective when | bias | ≥ 20%, regressive when | bias | < 10%. N=200 single-obs stochastic error amplified by likelihood. Bias-variance tradeoff. |
sbi_reweight=False retained. Per-drug routing: method_routing.json gets sbi_reweight: {"morphine": true}, morphine route is → sbi. CLI auto: [auto] routing morphine → method=sbi +reweight. Final production: 12 SBI / 0 IS / 1 IBIS (IS override retired). 7 SBI dispatch tests pass.docs/superpowers/specs/2026-04-19-p6-morphine-fix-decision.md.pipeline/predict.py gains HIGH_ACID_LOW_FUP AD flag — informational warning for drugs with pKa < 5 AND DrugBank measured fup < 0.02. Ketorolac, ibuprofen flagged. Morphine / base drugs not flagged. Engine numbers unchanged.docs/superpowers/specs/2026-04-19-p7-ketorolac-decision.md.feat/continuous-hierarchical)src/sisyphus/sbi/physiology_generator.py — generate_physiology(BW, age) builds BodyGraph for any patient 0.5–85y, 5–120kg. Hines 2008 enzyme ontogeny (exponential maturation) + Wynne 1989 aging decline + allometric volume/flow scaling.bayesian_update(body_weight_kg=X, age_years=Y) + CLI --body-weight X --age Y.scripts/sbi_generate_continuous_data.py + scripts/sbi_train_continuous_hierarchical.py.21a92c9): sisyphus tdm --phenotype CYP2D6:PM — CPIC activity scaling (PM 0.1×, IM 0.5×, EM 1×, UM 2×). src/sisyphus/predict/phenotype.py. 17 tests. DM PM case: posterior enzyme_affinity 4.89 → 6.48 (physiologically interpretable).d4e1633): Track A amortizer conditions on first obs only; additional obs applied as post-hoc log-normal likelihood importance reweighting. _scipy_cmax_and_obs_conc() helper + weighted posterior stats. 2-obs test confirms ESS decrease.ce9a924): removed hardcoded DEFAULT_DOSE_MIN=25mg. Now inferred from current_dose as 0.1×–10×. DM 30 mg PM → recommends 12 mg correctly (previously clamped to 25 mg).5c0d864, reverted fdda41c)See DE-32.
feat/oatp1b1-pravastatin)builder.py — node transporters: + active_transport edge type) + flux.py / rhs_jax.py target-side IVIVE bug fixes + build_drug_on_graph(transporter_kinetics=...) kwarg + data/transporters/oatp1b1.json DB + predict/transporter_db.py loader.docs/superpowers/specs/2026-04-15-oatp1b1-hepatic-uptake-design.md, docs/superpowers/plans/2026-04-15-oatp1b1-pravastatin.md.ccc15a0 code + 43051ab eval)apply_theta_to_drug sigmoid-inverts. Improves prior coverage for low-fup acids / statins.models/sbi/multi_drug_nsf.pt = v2 (logit fup, 94 epochs, 2815s on 110k samples). v1 archived as _v1.pt.data/validation/tdm_method_tournament_v2.json.amortizer.py:load_result() warning + tdm_sbi.py:sbi_update() ValueError block old models.docs/tdm_ci_calibration.md)TDMResult.cmax_ci_90 populated from raw posterior Cmax samples via weighted quantile in all dispatch paths (IS / IBIS / EnKF / SBI). Removes the lognormal over-cover artifact on high-CV posteriors.bayesian_update(min_ci_half_width_fraction=0.5) kwarg. Posterior CI half-width < 50% × mean widens to 50%. apply_ci_floor() public helper.data/training/4track_holdout_predictions.json formally saved (JSON schema + per-drug fields).docs/sbi_multi_drug_results.md)docs/sbi_multi_drug_results.md Addendum)tdm.bayesian_update(method="sbi") + silent IBIS fallback.data/sbi/method_routing.json — initially 11 SBI / 1 IS / 1 IBIS.sisyphus tdm --method {is, ibis, enkf, sbi, auto}. auto consults routing table.apply_theta_to_drug must collapse override-field CVs to 0 so posterior CV drops below prior CV (morphine before 56% > 39%, after 34% < 39%).docs/surrogate_ood_fix.md)Initial:
params_to_features_single summed abundance × affinity across all nodes (liver+gut) without reversing _CLINT_SCALING. Real drugs had log10_clint ≈ 6 vs training range [−0.5, 3.0]. Inflation ~10⁴×.recover_drug_level_clint() restricts sum to liver node, divides by _CLINT_SCALING / _IVIVE_SCALING = 180,000. All 6 test drugs recover to within 5% of predict_adme(..).clint.mean.data/validation/surrogate_production_accuracy.json): 13 drugs, R²=0.992, mean abs rel err 22%, 9/13 within 30% (69% overall, 80% on 10-drug SBI routing subset).bayesian_update(method="sbi", sbi_use_surrogate=True). Batched JAX call (not per-sample). Default False.Follow-up (ensemble-std gate, hybrid routing):
features_in_distribution (box) + ensemble_std <= 0.02. Rejected samples fall back to scipy. Threshold calibrated so nominal drugs (ensemble std 0.004–0.020) stay on surrogate.data/sbi/populations.json — adult (70 kg) + pediatric_5y (18 kg).models/sbi/hierarchical_nsf_2k.pt.bayesian_update(population_class="pediatric_5y") + CLI --population pediatric_5y.tests/unit/test_sbi_hierarchical.py.c0cab88)audit/holdout-leakage-fix + feat/ude-diffrax merged. VDss 4th-track production added, EnKF TDM added, prospective validation series integrated, JAX backend consolidated. Post-merge AAFE 2.808 → 2.695 confirmed. tdm.py latent bug exposed and fixed (method="enkf" wrong kwarg + EnKFResult → TDMResult conversion).
| Metric | 1 obs | 2 obs | 3 obs |
|---|---|---|---|
| Mean CV reduction | 78.1% | 82.7% | 82.9% |
| Mean error reduction | 79.4% | 80.8% | 79.1% |
| Mean posterior CV | 8.4% | 6.5% | 6.4% |
5e5a3d0)docs/holdout_contamination_audit.md, data/validation/contamination_fix_report.json.engine/ diff=0).Detailed per-phase milestones: see phase-completion.md (local-only; moved to docs/_internal/ in PR #51).
Prepend a new section at the top of the appropriate date block. Each entry should have:
If an entry documents a failure, also append it to dead-ends.md with the next DE-NN id.