Every experiment here was run, reverted, and documented. Before proposing any accuracy improvement, open this file and search for the approach. New track proposals must first pass the error-decorrelation gate described in diagnosis.md §4.
Canonical count: 43 enumerated experiments below. Narrative references in commit messages or prose (e.g. “#35 error cancellation”, “14번째 시도”, “누적 33 methods”) use informal numbering that counts early exploration attempts separately; those narrative numbers are not authoritative and do not match the table count below. When in doubt, cite the table entry (DE-NN).
| Category | Representative entries | Headline outcome |
|---|---|---|
| Post-hoc meta-learner variants (33 method-combinations total) | DE-23, DE-24, DE-30 | error correlation r > 0.986 with baseline; provably near-optimal already |
| CLint R² improvement (14 attempts) | DE-08, DE-11, DE-13, DE-16, DE-17, DE-18 | R² gains 0.02–0.12 achievable, but all destroy pipeline error cancellation |
| Foundation models (MoLFormer / ChemBERTa / Uni-Mol) | DE-14 | Morgan FP + XGB dominates every frozen-embedding combination |
| Docking features for CLint | DE-13 | ΔR² = +0.005 (noise); binding affinity ≠ metabolic rate |
| UDE / gradient-through-solver | DE-29 | residual CV R² < 0 (falsified); Phase 2/3 unexecuted |
| E2E Neural PK (Pharos, E2E MLP) | DE-05, DE-17 | data scale insufficient; GNN needs »5k Cmax |
| Training-data expansion (ChEMBL / DrugBank / Biogen) | DE-09, DE-11 | breaks error cancellation |
| ADME replacement (partial or full) | DE-18, DE-31, DE-15 | 18+ error-cancellation regressions |
| Class-aware / batch-specific meta weighting | DE-25 | kinase under-prediction diagnosed; no weight combo beats baseline |
| F% bioavailability predictor | DE-26 | trained, negative; VDss-style unlock does not apply |
| Direct CL/F + t½ predictors | DE-27, DE-28 | CL/F R²=0.232 + t½ variants all negative; falsifies “IVIVE bypass” as the reason VDss worked |
| Hepatic intracellular fu correction (PPB-targeted) | DE-37 | Phase A infra shipped; primary literature corpus paywall-locked, 4 PPB candidates dispositioned ceiling_accepted, Meta AAFE shift 0.0% |
| UGT path / abundance / IVIVE interventions | DE-36, DE-38, DE-39, DE-40 | four consecutive metric-neutral UGT cycles; no per-substrate hepatocyte-basis scaling factor exists; ΔMeta AAFE ≤ 0.003 |
| Absorption / first-pass bioavailability-F recalibration | DE-41, DE-42, DE-43 | engine F under-call is bidirectional first-pass dispersion (not absorption); the absorption knob is linear = a flat scalar; and the fixed-weight meta damps any engine recalibration to ~18% pass-through on both the retrospective and prospective sets — the engine is not a headline lever on any benchmark |
Root cause (shared across categories): Cmax residuals are not learnable from molecular structure (CV R² < 0). Remaining error ≈ experimental variability + formulation + inter-patient variability. SMILES → Cmax carries a fundamental information-channel ceiling.
Each entry: ID, date, what was tried, outcome (numeric), why it failed, telltale sign if the idea returns under a new name.
AAFE ± 0.02, noise level. fup XGBoost alternatives do not move the pipeline.
AAFE ± 0.02, noise level.
Negative result.
Engine AAFE 2.861 → 3.090. Revert. Re-measured 2026-05-13 under current pipeline as DE-36 — prior conclusion does NOT generalize; today UGT path is mildly Engine-positive and Meta-neutral. See DE-36 for the refreshed measurement; DE-04 retained for historical record only.
Holdout 3.265. N=65 insufficient to learn a full SMILES→Cmax map.
R²=0.166. Apparent CLint not learnable from molecular features alone.
Quantitative kinetics absent at the time; zero drugs active. Superseded by Phase 1 OATP1B1 (2026-04-15).
Engine AAFE +0.005 (noise), meta AAFE 2.058 → 2.153 worse. Error cancellation destroyed. Revert.
Engine AAFE +0.021, meta 2.058 → 2.067 worse. Revert.
Engine AAFE +0.021 (noise). Kp is not the engine-error driver.
CV R² 0.229 → 0.273 (+0.044). Engine AAFE 2.945 → 2.930 (−0.015), meta AAFE 2.058 → 2.110 (+0.052 worse). Error cancellation broken. Revert.
Engine AAFE +0.072, meta +0.077. Individual harms sum. Simultaneous improvement cannot establish a new balance. Revert.
DiffDock CYP3A4 1,114 drugs: CLint CV R² 0.190 → 0.196 (ΔR²=+0.005, noise). Vina: ΔR² = −0.026 (worse). Docking feature importance 0.2–0.4%, top-30 has zero docking features. Binding affinity ≠ metabolic rate. Do not retry.
Frozen embedding + Ridge / MLP / XGBoost, every combination. Morgan FP + XGB R²=0.205 dominates every alternative (MoLFormer mean 0.184, ChemBERTa 0.170, Uni-Mol 0.083). Ensembling also worsens. CLint R²≈0.20 is a target-noise ceiling, not a representation ceiling.
MMPK AUC → CL/F direct prediction (N=1,014), Vd/F inverse (N=940). CL/F XGB CV R²=0.232, Vd/F R²=0.332. Analytical 1-cpt Cmax. 3-track LOOCV: w_clf=0.00 (base / other both). Standalone AAFE=3.133. Meta Δ=−0.005 (noise). Oracle 1.788 across 28/107 drugs but not unlockable with fixed weights. Benet hypothesis (“IVIVE bypass → accuracy gain”) not verified. Infrastructure retained, w_clf=0.00.
ChEMBL 36 all-extract: 539 unique compounds (534 net new). TDC Hep 978 + ChEMBL 517 = 1,910 compounds. Scaffold CV R² 0.279 → 0.333 (+0.054). Engine AAFE 3.416 → 3.515 (+0.099 worse), meta AAFE 2.277 → 2.316 (+0.038 worse). LOOCV w_base 0.45 → 0.25 (meta-learner loses confidence in engine). Revert. Data archived under data/chembl/ and data/training/clint_expanded_v2.csv.
Low / Med / High (10/50 cutoff), XGB classifier accuracy 53.5% (kappa 0.299, scaffold CV). Probability-weighted MC mixture. Engine AAFE +0.108 worse; Meta AAFE 2.277 → 2.255 (Δ=−0.023, marginal noise-level improvement). w_base=0.45 retained. Coarser prediction destroys less error cancellation, but the effect is within noise.
ALFABET BDE on 978 compounds. BDE_min vs log10(CLint): r=+0.033 (no correlation). CYP subset: r=+0.043. Gate |r|<0.15 failed. Hepatocyte CLint integrates kcat + Km + enzyme complement; C-H BDE (CYP kcat-only) cannot explain it.
GNN encoder + MoE(K=3) + 1-comp PK backbone. 3,551 compounds, 1,074 with Cmax. Best AAFE=3.006 (GNN+MoE), worse than Sisyphus ML-only 2.336. 465K parameters vs 1,074 samples (ratio 433:1). Data scale, not architecture, is the bottleneck. GNN needs »5,000 Cmax samples. Branch: pharos-prototype.
Feature selection top-300 + Optuna: CLint scaffold CV R² 0.279 → 0.399 (+0.120). Holdout Meta AAFE +0.012 (17th error-cancellation regression). Regularization is not the ceiling; data quality is.
All ADME models re-optimized simultaneously. CLint +0.033, fup +0.042, VDss +0.057 in R². Engine AAFE +0.165, Meta AAFE +0.023 worse. Partial OR whole replacement fails under the current pipeline.
Mordred 1,613 descriptors + ensemble (XGB + LGB + Ridge). CV AAFE 3.410 < Morgan 3.750, but Holdout AAFE 2.848 > Morgan 2.336. At N≈1,100, dense features CV-overfit.
log10(Cmax) = log10(Engine) + Delta(features). Delta variance 46% of Cmax variance. Holdout: Delta-only 3.528, Delta+ADME 8.450 (catastrophic overfit). Engine error is non-systematic → ML correction impossible.
Morgan FP Tanimoto (median 0.464), k=20 similarity-weighted: AAFE 3.049. 3-way blend w_knn=0.00. r(ML, kNN) = 0.690 (correlated errors). Oracle 3-track 1.689 (28/107 drugs kNN best) but no fixed weight exploits it.
OOF Stacking (Ridge), ACF (Analog Correction Factor), Winsorized — 6 variants. None beat baseline meta 2.277. Stacking V1: 2.420 (OOF-Full gap r=0.81 destroys transfer), ACF k=5: 3.005 (neighbor fold-error std 0.67, noisy), Winsorized cap=0.5: 2.300 (same). Stacking+ACF combined — no effect. 23rd negative.
5 PK-domain + 5 cross-domain. Every entry: Isotonic Engine Cal. (+0.325), ER-Proxy Routing (tie), Error Direction Clf (+0.055, 64.2% acc), CLint-Stratified (+0.006), AAFE-Direct Optim (+0.082), Quantile XGB (+0.602), Local BMA (+0.081), Caruana Ensemble (+0.090), Disagree-Sigmoid (+0.014), Trimmed AAFE (+0.097). All 10 have error correlation r>0.986 with baseline. Compound-type-adaptive geometric blend is provably near-optimal. 24th negative (cumulative 33 methods).
After diagnosing kinase batch under-prediction, scripts/class_aware_meta_benchmark.py swept kinase-class weights. Meta AAFE 2.277 tie; 1,765-cell weight cache generated; no combo beats baseline. data/validation/class_aware_meta_results.json.
DrugBank 527 drugs, XGB (scripts/train_bioavailability.py). Standalone + meta integration both negative. data/validation/f_predictor_negative_result.json. F% does not unlock error cancellation the way VDss did.
xgboost_clearance_v1.json + xgboost_thalf_v1.json. 6 combinations all negative. data/validation/post_vdss_negative_results.json. Falsifies the interpretation that “VDss’s IVIVE bypass” is what made VDss work — the real reason is clearance-orthogonality (see diagnosis.md §4).
Residual learning. data/validation/phase1_ude_prototype_result.json records the falsification. Residual not learnable from molecular structure (CV R² < 0). Phase 2 (amortized SBI) and Phase 3 (flow matching) unexecuted.
DrugBank measured fup always preferred over XGBoost prediction (inverting the >5× disagreement fallback). Principled, empirically harmful: Engine AAFE 3.421 → 3.726 (+0.306 — the 34+ error-cancellation failure pattern reproduced), Meta AAFE 2.695 → 2.728 (+0.033 noise). Revert. Narrative “35th error cancellation failure” entry.
To fix pravastatin SBC (cov_dev 0.223), OATP1B1 substrates added: atorvastatin, fluvastatin, pitavastatin, valsartan, bosentan. 55 → 60 drugs. Result: pravastatin 0.223 → 0.237 (worse), posaconazole 0.073 → 0.173 (much worse, SBI → IBIS regression). SBI 12 → 11. Pravastatin failure is engine-level (OATP1B1 not modeled), not training-data. Revert. Narrative “36th failure” entry.
Valsartan fup predicted 0.009 vs DrugBank/clinical 0.050. Hypothesized this 5.6× deficit drove V3 Cmax underprediction (FE 0.48× on valsartan, 0.39× on glimepiride under V3 windowed IV-Cmax). Override: Cmax changed by 0.97× — essentially unchanged. Glimepiride predicted fup already matches clinical. fup RULED OUT as cause of V3 OATP non-statin underprediction. scripts/diagnose_v3_underpredict.py, result 5ff72eb. Candidates remaining (not tested): Jmax calibration, Vss/Kp over-distribution, ECM architecture limit outside statin Km range. Do not re-test fup override for this class.
2026-05-03 root cause confirmed: Vss over-prediction. Post-Hardening diagnostic with realize_means() (issue #8/#21 closure session): XGBoost VDss predictor over-distributes both drugs by 3-6× (valsartan pred 107.8 L vs FDA 17 L; glimepiride 28.5 L vs FDA 8.8 L). Direct mechanism for Cmax under-prediction: IV Cmax ~ dose/Vss → 6× over-Vss → ~2× under-Cmax (residual absorbed by distribution kinetics). Jmax calibration ruled out as primary contributor (would not affect C0 IV behavior). Engine uses Kp (Rodgers-Rowland), not Vss directly, so a registry-only Vss override would be architecturally inconsistent. Wholesale Kp recalibration for high-fup acid class is engine-layer work, deferred. Re-running fup override post-Hardening would still hit the DE-31 error-cancellation trap. Structural limitation acknowledged; not actionable without coordinated Kp method retuning.
20 RDKit 3D descriptors (asphericity, NPR1/2, PMI1/2/3, WHIM, etc.) on ETKDG-generated conformers (99.8% success). Two evaluations, both falsified:
feature/3d-cyp-multidist Experiment A): Morgan+3D ML AAFE 2.930 vs Morgan-only 3.030 (Δ=-0.100, first gate PASS); meta blend 2.818 vs production-then meta 2.277. Orthogonality r=0.655 with Morgan, insufficient to clear ensemble error cancellation. Archive tag archive/3d-cyp-multidist-2026-04-01.feature/morgan3d-retrain Phase 12c, 2026-04-01, 324 hyperparameter configs): Morgan+3D ML AAFE 2.402 vs Morgan-only 2.341 (Δ=+0.061, WORSE); meta 2.370 vs production 2.277 (Δ=+0.093). Error correlation r=0.952 — at production training scale, the 3D features lose their pseudo-orthogonality. 3D feature share = 6.9% importance, only 3 in top-20 (whim_5/6/7). Archive tag archive/morgan3d-retrain-2026-04-01.The two evaluations together rule out “but it might work at production scale” as an escape: the orthogonality observed at N=1029 was a small-data artifact, not a feature-engineering signal. Telltale if it returns: “asphericity / NPR / PMI / WHIM / 3D shape descriptors / conformer features” added to ML feature set.
Distinct from DE-30 (UDE / gradient-through-solver): pre-train an MLP operator to mimic the engine, then fine-tune SMILES→Encoder→Operator→Cmax end-to-end with frozen operator. Two-gate evaluation:
Mechanism: the operator captures engine physics perfectly, but the engine’s systematic bias (AAFE 3.42) transfers through unchanged — fine-tuning with N=1,239 cannot correct it via the encoder. Same SMILES information ceiling as DE-05/DE-17/DE-30 reached from a different architectural angle. Branch feature/neural-operator-surrogate (commit b85b18d); archive tag archive/neural-operator-surrogate-2026-04-02. Telltale if it returns: “differentiable surrogate / amortized engine / pre-trained operator + encoder fine-tuning” with N < 5K Cmax training data.
Refresh of a prior unrecorded sensitivity test that had concluded “UGT fm redistribution degrades Engine AAFE 2.861 → 3.090” (cited as a comment in src/sisyphus/predict/ivive.py pre-2026-05-13). That measurement was pre-v0.3.2 + pre-public-only-headline + pre-ECM-auto-activation; current pipeline is materially different. Re-measured under current main + DrugBank-present:
Old narrative (“UGT path is harmful”) does not generalize to the current pipeline. New finding: UGT path is mildly Engine-positive and headline-neutral. The Engine improvement gets re-absorbed by the 4-track meta-learner’s track weights (DE-08~DE-18 error-cancellation family).
Not activated in production because: (a) zero Meta benefit, (b) UGT annotations sourced only from DrugBank → public-clone reproducibility would require a curated data/enzymes/ugt2b7_substrates.json-style registry (separate cycle, parallel to the v0.3.2 NAT2/UGT1A1 pattern). Comment in ivive.py updated to point here. Telltale if it returns: “UGT path / UGT fm redistribution / DrugBank-driven UGT enrichment” without a public-clone-reproducible UGT substrate registry AND without re-running the error-decorrelation gate at the meta-learner level.
Artifacts: /tmp/4track_state_A_ugt_off.json and /tmp/4track_state_B_ugt_on.json (not committed; comparison summary in ivive.py:640-655 comment).
Date: 2026-05-22
Hypothesis: Per-drug fu_correction_liver from primary literature (Watanabe 2009 DMD 37:1471 / Yamazaki 2010 DMD 38:998 / Riccardi 2017 DMD 45:781 / Patilea-Vrana 2017 Clin Pharmacokinet) would reduce systematic over-prediction for highly bound drugs in the 107-holdout by gating hepatic CLint on fu_inc instead of fup at well-stirred and parallel-tube extraction sites.
What was measured: 19 holdout drugs with meta_fold > 3 were mechanism-triaged (T11). 4 identified as PPB-related candidates (paroxetine, oxybutynin, abiraterone, progesterone); the other 15 dispositioned not_applicable (P-gp, renal, CES1, extreme first-pass, formulation, prodrug-entity mismatch, autoinhibition, lysosomal trapping, gastric-pH absorption, MAO-A, cytidine deaminase, high-fup, UGT2B7 — see data/transporters/hepatic_fu_correction.json). T12 literature search across the 4 primary corpus papers + secondary PubMed queries (hepatic uptake, albumin-facilitated, Kp,uu,liver) returned 0 usable fu_inc/fu_p ratios for any of the 4 PPB candidates — every primary paper is paywalled subscription-only and WebFetch retrieves only abstracts, not the supplemental tables where per-drug ratios live. Final registry: 19 audit rows, all fu_correction_liver = {mean: 1.0, cv: 0.0} (identity); 0 literature_applied, 0 class_extrapolated, 4 ceiling_accepted, 15 not_applicable.
Outcome: Meta AAFE shift = 0.0% (2.7715238009 → 2.7715238009, bit-identical 107/107 per-drug, T14 verification). Phase A infrastructure (DrugOnGraph.fu_correction_liver, loader, ClearanceFluxSpec + ProdrugActivationFluxSpec gating, liver-node applicability flag, identity-blind random-rename invariance test) is shipped to main (commit a0c90f8). Curation rows retained as audit trail.
Why it failed: The primary literature corpus that publishes hepatocyte uptake fu_inc/fu_p ratios (Watanabe / Yamazaki / Riccardi / Patilea-Vrana et al.) is paywalled, and the per-drug ratios live in supplemental tables that WebFetch cannot retrieve from abstracts. Secondary PubMed sources for the 4 PPB candidates returned mechanism-context papers (autoinhibition PBPK, transdermal PBPK, SULT2A1-class steroid PBPK) without the specific fu_inc measurement. Without literature support, no defensible non-identity multiplier could be curated, so the infra runs on an all-identity registry.
What this implies: Either (a) fu_inc/fu_p ratios are not the dominant over-prediction mechanism for the 4 audited PPB candidates, (b) the accessible literature does not measure these ratios for the specific drugs of interest, or (c) both. Future iterations may revisit per spec §6.2 with: subscription access to the four primary DMD/CPK papers, in-house hepatocyte uptake assay data for the PPB candidates, or transporter-mediated alternatives (OATP / NTCP uptake clearance instead of fu_inc gating). Telltale if it returns: “hepatic intracellular fu / fu_inc / Kp,uu,liver / albumin-facilitated uptake” applied to highly-bound holdout drugs without a paired public-corpus or experimentally-measured ratio per drug.
Artifacts: data/transporters/hepatic_fu_correction.json (19 audit rows), curation log docs/superpowers/specs/2026-05-22-B11-Phase-B-curation-log.md, spec docs/superpowers/specs/2026-05-21-B11-hepatic-fu-correction-design.md, plan docs/superpowers/plans/2026-05-21-B11-hepatic-fu-correction.md. Phase A infra shipped at a0c90f8; Phase B curation at d10bbef.
Date: 2026-05-27
Hypothesis (not the primary B-02 hypothesis): activating literature-anchored UGT2B7 path for canonical substrates would improve per-drug FE for all 4 UGT2B7 entries (morphine, codeine, ketorolac, indomethacin).
What was measured (same-numerics-stack regen):
Outcome: Net Meta AAFE Δ = +0.0067 (within bootstrap noise [2.3151, 3.1690], 1.6% of CI half-width). 6 of 8 seeds improved; 2 of 8 (morphine + codeine) worsened. The Δ direction is determined by the 2 worsening drugs offsetting the 6 improvements.
Why morphine + codeine worsened: activating UGT2B7 redirects 70-85% of XGBoost CLint from the default CYP route to the UGT2B7 route. The UGT2B7 effective clearance (abundance × literature-fm × XGBoost CLint) is LOWER than the default CYP allocation, so total hepatic CL drops → Cmax rises. Pre-B-02 morphine engine FE 1.90 was a coincidental cancellation — over-extraction via CYP-default + missing UGT path summed to a moderate FE. Activating the correct UGT path REVEALED that the CYP-default routing was over-extracting these drugs.
What this implies: B-02 ships as designed (literature-anchored, anti-fudge preserved). The morphine/codeine worsening is a secondary diagnostic finding about the engine’s CYP-default routing balance for UGT-dominant substrates — orthogonal to B-02’s capability + reproducibility mandate. Phase 2.x (backlog B-13) will address UGT2B7 abundance + IVIVE recalibration to reconcile.
Telltale if it returns under a new label: “morphine over-prediction” or “UGT abundance recalibration” applied to UGT2B7 substrates without a CYP-route IVIVE rebalancing alongside.
Artifacts: data/training/4track_holdout_predictions.json (post-B-02 cache), data/validation/4track_ci_2026-05-27_B02.json (bootstrap CI), spec docs/superpowers/specs/2026-05-26-B02-ugt-public-registry-design.md, plan docs/superpowers/plans/2026-05-26-B02-ugt-public-registry.md.
Date: 2026-05-29
Hypothesis (B-13 original framing): adding literature-anchored gut-wall UGT2B7 + UGT1A9 abundance would supply the first-pass clearance B-02 left missing (DE-38), pulling morphine/codeine over-prediction back down.
What was measured (corrected B-13, same-numerics-stack regen): gut UGT2B7 added at the defensible literature value 3.6e3 pmol (0.60 pmol/mg total-mucosal × 6000; Al-Majdoub 2021 CPT 109:1136 / PMC8048492, corroborated Couto 2020 DMD 48:245). Only the 4 UGT2B7 gut-paired seeds shift, all DOWN but trivially: morphine −0.112%, codeine −0.034%, ketorolac −0.033%, indomethacin −0.004%. Meta 2.69828 → 2.69825 (Δ −2.7e-05). Morphine stays ~3.4× over-predicted.
Why it cannot work: the defensible gut UGT2B7 abundance (3.6e3) is ~0.15% of hepatic UGT2B7 (2.43e6). Gut first-pass via UGT2B7 is a sub-percent clearance term — it cannot close a 3.4× over-prediction. The morphine/codeine fix, if any, must come from a hepatic UGT2B7 IVIVE/extraction differential, not the gut node.
Citation-confabulation sub-finding (process, important): the B-13 spec’s gut abundances rested on confabulated literature. The claimed intestinal UGT2B7 “15 pmol/mg (5-30 range)” is ~25× the real median (0.60); “Bhatt 2019 DMD 47:498” resolves to an unrelated Kimoto maraviroc DDI paper (PMID 30862625); “Akabane 2012 DMD 40:1310” does not exist (NCBI esearch count=0). Gut UGT1A9 was DROPPED — not expressed in human small intestine (Oda 2012 isoform-specific antibody finds it in kidney+liver only; UGT1A10 is the intestine-specific 1A isoform; absent from Couto 2020 >5000-protein global proteomics). Caught by an 11-agent adversarial verification workflow (verify-gut-ugt-citations, 2026-05-29; both committed values refuted 3/3 + 3/3, high confidence). Lesson: spec-stage abundance/citation values must be verified against primary sources before reaching a committed YAML — “fallback citation” lists authored from memory are a confabulation risk.
Outcome: B-13 ships as an enzyme-level gut-wall correctness term (proper literature-grounded gut UGT2B7; UGT1A9 correctly absent), NOT a morphine fix. Metric-neutral within bootstrap noise.
Telltale if it returns under a new label: “gut UGT abundance,” “extra-hepatic UGT first-pass,” or “intestinal UGT2B7” proposed as a fix for morphine/codeine/UGT2B7-substrate over-prediction. The hepatic UGT2B7 IVIVE differential remains the only plausible lever and is a separate (un-started) backlog item.
Artifacts: data/physiology/reference_man.yaml (gut_wall UGT2B7), data/training/4track_holdout_predictions.json (corrected B-13 cache), spec docs/superpowers/specs/2026-05-27-B13-gut-ugt-expansion-design.md (+ 2026-05-29 amendment), tests/regression/test_gut_ugt_abundance.py.
Date: 2026-05-30
Hypothesis (the lever DE-39 named): correct the hepatic UGT in-vitro→in-vivo under-prediction with a per-substrate scaling factor (SF) on the UGT-routed affinity, pulling morphine/codeine over-prediction down. Designed as a bounded blind decisive experiment (spec 2026-05-30-hepatic-ugt-ivive-differential-design.md v2) after a 3-critic adversarial review demolished a v1 “fix morphine” framing as cherry-picking.
What was built (ships, audited no-op): a predict-side, per-enzyme SF hook — data/enzymes/ugt_ivive_sf.json registry + get_ugt_ivive_sf() loader + a one-line multiply in _decompose_clint (engine untouched; identity-blind preserved). With an all-1.0 registry it is a 107/107 bit-identical no-op (Gate D1). Infrastructure shipped per the B-11/DE-37 precedent.
What the decisive blind verification found — no applicable SF exists (all dispositions → 1.0):
Quantitative prior (pre-registered): even a full morphine 3.38→2.0 + codeine 1.78→1.3 fix moves Meta AAFE only ≈ −0.021, so any realistic partial, honest hepatic SF is sub-threshold. NO-GO.
Outcome: predicted/realized Meta Δ = 0 (no-op). This is the fourth UGT intervention to land neutral (DE-36/38/39/40) and it closes DE-39’s “only remaining lever”: the hepatic UGT IVIVE differential, evaluated honestly (hepatocyte-basis, hepatic-fraction-only, blind, per-substrate-verified), has no applicable literature value. The clean no-op infra remains for any future verified per-substrate hepatocyte SF.
Telltale if it returns: “hepatic UGT IVIVE / UGT under-prediction correction / albumin-effect scaling” proposed for morphine/codeine without a verified, hepatocyte-basis, hepatic-fraction-only, per-substrate number. The HLM albumin fold and the renal contribution are the two traps. The DE-38-complete idea (UGT IVIVE plus a CYP-route rebalance) remains theoretically open but is a different cycle — B-14 shows the UGT-IVIVE half alone has no honest large value.
Artifacts: data/enzymes/ugt_ivive_sf.json (all-1.0 audited registry), src/sisyphus/predict/non_cyp_substrates.py (get_ugt_ivive_sf), src/sisyphus/predict/ivive.py (_decompose_clint hook), tests/unit/test_ugt_ivive_sf.py, tests/regression/test_ugt_ivive_sf_registry_schema.py, spec docs/superpowers/specs/2026-05-30-hepatic-ugt-ivive-differential-design.md (v2), plan docs/superpowers/plans/2026-05-30-B14-hepatic-ugt-ivive-differential.md.
Date: 2026-06-01
Context: the 2026-06-01 prospective expansion (N=28, Meta AAFE 3.21 > retrospective 2.698) showed the engine catastrophically under-predicts some 2025 NMEs (mirdametinib 30×, sevabertinib 18×). Root-caused via IV/oral decomposition to bioavailability (F) under-prediction, not clearance — engine F = 0.05–0.08 vs implied real F ≈ 1.0, while engine CL_systemic ≈ literature (mirdametinib 4.8 vs 4.6 L/h). See diagnosis.md §8.
Hypothesis: the engine’s own low predicted-F (or engine↔ML track disagreement) is a predict-time signal of an OOD / unreliable Cmax that the applicability-domain detector could flag, excluding the catastrophic cases from in-domain.
| Result — falsified on the 107-holdout: corr(engine_F, | log10 fold | ) = −0.037 on the holdout (vs −0.54 on the prospective new-16 — does not generalize). Of 21 holdout drugs with engine F<0.10, 17 are within 2-fold (the engine predicts low F for nearly everything — median 0.18 — and it is co-calibrated). Flagging F<0.08 removes 7 in-domain drugs but only moves in-domain AAFE 2.760→2.732 — it removes well-predicted drugs. engine↔ML divergence is also flat (holdout r=−0.033; top-20 vs bottom-20 divergence AAFE 2.48 vs 2.57). |
Why it failed: the per-drug Cmax error is not recoverable from the model’s own outputs — consistent with the structural-error ceiling (~30% PI coverage). The F under-prediction is real but near-uniform, so it carries no discriminative OOD signal. The honest lever is measured-F routing or an absorption-model recalibration, not an AD flag.
Telltale if it returns: “flag low predicted-F / high track-disagreement as out-of-domain.” Re-check the holdout correlation (≈0) before building — it looks predictive on a prospective slice but does not generalize.
Date: 2026-06-03
Context: DE-41 / diagnosis.md §8 left “an absorption-model recalibration” as the one un-tested honest lever for the systematic engine bioavailability-F under-call (median engine-F/lit-F ≈ 0.46, 10/10 measured-fup+CLint PoC drugs). Two measurement-only multi-agent decompositions tested it (engine F = fa·Fg·Fh; runtime monkeypatch only, no tracked file changed; headline Meta 2.698 / engine 3.831 reproduced exactly as controls).
Confirmed diagnostic: the median under-call localises to fa (fraction absorbed) — fa median bias 0.55 (vs physiological ~0.9), Fg ≈ 1.0, Fh ≈ 1.05 — because ka = 2.88·Peff·ka_fraction/radius (~6%/segment) loses the race to gut transit (~3.85/h), so most dose transits to faeces unabsorbed (dasatinib fa 0.16, sildenafil 0.22). Decisive: the non-CYP3A acids (diclofenac/etodolac/febuxostat) have an empty metabolized_gut sink (Fg ≈ 1 real) yet suppressed F ⇒ the loss is fa, not first-pass.
Why the lever fails: ka enters the ODE linearly (rate = ka·y), so any uniform multiplier — the 2.88 constant, a villous-amplification factor, a corrected particle radius, or a literature transit-window — is mathematically the same flat scalar. It nulls the median (5.25× → engine-F/lit-F 1.0; engine N=107 3.831→3.336) but cannot reduce per-drug dispersion: all 4 candidates plateau at geomean fold-error 1.43–1.45 (flat-scalar 1.40, itself inside the ±15% lit-F noise band); the one nonlinear candidate (Peff Caco-2→in-vivo remap) made it worse (1.52); engine SITT (195 min) already matches literature (Yu 1996, 199 min). On the full N=107 holdout the best refinement scored engine AAFE 3.405 — worse than the plain scalar (3.336) — and flipped the engine from 14 to 30 >3×-over-predictors (the co-calibration-break signature; un-refit Meta regresses +3%, meta-regression risk HIGH).
The real residual is bidirectional first-pass, not absorption: once fa→1, the per-drug error splits into two opposing modes one knob cannot reconcile — (a) CYP3A first-pass over-extraction for bases (alprazolam/carbamazepine/quinine cap at F ≈ 0.5 vs lit 0.8–0.9 even at fa=1; candidate cause: gut-CYP3A abundance scaled-to-midazolam over-extracting non-midazolam substrates) and (b) well-stirred Fh under-extraction for high-PPB acids (diclofenac fup=0.003, febuxostat, etodolac overshoot — the DE-37/B-11 hepatic-fu problem). Fixing the bases worsens the acids. Both halves are already data-blocked / co-calibrated.
Telltale if it returns: “recalibrate the absorption constant / villous amplification / particle radius / transit time to fix the engine’s low bioavailability F.” It nulls the median F on a PoC set but is a flat scalar in disguise (ka is linear), worsens the holdout vs the simpler scalar, and breaks meta co-calibration. The only un-foreclosed F lever is measured-F routing; the recoverable structural residual is first-pass (gut/hepatic CYP3A IVIVE ⊕ hepatic-fu for high-PPB acids), not absorption.
Date: 2026-06-03
Context: DE-42 foreclosed absorption recalibration for the retrospective headline. Open question: the prospective N=28 set (Meta AAFE 3.21 — the real novel-drug failure, §8) is not part of the meta co-calibration, so a first-pass lever foreclosed retrospectively might still net-improve it. A measurement-only test decomposed the prospective catastrophes and measured two levers on both benchmarks via the production meta path (runtime monkeypatch only; before-controls bit-exact: retro meta 2.69825 / engine 3.8314).
Decomposition (production predicted-ADME, F = fa·Fg·Fh): the catastrophic under-predictors (mirdametinib engine 74×, sevabertinib 53×, pirtobrutinib, pacritinib, tovorafenib … mostly kinase inhibitors) are fa-first, Fg-second — fa 0.08–0.32 (absorption starved: low Peff, or low RDKit-solubility → particle_radius=50µm → ka ≪ gut transit), then gut-CYP3A Fg 0.37–0.55 (the midazolam-calibrated gut_wall CYP3A4 over-extracting). Fh is correct (consistent with §8: CL_systemic correct). The over-predictors (imlunestrant, taletrectinib) are not_F (Vdss/distribution, out-of-AD) — a blunt F lever worsens them.
Result — both levers fail at the meta: absorption scalar (5.25×): prospective meta 3.171→3.102 (−0.069) but retro meta 2.698→2.780 (+0.082) → net −0.012 (costs the headline more than it gains). Gut-CYP3A 0.5×: prospective meta 3.171→3.151 (−0.020), retro meta neutral (−0.0006) → net +0.020 but inside the N=28 bootstrap CI (statistically zero) and not literature-anchored (halving a midazolam-calibrated abundance = tuning to Cmax, Invariant #8).
Why it failed (the unifying mechanism): both levers move the engine track materially on prospective (absorption 4.11→3.75; gut-CYP3A 4.11→4.00; mirdametinib engine fold 58→13 / 58→51) — but the fixed-weight meta-learner damps this to ~18–19% pass-through, the SAME on prospective as on retrospective. The meta is robust to engine errors by construction (down-weights outlier engine predictions), which symmetrically prevents engine improvements from propagating. Prospective is NOT exempt from co-calibration — the engine is structurally not a headline lever on any benchmark. Plus the DE-42 bidirectional tension: relieving the catastrophic unders blows up the not_F over-predictors (imlunestrant 17×→62× under the absorption scalar).
Telltale if it returns: “the prospective / novel-drug set isn’t co-calibrated, so an engine F / first-pass / gut-CYP3A recalibration will fix it.” It improves the engine track on both sets but the fixed-weight meta mutes it to ~18%; net is neutral-to-negative and within N=28 noise. The only un-foreclosed F lever is per-drug measured-F routing, not an engine recalibration.
When a new experiment concludes negative, append as the next DE-NN with: ID, date, one-sentence description, numeric outcome (ΔR² or ΔAAFE or ratio), why it failed (1 sentence), telltale sign if it returns under a new label. Keep entries under 5 lines.