wormuse-analytics

Improving the worm→Chopin pipeline through Applied Statistics

Goal: the best model we can build that teaches a C. elegans to play Chopin's Nocturne in C♯ minor.

Polimi · AppStat 2026 · Companion to PyANNOW (NAML) · v0.8.0
Engineering doc: docs/STATISTICAL_DIAGNOSTICS.md
Notebook: notebooks/01_appstat_lecture_audit.ipynb
21 PyANNOW issues tagged [@appstat-audit] — 11 resolved in v0.8.0

The goal — and why metrics aren't enough

Teach a worm to play Chopin. That means two things:

A pipeline

ion channels → spikes → muscles → MIDI → audio

PyANNOW already implements 8 NAML steps along this chain.

A scoring function

Quantify "Chopin-likeness" rigorously enough that optimising it produces music.

PyANNOW reports `onset_loss` and `musical_f1`. Both are too coarse.

PyANNOW ISSUE-016 and ISSUE-017 patched the metric — but the underlying pipeline has 10 structural logic problems that no metric correction can fix. Each AppStat lecture maps to a tool that addresses one of them.

The current pipeline

ion-channel HH params (DEFAULT_PARAMS) ↓ run_forward_fast V_muscles (T, 96) ──► generate_neural_activity_302() → X_neural (302, T) k≥4 ✅ v0.7.0 ↓ ↓ onsets_base Step 1a RSVD (k=4) (Step 0 baseline) ↓ Step 1b Procrustes(standardize=True) → Z_aligned ✅ v0.8.0 ↓ Step 2 PCA + KMeans ↓ Step 3 RidgeCV → C_pred ↓ Steps 4-6 MLP+Adam+L-BFGS → C_pred ← beats baseline ✅ ↓ Step 8a/b ODE/PDE PINN ↓ find_peaks(distance=280ms, height=mean) ← OPEN ISSUE-033 ↓ pitch ← MUSCLE_PITCHES_96[k] (all 12 classes) ✅ v0.7.0 ↓ audio · score (pitch_aware_f1) ✅ v0.8.0

v0.8.0 fixes ✅: (a) 96-cell Boyle model, k≥4 PCs; (b) Procrustes standardized; (c) 100% pitch coverage. Still open: ISSUE-033 magic threshold.

10 pipeline logic problems

#	Problem	Fix by AppStat	Issue
1	302-neuron matrix is synthetic (8 blocks + noise)	L01 PCA biplot reveals rank ≈ 8	029
2	Step 0 mislabelled "random"	L00 IOI distribution shows it's deterministic body-wave	030
3	8-muscle pitch bottleneck	L06 pitch-aware F1 ceiling	031
4	Procrustes between unstandardized incommensurate spaces	L04 standardize-first; L01 biplot	032
5	Shared `find_peaks(height=mean)` across all steps	L05 calibrated logistic + Youden	033
6	Chopin features lossily compressed to k=8	L01 cumvar audit	034
7	Onset-only metric ignores pitch & velocity	L06 pitch-aware F1 + velocity Pearson r	035
8	Step 8 PINN is oscillator-PINN, not ion-channel PINN	L07 RF baseline reveals what physics buys	036
9	Biological ceiling assumes any muscle = any pitch	L06 reachable-pitch confusion	037
10	No cross-validation anywhere	L06 StratifiedKFold + bootstrap CIs	038

L00 / Lab IDescriptive statistics

The metric paradox is invisible from scalar numbers but obvious from distributions.

Closes logic #2

Step 0's IOI distribution = sharp peak at one body-wave period (~220 ms).

Chopin's IOI = broad, high kurtosis.

Step 0 is not "random" — it's deterministic and structurally incompatible with Chopin.

What we add

descriptive.collect_step_stats() → DataFrame of `{n, ioi_mean, ioi_std, ioi_skew, ioi_kurtosis, ...}`

plot_ioi_distributions() — KDE overlay

ISSUE-022

L01-L02 / Lab IIDim reduction — linear & nonlinear

L01 — PCA biplot

Run PCA on the 302-neuron matrix; report cumvar at k=8.

Expected: cumvar @ k=8 ≈ 100% → the matrix has rank ≤ 8, not 302.

This closes logic #1 visually.

L02 — t-SNE / UMAP

2-D nonlinear projection coloured by KMeans label.

If colours mix, the motor primitives are arbitrary.

Manifold sanity check for Step 2.

ISSUE-023 — biplot · ISSUE-024 — t-SNE/UMAP · ISSUE-034 — Chopin k=8 audit

L03 / Lab III-IVClustering — four methods compared

Method	PyANNOW	What it adds
KMeans	✅	(in Step 2)
Ward + dendrogram + cophenet	🔴	Is the k=4 hierarchy natural?
DBSCAN	🔴	Outlier worm-behavior detection
GMM (soft, BIC)	🔴	Smooth biological transitions

clustering.compare_methods() returns silhouettes + pairwise ARI for all four. GMM is the most biologically defensible (motor primitives transition gradually, not abruptly).

ISSUE-025

L04 / Lab VLinear models + diagnostics

PyANNOW Step 3

✅ Ridge + RidgeCV; R² vs α curve

🔴 No diagnostics — residuals, QQ, BP, DW, VIF, Cook

🔴 No Lasso — we never check whether k=4 PCs are needed

What we add

regression.diagnose_ridge() — Lab V table per Chopin feature dim

regression.lasso_path_selection() — true k_effective from LassoCV

Standardize before fitting (closes logic #4)

ISSUE-026 · ISSUE-032

L05Logistic onset detector — closes logic #5

PyANNOW's onset detection at every step is one line:

    peaks, _ = find_peaks(activ, distance=int(0.28/0.5e-3), height=activ.mean())
  

That is a 1-feature classifier with a hardcoded threshold. Different step activations → different ideal thresholds, but the same `mean()` rule is used everywhere.

We swap for a calibrated LogisticRegression(class_weight='balanced') with Youden's-J or best-F1 threshold tuning. The gap between calibrated F1 and PyANNOW's reported F1 measures how much each step's poor score was caused by the hardcoded threshold rather than the activation itself.

ISSUE-027 · ISSUE-033

L06 / Lab VIClassification metrics — full Lab VI

Curves & CIs

F1 vs tolerance {10, 25, 50, 100, 200, 400} ms

PR overlay (preferred under heavy imbalance)

ROC overlay + AUC

Bootstrap 95% CI + paired-bootstrap H₀ test

StratifiedKFold over Chopin sub-windows

Pitch-aware F1 — the wormuse metric

Onset matched in BOTH time AND pitch

Plain F1 ignores wrong notes; pitch-aware F1 doesn't.

Bipartite matching: greedy by time, must share pitch-class.

Closes logic #7

ISSUE-018, 019, 020, 021, 035

L07Random Forest — model-agnostic ceiling

PyANNOW jumps from Linear (Ridge) to MLP (Adam) to L-BFGS to PINN — a deep-model escalation. AppStat L07 inserts a basic question before any of that:

Does a stock Random Forest already match the MLP?

If RF F1 ≥ MLP F1

Steps 4-6 are buying nothing.

The deep model is over-engineering; ship the RF.

If RF F1 ≪ MLP F1

The MLP captures genuine nonlinear structure that RF can't.

Justifies the extra complexity.

ISSUE-028 · ISSUE-036 (Step 8 PINN: which physics is it?)

The composite score — the wormuse goal made measurable

A worm-Chopin model is "Chopin-like" iff every floor below is passed:

Component	What it measures	Floor
pitch_aware_f1 @ 50 ms	onsets matched in BOTH time AND pitch	≥ 0.20
AUC-PR (binned onset classification)	threshold-free quality of the activation	≥ 0.30
ioi_similarity (already in pyannow)	rhythmic distribution overlap	≥ 0.30
velocity_correlation	Pearson r over matched-note velocities (dynamics)	≥ 0.20
bootstrap 95 % CI	uncertainty around every claim	CI excludes Step 0

Reported by wormuse_analytics.pipeline.ImprovedPipeline.score_all(). None of these is currently in PyANNOW's `losses` dict.

21 PyANNOW issues — closing each gap

ID	Title	Lec	Status
✅ 018	Builder/notebook desync	L06 (infra)	v0.7.0
019	Cell 20 left panel still misleads	L06	open
✅ 020	Multi-tol F1 + PR + ROC sweep	L06	v0.8.0
✅ 021	Bootstrap CIs for per-step F1	L06	v0.8.0
022	Descriptive stats of outputs	L00	open
023	PCA biplot + standardization	L01	open
024	t-SNE / UMAP manifold view	L02	open
025	Compare four clustering methods	L03	open
026	Lab V diagnostics on Ridge	L04	open
027	Logistic onset detector	L05	open
028	RF baseline + permutation importance	L07	open
✅ 029	Synthetic 302-neuron matrix	L01 (logic #1)	v0.7.0
030	Step 0 mislabelled "random"	L00 (logic #2)	open
✅ 031	8-muscle pitch bottleneck	L06 (logic #3)	v0.7.0
✅ 032	Procrustes feature mismatch	L04 (logic #4)	v0.8.0
033	Magic-threshold peak detector	L05 (logic #5)	open — P1
✅ 034	Chopin lossy k=8 compression	L01 (logic #6)	v0.8.0
✅ 035	Pitch-aware F1 missing	L06 (logic #7)	v0.8.0
✅ 036	Step 8 PINN ≠ ion-channel PINN	L04 (logic #8)	v0.8.0
✅ 037	Biological ceiling overstated	L06 (logic #9)	v0.8.0
✅ 038	No cross-validation anywhere	L06 (logic #10)	v0.8.0

11 resolved ✅ (7 in v0.7.0, 4 in v0.8.0) · 10 open · one md per issue under docs/proposed_pyannow_issues/

v0.8.0 measured results — what the fixes achieved

Step	Method	F1 (pitch-aware)	vs baseline
0	Rule-based baseline	0.186	—
1	SVD + Procrustes (standardize=True)	0.000	residual 89.8→4.06 ✅; ISSUE-033 open
2	K-means	0.110	−0.076
3	Ridge	0.000	pending investigation
4-6	MLP + Adam + L-BFGS	0.193	+0.007 first beat ↑

Key fix: ISSUE-032 Procrustes

procrustes_align(standardize=True) — z-score W_k before SVD.

PC scale imbalance was 6.9×. Residual: 89.807 → 4.059.

Key fix: ISSUE-034 Auto-k

build_chopin_features(k_chopin=None) — 90% cumvar auto-selection.

For 10s clip: k=5 captures 97.7% variance (was k=8 fixed).

Next priority: ISSUE-033 — logistic onset detector to fix Step 1 regression.

What this gives wormuse

A pipeline that measurably improves toward Chopin, with uncertainty quantified, logic problems explicit, and deep-model claims falsifiable.

The 21 [@appstat-audit] issues form a roadmap: close them in priority order (P1 first) and PyANNOW's notebook 03 ends up reporting a scalar pitch-aware F1 that actually correlates with how Chopin-like the audio sounds.

For full details: docs/STATISTICAL_DIAGNOSTICS.md.
For executable demonstration: notebooks/01_appstat_lecture_audit.ipynb.
For drop-in PyANNOW issues: docs/proposed_pyannow_issues/ISSUE-{018..038}.md.