wormuse-analytics

Improving the worm→Chopin pipeline through Applied Statistics

Goal: the best model we can build that teaches a C. elegans to play Chopin's Nocturne in C♯ minor.

Polimi · AppStat 2026 · Companion to PyANNOW (NAML) · v0.8.0
Engineering doc: docs/STATISTICAL_DIAGNOSTICS.md
Notebook: notebooks/01_appstat_lecture_audit.ipynb
21 PyANNOW issues tagged [@appstat-audit] — 11 resolved in v0.8.0

The goal — and why metrics aren't enough

Teach a worm to play Chopin. That means two things:

A pipeline

ion channels → spikes → muscles → MIDI → audio

PyANNOW already implements 8 NAML steps along this chain.

A scoring function

Quantify "Chopin-likeness" rigorously enough that optimising it produces music.

PyANNOW reports `onset_loss` and `musical_f1`. Both are too coarse.

PyANNOW ISSUE-016 and ISSUE-017 patched the metric — but the underlying pipeline has 10 structural logic problems that no metric correction can fix. Each AppStat lecture maps to a tool that addresses one of them.

The current pipeline

ion-channel HH params (DEFAULT_PARAMS) ↓ run_forward_fast V_muscles (T, 96) ──► generate_neural_activity_302() → X_neural (302, T) k≥4 ✅ v0.7.0 ↓ ↓ onsets_base Step 1a RSVD (k=4) (Step 0 baseline) ↓ Step 1b Procrustes(standardize=True) → Z_aligned ✅ v0.8.0 ↓ Step 2 PCA + KMeans ↓ Step 3 RidgeCV → C_pred ↓ Steps 4-6 MLP+Adam+L-BFGS → C_pred ← beats baseline ✅ ↓ Step 8a/b ODE/PDE PINN ↓ find_peaks(distance=280ms, height=mean) ← OPEN ISSUE-033 ↓ pitch ← MUSCLE_PITCHES_96[k] (all 12 classes) ✅ v0.7.0 ↓ audio · score (pitch_aware_f1) ✅ v0.8.0

v0.8.0 fixes ✅: (a) 96-cell Boyle model, k≥4 PCs; (b) Procrustes standardized; (c) 100% pitch coverage. Still open: ISSUE-033 magic threshold.

10 pipeline logic problems

#ProblemFix by AppStatIssue
1302-neuron matrix is synthetic (8 blocks + noise)L01 PCA biplot reveals rank ≈ 8029
2Step 0 mislabelled "random"L00 IOI distribution shows it's deterministic body-wave030
38-muscle pitch bottleneckL06 pitch-aware F1 ceiling031
4Procrustes between unstandardized incommensurate spacesL04 standardize-first; L01 biplot032
5Shared `find_peaks(height=mean)` across all stepsL05 calibrated logistic + Youden033
6Chopin features lossily compressed to k=8L01 cumvar audit034
7Onset-only metric ignores pitch & velocityL06 pitch-aware F1 + velocity Pearson r035
8Step 8 PINN is oscillator-PINN, not ion-channel PINNL07 RF baseline reveals what physics buys036
9Biological ceiling assumes any muscle = any pitchL06 reachable-pitch confusion037
10No cross-validation anywhereL06 StratifiedKFold + bootstrap CIs038

L00 / Lab IDescriptive statistics

The metric paradox is invisible from scalar numbers but obvious from distributions.

Closes logic #2

Step 0's IOI distribution = sharp peak at one body-wave period (~220 ms).

Chopin's IOI = broad, high kurtosis.

Step 0 is not "random" — it's deterministic and structurally incompatible with Chopin.

What we add

descriptive.collect_step_stats() → DataFrame of `{n, ioi_mean, ioi_std, ioi_skew, ioi_kurtosis, ...}`

plot_ioi_distributions() — KDE overlay

ISSUE-022

L01-L02 / Lab IIDim reduction — linear & nonlinear

L01 — PCA biplot

Run PCA on the 302-neuron matrix; report cumvar at k=8.

Expected: cumvar @ k=8 ≈ 100% → the matrix has rank ≤ 8, not 302.

This closes logic #1 visually.

L02 — t-SNE / UMAP

2-D nonlinear projection coloured by KMeans label.

If colours mix, the motor primitives are arbitrary.

Manifold sanity check for Step 2.

ISSUE-023 — biplot · ISSUE-024 — t-SNE/UMAP · ISSUE-034 — Chopin k=8 audit

L03 / Lab III-IVClustering — four methods compared

MethodPyANNOWWhat it adds
KMeans(in Step 2)
Ward + dendrogram + cophenet🔴Is the k=4 hierarchy natural?
DBSCAN🔴Outlier worm-behavior detection
GMM (soft, BIC)🔴Smooth biological transitions

clustering.compare_methods() returns silhouettes + pairwise ARI for all four. GMM is the most biologically defensible (motor primitives transition gradually, not abruptly).

ISSUE-025

L04 / Lab VLinear models + diagnostics

PyANNOW Step 3

✅ Ridge + RidgeCV; R² vs α curve

🔴 No diagnostics — residuals, QQ, BP, DW, VIF, Cook

🔴 No Lasso — we never check whether k=4 PCs are needed

What we add

regression.diagnose_ridge() — Lab V table per Chopin feature dim

regression.lasso_path_selection() — true k_effective from LassoCV

Standardize before fitting (closes logic #4)

ISSUE-026 · ISSUE-032

L05Logistic onset detector — closes logic #5

PyANNOW's onset detection at every step is one line:

peaks, _ = find_peaks(activ, distance=int(0.28/0.5e-3), height=activ.mean())

That is a 1-feature classifier with a hardcoded threshold. Different step activations → different ideal thresholds, but the same `mean()` rule is used everywhere.

We swap for a calibrated LogisticRegression(class_weight='balanced') with Youden's-J or best-F1 threshold tuning. The gap between calibrated F1 and PyANNOW's reported F1 measures how much each step's poor score was caused by the hardcoded threshold rather than the activation itself.

ISSUE-027 · ISSUE-033

L06 / Lab VIClassification metrics — full Lab VI

Curves & CIs

F1 vs tolerance {10, 25, 50, 100, 200, 400} ms

PR overlay (preferred under heavy imbalance)

ROC overlay + AUC

Bootstrap 95% CI + paired-bootstrap H₀ test

StratifiedKFold over Chopin sub-windows

Pitch-aware F1 — the wormuse metric

Onset matched in BOTH time AND pitch

Plain F1 ignores wrong notes; pitch-aware F1 doesn't.

Bipartite matching: greedy by time, must share pitch-class.

Closes logic #7

ISSUE-018, 019, 020, 021, 035

L07Random Forest — model-agnostic ceiling

PyANNOW jumps from Linear (Ridge) to MLP (Adam) to L-BFGS to PINN — a deep-model escalation. AppStat L07 inserts a basic question before any of that:

Does a stock Random Forest already match the MLP?

If RF F1 ≥ MLP F1

Steps 4-6 are buying nothing.

The deep model is over-engineering; ship the RF.

If RF F1 ≪ MLP F1

The MLP captures genuine nonlinear structure that RF can't.

Justifies the extra complexity.

ISSUE-028 · ISSUE-036 (Step 8 PINN: which physics is it?)

The composite score — the wormuse goal made measurable

A worm-Chopin model is "Chopin-like" iff every floor below is passed:

ComponentWhat it measuresFloor
pitch_aware_f1 @ 50 msonsets matched in BOTH time AND pitch≥ 0.20
AUC-PR (binned onset classification)threshold-free quality of the activation≥ 0.30
ioi_similarity (already in pyannow)rhythmic distribution overlap≥ 0.30
velocity_correlationPearson r over matched-note velocities (dynamics)≥ 0.20
bootstrap 95 % CIuncertainty around every claimCI excludes Step 0

Reported by wormuse_analytics.pipeline.ImprovedPipeline.score_all(). None of these is currently in PyANNOW's `losses` dict.

21 PyANNOW issues — closing each gap

IDTitleLecStatus
✅ 018Builder/notebook desyncL06 (infra)v0.7.0
019Cell 20 left panel still misleadsL06open
✅ 020Multi-tol F1 + PR + ROC sweepL06v0.8.0
✅ 021Bootstrap CIs for per-step F1L06v0.8.0
022Descriptive stats of outputsL00open
023PCA biplot + standardizationL01open
024t-SNE / UMAP manifold viewL02open
025Compare four clustering methodsL03open
026Lab V diagnostics on RidgeL04open
027Logistic onset detectorL05open
028RF baseline + permutation importanceL07open
✅ 029Synthetic 302-neuron matrixL01 (logic #1)v0.7.0
030Step 0 mislabelled "random"L00 (logic #2)open
✅ 0318-muscle pitch bottleneckL06 (logic #3)v0.7.0
✅ 032Procrustes feature mismatchL04 (logic #4)v0.8.0
033Magic-threshold peak detectorL05 (logic #5)open — P1
✅ 034Chopin lossy k=8 compressionL01 (logic #6)v0.8.0
✅ 035Pitch-aware F1 missingL06 (logic #7)v0.8.0
✅ 036Step 8 PINN ≠ ion-channel PINNL04 (logic #8)v0.8.0
✅ 037Biological ceiling overstatedL06 (logic #9)v0.8.0
✅ 038No cross-validation anywhereL06 (logic #10)v0.8.0

11 resolved ✅ (7 in v0.7.0, 4 in v0.8.0) · 10 open · one md per issue under docs/proposed_pyannow_issues/

v0.8.0 measured results — what the fixes achieved

StepMethodF1 (pitch-aware)vs baseline
0Rule-based baseline0.186
1SVD + Procrustes (standardize=True)0.000residual 89.8→4.06 ✅; ISSUE-033 open
2K-means0.110−0.076
3Ridge0.000pending investigation
4-6MLP + Adam + L-BFGS0.193+0.007 first beat ↑

Key fix: ISSUE-032 Procrustes

procrustes_align(standardize=True) — z-score W_k before SVD.

PC scale imbalance was 6.9×. Residual: 89.8074.059.

Key fix: ISSUE-034 Auto-k

build_chopin_features(k_chopin=None) — 90% cumvar auto-selection.

For 10s clip: k=5 captures 97.7% variance (was k=8 fixed).

Next priority: ISSUE-033 — logistic onset detector to fix Step 1 regression.

What this gives wormuse

A pipeline that measurably improves toward Chopin, with uncertainty quantified, logic problems explicit, and deep-model claims falsifiable.

The 21 [@appstat-audit] issues form a roadmap: close them in priority order (P1 first) and PyANNOW's notebook 03 ends up reporting a scalar pitch-aware F1 that actually correlates with how Chopin-like the audio sounds.

For full details: docs/STATISTICAL_DIAGNOSTICS.md.
For executable demonstration: notebooks/01_appstat_lecture_audit.ipynb.
For drop-in PyANNOW issues: docs/proposed_pyannow_issues/ISSUE-{018..038}.md.