# MoverNet / `ad` learned-detector — decision & exploration log

The narrative companion to the **[decision tree](decision_tree.svg)** ([PNG](decision_tree.png) ·
[source](decision_tree.dot)). The tree is the map; this is the legend in prose. One section per node,
grouped by axis, chronological within each axis. Each entry states **(a)** the decision/avenue,
**(b)** why we tried it, **(c)** how deep we went, **(d)** the outcome + the evidence/number, **(e)**
status. Honest about depth: shallow tries and retracted over-claims are marked as such.

Sources mined: `learned_detector_roadmap.md` (the spine — D1–D10 / O1–O6 + dated session blocks),
`learned_detector.md`, `learned_detector_rationale.md`, `methodology_learned_detector.md`,
`movernet_writeup.md`, `research_multiblob_tracking.md`, `datasets.md`, the git log, and the project
auto-memory.

**Outcome legend:** WON (live path) · PLATEAUED/neutral (kept but not a win) · KILLED (abandoned) ·
OPEN (frontier) · CORRECTED (an over-claim later found to be a metric artifact and retracted).

---

## The root goal

**Decision.** Build a **class-agnostic, motion-only** detector (D10) that fires on *any* coherent
moving aerial object — quadcopter, fixed-wing Shahed, crewed aircraft, airport bird — **not** a trained
"drone" class, and that generalizes to unseen outdoor footage. It feeds per-camera 2-D landmarks into
the existing 3-D multi-view triangulation + tracking pipeline.

**Why.** A class detector (YOLO/DETR) only fires on classes it was taught, so it misses the novel/odd
mover and (per TrackNet) collapses on tiny-fast targets. The design thesis is an **appearance/identity
bottleneck**: the net sees motion only, has no class head (~435k params, ~2800 fps), and "is it the
target" is deferred to 3-D consensus downstream. Generalizability is a hard requirement and is the
whole reason to take the motion-only route.

**Status.** The organizing goal; everything below is an axis under it.

---

## Axis 1 — Input representation

### 1.1 Optical flow (Farnebäck) — the original input (D1)
- **Avenue.** Dense optical flow `(u, v, |flow|)`, 3-channel, as the motion field; frame-diff was only
  "the cheap floor."
- **Why.** A coherent mover is a patch of consistent flow vectors; that consistency *is* the grouping
  signal. Flow carries direction where frame-diff (a scalar magnitude) loses it.
- **Depth.** Full — it was the v0 MoverNet input, trained on UA-DETRAC, plus a model-free premise check.
- **Outcome / evidence.** **KILLED.** Flow-grouping on real DPJAIT drone footage gave **~3.5% recall**
  vs 72–92% on surveillance: Farnebäck's displacement window blurs tiny fast targets and low texture
  against sky. The motion *input itself* collapsed on sub-10px aerial targets, pre-empting any "does
  pretraining transfer" question (O2).
- **Status.** Killed (overturns D1).

### 1.2 Change map — MOG2 foreground + temporal-diff (the pivot)
- **Avenue.** A 2-channel **change** stack: MOG2 background-subtraction foreground + `|gray_t −
  gray_{t−1}|`, resized to 256×256.
- **Why.** Recover the tiny aerial targets dense flow lost, at a fraction of RAFT's cost; and it is
  **modality-agnostic** (never sees colour → works on thermal).
- **Depth.** Full — this is the deployed input on the live path.
- **Outcome / evidence.** **WON.** DPJAIT recall **0.04 → 0.78** (with GroupNorm + in-domain drone
  data). Resolves O2, overturns D1. Winning ckpt `movernet_variety_drone_gn.pt`.
- **Status.** Kept (the live input).

### 1.3 `with_flow` 4-channel — change(2) ⊕ flow(2)
- **Avenue.** Re-introduce flow as an *additive* continuity channel alongside change (multi-blob lever E):
  flow encodes velocity, so a blob should persist through a faint foreground frame.
- **Why.** A per-pixel "it should still be here, moving this way" prior to stop faint-frame drops.
- **Depth.** Trained A/B against change-only on the same sim+real mix.
- **Outcome / evidence.** **KILLED.** OOD recall **0.344 → 0.300** (worse), precision 0.413 → 0.337,
  R12 neutral, and ~7× dataset-materialization cost (Farnebäck CPU bottleneck). Flow is redundant with
  the tempdiff channel and adds noise. Lesson: continuity belongs at the *track* level, not per-pixel.
  Code kept (`with_flow=`) but off by default.
- **Status.** Killed (off the live path).

### 1.4 Thermal as in-domain motion variety
- **Avenue.** Use thermal-IR drone corpora (Anti-UAV410, RGBT infrared) as valid in-domain training data.
- **Why.** The change input is colour-blind, so a hot drone on cold sky yields an *equally valid* — often
  cleaner — foreground, despite deployment cameras being visible-light.
- **Depth.** Adapters built + verified through `change_samples`.
- **Outcome.** **WON** (a corollary of 1.2). Keep thermal's mix share modest; guard against AGC steps.
- **Status.** Kept.

### 1.5 Color temporal-diff (motion-in-RGB) — proposed
- **Avenue.** Compute the temporal difference *per colour channel* — `|frame_t − frame_{t−1}|` on R, G, B
  separately — instead of the current grayscale temporal-diff, as an additive channel alongside the
  existing change stack.
- **Why.** It is still **MOTION** (a difference), **not appearance**, so it preserves the
  spoof-resistant / class-agnostic bottleneck (unlike raw-RGB-crop in 1.6, which is killed by design). It
  catches **isoluminant movers** the grayscale diff misses — e.g. a red drone over green foliage, where
  the two map to nearly the same luma and the grayscale difference is near-zero but the per-channel
  difference is large.
- **Depth.** Proposed (not yet trained).
- **Outcome / evidence.** No number yet — **OPEN**.
- **Caveats.** (a) It is a **RECALL** lever, **not** a fix for the current false-**POSITIVE** bottleneck
  (the B gate verdict, 5.6); (b) it **costs the thermal modality-agnostic property** — thermal is
  single-channel — unless the input is made modality-adaptive (grayscale-diff for IR, colour-diff for
  RGB). 
- **Status.** **OPEN, queued AFTER the precision ladder** — recall is not the current binding
  constraint, precision (ghosts) is.

### 1.6 Raw RGB cropped to motion regions — KILLED by design
- **Avenue.** Feed the raw RGB *appearance* inside each motion box (crop the colour image to where motion
  fired) as an extra input.
- **Why considered.** A natural "use the pixels you already have" extension once motion has localized a box.
- **Depth.** Rejected at design time — never built.
- **Outcome / evidence.** **KILLED-BY-DESIGN.** It reintroduces **appearance** and breaks **D2/D10**
  (spoof-resistance / generalize-to-unseen-mover): a colour/texture prior is exactly the class bias the
  motion-only thesis exists to avoid, and it would let an adversary spoof or an unseen mover slip the net.
  The bottleneck-safe way to use appearance is a **LATE, NON-learned structural verifier** downstream of
  the motion detector (e.g. a cheap geometric/shape sanity check), never an input channel feeding the
  learned net.
- **Status.** Killed by design (kept here as the explicit boundary of the appearance bottleneck).

---

## Axis 2 — Normalization & training stability

### 2.1 BatchNorm (default)
- **Avenue.** Standard BatchNorm in the encoder.
- **Why.** Default; not initially questioned.
- **Depth.** Full — surfaced the bug.
- **Outcome / evidence.** **UNSTABLE / KILLED.** The net trains one frame at a time (effective batch
  size 1) over a *blocked* multi-source order; BN's running stats and the last grad steps drift toward
  whichever source trained last. Held-out collapses epoch-to-epoch (0.029 → 0.0002 → 0.022) and the
  in-training eval disagrees with the reloaded checkpoint. An earlier "good" 0.811 run worked only by
  luck (crowd-dominated → BN calibrated to drone-ish backgrounds).
- **Status.** Abandoned for the multi-source trainer.

### 2.2 Sequence-block shuffle
- **Avenue.** Keep frames *within* a sequence in order (the GRU needs it), but shuffle the *order of
  sequence blocks* each epoch so sources interleave. Now the default.
- **Why.** Blocked streaming biased the last gradient steps and BN stats toward the last source.
- **Depth.** Full — ablation runner (`order_ablation.py`).
- **Outcome / evidence.** **WON (decisive).** Fixed the **0.001 → 0.865** hit-recall cliff on held-out
  DPJAIT — a real bug fix, not noise.
- **Status.** Kept (default).

### 2.3 GroupNorm
- **Avenue.** Replace BatchNorm with GroupNorm (`norm="group"`) — batch- and order-invariant, no
  running statistics.
- **Why.** The architecturally safer cure for the batch-size-1 / heterogeneous-stream instability.
- **Depth.** Full — production setting.
- **Outcome / evidence.** **WON.** Converges smoothly to a plateau (last 3 epochs 0.033 → 0.044 → 0.041,
  no collapse). Production setting for the multi-source change trainer.
- **Status.** Kept.

---

## Axis 3 — Class-agnostic data engine (D10)

### 3.1 Skew the mix toward small-fast — CORRECTED
- **Avenue.** An early D10 sub-clause: weight the pretraining mix toward small-fast targets
  (tennis/shuttlecock/birds) since the operating regime is small fast drones.
- **Why.** Intuition that the mix should resemble the deployment target.
- **Depth.** Shallow — a draft clause, retracted before it was committed as a training recipe.
- **Outcome / evidence.** **CORRECTED (2026-06-16).** The myhal zero-shot run locked onto *drone-scale*
  blobs (a person's hands, head) and **missed a clearly visible tennis ball**. Any size bias — small or
  large — just relocates that single scale/speed failure to a different band. A real target spans the
  whole scale range *within one flight* (a dot at altitude, large on approach).
- **Status.** Retracted; replaced by 3.2.

### 3.2 Variety pretrain (all scales / speeds / types)
- **Avenue.** Pretrain on the full range of scales, speeds, and mover types — surveillance + balls +
  birds + drones — to make the detector scale/speed-**invariant**. Narrow toward the operating regime
  only at Stage-2 fine-tuning, never by starving the pretrain mix.
- **Why.** Scale/speed invariance is the only thing that avoids re-creating the myhal failure.
- **Depth.** Full — this is the spine of the data engine.
- **Outcome / evidence.** **WON.** Zero-shot variety pretrain ~matches the in-domain MVP for *detection*
  (recall 0.70 vs 0.80 on the honest product score; see the correction in 3.6). Box *precision* still
  lags — that is the Stage-2 gap, not a pretrain failure.
- **Status.** Kept.

### 3.3 In-domain aerial drone data — the decisive lever
- **Avenue.** Fold real in-domain drone corpora into the variety mix: DUT Anti-UAV (RGB, static cam),
  Anti-UAV-RGBT (visible+infrared), Anti-UAV410 (thermal). All native-labelled, static-cam, contiguous.
- **Why.** Directly counters the small/fast-drone miss; "broaden" should mean in-domain aerial, not more
  surveillance (the red-team demoted surveillance for the drone goal).
- **Depth.** Full — adapters verified, retrained, the central win.
- **Outcome / evidence.** **WON — the decisive lever.** Held-out DPJAIT recall **0.04 → 0.78** (a rig
  never trained on), ~matching the in-domain-only MVP (0.80) while *also* staying alive on Anti-UAV410
  (where the MVP is dead at 0.0). Ckpt `movernet_variety_drone_gn.pt`.
- **Status.** Kept.

### 3.4 Procedural multi-blob simulator (change-level)
- **Avenue.** Composite N moving real-crop sprites onto real static backgrounds with scripted
  trajectories — crossings, occlusions, wide log-scale (tiny speck → frame-filling), variable count and
  speed — then run the *same* MOG2+tempdiff to get exact multi-blob change maps with perfect boxes +
  track-ids. Extends `synthetic.py` + `change_samples`.
- **Why.** Real multi-drone labelled data is scarce; the model eats change maps, not appearance, so no
  photoreal rendering is needed. Unlimited, perfectly-labelled hard cases at near-zero cost.
- **Depth.** Full — built (`learning/datasets/sim.py`), trained, evaluated on a held-out OOD sim set.
- **Outcome / evidence.** **WON for generalization (~19×).** Out-of-domain held-out recall (balls / cars
  / birds / big blobs) **0.015 (real-only) → 0.284 (sim+real)**. Real-only is *blind* off-distribution;
  sim+real generalizes. The only clear per-frame-recall lever found.
- **Status.** Kept.

### 3.5 COCO 80-class mask cutouts as crop sprites
- **Avenue.** Add COCO 80-class mask cutouts to the sim's crop library for appearance variety.
- **Why.** More diverse sprite shapes might broaden the OOD generalization further.
- **Depth.** Ablated.
- **Outcome / evidence.** **PLATEAUED.** Only **+1.6 pts OOD**. Crop-*appearance* variety is saturated;
  the binding constraint is now motion/environment realism + real footage, not sprite diversity.
- **Status.** Kept but a plateau (no further appearance-variety investment).

### 3.6 In-domain fine-tune (warm-start → DPJAIT)
- **Avenue.** Warm-start the general GroupNorm model and fine-tune at low lr on the deployment corpus
  (DPJAIT). The Stage-2 "WHERE" half of the curriculum.
- **Why.** Variety pretrain generalizes but localizes loosely; fine-tuning recovers rig-specific
  precision without discarding the broad prior.
- **Depth.** Full — held-out probe R03.
- **Outcome / evidence.** **WON.** Held-out R03: recall **0.94** / precision **0.85** / centre-jitter
  **1.67 px** (with post-processing), stable across all 8 epochs. Ckpt `movernet_dpjait_ft_gn.pt` — the
  best single-target DPJAIT detector.
- **Status.** Kept.
- **Honest note (the corrected over-claim).** An early "variety matches MVP" claim was a **metric
  artifact**: `hit_recall` (GT-centre-in-any-box) is gameable by over-firing / large boxes (a 1-epoch
  model scored 0.94, training to convergence regressed 0.865 → 0.34). It was retracted and replaced by
  the honest product metric `recall·precision·exp(−rmse/3)·exp(−jitter/2)`, on which variety ~matches MVP
  on *detection recall* but lags ~9× on precision+centroid — the real Stage-2 gap.

### 3.7 Per-source label provenance + Track-Anything teacher
- **Avenue.** Use native labels where a corpus ships them; bootstrap genuinely-unlabelled clips with a
  per-source detector dispatch or a **DEVA/SAM Track-Anything** class-agnostic video teacher, plus a
  track-displacement **moving-gate** that drops static blobs a motion-only learner can never see.
- **Why.** Almost every corpus is partially labelled; the teacher is the disposable RGB stand-in for a
  human annotator, the deployed net stays motion-only.
- **Depth.** Wired + running (not scaffold); SAM over-segments crowded scenes (the moving-gate is the
  filter; cleaner on drone-vs-sky than on streets).
- **Status.** Kept.

### 3.8 Change-signature sim (the root-cause data fix)
- **Avenue.** Replace the OLD sim's signature generator. The old sim slid a **frozen RGB crop rigidly**
  across the background, which produces an unrealistic smooth *sliding-blob* change signature — the
  MOG2+tempdiff fires only at the crop's **edges**, with no rotor flicker, motion blur, or interior
  shimmer. The fix (`learning/datasets/sim_change.py`) extracts **real 2-channel change-signature
  SEQUENCES** (the exact MOG2+tempdiff pipeline) from real GT boxes across dpjait / dut_antiuav /
  anti_uav410 (**10,586 patches**), models the real idle-background change statistics, and composites the
  real signatures over that background with per-instance pixel perturbation + change-domain env-realism.
- **Why.** Diagnosed by the user from a prediction montage: the sim's change signature did **not** match a
  real mover's, so the model never learned real interior temporal flicker. This is the *upstream* fix the
  precision ladder's takeaway points to (backend levers — threshold, geometry — are exhausted; cf. 5.7).
- **Depth.** Built + validated; **retrain done** (head-to-head vs `env_nodance_gn` on R12).
- **Outcome / evidence.** **VALIDATION PASSED**: interior temporal flicker **REAL 0.072 / OLD 0.045 / NEW
  0.065**, and a visual filmstrip confirms **NEW ≈ REAL** (ragged / internal flicker) vs **OLD** (smooth
  sliding). The head-to-head retrain resolved **MIXED — a precision win but a jitter regression** (3.9):
  - **R12 single-target recall/precision FLAT** — simchange **0.549 / 0.652** vs env_nodance **0.555 /
    0.645**. The headline single-target metric did **not** move; the lever acts on the multi-blob (ghost)
    and outdoor-robustness (jitter) axes, not on single-target recall.
  - **GHOSTS — WON.** Multi-blob ghosts @0.30 on R12_D4: **19 → 12 (−37%)** with **coverage held / up
    0.699 → 0.715** — a *real precision win*, and exactly the data-quality lever the backend gates
    (threshold, geometry) could not touch. The red-team's MAX-compositing caveat **did not bite** (ghosts
    fell rather than rose).
  - **JITTER — REGRESSED.** Frozen-scene jitter probe: **0.13 → 0.40 (dpjait)**, **0.38 → 2.33
    (dancetrack)**. Change-domain jitter is a **weaker approximation** than RGB-domain jitter, so
    simchange is **NOT outdoor-deployable**.
- **Honest caveats.** Single-target signatures; **MAX-compositing** (a blend, not a hard occlusion);
  background change-stats averaged over only a few clips.
- **Status.** **GREY / PARTIAL** — a confirmed ghost (precision) win, but a jitter (robustness)
  regression; not deployable as-is. See 3.9–3.11.

### 3.9 Combined-mix retrain (sim_change + old env-sim RGB + dpjait) — KILLED
- **Avenue.** Train one model on a *combined mix* — the new `sim_change` data **plus** the old env-sim RGB
  jitter data **plus** dpjait — to get the ghost win **and** jitter-robustness in a **single** model.
- **Why.** simchange wins on ghosts but breaks jitter; env_nodance wins on jitter; the obvious move is to
  combine their training data and keep both wins at once.
- **Depth.** Full — retrained, evaluated on both axes (ghosts @0.30 R12_D4 + frozen-scene jitter probe).
- **Outcome / evidence.** **KILLED — WORSE on BOTH axes.** Ghosts = **15** (worse than simchange's 12),
  coverage dropped to **0.587**, jitter **1.93 / 2.24** (worse than **both** singles — env_nodance's
  0.13/0.38 and simchange's 0.40/2.33). (Best single-target top-1 precision **0.684**, but that axis does
  not matter here.)
- **Diagnosis.** **Not dilution but a CONFLICT.** sim_change's change-domain jitter does **not** reproduce
  real RGB-jitter's *frame-wide MOG2 breakdown*, so it taught the net that "frame-wide change is fine" and
  **OVERRODE** the env-sim jitter-suppression that env_nodance had learned. Mixing the two jitter
  representations is actively destructive, not merely averaging.
- **Status.** Killed. Naive mixing is the wrong tool; the fix is to correct the *modeling* (3.10).

### 3.10 Fix sim_change's jitter modeling (the principled next step) — OPEN
- **Avenue.** Fix sim_change's **jitter modeling** so it reproduces **real MOG2-under-jitter**: extract
  signatures from / composite over **REAL jittered footage**, **or** apply jitter in **RGB before** the
  change transform — so that `sim_change` **alone** carries *both* realistic change signatures **and**
  realistic jitter. **NOT** naive mixing of two datasets (which the 3.9 conflict showed is destructive).
- **Why.** The 3.9 diagnosis: the conflict is that change-domain jitter ≠ real RGB-jitter's frame-wide
  MOG2 breakdown. Correcting the jitter *model* inside sim_change resolves the conflict at the source so a
  single model can carry the ghost win and jitter-robustness without the two datasets fighting.
- **Depth.** Proposed (not yet built).
- **Outcome / evidence.** No number yet — **OPEN**.
- **Status.** **YELLOW / proposed** — the principled successor to both simchange and the killed combined
  mix; the gate for retiring env_nodance as the keeper (3.11).

### 3.11 Current best DEPLOYABLE — `movernet_env_nodance_gn.pt`
- **Decision.** `movernet_env_nodance_gn.pt` remains the **CURRENT BEST DEPLOYABLE** Stage-1 model — it is
  the **only jitter-robust** model (19 ghosts). **Keep it deployed** until the jitter-fixed sim_change
  (3.10) beats it on **both** ghosts **and** jitter.
- **Why.** simchange cuts ghosts to 12 but is **not outdoor-deployable** (jitter 0.40/2.33); the combined
  mix is worse on both. Outdoor jitter-robustness is a deployment requirement, so the ghost win alone does
  not justify shipping simchange.
- **Status.** **Keeper.** simchange (12 ghosts, jitter-broken) and the combined mix are **EXPERIMENTAL**,
  not deployment candidates. **Update (leakage control, 3.12):** after the a2neg jitter 0.00 was shown to
  be dpjait-library leakage, `env_nodance` (`rgbcropsim__rgbjitter`) is **REINSTATED as the more general
  OUTDOOR-jitter keeper** — its RGB-domain jitter generalizes (0.13 without the test domain in its lib)
  where a2neg's edge-jitter degrades to 0.87. Pair it with a2neg's precision/ghost data fix.

### 3.12 Leakage control — RAN (2026-06-18) — SPLIT VERDICT
- **Avenue.** A **double control**: rebuild the **signature + jitter-bg + bg-stats** libraries with **NO
  dpjait** (keeping the common dpjait *fine-tune* source), retrain the **a2neg recipe**
  (`realchangesig__edgejitter__neg`), and re-test on **dpjait R12** — to separate genuine generalization
  from dpjait test-leakage. The control model is `realchangesig__edgejitter__neg__nodpjaitlibs`.
- **Why.** The `a2neg` signatures **include dpjait** and were **tested on dpjait** (the same exposure as
  3.8): both the ghost win and the perfect-jitter claim could be dpjait-in-library leaking into the R12
  evaluation rather than a real, generalizing improvement.
- **Depth.** Full — retrained the control recipe, re-ran single-target (recall/precision), multi-blob
  ghosts, and the frozen-scene jitter probe on dpjait R12.
- **Outcome / evidence.** **SPLIT — part REAL, part LEAKAGE.**
  - **REAL / generalizes (the change-signature data fix + negatives).** Single-target **PRECISION
    SURVIVES** removing the dpjait libs: **0.697 → 0.678** (still **> rgbcropsim__rgbjitter 0.645**);
    recall **0.563 → 0.541**; multi-blob **GHOSTS 15 → 18** (**≈ env_nodance's 19**). So the change-signature
    + negatives improvements are **GENUINE, generalizing** precision/ghost gains — **not** leakage.
  - **LEAKAGE / CORRECTED (the perfect jitter).** The headline **jitter 0.00 / 0.00 was substantially
    dpjait-LIBRARY leakage.** Removing dpjait from the jitter-bg library degrades dpjait jitter
    **0.00 → 0.87** (frozen-scene probe) — **WORSE** than `rgbcropsim__rgbjitter`'s RGB-domain jitter
    **0.13**. So **change-domain edge-jitter is DOMAIN-SPECIFIC** (it suppresses only the
    jitter-backgrounds it trained on), whereas **env_nodance's RGB-domain jitter GENERALIZES** better
    (MOG2 manufactures edge structure for *any* background). The A2 "jitter SOLVED 0.04 / 0.00" claims
    (3.14, 3.15) are accordingly **re-marked**: jitter robustness is **real but DOMAIN-DEPENDENT**, and the
    **0.00-on-dpjait was inflated by dpjait-in-library**.
- **Consequence.** The "current best deployable = a2neg + length filter" keeper (3.16) is **DOWNGRADED**:
  `env_nodance` (`rgbcropsim__rgbjitter`) **remains the more general jitter model**, and the
  `realchangesig__edgejitter__neg` model's **honest value is the precision/ghost gain, not jitter**.
- **Actionable lesson.** For outdoor jitter robustness, **either** use **RGB-domain jitter** (it
  generalizes) **or** include the **deployment domain's backgrounds** in the edge-jitter library. Never
  trust an edge-jitter 0.00 whose library shares its domain with the test.
- **Naming (per `model_naming.md`).** `a2neg` = `realchangesig__edgejitter__neg`; `env_nodance` =
  `rgbcropsim__rgbjitter`; the control = `realchangesig__edgejitter__neg__nodpjaitlibs`.
- **Status.** **GREEN for the precision/ghost fix (real); ORANGE/CORRECTED for the 0.00-jitter
  over-claim (retracted as dpjait-library leakage).**

### 3.13 Jitter-fix attempt A1 (sim_change with_jitter=OFF + RGB env-sim) — KILLED
- **Avenue.** Execute the 3.10 jitter fix by **outsourcing** jitter to the OLD RGB env-sim: run sim_change
  with its own `with_jitter=OFF` and let the env-sim carry the jitter signal alongside it.
- **Why.** env_nodance proved env-sim is jitter-robust; the lazy realization of 3.10 is to let env-sim
  supply jitter while sim_change supplies the change-signature ghost fix.
- **Depth.** Full — retrained, evaluated on jitter + ghosts + recall.
- **Outcome / evidence.** **KILLED.** Jitter **~recovered (0.16 / 0.82)** *but* ghosts **EXPLODED to 28**
  and recall dropped to **0.497**. The env-sim's **frozen rigid crops RE-POISON the ghosts** — exactly the
  unrealistic sliding-blob signature 3.8 was built to eliminate.
- **Diagnosis.** **env-sim is INCOMPATIBLE with the ghost fix.** sim_change must **self-contain** its
  jitter; jitter cannot be outsourced to the old env-sim without re-importing the ghost-causing signature.
- **Status.** Killed. Routes to A2 (3.14) — a self-contained REAL jitter background.

### 3.14 Jitter-fix attempt A2 (sim_change + REAL edge-structured jitter bg, alone) — SPLIT
- **Avenue.** Self-contain jitter inside sim_change: composite the real change-signatures over **REAL
  edge-structured (textured) jittered backgrounds**, with **no** env-sim.
- **Why.** A1's lesson — the jitter background must be realistic and self-contained, not a frozen-crop
  outsource.
- **Depth.** Full — retrained, evaluated on jitter + ghosts.
- **Outcome / evidence.** **SPLIT.** Jitter **LOW — 0.04 / 0.36**: the realistic textured jitter-bg drove
  the dpjait number down. **BUT ghosts blew up to 30.**
- **Diagnosis.** The model **over-fires on real content** because **every scene had a target** — there was
  no "fire nothing" training. Realistic backgrounds without negatives teach the net that real-looking
  content always contains a mover.
- **Correction (leakage control, 3.12).** The low **0.04** is **DOMAIN-DEPENDENT** and **partly
  dpjait-in-library**: `edgejitter` suppresses only the jitter-backgrounds it trained on, and removing
  dpjait from the library degrades the number sharply (0.00 → 0.87 for `a2neg`). The headline "jitter
  SOLVED" reading is **re-marked** — jitter robustness here is real but not general.
- **Status.** Split (jitter LOW but domain-dependent, ghosts red). Routes to negatives (3.15).

### 3.15 Pure-negative (no-target) scenes — `a2neg` — WON (jitter+precision) / PARTIAL (coverage)
- **Avenue.** Add **pure-negative scenes** (no target present) to A2: `a2neg` = A2 + **30% negatives**
  (`neg_prob=0.3`).
- **Why.** The **X-thread lesson** ("zero negatives → hallucinate"): A2 over-fired because it never saw a
  scene where the correct answer is *nothing*. Negatives teach "fire nothing."
- **Depth.** Full — retrained, evaluated on jitter / single-target / ghosts / coverage.
- **Outcome / evidence.** **WON for PRECISION (real / generalizes):** **BEST single-target** (recall
  **0.563** / precision **0.697** / **fp-per-frame 0.97**), ghosts **30 → 15**. The reported **jitter
  0.00 / 0.00** was **substantially dpjait-LIBRARY leakage** (see 3.12: rebuilding the libs without dpjait
  degrades dpjait jitter **0.00 → 0.87**, worse than `env_nodance`'s **0.13**). **PARTIAL on coverage:**
  coverage dropped to **0.569** — `neg_prob=0.3` is **over-conservative**.
- **Finding.** The **DATA side hit a ~12–15 ghost FRONTIER** — negatives + realistic jitter could not push
  ghosts below ~15 without bleeding coverage. Finishing the job requires a backend filter (see the backend
  ghost-suppression section). The leakage control later confirmed the **precision/ghost gain is the honest
  value** of this recipe (it survives no-dpjait); the jitter gain does **not** generalize.
- **Status.** **GREEN for precision (real); the 0.00-jitter claim RETRACTED as dpjait-library leakage
  (3.12); GREY/PARTIAL for the coverage cost.** The DATA half of the (now downgraded) deployable verdict
  (3.16).

### 3.16 Deployable verdict (DOWNGRADED by the leakage control, 3.12) — `a2neg` for precision, `env_nodance` for jitter
- **Decision.** The **`a2neg` + `min_track_len ~500`** keeper is **DOWNGRADED** from a
  "supersedes-env_nodance, perfect-jitter" doubleoctagon. Its **honest value** is the
  **PRECISION / GHOST gain** (real, generalizes — 3.12) plus **ZERO multi-blob ghosts** after the length
  filter — **not** jitter. For **outdoor jitter robustness**, **`env_nodance` =
  `rgbcropsim__rgbjitter`** is **REINSTATED as the more general model** (RGB-domain jitter generalizes;
  a2neg's change-domain edge-jitter is domain-specific).
- **Why.** The leakage control showed the reported "**PERFECT jitter 0.00**" was substantially
  **dpjait-LIBRARY leakage** (0.00 → 0.87 without dpjait libs, vs env_nodance's 0.13). So a2neg's data fix
  earns its keep on precision/ghosts, but its jitter advantage does **not** survive the control. The
  backend track-LENGTH filter (`min_track_len ~500`) still finishes the ghosts to **zero** without touching
  coverage (see backend ghost-suppression).
- **Actionable lesson (outdoor jitter).** Either use **RGB-domain jitter** (`rgbjitter`, which
  generalizes) **or** include the **deployment domain's backgrounds** in the **edge-jitter** library —
  to get a jitter-robust model whose robustness is not just leakage from a shared library domain.
- **Honest caveat (length filter).** Works for **PERSISTENT / loitering** targets; a **brief
  entering/exiting** real target would be **dropped**. Online, this is handled by the **tracker's M-of-N
  confirmation** — and the priority is **consistency > recall** anyway, so dropping a brief unconfirmed
  blip is the correct trade. The pipeline **default `min_track_len=40` is too LOW**; the operating point is
  **~500**, **not yet committed** (a **user decision**).
- **Status.** **a2neg = the precision/ghost data fix (kept); `env_nodance` reinstated as the
  outdoor-jitter keeper.** The doubleoctagon "perfect-jitter supersedes" claim is retracted. The
  experimental simchange / combined-mix / A1 lines stay retired.

### 3.x Real multi-object data (DanceTrack / SportsMOT / MOT17 / MOT20)
- **Avenue.** Acquire real multi-object tracking corpora; trial DanceTrack as a motion-only *detection*
  source.
- **Why.** Real multi-mover footage with track-ids for both detection variety and association supervision.
- **Depth.** Full — a clean 3-way detection ablation.
- **Outcome / evidence.** DanceTrack-as-detection **KILLED**: its large person-blobs (~130×190px, ~7/frame)
  re-imposed the myhal scale prior and tanked drone recall **0.566 → 0.339**. The clean 3-way
  (coco_gn 0.566 / env-no-dance 0.555 / env+dance 0.339) pins it on DanceTrack, not env-realism.
  **Re-scoped:** keep DanceTrack/SportsMOT/MOT only for **association / track-id supervision**
  (SportsMOT pans/zooms → association only, D4); never a raw detection source. Best Stage-1 outdoor model:
  `movernet_env_nodance_gn.pt`.
- **Status.** Killed for detection; kept for association.

---

## Axis 4 — Outdoor robustness (real-footage false-fires)

### 4.1 Env-realism in the sim (camera jitter + illumination drift)
- **Avenue.** Add camera micro-jitter (tripod vibration — *within* D4, not a pan) and illumination /
  cloud-shadow drift to the sim, so the detector learns that *coherent global* change is background and
  only *local* blob motion is a target.
- **Why.** The real-footage failure mode is firing on global change (shake, light); the cure is to train
  against it.
- **Depth.** Full — proven by a deterministic controlled test (freeze a real frame → zero true movers →
  apply known jitter → every detection is a ghost).
- **Outcome / evidence.** **WON.** Env-realism cuts jitter false-fires **2.4–9×** (env_nodance 0.13/0.38
  det/frame on dpjait/dancetrack vs coco_gn 1.02/3.56) at **~zero drone-recall cost** (R12 0.566 → 0.555).
- **Status.** Kept.

### 4.2 The 5-scene OOD-sim precision metric — CORRECTED
- **Avenue.** Use a 5-scene OOD-sim precision metric as the robustness eval.
- **Why.** First attempt at quantifying the robustness gain.
- **Depth.** Shallow.
- **Outcome / evidence.** **CORRECTED.** Too noisy to show the env-realism win — *retracted* as the
  robustness signal and replaced by the deterministic controlled jitter test (the reliable eval). The
  same noisy metric also surfaced the seed bug below.
- **Status.** Retracted.

### 4.3 `_seed_of` salted-hash bug
- **Avenue / bug.** `_seed_of` used Python's per-process-salted `hash()`, making eval scenes
  non-reproducible across runs.
- **Depth.** Found and fixed.
- **Outcome.** **FIXED** — switched to `crc32` for a stable per-process seed.
- **Status.** Fixed.

---

## Axis 5 — Multi-target backend (track several movers without dropping them)

### 5.1 Ray-exclusive correspondence
- **Avenue.** Replace permissive 2-camera-coincidence correspondence with a **ray-exclusive** consensus:
  take candidates best-first (most cameras, lowest residual), each *claims* its rays, no 2-D detection may
  be reused.
- **Why.** 4 drones × 4 cams gave **254 tracks** (250 ghosts), each drone fragmented across 8–27 ids, and
  `_decide_target` picked a ghost. `score_multi` hid it (per-frame Hungarian always finds one near each GT).
- **Depth.** Full + tested (`k_min=3` consensus trialled — worse, real drones often seen by only 2 cams;
  `k_min=2` kept).
- **Outcome / evidence.** **WON. 254 → 16 tracks**, fragmentation 1–5, `is_target` now real.
- **Status.** Kept (default).

### 5.2 `score_tracks` per-blob metric
- **Avenue.** One stable whole-sequence track→target assignment reporting per-target coverage / mean-err
  + overlapping-track (fragmentation) + ghost count + id-switches.
- **Why.** `score_multi` pools per-frame and *hides* ghosts; the honest multi-target gate was needed.
- **Status.** Kept (the multi-target eval gate; id_switches is the drop-and-re-pick KPI).

### 5.3 MultiTargetTracker (M-of-N confirm, birth-guard, merge)
- **Avenue.** M-of-N birth confirmation, a birth-guard second-chance association (the main
  de-fragmenter), coast-widened velocity-aware gate, near-duplicate merge.
- **Why.** Continuity — kill transient ghosts, recapture drifted predictions onto the same id.
- **Depth.** Full.
- **Outcome / evidence.** **WON.** R12_D4: 16 → 7 tracks, ghosts 12 → 3, mean coverage 0.77 → 0.94.
- **Status.** Kept.

### 5.4 `learned.detector(multi=True)`
- **Avenue.** Emit *all* decode peaks per frame instead of one smoothed single-target track per camera;
  let the 3-D `MultiTargetTracker` do the association.
- **Why.** The model already emits 2–3 peaks/frame (up to 5) — it sees multiple drones — but
  `ad/detect/learned.py` was single-target *by construction*. The integration, not the model, was the
  bottleneck.
- **Depth.** Full; plus a multi-drone fine-tune (`movernet_dpjait_multi_gn.pt`).
- **Outcome / evidence.** **WON.** Multi-drone fine-tune → held-out R12_D4 recall 0.65 / prec 0.77 /
  **3.32 preds/frame** (vs the single-drone model's ~1).
- **Status.** Kept (keep both single- and multi-target ckpts).

### 5.5 Post-hoc levers (NMS kernel / persistence gate / threshold)
- **Avenue.** A tunable NMS kernel + persistence gate + threshold on the multi-target front-end to
  suppress ghosts by tuning.
- **Why.** Cheapest possible ghost fix if it worked.
- **Depth.** Ablated.
- **Outcome / evidence.** **KILLED for ghosts.** NMS/threshold ablations did not move it. **Finding:
  multi-blob is DATA / precision-limited, not arch/tuning-limited** — the R12 consistency limiter is the
  learned detector's precision (noisier than YOLO: 15/11 vs dataset 7/3) + correspondence ghosts, not
  anything a post-hoc lever reaches.
- **Status.** Killed as a ghost fix.

### 5.6 B GATE — id-switch attribution probe (detection-vs-association)
- **Avenue.** Hold the multi-target backend fixed and **swap only the detections** on R12_D4: clean YOLO
  dataset boxes vs the learned detector's peaks. This attributes the multi-target failure to either the
  detection front-end or the association/maintenance layer.
- **Why.** Before investing in a smarter tracker / learned maintenance policy ("B"), confirm *what* is
  actually limiting the backend — prediction/association continuity, or upstream detection quality.
- **Depth.** Full A/B (one variable: the detection source).
- **Outcome / evidence.** **DECIDED.** Same backend:
  - clean-YOLO detections → **7 tracks / 3 ghosts / 3 id-switches / coverage 0.938**
  - learned detections → **62 tracks / 58 ghosts / 15 id-switches / coverage 0.690**
  The backend is fine when fed clean detections; the dominant failure mode is **false POSITIVES (ghosts)**,
  not track continuity. **Conclusion: the system is DETECTION-precision-limited, NOT association-limited.**
  A smarter tracker or learned track-maintenance POLICY (B) is therefore the **WRONG fix** — it cannot
  remove ghosts it is handed.
- **Status.** Decided. **Confirms the standing "data/precision-limited, not arch/tuning-limited"
  throughline** (cf. 5.5, 6.2). Routes the live plan to the precision ladder (frontier), not to a policy.

### 5.7 Backend ghost suppression — the "try all" ablation (real R12_D4, a2neg candidates)
The DATA side (3.15) drove ghosts down to a **~12–15 frontier** but no further without bleeding coverage.
This is the backend half: a "try every separator" ablation on real R12_D4 to finish the remaining ghosts.

- **Diagnostic (answers "how long / where are the ghosts?").** Ghosts are **SHORT** (track-length median
  **64**) and **SCATTERED** (only **4%** within **700 mm** of a real track); real tracks are **LONG**
  (length median **2771**). So ghosts are separable by **temporal length**, not by spatial proximity or
  residual mass.
- **blackhole (absorb ghosts into nearby real tracks)** — **KILLED (45 → 41).** Ghosts are **not near**
  real tracks to absorb (only 4% within 700 mm), so there is nothing to merge them into.
- **mass_gate (gate on per-track residual "mass")** — **KILLED.** **No gap:** ghost mass **3697 ≈** real
  **4044** → gating on mass crushes coverage before it cuts ghosts.
- **occupancy (coarse bbox-frustum occupancy)** — **KILLED (45 → 45).** The coarse bbox-frustum **can't
  separate** the low-residual phantoms.
- **track-LENGTH filter (`min_track_len 500`)** — **WON.** **0 ghosts**, **all 4 targets covered**,
  **coverage UNCHANGED (0.795)**. A **clean chasm**: real tracks **≥ 2207** vs the longest ghost **442**.
- **Takeaway.** The **clever** separators (blackhole / mass / occupancy) all **failed**; the **trivial
  temporal length filter won**. The ghosts' defining property is that they are short and scattered, and a
  length threshold is exactly the cut that property admits.
- **Status.** **Length filter = WON** (the backend half of the new keeper, 3.16). Operating point
  `min_track_len ~500` (pipeline default 40 is too low) — **not yet committed**, a user decision.

### 5.8 Dense point-cloud backend (virtual-LiDAR, your-architecture)

A separate, larger swing on the multi-target axis: **replace centroid-pairwise correspondence — the
ghost source — entirely** with a dense voxel point-cloud built from *every* moving pixel.

- **The build (green / real).** `ad/voxel_cloud.py`: every MOG2 foreground pixel casts a **calibrated
  ray** which is **DDA-accumulated** into a voxel grid, recording a per-voxel **intensity** plus a
  **distinct-camera bitmask**. Ported faithfully from `pixeltovoxelprojector`'s `ray_voxel.cpp` +
  `localeyes`'s `triple_take`. A **decoupled** `ad/cloud_to_targets.py` then applies a
  **≥k-distinct-camera threshold → cluster → `MultiTargetTracker`**. This replaces the centroid-pairwise
  correspondence that was the ghost source (cf. 5.1) — instead of pairing 2-D centroids, every pixel
  votes a ray and targets are where enough *distinct cameras* agree in 3-D.

- **Audit Review A — geometry / DDA / reference-faithfulness: PASS (green).** Independently verified, **no
  corner cuts**:
  - ray geometry correct to **7.4e-4 mm**;
  - DDA traversal **bit-identical over 7000+ adversarial rays** with **zero gaps**;
  - **faithful to `ray_voxel.cpp`** — it even **fixes a reference boundary bug**;
  - **every pixel, full-res, no subsample**;
  - the **k-distinct-camera bitmask is correct**.
  The math is sound. *(Nit: the non-default `'march'` knob misses ~20%, not the ~15% claimed.)*

- **Audit Review C — generalization / overfit: headline OVERSTATED (orange octagon / CORRECTED).**
  **Retract the "0 ghosts, 4/4" claim.** The per-frame **best-case** looked like 0 ghosts, but the
  honest **FULL-SEQUENCE TRACK-LEVEL** number (k=4, vs=100) is **11 ghosts / coverage 0.751 / 13
  id-switches** — still a **real ~4× gain** over correspondence's **45** ghosts, but **NOT 0**. The
  "0 ghosts" came from a **post-hoc `min_track_len ≥ 500` filter** (the same R12-overfit length cut as
  5.7 / 3.16), **not** from the k-consensus itself. Confirmed limits:
  - **(a) operating-point overfit.** Ghosts/frame **0.06 (vs=100) → 0.90 (vs=50) → 2.54 (vs=25)** — only a
    **narrow vs=50–100** voxel-size band works.
  - **(b) severe motion-dependence.** Hovering-target detection is only **20–31%**; a **26 s gap exceeds
    the coast budget → drops + re-acquires = FRAGMENTATION** — the worst failure mode under the standing
    **consistency > recall** priority.
  - **(c) 4-camera-contingent.** Drop one camera → **k4 collapses to 0 targets**; **k3 → 53 ghosts**.
  - **(d) offline-only.** **~3 fps** (not real-time).

- **The fix / open frontier (yellow) — [BUILDING].** Two pieces: a **consistency-tracker** with a
  **velocity-adaptive LONG coast** to **bridge the hover gaps** into one stable label, plus a **RANKER**
  that ranks tracks by **3-D-path plausibility** and **defers** the real/ghost decision (rather than a
  hard length cut). And feed the cloud from **MoverNet's RECURRENT output** (roadmap **D3**: the GRU
  carries the target through a stationary moment) so **hovering targets keep casting rays** — directly
  attacking limit (b).

- **Net.** The dense cloud is a **sound, real ~4× ghost improvement** over centroid-pairwise
  correspondence — but a **DEMONSTRATION-on-R12 with real constraints** (operating-point overfit,
  motion-dependence, 4-camera-contingent, offline), **not yet a general / real-time solution.**
- **Status.** **GREEN for the build + Audit A (geometry / DDA / faithfulness, no corner cuts); ORANGE /
  CORRECTED for the "0 ghosts 4/4" headline (Audit C, retracted as a post-hoc-filter / per-frame
  artifact); YELLOW for the consistency-tracker + ranker + recurrent-ray fix.**

---

### 5.9 Integrated chain + CROSS-CLIP validation — resolves 5.8's yellow/open items (numbers)

The 5.8 fix pieces are now **built, integrated end-to-end, and cross-clip-validated** — closing three of
5.8's four limits and honestly bounding the fourth.

- **The chain.** cloud (`voxel_cloud[_gpu]`, vs=100/k=4/mv=3) → **consistency-tracker** (velocity-adaptive
  long coast, `coast_hover=250`) → **principled LLR ranker** (`ad/track_llr.py`, recursive MHT
  log-likelihood ratio; P_D=0.9, radar λ_c — NOT R12-tuned). One script: `outputs/ghost_ablation/
  integrated_eval.py` (+ `cross_clip_eval.py`, `cross_clip_summary.py`).
- **Limit (d) offline — FIXED.** `ad/voxel_cloud_gpu.py`: bit-equivalent to the CPU DDA (0.0 mm cluster
  diff, 400/400 fuzz), **31.5 fps real-time** at vs=100 (was ~3 fps). Audit A's only nit closed.
- **Consistency — WON, generalizes.** Same config on 3 clips: **R12_D4 frag [1,1,1,1] / cov 0.857 / 12
  id-sw; R14_D3 frag [1,1,1] / cov 0.926 / 3 id-sw.** Every *sustained* target is **ONE stable label**.
  (R15_D3 frag [1,3,1] / cov 0.145 is a **500-frame-clip artifact**: the k=4 cloud fires on only 125/495
  frames there — short-clip cloud-recall, not a tracker fault; fg IS present, median 866 px/frame.)
- **Ranker — WON for sustained targets, cross-clip.** Under the operationally meaningful "real = a track
  covering a target ≥10% of its GT frames": **AUC 1.000 on ALL three clips**, every sustained real above
  every ghost, separation **+12 114 (R12) / +35 370 (R14) / +204 (R15) nats**. The principled MHT score is
  **NOT R12-overfit** — the AUC-1.0 generalizes. Ghosts are KEPT as stable tracks (13 / 34 / 3) and ranked
  cleanly below — exactly the consistency-first / rank-don't-suppress thesis.
- **HONEST RESIDUAL (the genuine limit, OPEN).** Under the *generous* precision-arm label, R14 AUC dips to
  **0.90**: a single **45-frame / 0.8 %-coverage** real re-acquisition fragment (id 29) scores **below**
  ~13 long **structured ghosts**. Verified the would-be fix **fails**: an MHT amplitude/support term does
  NOT separate them — the densest structured ghosts (id 36: 103 vox, id 11: 90 vox) carry **more** voxels
  than the reals (42–81). These dense ghosts are **ray-crossing PHANTOMS** — rays from *different* real
  targets intersecting: seen by 4 cameras, dense, smoothly moving ⇒ target-like in **both** kinematics and
  amplitude. The LLR already separates the SUSTAINED reals 32× (≈5 400 obs ⇒ ~37 000 nats vs phantoms'
  ~60 obs ⇒ 300–1 164 nats); the irreducible overlap is **brief-real-glimpse vs brief-phantom**, both
  short. This is precisely the multi-camera phantom the reference algo killed with **ray-exclusive
  correspondence** (5.1) — which the dense cloud *deliberately trades away* for completeness. The cloud
  buys completeness + robustness; the ranker recovers sustained targets but **cannot** disambiguate a
  brief real glimpse from a brief phantom on track-features alone. Frontier: phantom-aware geometry
  (ray-exclusivity / explaining-away at cloud level) or the D3 recurrent front-end carrying brief reals to
  sustained tracks.
- **Limit (b) motion-dependence — PARTIAL (TBD, `ad/tbd.py`).** Energy-integration TBD cuts blind-coast
  drift **41.6 % → 33.6 %** with **no new ghosts**, but costs a dead-parked target (cov 0.857 → 0.814,
  refragments) ⇒ kept **opt-in**, NOT in the headline config. Limits (a) operating-point and (c) 4-camera
  remain (POM = the documented camera-degradation contingency).
- **Status.** **GREEN: real-time GPU cloud (limit d), consistency cross-clip, LLR ranker cross-clip
  (AUC 1.0 sustained). YELLOW: TBD motion-recovery (partial). OPEN: brief-real-vs-brief-phantom (the
  ray-crossing residual) + limits (a)/(c).**

---

## Axis 6 — Stage-2 / motion head (priority: consistency > recall)

### 6.1 Constant-offset lever (`ad/offset.py`)
- **Avenue.** Fit the systematic 3-D bias vector (visual box-centre vs body-mounted marker + residual
  calibration) on one flight (R02), apply it causally to a held-out flight (R03).
- **Why.** The gap between raw and de-biased median error is a near-constant transferable bias — recover
  de-biased precision online without peeking at validation GT.
- **Depth.** Full — validated on the only two clean flights (R02/R03).
- **Outcome / evidence.** **WON.** Raw median **119.1 → 49.8 mm** (a 58% cut), hitting R03's own
  de-biased floor (49.1 mm). Bias ≈ `[2.1, −20.7, −104.7] mm` (mostly vertical). Pinned 115.5/41.5
  invariants intact.
- **Status.** Kept.

### 6.2 Deeper / FPN multi-scale backbone
- **Avenue.** Grow the backbone (FPN multi-scale heads) — one stride-4 head can't serve a 12px speck and
  a frame-filling blob; ~100× real-time headroom to spend.
- **Why.** Targeted capacity fix for the small/faint-secondary-drone drops.
- **Depth.** Trained ablation (tiny vs deep on the same sim+real mix).
- **Outcome / evidence.** **NOT A WIN (2026-06-18).** OOD recall **0.53 → 0.49** (worse), prec 0.43 →
  0.35, R12 0.54 → 0.50, and unstable per-epoch. Both arch/input levers (flow, FPN) failed to move OOD
  recall → confirms **data/task-limited, not capacity-limited** (~0.5 OOD ceiling is inherent sim
  difficulty). TINY stays the backbone; `arch="deep"` code kept, off by default.
- **Status.** Plateaued / killed as the live arch.

### 6.3 Cooperative motion + persistence head (4th head, two-stage)
- **Avenue.** Add a 4th CenterNet head off the GRU predicting per-cell displacement `(dx, dy)` +
  persistence; train two-stage with the backbone *frozen*. Supervision = sim track-ids (exact next-frame
  displacement + disappear events), then real single-target clips.
- **Why.** Priority is **consistency > recall** — don't drop-and-re-acquire a track under a new id. A
  learned next-position predictor + a *per-blob* (multi-blob-safe) presence gate, not a global presence
  classifier (which would be a single-target leak).
- **Depth.** Full — built (`MoverNet(motion_head=True)` + `train_motion.py`), sim- *and* real-supervised.
- **Outcome / evidence.** **BUILT + validated.** Sim held-out: displacement err **1.4 px**, persistence
  acc **0.983**. After the sim→real fix (real single-target clips), held-out **real Anti-UAV410:
  displacement err 0.4 px, persistence acc 99.5%** — the head transfers to real drones. Ckpt
  `movernet_motion_real_gn.pt`.
- **Status.** Built; the displacement payoff is real but not yet integrated (see 6.5).

### 6.4 Persistence as a detection gate
- **Avenue.** Wire the persistence head in as a detection gate (`persist_thresh` on `learned.detector`
  and `multi_target`).
- **Why.** A learned hold-vs-drop gate to cut transient ghosts.
- **Depth.** Ablated on real R12_D4 (gate off / 0.5 / 0.8).
- **Outcome / evidence.** **NO-OP on real.** Identical (15 tracks / 11 ghosts / 9 id-switches / 0.66 cov)
  — persistence saturates at ~0.99 on real detected peaks (real detected drones genuinely persist; the
  head never flags a vanish), and R12 id-switches come from the correspondence layer + noisy peaks, not
  transient detections. **Validated on sim only.**
- **Status.** Neutral / no-op on real.

### 6.5 Displacement → tracker predict/gate integration
- **Avenue.** Wire the (real-validated, 0.4px) displacement prediction into the `MultiTargetTracker`
  predict/gate — a learned non-linear next-position predictor in place of constant-velocity Kalman.
- **Why.** Displacement transfers to real where persistence didn't; it's the head's real payoff and
  directly targets the consistency (id-switch) axis.
- **Depth.** Not yet started — needs tracker surgery.
- **Status.** **OPEN** (the de-risked next step; to do *with* the user, not unattended).

---

## Current frontier (still open)

The system is now understood to be **data/ambiguity-limited, not capacity-limited** — arch and input
levers (FPN, flow) did not move OOD recall, and the B gate (5.6) pins the multi-target limiter on
**detection precision (false positives / ghosts)**, not association. So the open work is precision +
data + track-level continuity, not a bigger net.

### The precision ladder (corrected-B — the live plan, ordered)

The B gate (5.6) reframed the next move: the measured bottleneck is **false POSITIVES (58 ghosts)**, so
the live plan is a **precision ladder** aimed at killing ghosts, *not* a smarter tracker. Ordered
cheapest/safest → costliest/bottleneck-breaking; **steps 1–3 are cheap, need no retrain, and are
bottleneck-safe (no appearance).** Rungs 1–2 are now done: **threshold was a partial win, the geometric
gate was killed**, and the takeaway is that **the backend levers are exhausted → the fix is UPSTREAM**
(detection / data), realized as the change-signature sim (3.8).

1. **Threshold / calibration sweep** — **PARTIAL WIN.** `multi_thresh` **0.15 → 0.30** cut R12_D4 ghosts
   **58 → 19** with **coverage HELD** (0.690 → 0.699); the operating point is now **0.30**. But **19 ≫**
   clean-YOLO's **3**, so threshold *alone* is insufficient. *Grey / partial.*
2. **Multi-view geometric consistency gate** — **KILLED.** The (red-team-proposed) soft joint multi-view
   reprojection-residual at `k=2` was tested via a `max_resid_mm` sweep (80 / 50 / 35 / 25 / 15 at
   `thresh=0.30`). It **failed**: ghosts stuck at **18–20** (80 = 50 = 19, 35 → 20, 25 → 18) and at the
   tightest **15** they **EXPLODE to 31** by fragmenting real tracks; coverage never improves. The ghosts
   are **low-residual, geometrically-consistent phantoms** — exactly the red-team's coplanar / degenerate
   warning — so **backend geometry cannot separate them.** (Note: `k=3` was already killed in 5.1; this
   soft-residual `k=2` version is now killed too.) *Red / killed.*
   - **Takeaway:** backend levers are **EXHAUSTED** (threshold helped a little, geometry can't) → **the
     fix is UPSTREAM** in detection / data, not the backend. This routes the live plan to the
     **change-signature sim (3.8)**.
3. **Mask / MOG2 cleanup + temporal-persistence filter** — the "mask isn't working well" hypothesis: the
   foreground fires on clouds / foliage / sensor noise. Clean the MOG2 mask and require temporal
   persistence before a peak is admitted.
4. **Hard-negative training (real clutter)** — extends env-realism (Axis 4) with real clutter as
   negatives. **Needs a retrain** (no longer free).
5. **Appearance verifier** — **LAST, and it breaks the bottleneck.** Only as a late, non-learned
   structural check (cf. 1.6) if 1–4 leave residual ghosts; never an input channel to the learned net.

### Other open items (track-level continuity / data)

- **`neg_prob` sweep (recover coverage)** — `a2neg` at `neg_prob=0.3` is **over-conservative** (coverage
  0.569; 3.15). Sweep `neg_prob` to find the point that keeps the perfect jitter + precision wins while
  recovering coverage.
- **Leakage control (a2neg) — DONE (3.12), SPLIT VERDICT.** The double control
  (`realchangesig__edgejitter__neg__nodpjaitlibs`, libs rebuilt without dpjait) confirmed the
  **precision/ghost gain is REAL** (precision 0.697 → 0.678, still > 0.645; ghosts 15 → 18 ≈ 19) but the
  **perfect jitter 0.00 was dpjait-LIBRARY leakage** (→ 0.87 without dpjait libs, worse than env_nodance's
  0.13). a2neg is downgraded to a precision/ghost fix; env_nodance reinstated as the outdoor-jitter keeper.
- **Displacement → tracker integration (6.5)** — wire the real-validated 0.4px displacement head into
  the tracker predict/gate. The de-risked continuity step; addresses false NEGATIVES, not ghosts.
- **Closed-loop 3-D-reprojected prior heatmap** — render confirmed 3-D tracks back into every camera as
  an extra input channel (track-level continuity, trained with prior-dropout so it isn't a crutch). The
  "don't drop them" fix that flow at the input level failed to deliver.
- **Epipolar-guided cross-camera score boost** — let a strong peak in one view lower the threshold along
  the epipolar line in another (light multi-view fusion, no BEV retraining cost).
- **Color temporal-diff (1.5)** — a proposed RECALL lever; queued *after* the precision ladder.
- **SAM3 / EdgeTAM identity-tracking teacher** — **SCAFFOLDED, not yet validated.** EdgeTAM and SAM3.1 are
  implemented as optional Track-Anything teacher backends (`track_anything.py`): EdgeTAM (Apache-2.0 —
  throughput / clean license, the **default bulk** labeller) and SAM3.1 (SAM-License + **HF-gated** —
  quality, for hard / mid-entry clips, **blocked on gate acceptance**). The label-quality comparison was
  attempted but **all backends failed/skipped** (sam2 API drift `init_state_from_frames`, DEVA
  `state_dict` mismatch, EdgeTAM config missing, SAM3 gated) → **deferred, needs integration debugging.**
- **Learned CHANGE-frame generator (Phase 2)** — a diffusion / GAN / transformer generator of realistic
  2-ch change frames (starter kit at `etc/motion_bg_generator/`). The **scale play after** the
  hand-crafted change-signature fix (3.8). **Gated on:** the Phase-1 change-signature dataset existing (it
  now does) **AND** Phase-1 not already closing the gap. *Yellow / proposed.*
- **Scale sim data / epochs** — push the per-frame OOD recall via more sim, since arch/input levers are
  exhausted.
- **Full multi-view BEV fusion** — the heavy option; gate behind a multi-rig / simulated-placement
  validation only if everything above plateaus (rig-overfit risk).

### Learned track-management POLICY (imitation → RL) — DEFERRED

- **Avenue.** Birth/confirm/coast/drop + id-assignment as a sequential decision policy (imitation first,
  using GT track-ids from MOT/DanceTrack; RL only for long-horizon trade-offs imitation can't express).
- **Why deferred.** The 2-D track-**maintenance** refinement addresses false **NEGATIVES**
  (dropped/coasted tracks). But the **measured** bottleneck (B gate, 5.6) is false **POSITIVES**
  (58 ghosts), which a maintenance policy **cannot remove** — it can only decide how to handle the
  detections it is given. → **RL/policy is DEFERRED** until detection precision is fixed (the ladder above)
  *and* any residual problem is shown to be association/continuity rather than precision.
- **Status.** Deferred. **RL's unconditional fit stays = turret control** (a genuine sequential control
  problem, unlike track maintenance which is currently precision-bound).

Also still nominally open in the registry: **O1** (box+point vs box-only head — built as box + offset,
the offset *is* the point head) and **O3** (a thin appearance channel — default still no, the bottleneck
holds).