sinew R2S2R Research Report — Force-from-Vision on FMB

1. Executive summary

TL;DR (5 bullets)

v2f predictor is the ship artifact. RGBD → end-effector wrench. RL trains a policy in sim purely so we can record diverse (image, force) pairs; the policy itself is not deployed. This reframe inverts how the original plan described the project.
Sim F/T is a direct training signal. Isaac contact reporter via the locked read_eef_wrench_ee API. Never vision-predicted during training. Real Franka noise / bias / lag is modelled on the recorded label, not on inputs — the two-gap separation (visual gap closed by DR, F/T gap closed by label noise) is load-bearing throughout.
BC warmstart is dropped. The χ²=1.88 sim-to-real visual gap (with edge density +39% and 35–62% brightness skew on real) means BC trained on FMB-real frames cannot transfer onto sim-camera obs. Curriculum + sub-expert mixture compensates; ~10–20% wall-clock penalty traded for clean scope.
Stage-2 real fine-tune is non-optional. Wrist-cam content (hand, cables, peg layer-lines, lab clutter) cannot be simulated to χ² ≤ 1.0 without scope creep that's larger than the fine-tune itself. The scene plan brings sim to χ² ≈ 1.0–1.2; real fine-tune of the direction and gate heads closes the rest.
Research epic closed; implementation is next. sinew-1 (env setup, 5 children) and sinew-5 (research, 14 v1 children plus 8 reopened-wave children plus the synthesis ticket) are all closed. The next phase is five impl waves (substage + F/T surface → PPO + corrupted-action mapper → data-gen pass → v2f training → real-robot eval).

Binding cross-cutting locks

These are the deliverable from the research epic. Every row is an irreversible choice that downstream impl tickets honor without re-litigating.

Lock	Locked value	Origin
End deliverable	v2f predictor (RGBD → wrench). The RL policy is not shipped.	reframe; `project-v2f-is-end-goal`
Sim F/T provenance during training	Isaac contact reporter via `read_eef_wrench_ee`. Never vision-predicted.	sinew-5.21
Two-gap separation	Visual gap → heavy visual DR (no dynamics DR). F/T gap → noise/bias/lag on the recorded label, not the input.	sinew-5.13, sinew-5.16, sinew-5.21
Noisy vs clean wrench usage	Noisy → policy obs + v2f wrench-head label. Clean → reward + substage detector + direction-head label.	sinew-5.21, sinew-5.22, sinew-5.23
BC warmstart	DROPPED. χ²=1.88 visual gap invalidates BC-from-real-demos.	sinew-5.22 §2 (supersedes sinew-5.12)
Stage-2 real fine-tune	Non-optional. Direction + gate heads only; backbone, wrench head, contact-point head frozen.	sinew-5.16, sinew-5.23 §3.2
Disturbance recipe	SimDist action-only burst noise applied at data-gen time only (not during PPO training). Per-DOF σ ∈ [0.02, 0.30] m / [0.02, 0.40] rad; gripper bit excluded; 2.5% never-noised envs.	sinew-5.18, adopted by sinew-5.22 §3
Reward Φ_insert v2	Adds `-0.2 · \|\|f_clean\|\|` gated on `d_xy < r_align AND d_z < z_align`. No "degenerate-when-zero" branch; force signal is reliable.	sinew-5.22 §1 (supersedes sinew-5.1 §3.5)
Direction is the load-bearing head	L1 on unit-vec coords, λ=1.0. Wrench-magnitude head λ=0.1. Direction Matters recipe.	sinew-5.9, sinew-5.23 §1
Force / quat / image conventions	F/T in EE frame everywhere. Quat `(qx, qy, qz, qw)`. Images BGR on disk → RGB at parse time. `side_{left,right}`, `wrist_{left,right}` camera names.	sinew-1.5, sinew-5.7
Action contract	7-vec EE-delta normalized [-1, 1], scaled ±0.06 m / ±0.25 rad / gripper bit, @ 10 Hz, base frame, Plan A DifferentialIK at `panda_hand`.	sinew-2 + sinew-3
RL evaluation	IQM + 95% stratified bootstrap CIs (never mean/median); P(A>B) > 0.7 to claim improvement; N=5 seeds default at `{0, 7, 42, 314, 2718}`.	sinew-5.2 (rliable)
Recording resolution	224² (DINOv2-S patch match) is the v1 default; 256² is opt-in ablation.	sinew-5.6
Camera subset (v2f)	`2mixed_rgbd` minimum (`side_left + wrist_left`); `3cam_rgbd` for data leverage.	sinew-5.5 winner
FMB trajectory replay reliability	Not reliable in current sim. Substage verification falls back to a state-only adapter (calibration-only, never in the training loop).	sinew-5.20 (honest negative)

State of the world

closed sinew-1 env setup — Franka articulates, FmbInsertionEnv runs end-to-end on a 7-vec EE-delta action layer (Plan A DifferentialIK), gripper drives are live, 5/5 tests green.
closed sinew-5 research epic, both waves — 14 v1 children + 8 reopened-wave children (5.16–5.23) + 5.14 synthesis. 21 design memos under docs/research/; 3 reference code repos under isaac_twins/references/.
next Implementation phase — five waves enumerated in §7. None of the impl tickets are filed yet; by convention they get filed when the research epic closes and the team transitions to impl.

Goal	Role under the reframe	Validation criterion
Goal 1 — RL data factory	Train a PPO policy in sim that solves the FMB insertion task well enough to produce diverse trajectories. Curriculum: grasp → +place → +rotate → +regrasp → +insert. Sub-expert checkpoints from every 50 iters are kept for data-gen mixture. The policy is not the deliverable.	Data-gen yield: ≥ 1.6M (image, force) pairs, ≥ 5% contact transient frames, force coverage [0, 30] N, per-cam diversity χ² ≥ 0.3.
Goal 2 — Sim dataset	Record sim rollouts with full FMB-RLDS-parity observations plus sim ground truth: clean and noisy EE wrench, per-pair contact info, peg/board world poses, per-camera intrinsics/extrinsics. Per-episode DR. Substage labels written from the canonical detector. NAS pipeline to ferry zarr/HDF5 to the DL_A6000 for training.	5–10k `insert`-primitive rollouts (~243 GB at 224²), schema 1:1 with FMB RLDS + namespaced `obs/sim_*` extras, validator clean.
Goal 3 — v2f predictor	Train an RGBD → wrench predictor with three task heads (contact-point on wrist cam, force direction unit-vec, 6D wrench) plus an in-contact gate. Frozen DINOv2-S backbone, multi-view fusion. Two-stage training: sim pretrain (all heads) → real fine-tune (direction + gate only).	Direction cos-sim on real, tiered: aspirational ≥ 0.70 global / ≥ 0.60 per-shape min; acceptable ≥ 0.60 / ≥ 0.45 with +0.05 fine-tune lift; soft-fail → per-shape ensembles; hard-fail → halt and audit. Primary metric is monotonic improvement until χ² is re-measured post-DR.

Gap	What it is	How sinew closes it
Visual gap	Sim renders too clean: χ²=1.88, edge density +39% in real, brightness skew 35–62%. Wrist cams worst (hand, cables, peg layer-lines).	Heavy visual DR (sinew-5.13) plus Tier-1/Tier-2 scene authoring fixes (sinew-5.16 + addenda Q3). Stage-2 real fine-tune absorbs what can't be simulated.
F/T gap	Real Franka `K_F_ext_hat_K` is noisy + biased + lagged. Sim contact reporter is pristine.	Noise/bias/lag model on the recorded label (sinew-5.21). Predictor sees clean sim images and learns to match noisy real-Franka F/T. No domain adaptation for force.

Issue	Owner	Deliverable
sinew-1.1	Researcher	`docs/researcher.md` + `docs/fmb_reference.md`: FMB benchmark deep-dive + force-from-vision landscape v1.
sinew-1.2	SimWorker	`docs/sim_worker.md`: `isaac_twins` audit (API surface, known limitations, perf benchmarks).
sinew-1.3	RLWorker	`isaaclab_sinew/` bootstrap (pixi + IsaacLab v3.0.0-beta + `FmbInsertionEnv` skeleton).
sinew-1.4	SimWorker	Re-baked three USDs for Isaac Sim 6.0 asset paths after the `based_robotics` repo rename. Gripper-unpin landed as a side effect.
sinew-1.5	Researcher	`fmb_reference.md §3.1`: pin EE frame as canonical wrench convention; 6×6 EE→base adjoint applied only at the FMB-checkpoint boundary.

Issue	Owner	Deliverable
sinew-2	RLWorker	`docs/rl_action_layer_sketch.md`: 7-vec EE-delta action layer design (Plan A DifferentialIK, Plan B serl-impedance contingency).
sinew-3	RLWorker	`EEDeltaActionMapper` class wired into `FmbInsertionEnv`. Jacobian sourced via `art._articulation_view.get_jacobians()` at `panda_hand`. 5/5 tests green.
sinew-4	RLWorker	Env `_get_observations` bug fix: `frames[cam]` for single-env is ndarray, not `list[ndarray]`. Dropped stale `[0]` index; documented the multi-env migration path.
isaac_twins-36	SimWorker (backlog P3)	Pin production asset root explicitly — staging-S3 is currently 200-OK but transient.
isaac_twins-37	SimWorker (backlog P3)	Bake `peg_tip_local_offset` attribute on each peg USD at author time. Required by the `insert` predicate.

4. Goal 1 — RL as a data factory

Six research deliverables shape the RL workstream: evaluation protocol, substage detection (with gate verification per the addenda), reward design v2 with clean wrench, the historical BC record, the SimDist disturbance recipe, and the four trajectory-label classes the recorder writes.

4.1 Evaluation protocol

Owner: RLWorker · sinew-5.2 · docs/research/rl_eval_protocol.md · Refs: references/eval/rliable, Andrychowicz 2020.

The R2S2R compute budget forces a permanent few-run regime (3–10 seeds per config). Per rliable §4, point estimates at this seed count carry only 50–70% probability of being real improvements. The protocol stays defensible by reporting interquartile mean (IQM) with stratified bootstrap CIs.

Choice	Value	Why
Aggregate metric	IQM + 95% stratified bootstrap CIs	Mean is outlier-dominated; median is zero-on-half-tasks insensitive; IQM trims top + bottom 25%.
Improvement claim	P(A > B) > 0.7 from rliable	P ∈ (0.5, 0.7] inconclusive at N=5; P ≤ 0.5 no-difference or regression.
Seed budget	N=5 default at `{0, 7, 42, 314, 2718}`; bump to N=10 only on decision-gating overlap	Selection-bias safe; disaster-stopped runs still count as seeds.
Per-seed score	AUC of per-step eval-return curve (Andrychowicz §2)	Rewards data-efficient policies, not just final return.

Logging cadence

Every	Metrics
10K control steps	`train/return_mean`, `policy_loss`, `value_loss`, `entropy`, `grad_norm`, `kl_div`, `explained_variance`, `env/safety_clip_count`, `sys/sps`
100K control steps	`eval/return_mean` (100 episodes, stochastic policy), `eval/success_rate`, `eval/tcp_to_peg_dist_mean`

PPO defaults (Plan A, Andrychowicz "competitive base")

Choice	Value	Choice	Value
clip ε	0.25	γ	0.99
GAE λ	0.9	activation	tanh
policy MLP	2 × 64	action transform	tanh (NOT clip)
value MLP	2 × 256	initial action std	0.5
obs norm	YES (crucial)	advantage norm	per-minibatch
value loss clip	NO (hurts)	optimizer	Adam, lr=3e-4
Top "surprising" finding: last policy-layer init 100× smaller.

SAC is the fallback only if PPO plateaus below 50% success after the 5-seed pass.

Two-stream evaluation under the reframe

Per sinew-5.22 §5 the eval surface forks into two streams:

PPO training eval — unchanged. The sinew-5.2 protocol gates curriculum advancement.
Data-gen pass eval — new. The product is (image, force) pairs, not return. Yield gates: ≥ 1.6M total pairs, ≥ 5% contact-transient frames, F/T magnitude coverage [0, 30] N, per-cam diversity χ² ≥ 0.3.

4.2 Substage detection and gate verification

Owner: SimWorker · sinew-5.3 · sinew-5.19 · addenda Q1+Q2 (2026-05-21) · sim_substage_detection.md + substage_verification.md

FMB upstream has no automated substage detection — the operator hits Enter to advance primitives at rollout time (sequential_rollout.py:250). Sim has full ground truth, so deterministic detectors are tighter than real; deltas are documented intentional.

Sensor surface

Three Isaac Lab ContactSensor instances using filter_prim_paths_expr cover all five FMB primitives plus transitions:

finger_contact — sense = panda_(left|right)finger, filter = peg. Antagonistic finger-force pattern → grasp closed.
peg_contact — sense = peg, filter = [board, fixture, both fingers, Bin]. This is the substage-defining signal: per-partner zero/non-zero distinguishes "peg on floor" / "peg on fixture" / "peg in hole".
board_contact — sense = board, filter = Bin. Sanity heartbeat — board shouldn't tip during insertion.

Per-primitive predicates

Primitive	Success predicate
`grasp`	antagonistic finger contact (force dot < -0.7), `F_grip` > 1.0 N, peg z > 5 cm above bin, no slip ≥ 83 ms (10 physics ticks @ 120 Hz)
`place_on_fixture`	peg↔fixture force > 0.3 N, peg↔fingers force = 0 (released), peg vel < 5 mm/s, peg z in fixture-height window
`rotate`	peg long-axis rotation > 90° from entry, peg-on-fixture maintained, ±15° verticality. Axisymmetric pegs auto-pass.
`regrasp`	grasp.success + peg long-axis vertical ±15°
`insert`	peg↔board force > 0.3 N, peg-tip z within ±3 mm of hole bottom, peg xy within 5 mm of hole center, verticality ±20°, stable 10 steps

The SubstageDetector class lives at isaac_twins/src/isaac_twins/fmb/substage.py. It exposes {p}_success(), {p}_failure(), transition_ready(p) (success AND TCP at z_safe ≥ 0.20 m), and diagnostics(). Reward authors never recompute distances — the detector is single source of truth, eliminating reward-vs-eval drift by construction.

Q1 (addenda): are the gates meaningful before execution?

Beyond the three core tests already specced in sinew-5.3 §6 (offline unit, runtime smoke, negative case), four additional checks are ranked by ROI. The recommendation lands three before RL kickoff (~2 days total):

#	Check	Cost	What it catches	Verdict
1.2	Threshold-envelope startup assertion	1 hr	Silent misconfig (e.g. `F_grip_min` tuned to 100 N during debug and forgotten)	land pre-RL
1.5	Inverted-physics sanity probes (5 per primitive, "should say False")	1 day	FP catches: zero-force fake grasps, one-sided pegs, jammed-not-lifted, halfway-not-seated	land pre-RL
1.4	Temporal smoothness flag (predicate flips > 3× per primitive window)	2 hr	Sensor-noise artifacts masquerading as substage transitions	land pre-RL
1.3	Detector ↔ recorded-label cross-check (two implementations compared)	0.5 day	Hidden state / non-determinism in detector across calls	defer unless non-determinism observed
1.6	Per-shape threshold ablation	1 day	Per-peg-size FP/FN drift	defer until FMB raw arrives

Landing the threshold-envelope + inverted-physics + temporal-smoothness checks lifts predicate confidence high enough to trust the recorded obs/sim_substage_predicate as a v2f gate-head training label.

Q2 (addenda): FP/FN rates and how to combine state-only with full-state

Primitive	Full-state detector (canonical) — FP	FN	State-only fallback — P	R
`grasp`	1–3%	2–5%	0.90–0.95	0.85–0.95
`place_on_fixture`	2–5%	5–10%	0.75–0.85	0.80–0.90
`rotate`	5–10%	10–20%	0.40–0.60	0.50–0.70
`regrasp`	3–7%	5–10%	0.80–0.90	0.75–0.90
`insert`	5–10%	10–15%	0.60–0.75	0.65–0.80

The gap is big — full-state has 1–15% error, state-only has 5–50%, with rotate and insert worst because peg orientation isn't observable from state alone.

Recommendation: option (d) for the training loop, (b) for offline audit. Full-state detector is canonical for reward + recorded labels. State-only adapter is offline calibration only — never enters training. Mixing them as a confidence-weighted auxiliary loss (option a) would inject the state-only 5–50% error into the reward gradient, correlated with the actual physics on the weak primitives (rotate, insert) — exactly the wrong shape for reward shaping. State-only's only role is flagging audit-worthy disagreements (option b) and measuring sim-bias when state-only is the proxy for FMB-real labels at stage 2 (option c).

4.3 Reward design v2 (Φ_insert with clean wrench)

Owner: RLWorker · sinew-5.1 superseded in part by sinew-5.22 §1 · rl_reward_design.md + rl_revised_plan.md

Composition: dense potential-based shaping + sparse substage bonus + small action regularizer. Per-primitive activation — only the current primitive's reward contributes per tick.

R_p(s, a, s') = λ_success · 1[p_success(s')]       # sparse terminal
              - λ_failure · 1[p_failure(s')]       # sparse anti-terminal
              + γ · Φ_p(s') - Φ_p(s)               # dense potential-based (Ng 1999)
              - λ_action  · ||a[:6]||²             # small action regularizer
              - λ_clip    · Δsafety_clip_count    # safety-box pressure

PBRS form preserves the optimal policy (Ng 1999): cumulative dense return ≈ Φ(final) − Φ(initial), so the policy cannot bank arbitrary shaping reward.

Φ_insert v2 — the load-bearing change post-reframe

Under the reframe sim F/T is a real, reliable signal via read_eef_wrench_ee(art, noisy=False). The v1 "degenerate-when-F/T-zero" branch is dropped — clean wrench is non-zero only in real contact, so the force term naturally degenerates.

d_xy      = ||peg_tip_xy - hole_xy||
d_z       = max(0, peg_tip_z - hole_z_bottom)
align     = 1 - cos²(peg_long_axis, hole_axis)
f_clean   = read_eef_wrench_ee(art, sensor, noisy=False)[:3]   # (3,) clean N, EE frame
f_mag     = ||f_clean||

Φ_insert(s) = -d_xy
            - 1.5 · d_z · 1[d_xy < r_align_xy]
            - 1.0 · align · 1[d_z < z_align_thresh]
            - 0.2 · f_mag · 1[d_xy < r_align_xy AND d_z < z_align_thresh]

Coefficient lifted 0.1 → 0.2 because the signal is now reliable. The force-term gate ensures we only penalize contact force when we should be making controlled contact (inside the alignment and seat-depth window); outside that window force is exploration cost and is not penalized.

Constants (starting values, all tunable)

Constant	Value	Rationale
λ_success	20.0	Dominates dense return (≤5 per primitive) by 4×
λ_failure	5.0	Smaller than success; eval cares about success rate
γ (PBRS)	0.99	Matches PPO discount; Ng's theorem requires the same γ
λ_action	0.001	Tiny — just enough to break "wave arm around" ties
λ_clip	0.5	50-tick all-clipped trajectory costs 25 = exceeds λ_success

Curriculum

Phase 1: grasp-only. Pass: N=5 seeds, IQM > 0.7, 95% CIs not crossing 0.5.
Phase 2: + place_on_fixture.
Phase 3+: + rotate, + regrasp, + insert one at a time.

4.4 BC dropped — historical record

Owner: RLWorker · superseded by sinew-5.22 §2 · originals: il_bc_warmstart.md + rl_revised_plan.md §2.

The original plan (sinew-5.12) recommended Option C: BC warmstart → PPO/SAC fine-tune using the 22,550 FMB demos. The χ²=1.88 visual gap (sinew-5.16) invalidated the precondition that FMB-real images are drop-in compatible with sim-camera obs. Three options were considered post-reframe:

Option	Trade-off	Verdict
(a) State-conditioned-only BC (drop image obs)	No visual gap; throws away ~95% of FMB demo signal (images dominate the input dimension). State-only policy can't represent visual feature dependencies.	reject
(b) BC on sim-rendered FMB-replay images	BC sees sim-distribution images directly. Requires sinew-5.20 replay to be production-ready — it is not.	reject
(c) Drop BC entirely	PPO from scratch is harder. Under the reframe (RL = data factory, policy not deployed) "task hard for pure-RL" is tolerable; ~10–20% wall-clock lost, scope clarity gained.	decision

Curriculum + sub-expert mixture compensate for losing BC. The 22,550 FMB demos are kept on the project shelf for future real-robot work but do not enter the sinew RL training loop. The four data-side issues from the original sinew-5.12 spec (frame convention, quat order, F/T mismatch, BGR-on-disk) remain documented as a reference for any future BC revival.

4.5 SimDist disturbance recipe

Owner: Researcher · sinew-5.18, adopted by sinew-5.22 §3 · sim_dist_review.md + cloned code at isaac_twins/references/sim_dist/ (CLeARoboticsLab/simdist).

The only piece of SimDist sinew adopts is the action-only burst noise recipe. The latent world model + MPC planning machinery is out of v2f scope.

Noise scope	Action only. No push/wrench, no observation noise at data-gen.
Per-env σ draw	Once per env at run start (fixed for that env's entire run), sampled from `U[σ_min, σ_max]`.
Burst pattern	On for 1–50 control steps, off for 25–500, alternating. Net ≈ 9% noised time fraction.
Never-noised fraction	2.5% of envs run completely clean — produces the `clean_expert` trajectory label.
Policy mixture	50% expert + 50% from 11 sub-expert PPO checkpoints (iter 0, 50, …, 2000), re-rolled per env at reset.
Per-DOF σ_max	Translation 0.30 (5× per-step limit → saturating perturbation that generates contact transients), rotation 0.40.
Gripper bit	Excluded. Flipping mid-burst drops the peg.
When applied	Data-gen pass only. Not during PPO training — cleaner reward attribution.
Recording	Every step recorded regardless of noise state. Per-env noise flag becomes an HDF5 column.

Adoption cost is ~2 days: an EEDeltaCorruptedActionMapper subclass that wraps the existing mapper (~0.5 d), the burst-state machine, and the HDF5 schema additions (sim_action_noised, sim_policy_iter, sim_never_noised).

4.6 Trajectory labels for v2f filtering

Owner: RLWorker · sinew-5.22 §4

The recorder writes four orthogonal per-episode labels so the v2f trainer can stratify by data quality at HDF5 load time. Predicted fractions per a 1.6M-pair data-gen pass:

Label	Definition	Predicted fraction	v2f use
`successful`	terminal substage `insert.success() == True`	~60%	high-quality direction labels
`disturbed`	any tick with `sim_action_noised == 1`	~55%	off-policy + contact-transient diversity
`failed`	¬successful	~40%	off-manifold (image, force) coverage; negative gate samples
`clean_expert`	successful AND ¬disturbed AND policy==expert	~1.25%	held-out "nominal-regime" eval slice

Cross-tabulated: ~30% successful + clean, ~30% successful + disturbed, ~25% failed + disturbed, ~14% failed + clean. The trainer's default behaviour is to ignore the flags and train on the full pool (the distribution is already mixed); gate-head loss optionally upweights noised steps by 1.5× because contact-transient frames carry the cleanest gate signal.

5. Goal 2 — sim dataset recording

Seven research deliverables shape Goal 2: the camera subset bench, the unified GT recording spec with the sim F/T sensor, FMB↔sim data matching, the FMB-replay honest negative, the NAS pipeline, the scene visual-gap mitigation plan, and the parallel-development architecture.

5.1 Camera subset bench

Owner: SimWorker · sinew-5.5 · camera_subset_benchmark.md · 48-row JSON at isaac_twins/docs/research/camera_subset_benchmark.json

48-cell sweep: 8 camera subsets × {RGB, RGBD} × N ∈ {1, 4, 8}. Mixed-render steps/s:

Config	N=1	N=4	N=8	Note
phys-only baseline	~1700	~1200	~860	invariant of cam subset — physics cost dominates above N=4
1cam_rgb (any)	~1450	~880	~540	1side ≈ 1wrist at every N — informational not perf choice
2cam_rgb (any pair)	~1000	~550	~370	2side/2wrist/2mixed all equal cost
3cam_rgb	~750	~360	~225	linear-ish drop
4cam_rgb	~580	~250	~146	3 → 4 cam is the perf cliff (1.5× drop at N=8)
5cam_rgbd (4× D405 + overview)	~480	~155	~75	0.62× realtime at N=8 — A6000 needed for N=16+

Winner by consumer

Consumer	Choice	Why
RL training bootstrap	`1wrist_rgb`	Cheapest with contact view; 1700 / 1200 / 860 steps/s @ N=1/4/8
v2f data-gen (locked)	`2mixed_rgbd` minimum (`side_left + wrist_left`), `3cam_rgbd` for data leverage	Direction-from-vision wants depth; cross-view fusion needs ≥ 2 cams
Multi-env RL training	`2mixed_rgb` or `3cam_rgb`	Stay below the 4-cam cliff
Recording / replay videos	5cam	Offline replay only; never for training

RGBD costs +5–15% over RGB at fixed cam count — cheaper than the cliff above. The bench appendix documents the nohup + resumable-driver pattern that the NAS recorder inherits.

5.2 GT recording spec + sim F/T sensor (unified data surface)

Owner: SimWorker · sinew-5.6 + sinew-5.21 · sim_recording_spec.md + sim_ft_sensor.md

Sim F/T sensor pipeline (sinew-5.21)

Replicates Panda's K_F_ext_hat_K in four stages. New public API: read_eef_wrench_ee(art, contact_sensor, *, noisy, state, rng) → dict.

contact sensor             coord transform         DR noise model            output
    │                          │                       │                       │
    │ world-frame net force    │ rotate by R_world_EE  │ + bias + Gauss + lag  │
┌───▼───────────┐    EE-frame  ┌▼─────────────────┐    ┌▼──────────────────┐   │
│ clean F_w (3,)│ ── world→EE ─│ F_clean_ee (3,)  │ ──▶│ noisy_lagged_ee   │──▶│ obs/eef_force
│ clean τ_w (3,)│   adjoint    │ τ_clean_ee (3,)  │    │ (3,) + (3,)       │   │ obs/eef_torque
└──────────────-┘              └──────────────────┘    └───────────────────┘   │
                                       │                                       │
                                       ├──▶ obs/sim_eef_force_clean (3,)       │
                                       ├──▶ obs/sim_eef_torque_clean (3,)      │
                                       │                                       │
                                       ▼                                       │
                               ‖F_clean‖ > 0.1 N ───▶ obs/sim_in_contact (bool)

Integration discipline

Caller	Wrench used	Why
Reward Φ_insert	`noisy=False`	Deterministic gradient; gate must be exact
SubstageDetector	`noisy=False`	Sim-internal gates; predicates must be sharp
Env policy obs (`tcp_force`, `tcp_torque`)	`noisy=True`	Match real Franka deployment distribution
v2f wrench-head label	`noisy=True`	Predictor learns to match real noisy F/T
v2f direction-head label	`noisy=False` → `f/\|\|f\|\|`	Noising rotates the unit vector — corrupts geometry
Gate label `sim_in_contact`	`noisy=False`, threshold 0.1 N	Deterministic, never noised; shifts to 8 N at real stage-2

Noise model parameters

Force additive Gaussian (per axis)	σ_f = 0.025 N (Franka 0.05 N resolution / 2)
Torque additive Gaussian (per axis)	σ_τ = 0.01 Nm (Franka 0.02 Nm resolution / 2)
Per-episode bias drift	σ_bias_f = 0.05 N; σ_bias_τ = 0.02 Nm (constant within episode)
1st-order low-pass lag	τ_lag = U(20, 80) ms per episode, discrete IIR @ 10 Hz
Scaling mode	`{constant, scaled}` opt-in; default `constant` for v1 corpus

Recording schema (FMB-parity + sim extras)

The recorder writes FMB-RLDS-parity keys (images, joint_pos/vel, eef_pose/vel/force/torque, action, primitive, language) plus namespaced obs/sim_* extras for v2f training:

obs/sim_eef_force_clean, obs/sim_eef_torque_clean — pre-noise labels paired with the noised obs/eef_force/obs/eef_torque.
obs/sim_in_contact — gate label from contact reporter.
obs/sim_contact_point_local, obs/sim_peg_local_axis — contact-point head targets in peg-local frame.
obs/sim_force_dir_ee — direction-head target, unit-vec EE frame.
obs/sim_substage_predicate — per-primitive boolean vector from the detector.
obs/sim_dr_profile_blob — JSON of per-episode DR knobs (replay reproducibility).
obs/sim_action_noised, obs/sim_policy_iter, obs/sim_never_noised — SimDist columns.
episode_metadata/{successful, disturbed, clean_expert} — trajectory labels.

Per-frame loop ordering (validator invariants)

F/T capture, clean-then-noised: read sim wrench → write obs/sim_eef_force_clean; apply DR noise (per-frame Gaussian + per-episode bias + per-episode lag) → write obs/eef_force.
Depth pipeline order: dropout (low-texture mask) BEFORE Gaussian noise BEFORE range-clip. Different order changes the noise distribution.
F/T bias and lag-τ sampled per-episode (constant within episode); frame noise per-frame.

Format and storage

Format	Role	Why
`zarr`	live recording intermediate	Append-fast, concurrent-writer friendly, atomic `.partial → .zarr` rename, schema evolution = `mkdir`
`sinew_fmb_strict` TFDS builder	FMB-canonical, sim extras stripped	Mixed sim+FMB training without schema fork
`sinew_fmb_v2f` TFDS builder	Everything (sim extras + labels)	Predictor training
HDF5	rejected	Concurrent-writer fragility, NAS-unfriendly file locking

Config	per 50-ts ep	22k-FMB-equivalent corpus
2mixed_rgbd @ 224² (v1 default)	~6 MB	~243 GB
2mixed_rgbd @ 256² (FMB-faithful, opt-in)	~8 MB	~316 GB
3cam_rgbd @ 224²	~9 MB	~310 GB
4cam_rgbd @ 224²	~12 MB	~400 GB

224² matches DINOv2-S patches (16×14 = 224), saves train-time resize and ~23% disk. Even the v1 default (~243 GB) is ~45% of FMB upstream's 545 GB single-object zip.

5.3 FMB ↔ sim data matching

Owner: Researcher · sinew-5.7 · fmb_sim_data_match.md + validator isaac_twins/scripts/validate_fmb_schema.py

Three FMB schemas exist (raw .npy, RLDS, live gym env). The schema work locks one canonical sim record that's drop-in compatible with FMB's RLDS while adding sinew ground truth via the obs/sim_* prefix.

Key	Shape	Purpose
`obs/sim_t_ns`, `obs/sim_ctrl_step_idx`, `obs/sim_cam_capture_t_ns/<view>`	scalars	Timestamps — FMB stores none; sim emits so sim↔real can be cross-checked
`obs/sim_contact_wrench_ee`	(6,) float32	GT wrench in EE frame for the wrench head
`obs/sim_cartesian_contact`	(6,) bool	Per-Cartesian-dim contact — the FoAR-style gate label
`obs/sim_peg_pose_world`, `obs/sim_board_pose_world`	(7,) each	Geometric GT for contact-point head (project contact line into wrist-cam pixel)
`obs/sim_cam_intrinsics/<view>`, `obs/sim_cam_extrinsics/<view>`	(3,3) + (4,4) × N	K matrix + T_world_cam for the projection above
`obs/sim_jacobian`, `obs/sim_gripper_dist`, `obs/sim_seed`, `obs/sim_randomization_id`	various	Diagnostics + replay reproducibility

libfranka channels FMB drops (sinew picks them up)

libfranka's franka::RobotState exposes ~30 channels FMB ignores. High-value ones recorded under obs/sim_*:

tau_J — direct link-side torque, better SNR than the external wrench estimator
tau_ext_hat_filtered — low-pass filtered external torque
cartesian_contact — per-Cartesian-dim contact bit (the load-bearing FoAR gate label)
O_T_EE_d — last commanded EE pose (reveals controller tracking lag)
time — strictly monotonic libfranka clock

Camera name mapping (sinew canonical)

FMB upstream	sinew canonical	Physical mount
`side_1`	`side_left`	workspace −X edge (robot's left)
`side_2`	`side_right`	workspace +X edge
`wrist_1`	`wrist_left`	wrist-mount slot L
`wrist_2`	`wrist_right`	wrist-mount slot R

Caveat: the wrist L/R mapping is a sinew choice — FMB upstream binds arbitrary serials. If real-Franka eval shows mirrored-wrist artifacts (policy reaches the wrong way), the fix is to flip the wrist mapping and retrain, not look for a bug elsewhere.

5.4 FMB trajectory replay in sim — honest negative

Owner: SimWorker · sinew-5.20 (honest negative) · fmb_replay_feasibility.md + isaac_twins/scripts/tests/test_replay_mechanism.py

Verdict: cannot reliably prove end-to-end FMB grasp+insert replay in current sim. This is the load-bearing reason substage verification falls back to a state-only adapter (§4.2 Q2), and the reason BC option (b) "BC on sim-rendered FMB-replay images" was rejected (§4.4).

Four structural blockers, none individually trivial:

No IK layer in the replay path — joint-replay drifts in EE space because the Plan A DifferentialIK that sinew-3 wired into the env isn't reused at replay.
Peg spawn is randomized; it doesn't match the FMB-recorded peg pose at the grasp moment, so the contact geometry doesn't line up.
STEP→USD tessellation deflection ~0.5 mm ≈ medium/small board clearances — tight-tolerance insertions are geometrically infeasible in sim.
FMB raw .npy (545 GB) not downloaded; only 5-frame cached smokes exist locally.

Mechanism smoke confirmed sim infra works end-to-end: scene builds, physics settles, gripper actuates 0.08 → 0.0002 m closed. Predicted yield shape-by-shape: ~30–50% best case, < 10% for asymmetric shapes. There's a ~4-day path to reliable replay if needed (proper IK in the replay loop, peg-pose forcing, board re-mesh, FMB raw pull), but it's not required for the v2f-end-goal pipeline because substage verification falls back to the state-only adapter for offline calibration.

5.5 NAS + DL_A6000 pipeline

Owner: SimWorker · sinew-5.8 · data_pipeline.md + draft isaac_twins/scripts/episode_uploader.py

local PC (4070 Ti SUPER, 16 GB)        NAS (143.248.121.169:7002, ftp)          DL_A6000 (24 GB+)
┌─────────────────────┐                 ┌──────────────────────────┐            ┌────────────────────┐
│ FmbRecorder         │                 │ /IntelligentManipulation │            │  pixi env          │
│   → zarr/episode_*  │  FTP (curl)     │  Team/DomrachevIvan/     │   FTP      │  → TFDS reader     │
│   → episode.zarr.   │ ─push────────►  │  sinew/recordings/       │ ◄─pull──── │  → train_v2f.py    │
│       tar           │                 │    2026-05-21/seed_07/   │            │                    │
│ episode_uploader.py │                 │    sinew/tfds/           │            │                    │
└─────────────────────┘                 └──────────────────────────┘            └────────────────────┘

Protocol locks (per user global CLAUDE.md)

Choice	Value	Why
Protocol	plain FTP, `curl --ftp-method nocwd`	FTPS data channel fails from this PC; `nocwd` is stateless
Endpoint	`ftp://143.248.121.169:7002`	DNS fallback for `irislab.asuscomm.com`
Base path	`/IntelligentManipulationTeam/DomrachevIvan/sinew/`	Per user CLAUDE.md folder + sinew subtree
Wire format	tar-of-zarr per episode (uncompressed)	Zarr's many-small-files layout is FTP-unfriendly; arrays already compressed
Auth	`~/.netrc` (chmod 600), `curl --netrc-file`	Never put password in command line
rsync / rclone / sftp / FTPS	rejected	NAS has no SSH; plain FTP per user CLAUDE.md

Two-process model: FmbRecorder writes zarr atomically; episode_uploader.py watches and pushes per-episode opportunistically. Idempotent — mid-upload crash → next pass overwrites. Recording rate ~60 MB/min = 1.0 MB/s at 224², vs ~12 MB/s home-link ceiling — network is never the bottleneck. Even four parallel collectors stay well under.

Pool	Episodes	Size @ 224²	Wall @ 1	Wall @ 4 parallel
single-object multi-stage	15,350	~94 GB	~76 h	~19 h
single-object insertion-only	4,050	~49 GB	~20 h	~5 h
long-horizon (300 ts/ep)	2,700	~100 GB	~67 h	~17 h
total mirror-FMB @ 224²	22,100	~243 GB	~164 h (~7 days)	~41 h (~1.7 days)

5.6 Scene visual-gap mitigation (Q3 answer from the addenda)

Owner: SimWorker (addenda) over Researcher (sinew-5.16) · substage_and_scene_addenda.md §3 + sim2real_visual_gap.md

Researcher's measured baseline (sinew-5.16): χ² = 1.88 mean across 4 cams, wrist cams worst at χ² ~ 2.0 with real-edge-density 51–67% denser. Wrist cams see hand, cables, fingers, peg-print lines — none of which the current sim USD models. The addenda enumerates 10 candidate fixes and ranks them by leverage × inverse-cost. Two tiers land before the v2f stage-1 pretrain.

Tier 1 — do first (~2.5 days total)

#	Fix	Time	Δχ² (mean)	Δχ² (wrist)	Why this rank
1	Procedural cable mesh in wrist FOV (1–2 swept curves, random color, random routing)	1 day	-0.05 to -0.1	-0.3 to -0.5	Biggest single hit on wrist χ² — addresses the 39% edge-density gap directly
2	Lab-clutter distractor spawning (3–5 small meshes per ep outside the action region)	0.5 day	-0.15 to -0.25	-0.05	Biggest global χ² hit per cost; already in DR spec row 32, just impl
3	Background plane workshop texture (tiled real-workshop photo on the ground plane)	0.5 day	-0.1 to -0.2	-0.05	Cheap, fixes side-cam background uniformity

Tier 2 — next ~2 days if budget

#	Fix	Time	Δχ² (mean)	Δχ² (wrist)
4	FDM layer-line normal map on peg surfaces (~0.4 mm period)	0.5 day	-0.05 to -0.1	-0.1 to -0.15
5	Wrist-mount visual upgrade (real FMB STEP mesh + bevels + screws)	1 day	-0.05	-0.1 to -0.15
6	Domain-aware per-frame exposure jitter (random render exposure)	0.25 day	-0.1 to -0.15	-0.1 to -0.15

Defer or reject

Fix	Why deferred
PathTracing render (raytraced → pathtraced)	3–5× render-cost penalty — throughput killer. Run only if Tier 1+2 leaves a visible gap.
Hand approximation (stub human hand USD near gripper)	2–3 day scope; Researcher explicitly flagged "stage 2 real fine-tune carries this." Don't simulate a human.
Photographed FMB bin texture	Needs a real photograph; not on the critical path.

Expected χ² trajectory (SimWorker estimate, not measured)

State	Mean χ²	Wrist χ²	Interpretation
Today	1.88	~2.0	Solidly "visibly distinct" per Force Map
After Tier 1	~1.4	~1.4	Distinguishable but training-tolerant
After Tier 1+2	~1.0–1.2	~1.1–1.3	Approaching Force Map threshold; sufficient for stage-1 sim pretrain

Honesty note: these Δχ² estimates are SimWorker fix-table predictions per the leverage analysis in sim2real_visual_gap.md §3. The original measurement is N=1 (one real episode, one timestep, one sim env, 4 cams). Re-measuring χ² post-DR + post-Tier-1 is the highest-priority sinew-5.16 follow-up. What scene authoring cannot fix — unmodeled lab clutter, lighting hardware noise, sensor-level rolling-shutter / chromatic-aberration artifacts — is absorbed by the stage-2 real fine-tune.

5.7 Parallel-development architecture

Owner: SimWorker · sinew-5.4 · parallel_dev_architecture.md

Two-repo separation

Repo	Owns	Communicates via
`isaac_twins/`	Scenes, USDs, Franka control, recording driver, substage detector, F/T sensor.	8 published symbols (`sim_worker.md §3.2`)
`isaaclab_sinew/`	RL env wrapper, training scripts, eval harness, parser (no BC loader after the BC drop).	Imports only the 8 published symbols
`sinew/` workspace	Docs, beads, references	Read-only on both repos

num_envs scaling roadmap

Multi-env RL gated on three SimWorker follow-ups (carried into Wave 1):

Batched articulation handle — grab_franka_view(num_envs) → Articulation wrapping /World/envs/env_.*/Scene/Robot regex.
Per-env reset — SceneConfigurator.reset_episode(env_ids).
Observation packager — isaac_twins.fmb.obs.get_obs(cfgr, art_view, sub_detector).

Until those land, single-env is the only contract. DirectRLEnv migration is then a rename-only ~200-line PR.

The hot rules (workspace CLAUDE.md proposed)

One author per docs/research/*.md.
No USD re-bake in a PR that also changes Python code.
isaaclab_sinew imports from isaac_twins only via the 8 published symbols.
Paired tickets for cross-repo work; never one bundled.
Topic branches per ticket; no long-lived feature branches.
Asset additions are additive — never rename or remove existing.
DR variants don't need new assets.
nohup + resumable driver for any Kit-loop sweep > 5 min.
bd ready is the conflict-avoidance gate — claim before editing.

6. Goal 3 — video-to-force predictor

This is the ship artifact. Seven deliverables: the visual gap quantification, the literature review v2, the FMB-only feasibility spec, the data leverage analysis, the 3-head architecture, the DR spec, and the revised pipeline that branches the training plan on the FMB-only outcome.

6.1 Visual sim2real gap quantified

Owner: Researcher · sinew-5.16 · sim2real_visual_gap.md

Metric	Sim	Real	Gap
Color hist χ² (sim vs real)	—	1.88	Above FoAR χ² = 1.0 "visibly distinct" threshold
Edge density fraction	0.056	0.079	Real +39% denser edges
Per-channel brightness (R, G, B)	(152, 159, 152)	(94, 117, 104)	Sim 1.35–1.62× brighter
Per-channel std (R, G, B)	(40, 37, 38)	(51, 56, 55)	Real 30–52% wider tonal range

Per-camera breakdown

Cam	Hist χ²	Sim edge	Real edge	Sim mean RGB	Real mean RGB
`side_left`	2.10	0.059	0.069	(143, 156, 146)	(93, 112, 83)
`side_right`	1.35	0.049	0.057	(149, 154, 147)	(108, 116, 112)
`wrist_left`	2.00	0.055	0.083	(161, 162, 160)	(96, 123, 100)
`wrist_right`	2.06	0.063	0.105	(156, 161, 157)	(78, 118, 119)

Wrist cams have the largest gap (real-edge-density 51–67% denser) because wrist cams see hand, fingers, board screws, peg layer-lines — sim doesn't model the foreground. Side cams cover the larger workspace that the sim USD captures more faithfully.

Sim vs real 4-camera grid — **Figure 6.1.** Side-by-side sim (top row) vs real (bottom row) at four cameras, 256². Sim is uniformly bright, monochrome peg silhouettes, no foreground hand or cables. Real has darker shading, hand visible in wrist cams, ambient lab clutter, peg surface texture. Source: `docs/research/figures/sim2real_visual_gap_grid.png`.

FMB real frame variability across 5 timesteps — **Figure 6.2.** Five timesteps × four cameras from a single real FMB `insert` episode. Lighting and viewpoint vary substantially within one episode — the v2f predictor must learn over a multi-modal real distribution, not a single fixed pose. Source: `docs/research/figures/fmb_real_frame_variability.png`.

Implications (load-bearing for the rest of Goal 3)

Stage-1 sim pretrain cannot ship without the full DR knob set. A predictor trained on sim-only without DR deploys onto inputs that are 1.5× brighter, 0.7× edge-dense, 0.7× tonal-variance vs what it saw at training.
Stage-2 real fine-tune is non-optional. The hand-in-wrist-cam slice cannot be simulated away.
The "70% direction-acc bar" needs caveats. Until χ² is re-measured post-DR + post-Tier-1, the primary v2f metric is "monotonic improvement post-fine-tune," not absolute ≥ 0.70.
The N=1 study is sufficient for the design decision — a 20-episode × 5-timestep study would tighten the mean by maybe 0.3–0.5, not 10×. Re-measurement is deferred to a sinew-5.13 follow-up.

6.2 Force-from-vision lit review v2

Owner: VisionWorker · sinew-5.9 · v2f_lit_v2.md · cloned code at references/v2f_lit_v2/code/{FoAR, reactive_diffusion_policy, forcesight}

Backbone consensus (2025)

Frozen DINOv2 ViT-S per cam, 4-channel RGBD patch embed (ForceSight pattern at references/v2f_lit_v2/code/forcesight/prediction/models.py:RGBDDinov2). Depth channel trainable; init from RGB conv1 mean.
ResNet still works for purely-RGB or PCD inputs (Force Map ResNet50; FoAR MinkowskiEngine ResNet14 for PCD).
No paper uses a video-native backbone (VideoMAE, TimeSformer) for force prediction yet.

Strongest impl-detail finding

Finding	Source
Force direction transfers sim→real. Magnitude does not.	Direction Matters (Yang 2026) — L1 on unit-vec coords (NOT cosine, NOT angle); magnitude dropped.
Voxel grid only helps top-down clutter; per-pixel heatmap is better for peg-board contact.	Force Map (Hanai 2023)
Future-contact gate (binary) gates magnitude loss when ¬contact.	FoAR (He 2024)
Magnitude head as scalar regression with ~0.1× direction weight.	ForceSight + Direction Matters consensus
Heavy visual DR with NO dynamics DR. Dynamics randomization hurts direction supervision.	Force Map appendix + Direction Matters

Training defaults to crib

Optimizer	AdamW, lr=3e-4
Schedule	cosine + 2000-step warmup
Batch size	128 per A6000 (FoAR uses 240 on 2× A100 → halve)
Epochs	300 (FoAR default for similar scale)
Precision	bf16 on Ampere+
Data scale	5–10k sim rollouts with full F/T labels (Force Map's 5,400-scene recipe)

6.3 v2f FMB-only feasibility

Owner: VisionWorker · sinew-5.17 · v2f_fmb_only_feasibility.md + isaaclab_sinew/scripts/train_v2f_fmb_only.py (480 lines, AST-valid)

The feasibility check answers a single load-bearing question before committing to the staged-pretrain plan: can v2f be learned from FMB-real labels alone? The spec is closed; the A6000 run is a separate impl ticket per the user CLAUDE.md "training runs on DL_A6000 not local PC" rule.

Data	100–500 FMB `insert` episodes, 2cam RGB-only
Architecture	Direction + gate heads only, frozen DINOv2-S
Budget	≤ 1 A6000-day
Output	Five-outcome matrix A–E that branches the pipeline downstream (see §6.7)

6.4 Data leverage analysis

Owner: Researcher (coord w/ VisionWorker) · sinew-5.10 · v2f_data_leverage.md

Stage	Data	Trained	Frozen
1: sim pretrain	5–10k FMB `insert` sim rollouts, 2–4 cam RGB(+D), clean+noisy GT wrench, heavy visual DR	backbone (optional unfreeze) + all 3 heads + gate	nothing
2: real fine-tune	~3–4k FMB `insert` real episodes × ~100 steps × 2cam RGB-only, EE-frame F/T zero-bias-subtracted, `gripper_pose==1` only	direction head + in-contact-gate head	backbone, magnitude head, contact-point head

Why this partition

Direction transfers — geometric constraint normals are sim/real-identical (Direction Matters confirmed on Franka peg-in-hole).
Magnitude doesn't transfer — real F/T includes payload-model error, gravity-comp residual, thermal drift; sim contact stiffness is calibrated arbitrarily. Cross-distribution training corrupts both.
Contact-point doesn't transfer — no real-world per-pixel "contact happened here" label exists. Sim derives from peg/board poses + cam extrinsics.
Gate transfers if masked correctly — threshold at 8 N (FoAR default, 3–5× DR σ) gives a clean binary signal.
Backbone frozen at stage 2 — catastrophic-forgetting risk on visual-DR features.

Hard exclusions

Multi-object assemblies (7,200 episodes) — wrong physics.
Non-insert primitives (grasp/place/rotate/regrasp) — gripper-peg contact direction is uninformative for peg-board contact direction.
FMB real depth (D405 passive stereo on textureless 3D-printed pegs is the worst real distribution shift). RGB-only stage 2.
FMB language_embedding — no language conditioning.
FMB action labels — were for BC (now dropped); not for v2f.

Filter chain (parser-side)

Primitive filter: keep only primitive == 'insert'.
Gripper filter: keep only state_gripper_pose == 1 (peg held).
Episode validity: drop episodes with < 10 post-filter steps.

Steps 1+2 collapse 22,550 episodes → ~3,000–4,000 insert-only-with-peg episodes, ~20–40 GB RGB-only.

F/T zero-bias subtraction (per-episode)

Franka external wrench carries a per-episode baseline drift (payload model + thermal). The first 5 pre-grip-close timesteps (state_gripper_pose == 0) provide the bias estimate. If fewer than 5 pre-grip-close steps exist (peg already grasped), skip subtraction — falling back to in-contact bias estimation would bake contact force into the "bias" and contaminate all downstream direction labels.

6.5 3-head architecture

Owner: VisionWorker · sinew-5.11 · v2f_arch.md (~43M params, 22M trainable + 21M frozen)

Component	Value
Backbone	DINOv2-S ViT-S/14, 384-d features, frozen (depth channel trainable)
Input	`side_left + wrist_left` RGBD 224×224 (letterboxed from 1280×720 sim or 256² FMB-real)
Patch embed	4-ch RGBD (ForceSight pattern); depth conv1 init from RGB mean clone
Fusion	2-layer transformer encoder, 4 heads, GELU, per-cam positional embeddings

Four heads

Head	Output	Loss	Weight	Stage
contact-point	per-pixel 64×64 heatmap on wrist cam	BCE + soft-L2 hybrid	0.5	sim only; frozen at stage 2
force-direction	unit-vec 3D in EE frame	L1 on coords (Direction Matters)	1.0 (load-bearing)	sim pretrain + real fine-tune
6D wrench	EE-frame `[f; τ]`, trained on noisy-lagged label	MSE, gate-gated	0.1	sim only; frozen at stage 2
in-contact gate	binary probability	BCE (FoAR pattern)	0.1	sim + real fine-tune

Key locks across memos

F/T noise on labels not inputs — predictor sees clean sim vision and learns to match noisy real F/T.
NaN-mask direction loss at samples where ||F|| < 8 N — direction at the noise floor is uninformative.
Gate-gated wrench loss — per-sample MSE × gate label suppresses magnitude gradient on no-contact frames.
No F/T input to predictor. RGBD in, wrench out. Single-direction flow. Eliminates train-test divergence at deploy.

Training budget

Stage	Epochs	lr	bs	Wall
Stage 1 (sim pretrain)	300	3e-4 cosine + 2k warmup	128	~1.0–1.3 A6000-days, bf16
Stage 2 (real fine-tune, dir + gate only)	30–50	3e-5, no warmup	128	< 1 A6000-day

6.6 Domain randomization spec

Owner: VisionWorker · sinew-5.13 · v2f_dr_spec.md

Heavy visual DR, NO dynamics DR. F/T noise applied to labels, not inputs. 46-row master knob table organized in three schedule buckets:

Bucket	Knobs (count)	Where applied
per-episode	lighting (5), cam intrinsics (3), cam extrinsics (3), materials (8), placement (3), F/T bias + lag (2), MODE split (1) — ~26 knobs	`SceneConfigurator.randomize(step)`. Re-author USD attributes / lights / materials in place. USD writes > runtime `set_focal_length` calls.
per-frame (sim)	depth dropout + Gauss noise + clamp + RGBD jitter (4), F/T additive noise (2) — 6 knobs	Recording loop, before writing the labeled tuple. Noise becomes part of the recorded label.
per-frame (train aug)	RGB brightness/contrast/hue/saturation/gauss/gamma/JPEG (7), chromatic aberration (1) — 8 knobs	Training dataloader (`torchvision.transforms.v2`). Cheap; expands effective dataset.

Critical knob: depth dropout on low-texture pixels (knob #20). D405 is passive stereo — textureless 3D-printed pegs give sparse / noisy depth in real but the sim depth is clean. Approximate by masking texture_grad < τ pixels and setting them to inf. Without this knob, any RGBD-using head will be sim-tuned.

Five categories elevated to hard-required (post sinew-5.16)

Per the visual-gap quantification: lighting + color jitter + material BRDF + camera intrinsic jitter + background clutter must all ship in the stage-1 DR set. None of these is "optional ablation."

6.7 Revised pipeline — branches A–E

Owner: VisionWorker · sinew-5.23 · supersedes sinew-5.10 + 5.11 + 5.13 · v2f_pipeline_revised.md

One pipeline, five outcome branches set by the sinew-5.17 FMB-only A6000 run result.

                     ┌─ A. Strong validation (real-only ≥0.85) ── drop sim stage 1; ship real-only
                     │
                     ├─ B. Validated, expected (≥0.70, <0.85)── staged pretrain-sim + fine-tune-real (canonical recipe)
[sinew-5.17 result]──┤
                     ├─ C. Marginal (≥0.70 global, <0.60 worst)─ per-shape ensembles OR more real data
                     │
                     ├─ D. Below-bar (0.55-0.70)─────────────── triage: unfreeze backbone, swap to ViT-B, more real data
                     │
                     └─ E. Failed (<0.55)────────────────────── HALT — frame audit, zero-bias check, linear probe

Outcome B (the canonical, most-likely branch)

Stage 1 sim pretrain. 1.6M (image, force) pairs from the RL data-gen pass; DR per sinew-5.13 calibrated to sinew-5.16 deltas; all 4 heads trained; 300 epochs, AdamW lr=3e-4 cosine + 2k warmup, bs=128, bf16. SimDist disturbance integration via gate-head loss upweight ×1.5 on noised steps. Output: predictor_sim_pretrain.pt.
Stage 2 real fine-tune. Filtered FMB insert subset (~3–4k eps × ~100 steps × 2cam RGB). Direction + gate heads only; backbone, wrench, contact-point heads frozen. 30–50 epochs, AdamW lr=3e-5, no warmup, bs=128. Direction (L1 NaN-masked) + gate (BCE) weighted 1.0 : 0.1. Output: predictor_real_finetune.pt — the ship artifact.
Eval. Stratified per-shape direction cos-sim, magnitude-stratified direction cos-sim, gate F1 on FMB-real test.

Why no F/T input or disturbance-conditioning input

Two related decisions that both prevent train-test divergence at deployment:

No F/T input to predictor. Real Franka reports K_F_ext_hat_K — that's a separate channel the policy may read; the predictor's job is to produce wrench from RGBD alone.
No sim_action_noised conditioning input. The flag doesn't exist at deployment time; training a conditioning input the trainer can't reproduce at test time is exactly the kind of subtle divergence we should refuse to introduce. The disturbance shows up implicitly in RGBD (object position, contact geometry); the model can learn the regime from images.

The 70% bar caveat

Direction Matters's 70% bar was derived under their own sim2real distribution. Our χ² = 1.88 makes it likely the absolute bar is closer to 0.55–0.65 in the worst case. Therefore:

Tier	Global cos-sim	Per-shape min	Action
Aspirational	≥ 0.70	≥ 0.60	Ship outcome B as-is
Acceptable	≥ 0.60	≥ 0.45 with +0.05 fine-tune lift	Ship outcome B; flag the gap
Soft-fail	0.55–0.60	varies	Outcome C — per-shape ensembles
Hard-fail	< 0.55	—	Outcome E — halt + audit (frame + zero-bias + quat + linear-probe)

Primary metric becomes monotonic improvement post-fine-tune until χ² is re-measured post-DR. If real fine-tune doesn't lift over stage-1 sim-pretrain, the visual gap absorption isn't working.

7. Implementation sequencing — five waves

The research epic produces design memos. This section lays out the impl tickets that follow, sequenced by dependency. None are filed yet; by convention they get filed when the research epic closes and the team transitions to impl.

Dependency graph (sinew-5.X memos → impl deliverables)

Wave-by-wave details

Wave 1 — sim surface (~3–5 person-days, no training)

SubstageDetector class at isaac_twins/src/isaac_twins/fmb/substage.py per sinew-5.3 §4. Single-env. Offline unit tests + 2 runtime tests + the three pre-RL Q1 checks (threshold envelope, inverted-physics sanity, temporal smoothness flag).
read_eef_wrench_ee API per sinew-5.21. Stateful when noisy=True (LP filter state owned by caller); stateless when noisy=False. Smoke test: push peg into board, confirm wrench magnitude grows monotonically with depth.
Bake peg_tip_local_offset on each peg USD at author time (isaac_twins-37). Required by the insert peg-tip-z check.
grab_franka_view(num_envs), SceneConfigurator.reset_episode(env_ids), obs packager — unblock multi-env scaling and DirectRLEnv migration.
Φ_insert v2 reward fn per sinew-5.22 §1. Pure function (diagnostics, prev_diagnostics, primitive, action, safety_clip_delta) → float. Offline tests against synthetic histories.

Wave 2 — RL data factory (~6 GPU-days at N=1, ~1.5 days at N=4)

PPO training per sinew-5.2 logging spec, curriculum per sinew-5.1 §5.
EEDeltaCorruptedActionMapper subclass per sinew-5.22 §3 (~0.5 d). Burst-state machine for the SimDist recipe.
Sub-expert checkpoint preservation every 50 PPO iters → 11 checkpoints (0, 50, …, 2000).
Curriculum advance gates per sinew-5.2 (IQM > 0.7, CIs not crossing 0.5 at N=5).

Wave 3 — data-gen pass (~9 GPU-h + ~2 h FMB pull + ~2 h parse)

FmbDataRecorder outer loop (~1.5 d) per sinew-5.22 §3. Burst-noise applied here, NOT during PPO training.
HDF5 schema additions (~0.5 d): 4 obs/sim_* keys (sim_action_noised, sim_policy_iter, sim_never_noised, plus the F/T clean/label pair already in sinew-5.6) + 4 episode-meta labels.
NAS sync driver per sinew-5.8 (nohup, plain FTP, per-episode opportunistic).
FMB real-data filtered subset pull via HTTPS-range from gs://gresearch/robotics/fmb/0.0.1/ — 15–30 shards (~20–35 GB), insert-only-after-filter. ~30 min download + ~2 h parsing.
Data-gen yield eval script (~0.5 d) checking pair count + contact transient fraction + F/T coverage + per-cam diversity.

Wave 4 — v2f training (~1–2 A6000-days stage 1 + ~1 day stage 2)

Stage 1 sim pretrain per sinew-5.23 §3.2 / sinew-5.11. 300 epochs, bs=128, AdamW lr=3e-4 cosine + 2k warmup, bf16. All 3 heads + gate. Gate-head loss upweight ×1.5 on noised steps.
Outcome-matrix branch on the sinew-5.17 FMB-only A6000 result (A–E).
Stage 2 real fine-tune per sinew-5.23 §3.2. 30–50 epochs, lr=3e-5, no warmup. Direction + gate only; backbone, wrench head, contact-point head frozen. Non-optional per sinew-5.16.
Stratified eval per-shape, magnitude-stratified, gate-F1. Held-out clean_expert slice (~1.25%).

Wave 5 — real-robot eval (~deferred to next epic)

Deploy predictor_real_finetune.pt on real Franka.
Compare predicted F/T vs real K_F_ext_hat_K.
Per-shape stratified bench.
Real-robot data-collection pipeline + safety envelope + impedance gains tuning are the open chunks.

Compute envelope

Wave	Estimated time	Bottleneck
Wave 1 (impl)	~3–5 person-days	SimWorker bandwidth
Wave 2 (RL training)	~6 GPU-days at N=1, ~1.5 at N=4	per-seed wall time
Wave 3 (data gen + pull + parse)	~9 GPU-h + ~2 h pull + ~2 h parse	I/O on FMB pull
Wave 4 (v2f train)	~1–2 A6000-days stage 1 + ~1 day stage 2	training compute
Wave 5 (real eval)	deferred	real-robot access

Item	Source	When
χ² re-measurement post-DR + post-Tier-1 scene fix	sinew-5.16 §5 follow-up #4	After Wave 3 data-gen lands
Detector ↔ recorded-label cross-check (addenda 1.3)	addenda Q1	Only if non-determinism observed in recordings
Per-shape threshold ablation (addenda 1.6)	addenda Q1	After FMB raw arrives
Soft-success bonus shape	sinew-5.1 §8 #2	First training-run data
Ablations: backbone freeze/unfreeze, RGB vs RGBD, temporal stack T=1 vs T=4, contact-point per-pixel vs voxel	sinew-5.11 §6	Wave 4 (if stage 1 underperforms)
Production asset root pin	`isaac_twins-36`	Before any S3-staging issue resurfaces

Item	Reason
FMB trajectory replay reliability (~4 days to fix)	Not needed for v2f-end-goal; state-only adapter covers substage verification (sinew-5.20)
SimDist latent world model + MPC planning	Out of v2f scope (sinew-5.18)
Force-side domain adaptation	Two-gap separation: F/T gap closed by label noise, not domain adaptation
PathTracing render	3–5× render-cost penalty; throughput killer (addenda Q3)
Hand approximation in sim	Stage-2 real fine-tune absorbs this; modelling a human is scope creep (addenda Q3, sinew-5.16 §3.2)
Multi-object FMB assemblies (stage 2)	Different contact physics; 7,200 demos / 233 GB; defer indefinitely
Real-robot validation epic	Its own future epic; sinew-5.2 protocol carries over once real eval starts

9. References

Internal memos

File	Issue	Owner	Scope
`docs/researcher.md`	sinew-1.1	Researcher	FMB deep-dive + force-from-vision landscape v1
`docs/fmb_reference.md`	sinew-1.1 + 1.5	Researcher	FMB cheat-sheet; §3.1 EE-frame canonical; §11 cam LR mapping
`docs/sim_worker.md`	sinew-1.2 + 1.4	SimWorker	`isaac_twins` audit
`docs/rl_worker.md`	sinew-1.3	RLWorker	`isaaclab_sinew` bootstrap
`docs/rl_action_layer_sketch.md`	sinew-2	RLWorker	EE-delta action layer (Plan A DIK)
`docs/research/rl_eval_protocol.md`	sinew-5.2	RLWorker	IQM + bootstrap CIs, PPO defaults
`docs/research/sim_substage_detection.md`	sinew-5.3	SimWorker	Substage detector spec
`docs/research/rl_reward_design.md`	sinew-5.1	RLWorker	PBRS + sparse + curriculum (Φ_insert §3.5 superseded)
`docs/research/il_bc_warmstart.md`	sinew-5.12	RLWorker	BC plan (dropped per sinew-5.22 §2)
`docs/research/fmb_sim_data_match.md`	sinew-5.7	Researcher	3 FMB schemas, sim canonical, libfranka extras
`docs/research/v2f_lit_v2.md`	sinew-5.9	VisionWorker	11-paper impl-detail review
`docs/research/v2f_data_leverage.md`	sinew-5.10	Researcher	Stage-1/stage-2 partition (revised by sinew-5.23)
`docs/research/v2f_arch.md`	sinew-5.11	VisionWorker	3-head architecture (still binding)
`docs/research/v2f_dr_spec.md`	sinew-5.13	VisionWorker	46-knob DR table
`docs/research/camera_subset_benchmark.md`	sinew-5.5	SimWorker	Camera subset perf bench
`docs/research/sim_recording_spec.md`	sinew-5.6	SimWorker	Recording schema + format choice
`docs/research/data_pipeline.md`	sinew-5.8	SimWorker	NAS + DL_A6000 pipeline
`docs/research/parallel_dev_architecture.md`	sinew-5.4	SimWorker	Two-repo split, hot rules
`docs/research/sim2real_visual_gap.md`	sinew-5.16	Researcher	χ²=1.88 quantification
`docs/research/v2f_fmb_only_feasibility.md`	sinew-5.17	VisionWorker	FMB-only feasibility spec + outcome matrix
`docs/research/sim_dist_review.md`	sinew-5.18	Researcher	SimDist action-burst recipe
`docs/research/substage_verification.md`	sinew-5.19	SimWorker	State-only adapter for offline calibration
`docs/research/fmb_replay_feasibility.md`	sinew-5.20	SimWorker	Honest negative on FMB replay
`docs/research/sim_ft_sensor.md`	sinew-5.21	SimWorker	`read_eef_wrench_ee` API + 4-stage pipeline
`docs/research/rl_revised_plan.md`	sinew-5.22	RLWorker	Reward v2, BC drop, SimDist adoption, labels
`docs/research/v2f_pipeline_revised.md`	sinew-5.23	VisionWorker	5-outcome branches; canonical recipe
`docs/research/substage_and_scene_addenda.md`	addenda	SimWorker	Q1 gate checks, Q2 combining, Q3 scene tiers
`isaac_twins/scripts/validate_fmb_schema.py`	sinew-5.7	Researcher	FMB-schema validator (3-mode autodetect)

Cloned code (in `isaac_twins/references/`)

Repo	Path	Why
FoAR — He et al. 2024	`v2f_lit_v2/code/FoAR/`	In-contact-gate pattern, ResNet18 + F/T fusion
Reactive Diffusion Policy — Xue et al. 2025	`v2f_lit_v2/code/reactive_diffusion_policy/`	2-stage VAE + latent-DDPM, slow/fast hierarchy
ForceSight — Collins et al. 2023	`v2f_lit_v2/code/forcesight/`	RGBDDinov2 backbone, ThreeHeadMLP
SimDist — CLeARoboticsLab	`references/sim_dist/`	Action-burst recipe (kept; latent WM dropped)
rliable — Agarwal et al. 2021	`eval/rliable/`	IQM + stratified bootstrap CI
FMB — Luo et al. 2024	`fmb/`	Authoritative FMB code
realsense-ros	`realsense-ros/`	D405 driver reference

Paper PDFs (in `isaac_twins/references/v2f_lit_v2/papers/`)

Force Map — Hanai et al., IROS 2023 — arXiv:2304.05803
Direction Matters — Yang et al., 2026 — arXiv:2602.14174
FoAR — He et al., RA-L 2024 — arXiv:2411.15753
Reactive Diffusion Policy — Xue et al., RSS 2025 — arXiv:2503.02881
ForceMimic — Liu et al., 2024 — arXiv:2410.07554
ForceSight — Collins et al., 2023 — arXiv:2309.12312
Feel the Force — Adeniji et al., 2025 — arXiv:2506.01944
Visuo-Tactile Transformers (VTT) — Chen et al., CoRL 2022 — arXiv:2210.00121
ViTaMIn — 2025 — arXiv:2504.06156
DaFoEs — Reyzabal et al., 2024 — arXiv:2401.09239
VTLA — 2025 — arXiv:2505.09577
Forces for Free — Zhu et al., Science Robotics 2025 — DOI:10.1126/scirobotics.adq5046
rliable — Agarwal et al., NeurIPS 2021 — eval/agarwal_2021_rliable.pdf
What Matters in On-Policy RL — Andrychowicz et al., 2020 — eval/andrychowicz_2021_on_policy.pdf

External anchors

FMB project — functional-manipulation-benchmark.github.io
FMB paper (arXiv v2) — arXiv:2401.08553
FMB IJRR 2025 — SAGE
libfranka RobotState reference — frankarobotics.github.io
Isaac Sim 6.0 production S3 — omniverse-content-production
FMB TFDS schema — gs://gresearch/robotics/fmb/0.0.1/ (mirrored at isaac_twins/references/fmb_dataset_schema/)

Generated 2026-05-21 by the sinew agent team. Final state: sinew-1 env setup (5 children) and sinew-5 research epic (14 v1 children + 8 reopened-wave children + 5.14 synthesis) all closed. Implementation phase begins with Wave 1 (sim surface). Source memos under docs/research/; cloned references under isaac_twins/references/; this report lives at docs/r2s2r_research_report.html.

`isaac_twins/`	Isaac Sim digital twin (Python 3.12, isaacsim 6.0). Owned by SimWorker. Scenes, FMB asset spawning, articulation control, cameras, USD baking, replay. Public API is single-env stable; multi-env vectorization is impl follow-up.
`isaaclab_sinew/`	Isaac Lab RL training repo (pixi-managed Python 3.12 + isaacsim 6.0 + Isaac Lab v3.0.0-beta). Owned by RLWorker. Wraps `isaac_twins` in a `gymnasium.Env` today; promotes to `DirectRLEnv` once the batched articulation handle lands.
`sinew/`	Orchestration root. Cross-cutting docs under `docs/`, beads issue tracker under `.beads/`, shared references under `isaac_twins/references/`.