sinew R2S2R — Force-from-Vision on the FMB Testbed

Date: 2026-05-21  |  Authors: sinew agent team (team-lead orchestrator, Researcher, SimWorker, RLWorker, VisionWorker)  |  Scope: sinew-1 env setup (closed) and sinew-5 research epic (two waves, closed)

This report is the unified research synthesis for the sinew project — a real-to-sim-to-real (R2S2R) pipeline whose end deliverable is a video-to-force (v2f) predictor that maps RGBD frames to an end-effector wrench on the FMB benchmark. The framing the project converged to is that v2f is the ship artifact, sim contact reports are the direct training signal (not vision-predicted), and reinforcement learning serves as a data factory that produces diverse trajectories for v2f training. The body of the report covers the locked decisions, per-goal findings from 21 research memos, an implementation sequencing plan over five waves, and the small set of open questions deferred to the next epic.

1. Executive summary

TL;DR (5 bullets)

  1. v2f predictor is the ship artifact. RGBD → end-effector wrench. RL trains a policy in sim purely so we can record diverse (image, force) pairs; the policy itself is not deployed. This reframe inverts how the original plan described the project.
  2. Sim F/T is a direct training signal. Isaac contact reporter via the locked read_eef_wrench_ee API. Never vision-predicted during training. Real Franka noise / bias / lag is modelled on the recorded label, not on inputs — the two-gap separation (visual gap closed by DR, F/T gap closed by label noise) is load-bearing throughout.
  3. BC warmstart is dropped. The χ²=1.88 sim-to-real visual gap (with edge density +39% and 35–62% brightness skew on real) means BC trained on FMB-real frames cannot transfer onto sim-camera obs. Curriculum + sub-expert mixture compensates; ~10–20% wall-clock penalty traded for clean scope.
  4. Stage-2 real fine-tune is non-optional. Wrist-cam content (hand, cables, peg layer-lines, lab clutter) cannot be simulated to χ² ≤ 1.0 without scope creep that's larger than the fine-tune itself. The scene plan brings sim to χ² ≈ 1.0–1.2; real fine-tune of the direction and gate heads closes the rest.
  5. Research epic closed; implementation is next. sinew-1 (env setup, 5 children) and sinew-5 (research, 14 v1 children plus 8 reopened-wave children plus the synthesis ticket) are all closed. The next phase is five impl waves (substage + F/T surface → PPO + corrupted-action mapper → data-gen pass → v2f training → real-robot eval).

Binding cross-cutting locks

These are the deliverable from the research epic. Every row is an irreversible choice that downstream impl tickets honor without re-litigating.

LockLocked valueOrigin
End deliverablev2f predictor (RGBD → wrench). The RL policy is not shipped.reframe; project-v2f-is-end-goal
Sim F/T provenance during trainingIsaac contact reporter via read_eef_wrench_ee. Never vision-predicted.sinew-5.21
Two-gap separationVisual gap → heavy visual DR (no dynamics DR). F/T gap → noise/bias/lag on the recorded label, not the input.sinew-5.13, sinew-5.16, sinew-5.21
Noisy vs clean wrench usageNoisy → policy obs + v2f wrench-head label. Clean → reward + substage detector + direction-head label.sinew-5.21, sinew-5.22, sinew-5.23
BC warmstartDROPPED. χ²=1.88 visual gap invalidates BC-from-real-demos.sinew-5.22 §2 (supersedes sinew-5.12)
Stage-2 real fine-tuneNon-optional. Direction + gate heads only; backbone, wrench head, contact-point head frozen.sinew-5.16, sinew-5.23 §3.2
Disturbance recipeSimDist action-only burst noise applied at data-gen time only (not during PPO training). Per-DOF σ ∈ [0.02, 0.30] m / [0.02, 0.40] rad; gripper bit excluded; 2.5% never-noised envs.sinew-5.18, adopted by sinew-5.22 §3
Reward Φ_insert v2Adds -0.2 · ||f_clean|| gated on d_xy < r_align AND d_z < z_align. No "degenerate-when-zero" branch; force signal is reliable.sinew-5.22 §1 (supersedes sinew-5.1 §3.5)
Direction is the load-bearing headL1 on unit-vec coords, λ=1.0. Wrench-magnitude head λ=0.1. Direction Matters recipe.sinew-5.9, sinew-5.23 §1
Force / quat / image conventionsF/T in EE frame everywhere. Quat (qx, qy, qz, qw). Images BGR on disk → RGB at parse time. side_{left,right}, wrist_{left,right} camera names.sinew-1.5, sinew-5.7
Action contract7-vec EE-delta normalized [-1, 1], scaled ±0.06 m / ±0.25 rad / gripper bit, @ 10 Hz, base frame, Plan A DifferentialIK at panda_hand.sinew-2 + sinew-3
RL evaluationIQM + 95% stratified bootstrap CIs (never mean/median); P(A>B) > 0.7 to claim improvement; N=5 seeds default at {0, 7, 42, 314, 2718}.sinew-5.2 (rliable)
Recording resolution224² (DINOv2-S patch match) is the v1 default; 256² is opt-in ablation.sinew-5.6
Camera subset (v2f)2mixed_rgbd minimum (side_left + wrist_left); 3cam_rgbd for data leverage.sinew-5.5 winner
FMB trajectory replay reliabilityNot reliable in current sim. Substage verification falls back to a state-only adapter (calibration-only, never in the training loop).sinew-5.20 (honest negative)

State of the world

2. Project framing

sinew is the Real-to-Sim-to-Real (R2S2R) force-from-vision project anchored on the FMB (Functional Manipulation Benchmark) testbed at IRIS Lab. The hardware target is a real Franka Emika Panda with four Intel RealSense D405 cameras (two side, two wrist) running contact-rich single-object insertion. Three sibling repositories host the work:

isaac_twins/Isaac Sim digital twin (Python 3.12, isaacsim 6.0). Owned by SimWorker. Scenes, FMB asset spawning, articulation control, cameras, USD baking, replay. Public API is single-env stable; multi-env vectorization is impl follow-up.
isaaclab_sinew/Isaac Lab RL training repo (pixi-managed Python 3.12 + isaacsim 6.0 + Isaac Lab v3.0.0-beta). Owned by RLWorker. Wraps isaac_twins in a gymnasium.Env today; promotes to DirectRLEnv once the batched articulation handle lands.
sinew/Orchestration root. Cross-cutting docs under docs/, beads issue tracker under .beads/, shared references under isaac_twins/references/.

The three goals, restated under the reframe

GoalRole under the reframeValidation criterion
Goal 1 — RL data factory Train a PPO policy in sim that solves the FMB insertion task well enough to produce diverse trajectories. Curriculum: grasp → +place → +rotate → +regrasp → +insert. Sub-expert checkpoints from every 50 iters are kept for data-gen mixture. The policy is not the deliverable. Data-gen yield: ≥ 1.6M (image, force) pairs, ≥ 5% contact transient frames, force coverage [0, 30] N, per-cam diversity χ² ≥ 0.3.
Goal 2 — Sim dataset Record sim rollouts with full FMB-RLDS-parity observations plus sim ground truth: clean and noisy EE wrench, per-pair contact info, peg/board world poses, per-camera intrinsics/extrinsics. Per-episode DR. Substage labels written from the canonical detector. NAS pipeline to ferry zarr/HDF5 to the DL_A6000 for training. 5–10k insert-primitive rollouts (~243 GB at 224²), schema 1:1 with FMB RLDS + namespaced obs/sim_* extras, validator clean.
Goal 3 — v2f predictor Train an RGBD → wrench predictor with three task heads (contact-point on wrist cam, force direction unit-vec, 6D wrench) plus an in-contact gate. Frozen DINOv2-S backbone, multi-view fusion. Two-stage training: sim pretrain (all heads) → real fine-tune (direction + gate only). Direction cos-sim on real, tiered: aspirational ≥ 0.70 global / ≥ 0.60 per-shape min; acceptable ≥ 0.60 / ≥ 0.45 with +0.05 fine-tune lift; soft-fail → per-shape ensembles; hard-fail → halt and audit. Primary metric is monotonic improvement until χ² is re-measured post-DR.

Two gaps, separately handled

The single most important architectural insight from the reopened-epic wave is that the visual sim2real gap and the F/T sim2real gap are different problems with different fixes. Conflating them was the bug in the original plan.

GapWhat it isHow sinew closes it
Visual gap Sim renders too clean: χ²=1.88, edge density +39% in real, brightness skew 35–62%. Wrist cams worst (hand, cables, peg layer-lines). Heavy visual DR (sinew-5.13) plus Tier-1/Tier-2 scene authoring fixes (sinew-5.16 + addenda Q3). Stage-2 real fine-tune absorbs what can't be simulated.
F/T gap Real Franka K_F_ext_hat_K is noisy + biased + lagged. Sim contact reporter is pristine. Noise/bias/lag model on the recorded label (sinew-5.21). Predictor sees clean sim images and learns to match noisy real-Franka F/T. No domain adaptation for force.

How the goals interlock

Goal 1 — RL data factory PPO + curriculum 11 sub-expert checkpoints SimDist action corruption policy is NOT shipped Goal 2 — Sim dataset FmbRecorder + heavy DR read_eef_wrench_ee labels substage predicate written ~243 GB @ 224², 22k eps Goal 3 — v2f predictor RGBD → wrench (3 heads) DINOv2-S frozen, fused sim pretrain → real FT THE SHIP ARTIFACT trajectories (image, force) pairs Real Franka (downstream) deployed v2f reads RGBD, emits F/T shipped predictor FMB real subset → stage 2 FT
Goal 1 — RL data factory Goal 2 — Sim dataset Goal 3 — v2f predictor Out-of-scope / downstream

Answers to the lead's original 12-question intake are integrated throughout §§4–6 of this report. A separate Q&A section is not maintained; each question is addressed in the section where its findings live.

3. Phase 0 — environment setup

The setup epic sinew-1 took the project from a sim with no Franka articulation to a closed-loop Isaac Lab environment that ingests a 7-vec EE-delta action and returns FMB-shaped observations. All five children closed.

IssueOwnerDeliverable
sinew-1.1Researcherdocs/researcher.md + docs/fmb_reference.md: FMB benchmark deep-dive + force-from-vision landscape v1.
sinew-1.2SimWorkerdocs/sim_worker.md: isaac_twins audit (API surface, known limitations, perf benchmarks).
sinew-1.3RLWorkerisaaclab_sinew/ bootstrap (pixi + IsaacLab v3.0.0-beta + FmbInsertionEnv skeleton).
sinew-1.4SimWorkerRe-baked three USDs for Isaac Sim 6.0 asset paths after the based_robotics repo rename. Gripper-unpin landed as a side effect.
sinew-1.5Researcherfmb_reference.md §3.1: pin EE frame as canonical wrench convention; 6×6 EE→base adjoint applied only at the FMB-checkpoint boundary.

Foundation pieces (top-level companions)

IssueOwnerDeliverable
sinew-2RLWorkerdocs/rl_action_layer_sketch.md: 7-vec EE-delta action layer design (Plan A DifferentialIK, Plan B serl-impedance contingency).
sinew-3RLWorkerEEDeltaActionMapper class wired into FmbInsertionEnv. Jacobian sourced via art._articulation_view.get_jacobians() at panda_hand. 5/5 tests green.
sinew-4RLWorkerEnv _get_observations bug fix: frames[cam] for single-env is ndarray, not list[ndarray]. Dropped stale [0] index; documented the multi-env migration path.
isaac_twins-36SimWorker (backlog P3)Pin production asset root explicitly — staging-S3 is currently 200-OK but transient.
isaac_twins-37SimWorker (backlog P3)Bake peg_tip_local_offset attribute on each peg USD at author time. Required by the insert predicate.

What the env can do today

cfg = FmbInsertionEnvCfg()  # defaults: fmb_big_demo + big_long/rect peg + N=1 + side+wrist cams
env = FmbInsertionEnv(cfg)
obs, info = env.reset()
# obs keys: side_left, side_right, wrist_left, wrist_right (RGBA 720x1280),
#           q (7,), dq (7,), tcp_pose (7,), tcp_force (3,), tcp_torque (3,), gripper_pose (1,)
for _ in range(N):
    action = policy(obs)          # 7-vec in [-1, 1]
    obs, rew, term, trunc, info = env.step(action)

Today tcp_force and tcp_torque are zero-tensor placeholders. The Wave-1 impl deliverable read_eef_wrench_ee (sinew-5.21) replaces them; reward and substage detector consume the noisy=False branch, policy obs and v2f labels consume the noisy=True branch.

4. Goal 1 — RL as a data factory

Six research deliverables shape the RL workstream: evaluation protocol, substage detection (with gate verification per the addenda), reward design v2 with clean wrench, the historical BC record, the SimDist disturbance recipe, and the four trajectory-label classes the recorder writes.

4.1 Evaluation protocol

Owner: RLWorker · sinew-5.2 · docs/research/rl_eval_protocol.md · Refs: references/eval/rliable, Andrychowicz 2020.

The R2S2R compute budget forces a permanent few-run regime (3–10 seeds per config). Per rliable §4, point estimates at this seed count carry only 50–70% probability of being real improvements. The protocol stays defensible by reporting interquartile mean (IQM) with stratified bootstrap CIs.

ChoiceValueWhy
Aggregate metricIQM + 95% stratified bootstrap CIsMean is outlier-dominated; median is zero-on-half-tasks insensitive; IQM trims top + bottom 25%.
Improvement claimP(A > B) > 0.7 from rliableP ∈ (0.5, 0.7] inconclusive at N=5; P ≤ 0.5 no-difference or regression.
Seed budgetN=5 default at {0, 7, 42, 314, 2718}; bump to N=10 only on decision-gating overlapSelection-bias safe; disaster-stopped runs still count as seeds.
Per-seed scoreAUC of per-step eval-return curve (Andrychowicz §2)Rewards data-efficient policies, not just final return.

Logging cadence

EveryMetrics
10K control stepstrain/return_mean, policy_loss, value_loss, entropy, grad_norm, kl_div, explained_variance, env/safety_clip_count, sys/sps
100K control stepseval/return_mean (100 episodes, stochastic policy), eval/success_rate, eval/tcp_to_peg_dist_mean

PPO defaults (Plan A, Andrychowicz "competitive base")

ChoiceValueChoiceValue
clip ε0.25γ0.99
GAE λ0.9activationtanh
policy MLP2 × 64action transformtanh (NOT clip)
value MLP2 × 256initial action std0.5
obs normYES (crucial)advantage normper-minibatch
value loss clipNO (hurts)optimizerAdam, lr=3e-4
Top "surprising" finding: last policy-layer init 100× smaller.

SAC is the fallback only if PPO plateaus below 50% success after the 5-seed pass.

Two-stream evaluation under the reframe

Per sinew-5.22 §5 the eval surface forks into two streams:

4.2 Substage detection and gate verification

Owner: SimWorker · sinew-5.3 · sinew-5.19 · addenda Q1+Q2 (2026-05-21) · sim_substage_detection.md + substage_verification.md

FMB upstream has no automated substage detection — the operator hits Enter to advance primitives at rollout time (sequential_rollout.py:250). Sim has full ground truth, so deterministic detectors are tighter than real; deltas are documented intentional.

Sensor surface

Three Isaac Lab ContactSensor instances using filter_prim_paths_expr cover all five FMB primitives plus transitions:

Per-primitive predicates

PrimitiveSuccess predicate
graspantagonistic finger contact (force dot < -0.7), F_grip > 1.0 N, peg z > 5 cm above bin, no slip ≥ 83 ms (10 physics ticks @ 120 Hz)
place_on_fixturepeg↔fixture force > 0.3 N, peg↔fingers force = 0 (released), peg vel < 5 mm/s, peg z in fixture-height window
rotatepeg long-axis rotation > 90° from entry, peg-on-fixture maintained, ±15° verticality. Axisymmetric pegs auto-pass.
regraspgrasp.success + peg long-axis vertical ±15°
insertpeg↔board force > 0.3 N, peg-tip z within ±3 mm of hole bottom, peg xy within 5 mm of hole center, verticality ±20°, stable 10 steps

The SubstageDetector class lives at isaac_twins/src/isaac_twins/fmb/substage.py. It exposes {p}_success(), {p}_failure(), transition_ready(p) (success AND TCP at z_safe ≥ 0.20 m), and diagnostics(). Reward authors never recompute distances — the detector is single source of truth, eliminating reward-vs-eval drift by construction.

Q1 (addenda): are the gates meaningful before execution?

Beyond the three core tests already specced in sinew-5.3 §6 (offline unit, runtime smoke, negative case), four additional checks are ranked by ROI. The recommendation lands three before RL kickoff (~2 days total):

#CheckCostWhat it catchesVerdict
1.2Threshold-envelope startup assertion1 hrSilent misconfig (e.g. F_grip_min tuned to 100 N during debug and forgotten)land pre-RL
1.5Inverted-physics sanity probes (5 per primitive, "should say False")1 dayFP catches: zero-force fake grasps, one-sided pegs, jammed-not-lifted, halfway-not-seatedland pre-RL
1.4Temporal smoothness flag (predicate flips > 3× per primitive window)2 hrSensor-noise artifacts masquerading as substage transitionsland pre-RL
1.3Detector ↔ recorded-label cross-check (two implementations compared)0.5 dayHidden state / non-determinism in detector across callsdefer unless non-determinism observed
1.6Per-shape threshold ablation1 dayPer-peg-size FP/FN driftdefer until FMB raw arrives

Landing the threshold-envelope + inverted-physics + temporal-smoothness checks lifts predicate confidence high enough to trust the recorded obs/sim_substage_predicate as a v2f gate-head training label.

Q2 (addenda): FP/FN rates and how to combine state-only with full-state

PrimitiveFull-state detector (canonical) — FPFNState-only fallback — PR
grasp1–3%2–5%0.90–0.950.85–0.95
place_on_fixture2–5%5–10%0.75–0.850.80–0.90
rotate5–10%10–20%0.40–0.600.50–0.70
regrasp3–7%5–10%0.80–0.900.75–0.90
insert5–10%10–15%0.60–0.750.65–0.80

The gap is big — full-state has 1–15% error, state-only has 5–50%, with rotate and insert worst because peg orientation isn't observable from state alone.

Recommendation: option (d) for the training loop, (b) for offline audit. Full-state detector is canonical for reward + recorded labels. State-only adapter is offline calibration only — never enters training. Mixing them as a confidence-weighted auxiliary loss (option a) would inject the state-only 5–50% error into the reward gradient, correlated with the actual physics on the weak primitives (rotate, insert) — exactly the wrong shape for reward shaping. State-only's only role is flagging audit-worthy disagreements (option b) and measuring sim-bias when state-only is the proxy for FMB-real labels at stage 2 (option c).

4.3 Reward design v2 (Φ_insert with clean wrench)

Owner: RLWorker · sinew-5.1 superseded in part by sinew-5.22 §1 · rl_reward_design.md + rl_revised_plan.md

Composition: dense potential-based shaping + sparse substage bonus + small action regularizer. Per-primitive activation — only the current primitive's reward contributes per tick.

R_p(s, a, s') = λ_success · 1[p_success(s')]       # sparse terminal
              - λ_failure · 1[p_failure(s')]       # sparse anti-terminal
              + γ · Φ_p(s') - Φ_p(s)               # dense potential-based (Ng 1999)
              - λ_action  · ||a[:6]||²             # small action regularizer
              - λ_clip    · Δsafety_clip_count    # safety-box pressure

PBRS form preserves the optimal policy (Ng 1999): cumulative dense return ≈ Φ(final) − Φ(initial), so the policy cannot bank arbitrary shaping reward.

Φ_insert v2 — the load-bearing change post-reframe

Under the reframe sim F/T is a real, reliable signal via read_eef_wrench_ee(art, noisy=False). The v1 "degenerate-when-F/T-zero" branch is dropped — clean wrench is non-zero only in real contact, so the force term naturally degenerates.

d_xy      = ||peg_tip_xy - hole_xy||
d_z       = max(0, peg_tip_z - hole_z_bottom)
align     = 1 - cos²(peg_long_axis, hole_axis)
f_clean   = read_eef_wrench_ee(art, sensor, noisy=False)[:3]   # (3,) clean N, EE frame
f_mag     = ||f_clean||

Φ_insert(s) = -d_xy
            - 1.5 · d_z · 1[d_xy < r_align_xy]
            - 1.0 · align · 1[d_z < z_align_thresh]
            - 0.2 · f_mag · 1[d_xy < r_align_xy AND d_z < z_align_thresh]

Coefficient lifted 0.1 → 0.2 because the signal is now reliable. The force-term gate ensures we only penalize contact force when we should be making controlled contact (inside the alignment and seat-depth window); outside that window force is exploration cost and is not penalized.

Constants (starting values, all tunable)

ConstantValueRationale
λ_success20.0Dominates dense return (≤5 per primitive) by 4×
λ_failure5.0Smaller than success; eval cares about success rate
γ (PBRS)0.99Matches PPO discount; Ng's theorem requires the same γ
λ_action0.001Tiny — just enough to break "wave arm around" ties
λ_clip0.550-tick all-clipped trajectory costs 25 = exceeds λ_success

Curriculum

  1. Phase 1: grasp-only. Pass: N=5 seeds, IQM > 0.7, 95% CIs not crossing 0.5.
  2. Phase 2: + place_on_fixture.
  3. Phase 3+: + rotate, + regrasp, + insert one at a time.

4.4 BC dropped — historical record

Owner: RLWorker · superseded by sinew-5.22 §2 · originals: il_bc_warmstart.md + rl_revised_plan.md §2.

The original plan (sinew-5.12) recommended Option C: BC warmstart → PPO/SAC fine-tune using the 22,550 FMB demos. The χ²=1.88 visual gap (sinew-5.16) invalidated the precondition that FMB-real images are drop-in compatible with sim-camera obs. Three options were considered post-reframe:

OptionTrade-offVerdict
(a) State-conditioned-only BC (drop image obs)No visual gap; throws away ~95% of FMB demo signal (images dominate the input dimension). State-only policy can't represent visual feature dependencies.reject
(b) BC on sim-rendered FMB-replay imagesBC sees sim-distribution images directly. Requires sinew-5.20 replay to be production-ready — it is not.reject
(c) Drop BC entirelyPPO from scratch is harder. Under the reframe (RL = data factory, policy not deployed) "task hard for pure-RL" is tolerable; ~10–20% wall-clock lost, scope clarity gained.decision

Curriculum + sub-expert mixture compensate for losing BC. The 22,550 FMB demos are kept on the project shelf for future real-robot work but do not enter the sinew RL training loop. The four data-side issues from the original sinew-5.12 spec (frame convention, quat order, F/T mismatch, BGR-on-disk) remain documented as a reference for any future BC revival.

4.5 SimDist disturbance recipe

Owner: Researcher · sinew-5.18, adopted by sinew-5.22 §3 · sim_dist_review.md + cloned code at isaac_twins/references/sim_dist/ (CLeARoboticsLab/simdist).

The only piece of SimDist sinew adopts is the action-only burst noise recipe. The latent world model + MPC planning machinery is out of v2f scope.

Noise scopeAction only. No push/wrench, no observation noise at data-gen.
Per-env σ drawOnce per env at run start (fixed for that env's entire run), sampled from U[σ_min, σ_max].
Burst patternOn for 1–50 control steps, off for 25–500, alternating. Net ≈ 9% noised time fraction.
Never-noised fraction2.5% of envs run completely clean — produces the clean_expert trajectory label.
Policy mixture50% expert + 50% from 11 sub-expert PPO checkpoints (iter 0, 50, …, 2000), re-rolled per env at reset.
Per-DOF σ_maxTranslation 0.30 (5× per-step limit → saturating perturbation that generates contact transients), rotation 0.40.
Gripper bitExcluded. Flipping mid-burst drops the peg.
When appliedData-gen pass only. Not during PPO training — cleaner reward attribution.
RecordingEvery step recorded regardless of noise state. Per-env noise flag becomes an HDF5 column.

Adoption cost is ~2 days: an EEDeltaCorruptedActionMapper subclass that wraps the existing mapper (~0.5 d), the burst-state machine, and the HDF5 schema additions (sim_action_noised, sim_policy_iter, sim_never_noised).

4.6 Trajectory labels for v2f filtering

Owner: RLWorker · sinew-5.22 §4

The recorder writes four orthogonal per-episode labels so the v2f trainer can stratify by data quality at HDF5 load time. Predicted fractions per a 1.6M-pair data-gen pass:

LabelDefinitionPredicted fractionv2f use
successfulterminal substage insert.success() == True~60%high-quality direction labels
disturbedany tick with sim_action_noised == 1~55%off-policy + contact-transient diversity
failed¬successful~40%off-manifold (image, force) coverage; negative gate samples
clean_expertsuccessful AND ¬disturbed AND policy==expert~1.25%held-out "nominal-regime" eval slice

Cross-tabulated: ~30% successful + clean, ~30% successful + disturbed, ~25% failed + disturbed, ~14% failed + clean. The trainer's default behaviour is to ignore the flags and train on the full pool (the distribution is already mixed); gate-head loss optionally upweights noised steps by 1.5× because contact-transient frames carry the cleanest gate signal.

5. Goal 2 — sim dataset recording

Seven research deliverables shape Goal 2: the camera subset bench, the unified GT recording spec with the sim F/T sensor, FMB↔sim data matching, the FMB-replay honest negative, the NAS pipeline, the scene visual-gap mitigation plan, and the parallel-development architecture.

5.1 Camera subset bench

Owner: SimWorker · sinew-5.5 · camera_subset_benchmark.md · 48-row JSON at isaac_twins/docs/research/camera_subset_benchmark.json

48-cell sweep: 8 camera subsets × {RGB, RGBD} × N ∈ {1, 4, 8}. Mixed-render steps/s:

ConfigN=1N=4N=8Note
phys-only baseline~1700~1200~860invariant of cam subset — physics cost dominates above N=4
1cam_rgb (any)~1450~880~5401side ≈ 1wrist at every N — informational not perf choice
2cam_rgb (any pair)~1000~550~3702side/2wrist/2mixed all equal cost
3cam_rgb~750~360~225linear-ish drop
4cam_rgb~580~250~1463 → 4 cam is the perf cliff (1.5× drop at N=8)
5cam_rgbd (4× D405 + overview)~480~155~750.62× realtime at N=8 — A6000 needed for N=16+

Winner by consumer

ConsumerChoiceWhy
RL training bootstrap1wrist_rgbCheapest with contact view; 1700 / 1200 / 860 steps/s @ N=1/4/8
v2f data-gen (locked)2mixed_rgbd minimum (side_left + wrist_left), 3cam_rgbd for data leverageDirection-from-vision wants depth; cross-view fusion needs ≥ 2 cams
Multi-env RL training2mixed_rgb or 3cam_rgbStay below the 4-cam cliff
Recording / replay videos5camOffline replay only; never for training

RGBD costs +5–15% over RGB at fixed cam count — cheaper than the cliff above. The bench appendix documents the nohup + resumable-driver pattern that the NAS recorder inherits.

5.2 GT recording spec + sim F/T sensor (unified data surface)

Owner: SimWorker · sinew-5.6 + sinew-5.21 · sim_recording_spec.md + sim_ft_sensor.md

Sim F/T sensor pipeline (sinew-5.21)

Replicates Panda's K_F_ext_hat_K in four stages. New public API: read_eef_wrench_ee(art, contact_sensor, *, noisy, state, rng) → dict.

contact sensor             coord transform         DR noise model            output
    │                          │                       │                       │
    │ world-frame net force    │ rotate by R_world_EE  │ + bias + Gauss + lag  │
┌───▼───────────┐    EE-frame  ┌▼─────────────────┐    ┌▼──────────────────┐   │
│ clean F_w (3,)│ ── world→EE ─│ F_clean_ee (3,)  │ ──▶│ noisy_lagged_ee   │──▶│ obs/eef_force
│ clean τ_w (3,)│   adjoint    │ τ_clean_ee (3,)  │    │ (3,) + (3,)       │   │ obs/eef_torque
└──────────────-┘              └──────────────────┘    └───────────────────┘   │
                                       │                                       │
                                       ├──▶ obs/sim_eef_force_clean (3,)       │
                                       ├──▶ obs/sim_eef_torque_clean (3,)      │
                                       │                                       │
                                       ▼                                       │
                               ‖F_clean‖ > 0.1 N ───▶ obs/sim_in_contact (bool)

Integration discipline

CallerWrench usedWhy
Reward Φ_insertnoisy=FalseDeterministic gradient; gate must be exact
SubstageDetectornoisy=FalseSim-internal gates; predicates must be sharp
Env policy obs (tcp_force, tcp_torque)noisy=TrueMatch real Franka deployment distribution
v2f wrench-head labelnoisy=TruePredictor learns to match real noisy F/T
v2f direction-head labelnoisy=Falsef/||f||Noising rotates the unit vector — corrupts geometry
Gate label sim_in_contactnoisy=False, threshold 0.1 NDeterministic, never noised; shifts to 8 N at real stage-2

Noise model parameters

Force additive Gaussian (per axis)σ_f = 0.025 N (Franka 0.05 N resolution / 2)
Torque additive Gaussian (per axis)σ_τ = 0.01 Nm (Franka 0.02 Nm resolution / 2)
Per-episode bias driftσ_bias_f = 0.05 N; σ_bias_τ = 0.02 Nm (constant within episode)
1st-order low-pass lagτ_lag = U(20, 80) ms per episode, discrete IIR @ 10 Hz
Scaling mode{constant, scaled} opt-in; default constant for v1 corpus

Recording schema (FMB-parity + sim extras)

The recorder writes FMB-RLDS-parity keys (images, joint_pos/vel, eef_pose/vel/force/torque, action, primitive, language) plus namespaced obs/sim_* extras for v2f training:

Per-frame loop ordering (validator invariants)

  1. F/T capture, clean-then-noised: read sim wrench → write obs/sim_eef_force_clean; apply DR noise (per-frame Gaussian + per-episode bias + per-episode lag) → write obs/eef_force.
  2. Depth pipeline order: dropout (low-texture mask) BEFORE Gaussian noise BEFORE range-clip. Different order changes the noise distribution.
  3. F/T bias and lag-τ sampled per-episode (constant within episode); frame noise per-frame.

Format and storage

FormatRoleWhy
zarrlive recording intermediateAppend-fast, concurrent-writer friendly, atomic .partial → .zarr rename, schema evolution = mkdir
sinew_fmb_strict TFDS builderFMB-canonical, sim extras strippedMixed sim+FMB training without schema fork
sinew_fmb_v2f TFDS builderEverything (sim extras + labels)Predictor training
HDF5rejectedConcurrent-writer fragility, NAS-unfriendly file locking
Configper 50-ts ep22k-FMB-equivalent corpus
2mixed_rgbd @ 224² (v1 default)~6 MB~243 GB
2mixed_rgbd @ 256² (FMB-faithful, opt-in)~8 MB~316 GB
3cam_rgbd @ 224²~9 MB~310 GB
4cam_rgbd @ 224²~12 MB~400 GB

224² matches DINOv2-S patches (16×14 = 224), saves train-time resize and ~23% disk. Even the v1 default (~243 GB) is ~45% of FMB upstream's 545 GB single-object zip.

5.3 FMB ↔ sim data matching

Owner: Researcher · sinew-5.7 · fmb_sim_data_match.md + validator isaac_twins/scripts/validate_fmb_schema.py

Three FMB schemas exist (raw .npy, RLDS, live gym env). The schema work locks one canonical sim record that's drop-in compatible with FMB's RLDS while adding sinew ground truth via the obs/sim_* prefix.

KeyShapePurpose
obs/sim_t_ns, obs/sim_ctrl_step_idx, obs/sim_cam_capture_t_ns/<view>scalarsTimestamps — FMB stores none; sim emits so sim↔real can be cross-checked
obs/sim_contact_wrench_ee(6,) float32GT wrench in EE frame for the wrench head
obs/sim_cartesian_contact(6,) boolPer-Cartesian-dim contact — the FoAR-style gate label
obs/sim_peg_pose_world, obs/sim_board_pose_world(7,) eachGeometric GT for contact-point head (project contact line into wrist-cam pixel)
obs/sim_cam_intrinsics/<view>, obs/sim_cam_extrinsics/<view>(3,3) + (4,4) × NK matrix + T_world_cam for the projection above
obs/sim_jacobian, obs/sim_gripper_dist, obs/sim_seed, obs/sim_randomization_idvariousDiagnostics + replay reproducibility

libfranka channels FMB drops (sinew picks them up)

libfranka's franka::RobotState exposes ~30 channels FMB ignores. High-value ones recorded under obs/sim_*:

Camera name mapping (sinew canonical)

FMB upstreamsinew canonicalPhysical mount
side_1side_leftworkspace −X edge (robot's left)
side_2side_rightworkspace +X edge
wrist_1wrist_leftwrist-mount slot L
wrist_2wrist_rightwrist-mount slot R

Caveat: the wrist L/R mapping is a sinew choice — FMB upstream binds arbitrary serials. If real-Franka eval shows mirrored-wrist artifacts (policy reaches the wrong way), the fix is to flip the wrist mapping and retrain, not look for a bug elsewhere.

5.4 FMB trajectory replay in sim — honest negative

Owner: SimWorker · sinew-5.20 (honest negative) · fmb_replay_feasibility.md + isaac_twins/scripts/tests/test_replay_mechanism.py

Verdict: cannot reliably prove end-to-end FMB grasp+insert replay in current sim. This is the load-bearing reason substage verification falls back to a state-only adapter (§4.2 Q2), and the reason BC option (b) "BC on sim-rendered FMB-replay images" was rejected (§4.4).

Four structural blockers, none individually trivial:

  1. No IK layer in the replay path — joint-replay drifts in EE space because the Plan A DifferentialIK that sinew-3 wired into the env isn't reused at replay.
  2. Peg spawn is randomized; it doesn't match the FMB-recorded peg pose at the grasp moment, so the contact geometry doesn't line up.
  3. STEP→USD tessellation deflection ~0.5 mm ≈ medium/small board clearances — tight-tolerance insertions are geometrically infeasible in sim.
  4. FMB raw .npy (545 GB) not downloaded; only 5-frame cached smokes exist locally.

Mechanism smoke confirmed sim infra works end-to-end: scene builds, physics settles, gripper actuates 0.08 → 0.0002 m closed. Predicted yield shape-by-shape: ~30–50% best case, < 10% for asymmetric shapes. There's a ~4-day path to reliable replay if needed (proper IK in the replay loop, peg-pose forcing, board re-mesh, FMB raw pull), but it's not required for the v2f-end-goal pipeline because substage verification falls back to the state-only adapter for offline calibration.

5.5 NAS + DL_A6000 pipeline

Owner: SimWorker · sinew-5.8 · data_pipeline.md + draft isaac_twins/scripts/episode_uploader.py

local PC (4070 Ti SUPER, 16 GB)        NAS (143.248.121.169:7002, ftp)          DL_A6000 (24 GB+)
┌─────────────────────┐                 ┌──────────────────────────┐            ┌────────────────────┐
│ FmbRecorder         │                 │ /IntelligentManipulation │            │  pixi env          │
│   → zarr/episode_*  │  FTP (curl)     │  Team/DomrachevIvan/     │   FTP      │  → TFDS reader     │
│   → episode.zarr.   │ ─push────────►  │  sinew/recordings/       │ ◄─pull──── │  → train_v2f.py    │
│       tar           │                 │    2026-05-21/seed_07/   │            │                    │
│ episode_uploader.py │                 │    sinew/tfds/           │            │                    │
└─────────────────────┘                 └──────────────────────────┘            └────────────────────┘

Protocol locks (per user global CLAUDE.md)

ChoiceValueWhy
Protocolplain FTP, curl --ftp-method nocwdFTPS data channel fails from this PC; nocwd is stateless
Endpointftp://143.248.121.169:7002DNS fallback for irislab.asuscomm.com
Base path/IntelligentManipulationTeam/DomrachevIvan/sinew/Per user CLAUDE.md folder + sinew subtree
Wire formattar-of-zarr per episode (uncompressed)Zarr's many-small-files layout is FTP-unfriendly; arrays already compressed
Auth~/.netrc (chmod 600), curl --netrc-fileNever put password in command line
rsync / rclone / sftp / FTPSrejectedNAS has no SSH; plain FTP per user CLAUDE.md

Two-process model: FmbRecorder writes zarr atomically; episode_uploader.py watches and pushes per-episode opportunistically. Idempotent — mid-upload crash → next pass overwrites. Recording rate ~60 MB/min = 1.0 MB/s at 224², vs ~12 MB/s home-link ceiling — network is never the bottleneck. Even four parallel collectors stay well under.

PoolEpisodesSize @ 224²Wall @ 1Wall @ 4 parallel
single-object multi-stage15,350~94 GB~76 h~19 h
single-object insertion-only4,050~49 GB~20 h~5 h
long-horizon (300 ts/ep)2,700~100 GB~67 h~17 h
total mirror-FMB @ 224²22,100~243 GB~164 h (~7 days)~41 h (~1.7 days)

5.6 Scene visual-gap mitigation (Q3 answer from the addenda)

Owner: SimWorker (addenda) over Researcher (sinew-5.16) · substage_and_scene_addenda.md §3 + sim2real_visual_gap.md

Researcher's measured baseline (sinew-5.16): χ² = 1.88 mean across 4 cams, wrist cams worst at χ² ~ 2.0 with real-edge-density 51–67% denser. Wrist cams see hand, cables, fingers, peg-print lines — none of which the current sim USD models. The addenda enumerates 10 candidate fixes and ranks them by leverage × inverse-cost. Two tiers land before the v2f stage-1 pretrain.

Tier 1 — do first (~2.5 days total)

#FixTimeΔχ² (mean)Δχ² (wrist)Why this rank
1Procedural cable mesh in wrist FOV (1–2 swept curves, random color, random routing)1 day-0.05 to -0.1-0.3 to -0.5Biggest single hit on wrist χ² — addresses the 39% edge-density gap directly
2Lab-clutter distractor spawning (3–5 small meshes per ep outside the action region)0.5 day-0.15 to -0.25-0.05Biggest global χ² hit per cost; already in DR spec row 32, just impl
3Background plane workshop texture (tiled real-workshop photo on the ground plane)0.5 day-0.1 to -0.2-0.05Cheap, fixes side-cam background uniformity

Tier 2 — next ~2 days if budget

#FixTimeΔχ² (mean)Δχ² (wrist)
4FDM layer-line normal map on peg surfaces (~0.4 mm period)0.5 day-0.05 to -0.1-0.1 to -0.15
5Wrist-mount visual upgrade (real FMB STEP mesh + bevels + screws)1 day-0.05-0.1 to -0.15
6Domain-aware per-frame exposure jitter (random render exposure)0.25 day-0.1 to -0.15-0.1 to -0.15

Defer or reject

FixWhy deferred
PathTracing render (raytraced → pathtraced)3–5× render-cost penalty — throughput killer. Run only if Tier 1+2 leaves a visible gap.
Hand approximation (stub human hand USD near gripper)2–3 day scope; Researcher explicitly flagged "stage 2 real fine-tune carries this." Don't simulate a human.
Photographed FMB bin textureNeeds a real photograph; not on the critical path.

Expected χ² trajectory (SimWorker estimate, not measured)

StateMean χ²Wrist χ²Interpretation
Today1.88~2.0Solidly "visibly distinct" per Force Map
After Tier 1~1.4~1.4Distinguishable but training-tolerant
After Tier 1+2~1.0–1.2~1.1–1.3Approaching Force Map threshold; sufficient for stage-1 sim pretrain

Honesty note: these Δχ² estimates are SimWorker fix-table predictions per the leverage analysis in sim2real_visual_gap.md §3. The original measurement is N=1 (one real episode, one timestep, one sim env, 4 cams). Re-measuring χ² post-DR + post-Tier-1 is the highest-priority sinew-5.16 follow-up. What scene authoring cannot fix — unmodeled lab clutter, lighting hardware noise, sensor-level rolling-shutter / chromatic-aberration artifacts — is absorbed by the stage-2 real fine-tune.

5.7 Parallel-development architecture

Owner: SimWorker · sinew-5.4 · parallel_dev_architecture.md

Two-repo separation

RepoOwnsCommunicates via
isaac_twins/Scenes, USDs, Franka control, recording driver, substage detector, F/T sensor.8 published symbols (sim_worker.md §3.2)
isaaclab_sinew/RL env wrapper, training scripts, eval harness, parser (no BC loader after the BC drop).Imports only the 8 published symbols
sinew/ workspaceDocs, beads, referencesRead-only on both repos

num_envs scaling roadmap

Multi-env RL gated on three SimWorker follow-ups (carried into Wave 1):

  1. Batched articulation handlegrab_franka_view(num_envs) → Articulation wrapping /World/envs/env_.*/Scene/Robot regex.
  2. Per-env resetSceneConfigurator.reset_episode(env_ids).
  3. Observation packagerisaac_twins.fmb.obs.get_obs(cfgr, art_view, sub_detector).

Until those land, single-env is the only contract. DirectRLEnv migration is then a rename-only ~200-line PR.

The hot rules (workspace CLAUDE.md proposed)

  1. One author per docs/research/*.md.
  2. No USD re-bake in a PR that also changes Python code.
  3. isaaclab_sinew imports from isaac_twins only via the 8 published symbols.
  4. Paired tickets for cross-repo work; never one bundled.
  5. Topic branches per ticket; no long-lived feature branches.
  6. Asset additions are additive — never rename or remove existing.
  7. DR variants don't need new assets.
  8. nohup + resumable driver for any Kit-loop sweep > 5 min.
  9. bd ready is the conflict-avoidance gate — claim before editing.

6. Goal 3 — video-to-force predictor

This is the ship artifact. Seven deliverables: the visual gap quantification, the literature review v2, the FMB-only feasibility spec, the data leverage analysis, the 3-head architecture, the DR spec, and the revised pipeline that branches the training plan on the FMB-only outcome.

6.1 Visual sim2real gap quantified

Owner: Researcher · sinew-5.16 · sim2real_visual_gap.md

MetricSimRealGap
Color hist χ² (sim vs real)1.88Above FoAR χ² = 1.0 "visibly distinct" threshold
Edge density fraction0.0560.079Real +39% denser edges
Per-channel brightness (R, G, B)(152, 159, 152)(94, 117, 104)Sim 1.35–1.62× brighter
Per-channel std (R, G, B)(40, 37, 38)(51, 56, 55)Real 30–52% wider tonal range

Per-camera breakdown

CamHist χ²Sim edgeReal edgeSim mean RGBReal mean RGB
side_left2.100.0590.069(143, 156, 146)(93, 112, 83)
side_right1.350.0490.057(149, 154, 147)(108, 116, 112)
wrist_left2.000.0550.083(161, 162, 160)(96, 123, 100)
wrist_right2.060.0630.105(156, 161, 157)(78, 118, 119)

Wrist cams have the largest gap (real-edge-density 51–67% denser) because wrist cams see hand, fingers, board screws, peg layer-lines — sim doesn't model the foreground. Side cams cover the larger workspace that the sim USD captures more faithfully.

Sim vs real 4-camera grid
Figure 6.1. Side-by-side sim (top row) vs real (bottom row) at four cameras, 256². Sim is uniformly bright, monochrome peg silhouettes, no foreground hand or cables. Real has darker shading, hand visible in wrist cams, ambient lab clutter, peg surface texture. Source: docs/research/figures/sim2real_visual_gap_grid.png.
FMB real frame variability across 5 timesteps
Figure 6.2. Five timesteps × four cameras from a single real FMB insert episode. Lighting and viewpoint vary substantially within one episode — the v2f predictor must learn over a multi-modal real distribution, not a single fixed pose. Source: docs/research/figures/fmb_real_frame_variability.png.

Implications (load-bearing for the rest of Goal 3)

6.2 Force-from-vision lit review v2

Owner: VisionWorker · sinew-5.9 · v2f_lit_v2.md · cloned code at references/v2f_lit_v2/code/{FoAR, reactive_diffusion_policy, forcesight}

Backbone consensus (2025)

Strongest impl-detail finding

FindingSource
Force direction transfers sim→real. Magnitude does not.Direction Matters (Yang 2026) — L1 on unit-vec coords (NOT cosine, NOT angle); magnitude dropped.
Voxel grid only helps top-down clutter; per-pixel heatmap is better for peg-board contact.Force Map (Hanai 2023)
Future-contact gate (binary) gates magnitude loss when ¬contact.FoAR (He 2024)
Magnitude head as scalar regression with ~0.1× direction weight.ForceSight + Direction Matters consensus
Heavy visual DR with NO dynamics DR. Dynamics randomization hurts direction supervision.Force Map appendix + Direction Matters

Training defaults to crib

OptimizerAdamW, lr=3e-4
Schedulecosine + 2000-step warmup
Batch size128 per A6000 (FoAR uses 240 on 2× A100 → halve)
Epochs300 (FoAR default for similar scale)
Precisionbf16 on Ampere+
Data scale5–10k sim rollouts with full F/T labels (Force Map's 5,400-scene recipe)

6.3 v2f FMB-only feasibility

Owner: VisionWorker · sinew-5.17 · v2f_fmb_only_feasibility.md + isaaclab_sinew/scripts/train_v2f_fmb_only.py (480 lines, AST-valid)

The feasibility check answers a single load-bearing question before committing to the staged-pretrain plan: can v2f be learned from FMB-real labels alone? The spec is closed; the A6000 run is a separate impl ticket per the user CLAUDE.md "training runs on DL_A6000 not local PC" rule.

Data100–500 FMB insert episodes, 2cam RGB-only
ArchitectureDirection + gate heads only, frozen DINOv2-S
Budget≤ 1 A6000-day
OutputFive-outcome matrix A–E that branches the pipeline downstream (see §6.7)

6.4 Data leverage analysis

Owner: Researcher (coord w/ VisionWorker) · sinew-5.10 · v2f_data_leverage.md

StageDataTrainedFrozen
1: sim pretrain5–10k FMB insert sim rollouts, 2–4 cam RGB(+D), clean+noisy GT wrench, heavy visual DRbackbone (optional unfreeze) + all 3 heads + gatenothing
2: real fine-tune~3–4k FMB insert real episodes × ~100 steps × 2cam RGB-only, EE-frame F/T zero-bias-subtracted, gripper_pose==1 onlydirection head + in-contact-gate headbackbone, magnitude head, contact-point head

Why this partition

Hard exclusions

Filter chain (parser-side)

  1. Primitive filter: keep only primitive == 'insert'.
  2. Gripper filter: keep only state_gripper_pose == 1 (peg held).
  3. Episode validity: drop episodes with < 10 post-filter steps.

Steps 1+2 collapse 22,550 episodes → ~3,000–4,000 insert-only-with-peg episodes, ~20–40 GB RGB-only.

F/T zero-bias subtraction (per-episode)

Franka external wrench carries a per-episode baseline drift (payload model + thermal). The first 5 pre-grip-close timesteps (state_gripper_pose == 0) provide the bias estimate. If fewer than 5 pre-grip-close steps exist (peg already grasped), skip subtraction — falling back to in-contact bias estimation would bake contact force into the "bias" and contaminate all downstream direction labels.

6.5 3-head architecture

Owner: VisionWorker · sinew-5.11 · v2f_arch.md (~43M params, 22M trainable + 21M frozen)

ComponentValue
BackboneDINOv2-S ViT-S/14, 384-d features, frozen (depth channel trainable)
Inputside_left + wrist_left RGBD 224×224 (letterboxed from 1280×720 sim or 256² FMB-real)
Patch embed4-ch RGBD (ForceSight pattern); depth conv1 init from RGB mean clone
Fusion2-layer transformer encoder, 4 heads, GELU, per-cam positional embeddings

Four heads

HeadOutputLossWeightStage
contact-pointper-pixel 64×64 heatmap on wrist camBCE + soft-L2 hybrid0.5sim only; frozen at stage 2
force-directionunit-vec 3D in EE frameL1 on coords (Direction Matters)1.0 (load-bearing)sim pretrain + real fine-tune
6D wrenchEE-frame [f; τ], trained on noisy-lagged labelMSE, gate-gated0.1sim only; frozen at stage 2
in-contact gatebinary probabilityBCE (FoAR pattern)0.1sim + real fine-tune

Key locks across memos

Training budget

StageEpochslrbsWall
Stage 1 (sim pretrain)3003e-4 cosine + 2k warmup128~1.0–1.3 A6000-days, bf16
Stage 2 (real fine-tune, dir + gate only)30–503e-5, no warmup128< 1 A6000-day

6.6 Domain randomization spec

Owner: VisionWorker · sinew-5.13 · v2f_dr_spec.md

Heavy visual DR, NO dynamics DR. F/T noise applied to labels, not inputs. 46-row master knob table organized in three schedule buckets:

BucketKnobs (count)Where applied
per-episodelighting (5), cam intrinsics (3), cam extrinsics (3), materials (8), placement (3), F/T bias + lag (2), MODE split (1) — ~26 knobsSceneConfigurator.randomize(step). Re-author USD attributes / lights / materials in place. USD writes > runtime set_focal_length calls.
per-frame (sim)depth dropout + Gauss noise + clamp + RGBD jitter (4), F/T additive noise (2) — 6 knobsRecording loop, before writing the labeled tuple. Noise becomes part of the recorded label.
per-frame (train aug)RGB brightness/contrast/hue/saturation/gauss/gamma/JPEG (7), chromatic aberration (1) — 8 knobsTraining dataloader (torchvision.transforms.v2). Cheap; expands effective dataset.
Critical knob: depth dropout on low-texture pixels (knob #20). D405 is passive stereo — textureless 3D-printed pegs give sparse / noisy depth in real but the sim depth is clean. Approximate by masking texture_grad < τ pixels and setting them to inf. Without this knob, any RGBD-using head will be sim-tuned.

Five categories elevated to hard-required (post sinew-5.16)

Per the visual-gap quantification: lighting + color jitter + material BRDF + camera intrinsic jitter + background clutter must all ship in the stage-1 DR set. None of these is "optional ablation."

6.7 Revised pipeline — branches A–E

Owner: VisionWorker · sinew-5.23 · supersedes sinew-5.10 + 5.11 + 5.13 · v2f_pipeline_revised.md

One pipeline, five outcome branches set by the sinew-5.17 FMB-only A6000 run result.

                     ┌─ A. Strong validation (real-only ≥0.85) ── drop sim stage 1; ship real-only
                     │
                     ├─ B. Validated, expected (≥0.70, <0.85)── staged pretrain-sim + fine-tune-real (canonical recipe)
[sinew-5.17 result]──┤
                     ├─ C. Marginal (≥0.70 global, <0.60 worst)─ per-shape ensembles OR more real data
                     │
                     ├─ D. Below-bar (0.55-0.70)─────────────── triage: unfreeze backbone, swap to ViT-B, more real data
                     │
                     └─ E. Failed (<0.55)────────────────────── HALT — frame audit, zero-bias check, linear probe

Outcome B (the canonical, most-likely branch)

Why no F/T input or disturbance-conditioning input

Two related decisions that both prevent train-test divergence at deployment:

The 70% bar caveat

Direction Matters's 70% bar was derived under their own sim2real distribution. Our χ² = 1.88 makes it likely the absolute bar is closer to 0.55–0.65 in the worst case. Therefore:

TierGlobal cos-simPer-shape minAction
Aspirational≥ 0.70≥ 0.60Ship outcome B as-is
Acceptable≥ 0.60≥ 0.45 with +0.05 fine-tune liftShip outcome B; flag the gap
Soft-fail0.55–0.60variesOutcome C — per-shape ensembles
Hard-fail< 0.55Outcome E — halt + audit (frame + zero-bias + quat + linear-probe)

Primary metric becomes monotonic improvement post-fine-tune until χ² is re-measured post-DR. If real fine-tune doesn't lift over stage-1 sim-pretrain, the visual gap absorption isn't working.

7. Implementation sequencing — five waves

The research epic produces design memos. This section lays out the impl tickets that follow, sequenced by dependency. None are filed yet; by convention they get filed when the research epic closes and the team transitions to impl.

Dependency graph (sinew-5.X memos → impl deliverables)

5.3 + 5.19 substage 5.21 sim F/T sensor 5.22 §1 reward v2 5.4 multi-env handles 5.16 + 5.6 scene + DR 5.18 + 5.22 §3 SimDist 5.2 eval protocol 5.5 + 5.7 cams + schema Wave 1 sim surface SubstageDetector read_eef_wrench_ee grab_franka_view Φ_insert v2 reward Wave 2 RL data factory PPO + curriculum 11 sub-expert ckpts CorruptedActionMapper eval protocol Wave 3 data-gen pass FmbDataRecorder NAS sync HDF5 / TFDS shards ~1.6M pairs target Wave 4 v2f training stage 1 sim pretrain branch on 5.17 A–E stage 2 real FT ship predictor Wave 5 real-robot eval predicted F/T vs real K_F_ext_hat_K ━━ critical path (Wave 1 → 2 → 3 → 4 → 5) ── research-memo dependency

Wave-by-wave details

Wave 1 — sim surface (~3–5 person-days, no training)

  1. SubstageDetector class at isaac_twins/src/isaac_twins/fmb/substage.py per sinew-5.3 §4. Single-env. Offline unit tests + 2 runtime tests + the three pre-RL Q1 checks (threshold envelope, inverted-physics sanity, temporal smoothness flag).
  2. read_eef_wrench_ee API per sinew-5.21. Stateful when noisy=True (LP filter state owned by caller); stateless when noisy=False. Smoke test: push peg into board, confirm wrench magnitude grows monotonically with depth.
  3. Bake peg_tip_local_offset on each peg USD at author time (isaac_twins-37). Required by the insert peg-tip-z check.
  4. grab_franka_view(num_envs), SceneConfigurator.reset_episode(env_ids), obs packager — unblock multi-env scaling and DirectRLEnv migration.
  5. Φ_insert v2 reward fn per sinew-5.22 §1. Pure function (diagnostics, prev_diagnostics, primitive, action, safety_clip_delta) → float. Offline tests against synthetic histories.

Wave 2 — RL data factory (~6 GPU-days at N=1, ~1.5 days at N=4)

  1. PPO training per sinew-5.2 logging spec, curriculum per sinew-5.1 §5.
  2. EEDeltaCorruptedActionMapper subclass per sinew-5.22 §3 (~0.5 d). Burst-state machine for the SimDist recipe.
  3. Sub-expert checkpoint preservation every 50 PPO iters → 11 checkpoints (0, 50, …, 2000).
  4. Curriculum advance gates per sinew-5.2 (IQM > 0.7, CIs not crossing 0.5 at N=5).

Wave 3 — data-gen pass (~9 GPU-h + ~2 h FMB pull + ~2 h parse)

  1. FmbDataRecorder outer loop (~1.5 d) per sinew-5.22 §3. Burst-noise applied here, NOT during PPO training.
  2. HDF5 schema additions (~0.5 d): 4 obs/sim_* keys (sim_action_noised, sim_policy_iter, sim_never_noised, plus the F/T clean/label pair already in sinew-5.6) + 4 episode-meta labels.
  3. NAS sync driver per sinew-5.8 (nohup, plain FTP, per-episode opportunistic).
  4. FMB real-data filtered subset pull via HTTPS-range from gs://gresearch/robotics/fmb/0.0.1/ — 15–30 shards (~20–35 GB), insert-only-after-filter. ~30 min download + ~2 h parsing.
  5. Data-gen yield eval script (~0.5 d) checking pair count + contact transient fraction + F/T coverage + per-cam diversity.

Wave 4 — v2f training (~1–2 A6000-days stage 1 + ~1 day stage 2)

  1. Stage 1 sim pretrain per sinew-5.23 §3.2 / sinew-5.11. 300 epochs, bs=128, AdamW lr=3e-4 cosine + 2k warmup, bf16. All 3 heads + gate. Gate-head loss upweight ×1.5 on noised steps.
  2. Outcome-matrix branch on the sinew-5.17 FMB-only A6000 result (A–E).
  3. Stage 2 real fine-tune per sinew-5.23 §3.2. 30–50 epochs, lr=3e-5, no warmup. Direction + gate only; backbone, wrench head, contact-point head frozen. Non-optional per sinew-5.16.
  4. Stratified eval per-shape, magnitude-stratified, gate-F1. Held-out clean_expert slice (~1.25%).

Wave 5 — real-robot eval (~deferred to next epic)

  1. Deploy predictor_real_finetune.pt on real Franka.
  2. Compare predicted F/T vs real K_F_ext_hat_K.
  3. Per-shape stratified bench.
  4. Real-robot data-collection pipeline + safety envelope + impedance gains tuning are the open chunks.

Compute envelope

WaveEstimated timeBottleneck
Wave 1 (impl)~3–5 person-daysSimWorker bandwidth
Wave 2 (RL training)~6 GPU-days at N=1, ~1.5 at N=4per-seed wall time
Wave 3 (data gen + pull + parse)~9 GPU-h + ~2 h pull + ~2 h parseI/O on FMB pull
Wave 4 (v2f train)~1–2 A6000-days stage 1 + ~1 day stage 2training compute
Wave 5 (real eval)deferredreal-robot access

8. Open questions and out-of-scope items

Items deliberately deferred or rejected, with the source memo. Kept tight; broad future-direction work belongs to the next epic.

Deferred (will revisit in impl)

ItemSourceWhen
χ² re-measurement post-DR + post-Tier-1 scene fixsinew-5.16 §5 follow-up #4After Wave 3 data-gen lands
Detector ↔ recorded-label cross-check (addenda 1.3)addenda Q1Only if non-determinism observed in recordings
Per-shape threshold ablation (addenda 1.6)addenda Q1After FMB raw arrives
Soft-success bonus shapesinew-5.1 §8 #2First training-run data
Ablations: backbone freeze/unfreeze, RGB vs RGBD, temporal stack T=1 vs T=4, contact-point per-pixel vs voxelsinew-5.11 §6Wave 4 (if stage 1 underperforms)
Production asset root pinisaac_twins-36Before any S3-staging issue resurfaces

Rejected for sinew scope (next epic or out-of-scope)

ItemReason
FMB trajectory replay reliability (~4 days to fix)Not needed for v2f-end-goal; state-only adapter covers substage verification (sinew-5.20)
SimDist latent world model + MPC planningOut of v2f scope (sinew-5.18)
Force-side domain adaptationTwo-gap separation: F/T gap closed by label noise, not domain adaptation
PathTracing render3–5× render-cost penalty; throughput killer (addenda Q3)
Hand approximation in simStage-2 real fine-tune absorbs this; modelling a human is scope creep (addenda Q3, sinew-5.16 §3.2)
Multi-object FMB assemblies (stage 2)Different contact physics; 7,200 demos / 233 GB; defer indefinitely
Real-robot validation epicIts own future epic; sinew-5.2 protocol carries over once real eval starts

9. References

Internal memos

FileIssueOwnerScope
docs/researcher.mdsinew-1.1ResearcherFMB deep-dive + force-from-vision landscape v1
docs/fmb_reference.mdsinew-1.1 + 1.5ResearcherFMB cheat-sheet; §3.1 EE-frame canonical; §11 cam LR mapping
docs/sim_worker.mdsinew-1.2 + 1.4SimWorkerisaac_twins audit
docs/rl_worker.mdsinew-1.3RLWorkerisaaclab_sinew bootstrap
docs/rl_action_layer_sketch.mdsinew-2RLWorkerEE-delta action layer (Plan A DIK)
docs/research/rl_eval_protocol.mdsinew-5.2RLWorkerIQM + bootstrap CIs, PPO defaults
docs/research/sim_substage_detection.mdsinew-5.3SimWorkerSubstage detector spec
docs/research/rl_reward_design.mdsinew-5.1RLWorkerPBRS + sparse + curriculum (Φ_insert §3.5 superseded)
docs/research/il_bc_warmstart.mdsinew-5.12RLWorkerBC plan (dropped per sinew-5.22 §2)
docs/research/fmb_sim_data_match.mdsinew-5.7Researcher3 FMB schemas, sim canonical, libfranka extras
docs/research/v2f_lit_v2.mdsinew-5.9VisionWorker11-paper impl-detail review
docs/research/v2f_data_leverage.mdsinew-5.10ResearcherStage-1/stage-2 partition (revised by sinew-5.23)
docs/research/v2f_arch.mdsinew-5.11VisionWorker3-head architecture (still binding)
docs/research/v2f_dr_spec.mdsinew-5.13VisionWorker46-knob DR table
docs/research/camera_subset_benchmark.mdsinew-5.5SimWorkerCamera subset perf bench
docs/research/sim_recording_spec.mdsinew-5.6SimWorkerRecording schema + format choice
docs/research/data_pipeline.mdsinew-5.8SimWorkerNAS + DL_A6000 pipeline
docs/research/parallel_dev_architecture.mdsinew-5.4SimWorkerTwo-repo split, hot rules
docs/research/sim2real_visual_gap.mdsinew-5.16Researcherχ²=1.88 quantification
docs/research/v2f_fmb_only_feasibility.mdsinew-5.17VisionWorkerFMB-only feasibility spec + outcome matrix
docs/research/sim_dist_review.mdsinew-5.18ResearcherSimDist action-burst recipe
docs/research/substage_verification.mdsinew-5.19SimWorkerState-only adapter for offline calibration
docs/research/fmb_replay_feasibility.mdsinew-5.20SimWorkerHonest negative on FMB replay
docs/research/sim_ft_sensor.mdsinew-5.21SimWorkerread_eef_wrench_ee API + 4-stage pipeline
docs/research/rl_revised_plan.mdsinew-5.22RLWorkerReward v2, BC drop, SimDist adoption, labels
docs/research/v2f_pipeline_revised.mdsinew-5.23VisionWorker5-outcome branches; canonical recipe
docs/research/substage_and_scene_addenda.mdaddendaSimWorkerQ1 gate checks, Q2 combining, Q3 scene tiers
isaac_twins/scripts/validate_fmb_schema.pysinew-5.7ResearcherFMB-schema validator (3-mode autodetect)

Cloned code (in isaac_twins/references/)

RepoPathWhy
FoAR — He et al. 2024v2f_lit_v2/code/FoAR/In-contact-gate pattern, ResNet18 + F/T fusion
Reactive Diffusion Policy — Xue et al. 2025v2f_lit_v2/code/reactive_diffusion_policy/2-stage VAE + latent-DDPM, slow/fast hierarchy
ForceSight — Collins et al. 2023v2f_lit_v2/code/forcesight/RGBDDinov2 backbone, ThreeHeadMLP
SimDist — CLeARoboticsLabreferences/sim_dist/Action-burst recipe (kept; latent WM dropped)
rliable — Agarwal et al. 2021eval/rliable/IQM + stratified bootstrap CI
FMB — Luo et al. 2024fmb/Authoritative FMB code
realsense-rosrealsense-ros/D405 driver reference

Paper PDFs (in isaac_twins/references/v2f_lit_v2/papers/)

External anchors


Generated 2026-05-21 by the sinew agent team. Final state: sinew-1 env setup (5 children) and sinew-5 research epic (14 v1 children + 8 reopened-wave children + 5.14 synthesis) all closed. Implementation phase begins with Wave 1 (sim surface). Source memos under docs/research/; cloned references under isaac_twins/references/; this report lives at docs/r2s2r_research_report.html.