sinew R2S2R

Force-from-Vision on FMB

Executive plan view - 2026-05-21

sinew team - team-lead, Researcher, SimWorker, RLWorker, VisionWorker

Deep-dive companion: r2s2r_research_report.html

The deliverable

What ships

predictor_real_finetune.pt - DINOv2-S backbone, RGBD patch embed, 4 heads (force-direction, 6D wrench, contact-point, in-contact gate)
Inference at 10 Hz from 2 D405 cameras (side_left + wrist_left) on real FMB testbed
Trained on a sim corpus generated entirely inside the sinew stack, fine-tuned on the FMB-real insert subset

What the predictor does NOT consume

No F/T input - RGBD in, wrench out. Single-direction flow.
No proprio. No commanded action. No history beyond a per-frame RGBD tensor.

Why force-from-vision

Contact-rich manipulation - peg insertion, assembly, tool-use - depends on force feedback. Real F/T sensors are noisy, slow, and absent on most low-cost hardware.

Approach	What it needs	Limitation
Real F/T sensor	Wrist-mounted load cell or Franka `K_F_ext_hat_K`	Noisy, lagged, biased; not on most arms
Tactile gripper	Custom fingers (GelSight, DIGIT, etc.)	Hardware lock-in; contact-only signal
v2f predictor (sinew)	RGBD cameras (already on the testbed)	Direction transfers sim→real; magnitude needs real anchor

FMB upstream is a benchmark for contact-rich insertion but ships no force-prediction baseline. sinew fills that gap with a predictor trained on sim-generated (image, force) pairs and anchored on the FMB real subset.

Three building blocks, one ship artifact

Goal 1 - RL data factory Goal 2 - Sim dataset Goal 3 - v2f predictor Downstream / real

Two gaps, separately handled

Visual and F/T sim-to-real gaps need different fixes. Conflating them is the easy mistake.

Gap	What it is	How sinew closes it
Visual	Sim renders too clean: chi2=1.88, edges +39%, brightness skew 35-62%	Heavy visual DR + Tier-1 scene authoring. Stage-2 real fine-tune absorbs residual.
F/T	Real Franka `K_F_ext_hat_K` is noisy + biased + lagged. Sim is pristine.	Noise/bias/lag model on the recorded label, not on inputs. No domain adaptation for force.

Visual gap → real fine-tune required

Fix lives in the image pipeline: heavy DR sim-side + stage-2 backbone freeze + direction-head FT. v2f stage 2

F/T gap → label-side noise only

Fix lives in the recorder: noise/bias/lag injected on label, never on input. No force-side domain adaptation. sim labels

Predictor sees clean sim images, learns to match noisy real-Franka F/T. The visual gap requires a real fine-tune; the F/T gap is handled entirely inside the sim label pipeline.

Sim environment foundation

Goal 2 - isaac_twins + isaaclab_sinew

cfg = FmbInsertionEnvCfg()  # defaults: fmb_big_demo + big_long/rect peg
                            # + side+wrist cams
env = FmbInsertionEnv(cfg)
obs, info = env.reset()

# obs keys: side_left, side_right, wrist_left, wrist_right (RGBA 720x1280),
#           q (7,), dq (7,), tcp_pose (7,),
#           tcp_force (3,), tcp_torque (3,),
#           gripper_pose (1,)

for _ in range(N):
    action = policy(obs)          # 7-vec in [-1, 1]
    obs, rew, term, trunc, info = env.step(action)

Key API surface

read_eef_wrench_ee(art, sensor, *, noisy, state, rng) - the sim F/T pipeline (clean + noisy paths)
SubstageDetector at isaac_twins/src/isaac_twins/fmb/substage.py - 5 primitive predicates, single source of truth for reward + labels
grab_franka_view(num_envs), SceneConfigurator.reset_episode(env_ids) - multi-env handles

How we detect substages

Goal 2 - sim_substage_detection.md + addenda Q1/Q2

Three ContactSensors cover all five primitives

finger_contact - antagonistic finger-force pattern → grasp closed
peg_contact - per-partner zero/non-zero distinguishes floor / fixture / hole. The substage-defining signal.
board_contact - sanity heartbeat (board should stay seated)

Predicate accuracy bounds (full-state detector)

Primitive	FP	FN
`grasp`	1-3%	2-5%
`place_on_fixture`	2-5%	5-10%
`rotate`	5-10%	10-20%
`regrasp`	3-7%	5-10%
`insert`	5-10%	10-15%

Detector is single-source-of-truth for reward gates AND recorded v2f labels. Threshold-envelope assertion + inverted-physics probes + temporal-smoothness check land pre-RL (~2 days, addenda Q1).

Sim F/T sensor pipeline

Goal 2 - sim_ft_sensor.md - replicates Panda K_F_ext_hat_K

Two callsites, one API

noisy=False → reward, SubstageDetector, v2f direction-head label (geometry-derived)
noisy=True → policy obs, v2f wrench-head label (matches real Franka distribution)

Force sigma=0.025 N/axis, torque sigma=0.01 Nm/axis (Franka resolution /2). Per-episode bias drift sigma=0.05 N / 0.02 Nm. LP lag tau_lag = U(20, 80) ms.

RL as a data factory

Goal 1 - rl_revised_plan.md + sim_dist_review.md

Element	Locked value
Algorithm	PPO with Andrychowicz 2020 defaults; SAC fallback if PPO plateaus below 50% success
Curriculum	grasp -> +place -> +rotate -> +regrasp -> +insert; phase advance on IQM > 0.7
Checkpoint preservation	Every 50 PPO iters → 11 sub-expert checkpoints (iters 0, 50, ..., 2000)
Data-gen disturbance	Action-only Gaussian burst, ~9% noised fraction, gripper bit excluded, 2.5% never-noised
Policy mixture at data-gen	50% expert + 50% from 11 sub-expert ckpts; per-env assignment persists per episode
Disturbance applied	Data-gen pass only - NOT during PPO training

Yield gates (data-gen pass)

≥ 1.6M (image, force) pairs total
≥ 5% contact-transient frames
F/T magnitude coverage [0, 30] N
Per-cam diversity chi2 ≥ 0.3

Sim dataset shape

Goal 2 - sim_recording_spec.md + rl_revised_plan.md 4

Knob	Value
Episodes	5-10k `insert`-primitive rollouts (FMB-equivalent at 22k eps)
Camera config	`2mixed_rgbd` (`side_left` + `wrist_left`); `3cam_rgbd` for data leverage
Resolution	224 (DINOv2-S patch match); 256 opt-in ablation
Storage	~243 GB at 224 - ~45% of FMB upstream's 545 GB zip
Schema	FMB-RLDS parity + `obs/sim_*` extras; validator (3-mode autodetect)

Per-episode trajectory labels (4-way)

Label	Definition	Frac	v2f use
`successful`	`insert.success() == True`	~60%	high-quality direction labels
`disturbed`	any tick with `sim_action_noised == 1`	~55%	off-policy + contact-transient diversity
`failed`	not successful	~40%	off-manifold coverage; gate negatives
`clean_expert`	successful AND not disturbed AND policy=expert	~1.25%	held-out nominal-regime eval slice

v2f architecture (~43M params)

Goal 3 - v2f_arch.md + v2f_pipeline_revised.md - 22M trainable + 21M frozen

Component	Value
Backbone	DINOv2-S ViT-S/14, 384-d features, frozen (depth channel trainable)
Input patch embed	4-channel RGBD patch embed; depth conv1 init from RGB mean (ForceSight pattern)
Cross-cam fusion	2-layer transformer encoder, 4 heads, GELU
Cameras	`side_left` + `wrist_left` RGBD 224x224, letterboxed

Four heads

Head	Output	Loss	Weight
load-bearing force-direction	unit-vec 3D in EE frame	L1 on coords, NaN-masked when \|\|F\|\| < 8 N	1.0
6D wrench	EE-frame [f; tau], noisy-lagged label	MSE, gate-gated	0.1
contact-point	per-pixel 64x64 heatmap on wrist cam	BCE + soft-L2	0.5
in-contact gate	binary probability	BCE (FoAR pattern)	0.1

"Force direction transfers sim→real; magnitude does not" (Direction Matters) - dictates head weights, freeze list, and DR-vs-no-DR split.

v2f training schedule

Goal 3 - v2f_pipeline_revised.md 3

Stage	Data	Trained	Frozen
1: sim pretrain	1.6M (image, force) pairs, 2-cam RGBD, clean+noisy GT wrench, heavy visual DR	backbone (depth ch) + all 4 heads + gate	DINOv2 RGB weights
2: real fine-tune	~3-4k FMB `insert` real eps x ~100 steps x 2cam RGB, EE-frame F/T zero-bias-subtracted	direction + gate heads only	backbone, wrench head, contact-point head

Why this partition

Direction transfers - geometric constraint normals are sim/real-identical
Magnitude doesn't - real F/T has payload model error, gravity-comp residual, thermal drift
Contact-point doesn't have a real label - no per-pixel "contact happened here" ground truth on FMB-real
Backbone frozen at stage 2 - protects visual-DR features from catastrophic forgetting

Stage 1: 300 epochs, AdamW lr=3e-4 cosine + 2k warmup, bs=128, bf16. Stage 2: 30-50 epochs, lr=3e-5, no warmup. Output: predictor_real_finetune.pt - the ship artifact.

Visual sim-to-real strategy

Goal 3 - v2f_dr_spec.md + sim2real_visual_gap.md - heavy visual DR, NO dynamics DR

Bucket	Knobs	Where applied
Per-episode (~26)	lighting, cam K, cam extrinsics, materials, placement, F/T bias+lag, mode, DR profile	`SceneConfigurator.randomize(step)`; USD writes > runtime calls
Per-frame (sim, ~6)	depth dropout + Gauss + clamp + RGBD jitter, F/T additive	Recording loop, before writing labeled tuple
Per-frame (train aug, ~8)	brightness/contrast/hue/sat/gauss/gamma/JPEG, chromatic aberration	Training dataloader (`torchvision.transforms.v2`)

Critical knob: depth dropout on low-texture pixels. D405 is passive stereo - textureless 3D-printed pegs give sparse depth in real but clean in sim. Without this, any RGBD head will be sim-tuned.

Five hard-required categories: lighting + color jitter + material BRDF + cam intrinsic jitter + background clutter. Tier-1 scene fixes (cable mesh, lab clutter, background plane) land alongside DR to drive chi2 from 1.88 to ~1.4.

Locked decisions (1 of 3)

Lock	Value
End deliverable	v2f predictor (RGBD → wrench). RL policy is not shipped.
Sim F/T provenance	Isaac contact reporter via `read_eef_wrench_ee`. Never vision-predicted.
F/T frame	EE frame end-to-end; 6x6 EE->base adjoint applied only at FMB-checkpoint boundary.
Quat order	`(qx, qy, qz, qw)` everywhere.
Image storage	BGR on disk -> RGB at parse time (FMB convention).

Locks compress design debate into known frames. Re-litigation gate: open a new beads issue, don't rewrite the lock in place.

Locked decisions (2 of 3)

Lock	Value
Action contract	7-vec EE-delta normalized [-1, 1]; scaled +-0.06 m / +-0.25 rad / gripper bit; 10 Hz; base frame.
SimDist recipe	Action-only burst, ~9% noised, gripper bit excluded, 2.5% never-noised, data-gen only.
Stage-2 real fine-tune	Non-optional. Direction + gate heads only; backbone, wrench, contact-point frozen.
Recording resolution	224 (DINOv2-S patch match) is v1 default; 256 is opt-in ablation.
Camera subset (v2f)	`2mixed_rgbd` minimum (`side_left + wrist_left`); `3cam_rgbd` for data leverage.

Locked decisions (3 of 3)

Lock	Value
Noisy / clean wrench discipline	Noisy → policy obs + v2f wrench label. Clean → reward + substage detector + direction-head label.
Reward shaping	PBRS with clean wrench; `Phi_insert` force coefficient 0.2; gates on alignment + seat depth.
Substage detector role	Full-state detector canonical for reward + recorded labels; state-only adapter for offline audit only.
RL evaluation	IQM + 95% stratified bootstrap CIs (never mean/median); P(A > B) > 0.7; N=5 seeds default.
v2f primary metric	Monotonic improvement post-fine-tune; aspirational direction cos-sim ≥ 0.70 global / ≥ 0.60 worst shape.

The two F/T disciplines (noisy/clean + EE-frame everywhere) are the most-touched rules - they thread through reward, recorder, predictor labels, and policy obs.

The sim-to-real visual gap

Goal 3 - sim2real_visual_gap.md - chi2=1.88, FoAR threshold for "visibly distinct" is 1.0

Metric	Sim	Real	Gap
Color hist chi2 (mean 4 cams)	-	1.88	Above FoAR chi2=1.0 "visibly distinct"
Edge density	0.056	0.079	Real +39%
Per-channel brightness (RGB)	(152, 159, 152)	(94, 117, 104)	Sim 1.35-1.62x brighter
Per-channel std (RGB)	(40, 37, 38)	(51, 56, 55)	Real 30-52% wider

Per-camera χ² (worst at wrist) — FoAR "visibly distinct" threshold = 1.0

side_left

2.10

side_right

1.35

wrist_left

2.00

wrist_right

2.06

χ²=1.0 · "visibly distinct"

Cam	Sim edge	Real edge	Δ edge
`side_left`	0.059	0.069	+17%
`side_right`	0.049	0.057	+16%
`wrist_left`	0.055	0.083	+51%
`wrist_right`	0.063	0.105	+67%

Headline: wrist cams have the worst gap because they see hand, fingers, board screws, peg layer-lines - all of which the sim does not model. This is what makes a real fine-tune non-optional and what the contact-point head freezes against.

Sim vs real - 4-camera grid

Scene fix plan

Goal 3 - substage_and_scene_addenda.md Q3 - leverage x inverse-cost ranking

Tier 1 (~2.5 days, biggest leverage)

Fix	Time	delta chi2 mean	delta chi2 wrist
Procedural cable mesh in wrist FOV	1 day	-0.05 to -0.1	-0.3 to -0.5
Lab-clutter distractor spawning	0.5 day	-0.15 to -0.25	-0.05
Background plane workshop texture	0.5 day	-0.1 to -0.2	-0.05

Expected χ² trajectory — before → after

Pre-fix · mean

1.88

Pre-fix · wrist

~2.0

Tier 1 · mean

~1.4

Tier 1 · wrist

~1.4

Tier 1+2 · mean

~1.0-1.2

Tier 1+2 · wrist

~1.1-1.3

target ≤ χ²=1.0

Out of scope: PathTracing (3-5x render cost), hand simulation (stage-2 FT carries this), photographed bin texture (not on critical path).

Six epics, 37 tasks, ~26 wall-days floor

impl_epic_plan.md - one wave per epic; Wave 4 splits into 4a sim pretrain + 4b real fine-tune

Epic	Scope	Owner	Wall-days	GPU-days
A Wave 1	Sim surface foundation	SimWorker (+ RLWorker)	~4	0
B Wave 2	RL data factory	RLWorker (+ SimWorker, Researcher)	~6	~4
C Wave 3	Data-gen pass	SimWorker (+ RLWorker)	~3	<1
D Wave 4a	v2f sim pretrain	VisionWorker	~3	~1-2
E Wave 4b	v2f real fine-tune	VisionWorker (+ Researcher)	~3	~1
F Wave 5	Real-robot eval (hardware-gated)	VisionWorker (+ User)	~7	0

Aggregate: ~7 GPU-days across Epics B + D + E; ~26 wall-days floor if teammates fully available + GPU not contested; ~45 days realistic with contention.

Dependency graph

Epic A - Wave 1 sim surface foundation

~4 wall-days - 0 GPU-days - SimWorker (+ RLWorker)

Deliverables (11 tasks)

SubstageDetector class at isaac_twins/src/isaac_twins/fmb/substage.py per spec - offline unit tests + Kit runtime smokes
Threshold-envelope startup assertion + inverted-physics probes (5 per primitive) + temporal-smoothness check
Bake peg_tip_local_offset attribute on each peg USD via the author script
read_eef_wrench_ee(art, sensor, *, noisy, state, rng) - stateful when noisy=True, stateless when noisy=False; smoke: push peg into board, wrench grows monotonically with depth
Multi-env handles: batched grab_franka_view(num_envs), SceneConfigurator.reset_episode(env_ids), observation packager
Phi_insert reward function (clean wrench, PBRS form) - offline unit tests against synthetic histories
Reward-decomposition logging hook - every 10K steps writes all 8 reward components + per-primitive share dict

Pass criterion: Integration smoke runs FmbInsertionEnv -> reset -> 100 steps -> reward returns correct decomposition AND read_eef_wrench_ee returns non-zero on contact in the same loop. Test suite green.

Epic B - Wave 2 RL data factory

~6 wall-days + ~4 GPU-days - RLWorker (+ SimWorker, Researcher)

Deliverables (10 tasks)

Promote FmbInsertionEnv to DirectRLEnv subclass; multi-env smoke at N=4
EEDeltaCorruptedActionMapper subclass with per-DOF Gaussian burst; gripper bit excluded; per-env sigma fixed at run start
Sub-expert checkpoint preservation every 50 PPO iters -> 11 .pt files (iters 0, 50, ..., 2000)
Curriculum scaffolding - phase advance on IQM > 0.7, eval cadence every 100K steps
PPO training run - phase 1 (grasp-only), 1.5 GPU-day, N=5 seeds
PPO training run - phase 2-5 (full chain), 2-3 GPU-days, full curriculum
Scene Tier 1 fixes #1-3 in parallel (SimWorker): procedural cable mesh + lab clutter + workshop background
Re-measure chi2 post-Tier 1 (Researcher) - target drop from 1.88 toward ~1.4

Pass criterion: 11 sub-expert + 1 expert PPO checkpoint on disk for full grasp->insert curriculum; post-Tier-1 chi2 re-measured and documented; scene fixes committed.

SimWorker scene fixes run alongside RLWorker PPO training (different repos, no contention).

Epic C - Wave 3 data-gen pass

~3 wall-days + <1 GPU-day - SimWorker (+ RLWorker)

Deliverables (7 tasks)

FmbDataRecorder outer loop - burst-noise applied here (NOT during PPO training); 10-episode test corpus validated
HDF5/zarr schema additions: 4 obs/sim_* keys (sim_action_noised, sim_policy_iter, sim_never_noised, sim_noise_std) + 4 episode-meta labels (successful, disturbed, failed, clean_expert)
episode_uploader.py - nohup-resumable, plain FTP via curl --ftp-method nocwd, idempotent on NAS
Data-gen yield eval script - reports total pairs, contact-transient fraction, F/T distribution, per-cam diversity chi2
Small pilot - 100 episodes; label distribution matches expectation
Full sweep - 5k-10k episodes, ~150-300 GB on NAS at 224 2mixed_rgbd
Build TFDS sinew_fmb_v2f shards on A6000 - tfds.load succeeds locally

Storage budget

~6 MB per 50-step episode at 224 2mixed_rgbd; 22k FMB-equivalent corpus ~243 GB (45% of FMB upstream's 545 GB zip)
NAS: ftp://143.248.121.169:7002/IntelligentManipulationTeam/DomrachevIvan/sinew/recordings/

Pass criterion: Sim corpus on NAS (label distribution per spec; yield eval green); TFDS shards verified loadable on A6000.

Epic D - Wave 4a v2f sim pretrain

~3 wall-days + ~1-2 GPU-days - VisionWorker

Deliverables (4 tasks)

V2FPredictor model class: DINOv2-S frozen + 4-ch RGBD patch embed + 2-layer cross-cam fusion + 4 heads (~43M params, 22M trainable)
Training script - AdamW lr=3e-4 cosine + 2k warmup, bs=128, 300 epochs, bf16; checkpoint save every 50 epochs; one-epoch wall-time matches budget (~7 min for 5k episodes)
Stage 1 sim pretrain run on A6000 - all 4 heads trained; gate-head loss upweighted 1.5x on noised steps
Per-shape stratified eval on sim test split - 9-row per-shape table, identifies worst shape, per-shape gate F1

GPU budget

Single A6000, bf16, batch 128
Stage 1: ~1-2 A6000-days for 300 epochs over 5-10k episodes
Held-out validation: clean_expert subset (~1.25% of corpus)

Pass criterion: predictor_sim_pretrain.pt on A6000; sim-test direction-acc > 0.85 (loose sanity bar; if not hit, pipeline is broken).

Epic E - Wave 4b v2f real fine-tune

~3 wall-days + ~1 GPU-day - VisionWorker (+ Researcher)

Deliverables (5 tasks)

fmb_parse.py - filter chain + zero-bias subtract + 8N gate threshold + (qx,qy,qz,qw) + BGR->RGB + cam renames; parses 100 episodes without warning
Pull FMB insert-only filtered subset - ~20-35 GB via GCS HTTPS-range; 15-30 shards; ~3000-4000 episodes; per-shape distribution recorded
Stage 2 real fine-tune run on A6000 - direction + gate heads only; backbone, wrench, contact-point heads frozen; 30-50 epochs, lr=3e-5, no warmup
Per-shape stratified eval on FMB-real test - direction-acc + gate F1 table; worst-shape direction-acc ≥ 0.45 (acceptable tier)
Branch decision per outcome matrix A-E (team-lead + VisionWorker) - written addendum, next-epic plan adjusted

GPU budget

~1 A6000-day for stage-2 fine-tune (30-50 epochs at 1/10 stage-1 lr)
Output: predictor_real_finetune.pt - the ship artifact

Pass criterion: Real-test direction cos-sim ≥ 0.60 global, ≥ 0.45 worst-shape (acceptable tier); aspirational ≥ 0.70 / ≥ 0.60. Outcome decision recorded.

Epic F - Wave 5 real-robot eval (hardware-gated)

~7 wall-days - 0 GPU-days - VisionWorker (+ User) - DEFERRED until hardware access confirmed

Deliverables (5 tasks)

Hardware setup confirmation - Franka + 4x D405 + FMB workspace match sim layout within +-5 mm
Deploy v2f predictor inference on real D405 streams at 10 Hz; publish to ROS topic / shared mem
Collect 50 real rollouts with predictor running alongside real F/T ground truth
Real eval bench - predicted vs real direction-acc + gate F1 + per-shape stratified report card
Final results memo + decision on next steps - docs/research/final_real_eval.md

Open chunks

Real-robot data-collection pipeline
Safety envelope review (presumed via FMB testbed; safety task added if not)
Impedance gains tuning

Pass criterion: 50-trajectory real eval done; per-shape direction-acc table on real Franka; final memo committed.

RL eval protocol (IQM + bootstrap CIs) carries over for real-eval reporting.

Open questions and unknowns

Item	Source	Trigger
chi2 re-measurement post-DR + post-Tier-1 scene fix	sim2real_visual_gap 5	End of Epic B
Detector vs recorded-label cross-check (non-determinism audit)	addenda Q1.3	Only if non-determinism observed in recordings
Per-shape threshold ablation for substage detector	addenda Q1.6	After FMB raw arrives locally
Soft-success bonus shape (reward shaping refinement)	rl_reward_design 8 #2	First PPO training data
Ablations: backbone freeze depth, RGB vs RGBD, temporal stack, voxel contact-point	v2f_arch 6	Epic D if sim-pretrain underperforms
Production asset root pin (S3 staging)	`isaac_twins-36`	Before any S3 staging issue resurfaces
FMB raw `.npy` download (545 GB) - currently 5-frame smokes only	FMB pull	Required for full per-shape calibration; partial pull (~35 GB) is enough for Epic E

Each open item has a "when" trigger - they enter the plan on a specific event, not on a schedule. The chi2 re-measurement is the most important; it gates how aggressively we interpret v2f real-test numbers.

Compute envelope summary

Epic	Wall-days	GPU-days	Bottleneck
A - sim surface	~4	0	SimWorker bandwidth
B - RL data factory	~6	~4	per-seed wall time
C - data-gen pass	~3	<1	I/O on FMB pull and NAS push
D - v2f sim pretrain	~3	~1-2	A6000 training compute
E - v2f real fine-tune	~3	~1	training compute
F - real eval	~7	0	real-robot access
Total (A-E)	~19 wall-days floor	~7 GPU-days	SimWorker queue (18 tasks)
Total (A-F)	~26 wall-days floor	~7 GPU-days	hardware availability for F

Cross-cutting

Critical path: A -> B -> C -> D -> E -> F. Each wave gates the next.
Intra-wave parallelism: Epic B (SimWorker scene fixes alongside RLWorker PPO), Epic E (Researcher parser alongside VisionWorker training).
Single A6000 covers the entire training budget.
Realistic with GPU contention + integration churn: ~45 days end-to-end.

End - and the deep-dive

This deck is the executive view of the sinew R2S2R plan. For the full analysis - locks, per-memo findings, validator schemas, dependency graph cross-refs, wave deliverable detail - see the long-form report:

docs/r2s2r_research_report.html

Next: Epic A kickoff (sim surface foundation). Source memos: docs/research/. Cloned references: isaac_twins/references/.

sinew R2S2R

Force-from-Vision on FMB

The deliverable

A v2f predictor: RGBD → end-effector wrench, deployable on a real Franka Panda.

What ships

What the predictor does NOT consume

Why force-from-vision

Three building blocks, one ship artifact

Two gaps, separately handled

Visual gap → real fine-tune required

F/T gap → label-side noise only

Sim environment foundation

Key API surface

How we detect substages

Three ContactSensors cover all five primitives

Predicate accuracy bounds (full-state detector)

Sim F/T sensor pipeline

Two callsites, one API

RL as a data factory

Yield gates (data-gen pass)

Sim dataset shape

Per-episode trajectory labels (4-way)

v2f architecture (~43M params)

Four heads

v2f training schedule

Why this partition

Visual sim-to-real strategy

Locked decisions (1 of 3)

Locked decisions (2 of 3)

Locked decisions (3 of 3)

The sim-to-real visual gap

Per-camera χ² (worst at wrist) — FoAR "visibly distinct" threshold = 1.0

Sim vs real - 4-camera grid

Scene fix plan

Tier 1 (~2.5 days, biggest leverage)

Expected χ² trajectory — before → after

Six epics, 37 tasks, ~26 wall-days floor

Dependency graph

Epic A - Wave 1 sim surface foundation

Deliverables (11 tasks)

Epic B - Wave 2 RL data factory

Deliverables (10 tasks)

Epic C - Wave 3 data-gen pass

Deliverables (7 tasks)

Storage budget

Epic D - Wave 4a v2f sim pretrain

Deliverables (4 tasks)

GPU budget

Epic E - Wave 4b v2f real fine-tune

Deliverables (5 tasks)

GPU budget

Epic F - Wave 5 real-robot eval (hardware-gated)

Deliverables (5 tasks)

Open chunks

Open questions and unknowns

Compute envelope summary

Cross-cutting

End - and the deep-dive