Executive plan view - 2026-05-21
sinew team - team-lead, Researcher, SimWorker, RLWorker, VisionWorker
Deep-dive companion: r2s2r_research_report.html
predictor_real_finetune.pt - DINOv2-S backbone, RGBD patch embed, 4 heads (force-direction, 6D wrench, contact-point, in-contact gate)side_left + wrist_left) on real FMB testbedinsert subsetContact-rich manipulation - peg insertion, assembly, tool-use - depends on force feedback. Real F/T sensors are noisy, slow, and absent on most low-cost hardware.
| Approach | What it needs | Limitation |
|---|---|---|
| Real F/T sensor | Wrist-mounted load cell or Franka K_F_ext_hat_K | Noisy, lagged, biased; not on most arms |
| Tactile gripper | Custom fingers (GelSight, DIGIT, etc.) | Hardware lock-in; contact-only signal |
| v2f predictor (sinew) | RGBD cameras (already on the testbed) | Direction transfers sim→real; magnitude needs real anchor |
FMB upstream is a benchmark for contact-rich insertion but ships no force-prediction baseline. sinew fills that gap with a predictor trained on sim-generated (image, force) pairs and anchored on the FMB real subset.
Visual and F/T sim-to-real gaps need different fixes. Conflating them is the easy mistake.
| Gap | What it is | How sinew closes it |
|---|---|---|
| Visual | Sim renders too clean: chi2=1.88, edges +39%, brightness skew 35-62% | Heavy visual DR + Tier-1 scene authoring. Stage-2 real fine-tune absorbs residual. |
| F/T | Real Franka K_F_ext_hat_K is noisy + biased + lagged. Sim is pristine. |
Noise/bias/lag model on the recorded label, not on inputs. No domain adaptation for force. |
Fix lives in the image pipeline: heavy DR sim-side + stage-2 backbone freeze + direction-head FT. v2f stage 2
Fix lives in the recorder: noise/bias/lag injected on label, never on input. No force-side domain adaptation. sim labels
Predictor sees clean sim images, learns to match noisy real-Franka F/T. The visual gap requires a real fine-tune; the F/T gap is handled entirely inside the sim label pipeline.
Goal 2 - isaac_twins + isaaclab_sinew
cfg = FmbInsertionEnvCfg() # defaults: fmb_big_demo + big_long/rect peg
# + side+wrist cams
env = FmbInsertionEnv(cfg)
obs, info = env.reset()
# obs keys: side_left, side_right, wrist_left, wrist_right (RGBA 720x1280),
# q (7,), dq (7,), tcp_pose (7,),
# tcp_force (3,), tcp_torque (3,),
# gripper_pose (1,)
for _ in range(N):
action = policy(obs) # 7-vec in [-1, 1]
obs, rew, term, trunc, info = env.step(action)
read_eef_wrench_ee(art, sensor, *, noisy, state, rng) - the sim F/T pipeline (clean + noisy paths)SubstageDetector at isaac_twins/src/isaac_twins/fmb/substage.py - 5 primitive predicates, single source of truth for reward + labelsgrab_franka_view(num_envs), SceneConfigurator.reset_episode(env_ids) - multi-env handlesGoal 2 - sim_substage_detection.md + addenda Q1/Q2
finger_contact - antagonistic finger-force pattern → grasp closedpeg_contact - per-partner zero/non-zero distinguishes floor / fixture / hole. The substage-defining signal.board_contact - sanity heartbeat (board should stay seated)| Primitive | FP | FN |
|---|---|---|
grasp | 1-3% | 2-5% |
place_on_fixture | 2-5% | 5-10% |
rotate | 5-10% | 10-20% |
regrasp | 3-7% | 5-10% |
insert | 5-10% | 10-15% |
Detector is single-source-of-truth for reward gates AND recorded v2f labels. Threshold-envelope assertion + inverted-physics probes + temporal-smoothness check land pre-RL (~2 days, addenda Q1).
Goal 2 - sim_ft_sensor.md - replicates Panda K_F_ext_hat_K
noisy=False → reward, SubstageDetector, v2f direction-head label (geometry-derived)noisy=True → policy obs, v2f wrench-head label (matches real Franka distribution)Force sigma=0.025 N/axis, torque sigma=0.01 Nm/axis (Franka resolution /2). Per-episode bias drift sigma=0.05 N / 0.02 Nm. LP lag tau_lag = U(20, 80) ms.
Goal 1 - rl_revised_plan.md + sim_dist_review.md
| Element | Locked value |
|---|---|
| Algorithm | PPO with Andrychowicz 2020 defaults; SAC fallback if PPO plateaus below 50% success |
| Curriculum | grasp -> +place -> +rotate -> +regrasp -> +insert; phase advance on IQM > 0.7 |
| Checkpoint preservation | Every 50 PPO iters → 11 sub-expert checkpoints (iters 0, 50, ..., 2000) |
| Data-gen disturbance | Action-only Gaussian burst, ~9% noised fraction, gripper bit excluded, 2.5% never-noised |
| Policy mixture at data-gen | 50% expert + 50% from 11 sub-expert ckpts; per-env assignment persists per episode |
| Disturbance applied | Data-gen pass only - NOT during PPO training |
Goal 2 - sim_recording_spec.md + rl_revised_plan.md 4
| Knob | Value |
|---|---|
| Episodes | 5-10k insert-primitive rollouts (FMB-equivalent at 22k eps) |
| Camera config | 2mixed_rgbd (side_left + wrist_left); 3cam_rgbd for data leverage |
| Resolution | 224 (DINOv2-S patch match); 256 opt-in ablation |
| Storage | ~243 GB at 224 - ~45% of FMB upstream's 545 GB zip |
| Schema | FMB-RLDS parity + obs/sim_* extras; validator (3-mode autodetect) |
| Label | Definition | Frac | v2f use |
|---|---|---|---|
successful | insert.success() == True | ~60% | high-quality direction labels |
disturbed | any tick with sim_action_noised == 1 | ~55% | off-policy + contact-transient diversity |
failed | not successful | ~40% | off-manifold coverage; gate negatives |
clean_expert | successful AND not disturbed AND policy=expert | ~1.25% | held-out nominal-regime eval slice |
Goal 3 - v2f_arch.md + v2f_pipeline_revised.md - 22M trainable + 21M frozen
| Component | Value |
|---|---|
| Backbone | DINOv2-S ViT-S/14, 384-d features, frozen (depth channel trainable) |
| Input patch embed | 4-channel RGBD patch embed; depth conv1 init from RGB mean (ForceSight pattern) |
| Cross-cam fusion | 2-layer transformer encoder, 4 heads, GELU |
| Cameras | side_left + wrist_left RGBD 224x224, letterboxed |
| Head | Output | Loss | Weight |
|---|---|---|---|
| load-bearing force-direction | unit-vec 3D in EE frame | L1 on coords, NaN-masked when ||F|| < 8 N | 1.0 |
| 6D wrench | EE-frame [f; tau], noisy-lagged label | MSE, gate-gated | 0.1 |
| contact-point | per-pixel 64x64 heatmap on wrist cam | BCE + soft-L2 | 0.5 |
| in-contact gate | binary probability | BCE (FoAR pattern) | 0.1 |
"Force direction transfers sim→real; magnitude does not" (Direction Matters) - dictates head weights, freeze list, and DR-vs-no-DR split.
Goal 3 - v2f_pipeline_revised.md 3
| Stage | Data | Trained | Frozen |
|---|---|---|---|
| 1: sim pretrain | 1.6M (image, force) pairs, 2-cam RGBD, clean+noisy GT wrench, heavy visual DR | backbone (depth ch) + all 4 heads + gate | DINOv2 RGB weights |
| 2: real fine-tune | ~3-4k FMB insert real eps x ~100 steps x 2cam RGB, EE-frame F/T zero-bias-subtracted |
direction + gate heads only | backbone, wrench head, contact-point head |
Stage 1: 300 epochs, AdamW lr=3e-4 cosine + 2k warmup, bs=128, bf16. Stage 2: 30-50 epochs, lr=3e-5, no warmup. Output: predictor_real_finetune.pt - the ship artifact.
Goal 3 - v2f_dr_spec.md + sim2real_visual_gap.md - heavy visual DR, NO dynamics DR
| Bucket | Knobs | Where applied |
|---|---|---|
| Per-episode (~26) | lighting, cam K, cam extrinsics, materials, placement, F/T bias+lag, mode, DR profile | SceneConfigurator.randomize(step); USD writes > runtime calls |
| Per-frame (sim, ~6) | depth dropout + Gauss + clamp + RGBD jitter, F/T additive | Recording loop, before writing labeled tuple |
| Per-frame (train aug, ~8) | brightness/contrast/hue/sat/gauss/gamma/JPEG, chromatic aberration | Training dataloader (torchvision.transforms.v2) |
Critical knob: depth dropout on low-texture pixels. D405 is passive stereo - textureless 3D-printed pegs give sparse depth in real but clean in sim. Without this, any RGBD head will be sim-tuned.
Five hard-required categories: lighting + color jitter + material BRDF + cam intrinsic jitter + background clutter. Tier-1 scene fixes (cable mesh, lab clutter, background plane) land alongside DR to drive chi2 from 1.88 to ~1.4.
| Lock | Value |
|---|---|
| End deliverable | v2f predictor (RGBD → wrench). RL policy is not shipped. |
| Sim F/T provenance | Isaac contact reporter via read_eef_wrench_ee. Never vision-predicted. |
| F/T frame | EE frame end-to-end; 6x6 EE->base adjoint applied only at FMB-checkpoint boundary. |
| Quat order | (qx, qy, qz, qw) everywhere. |
| Image storage | BGR on disk -> RGB at parse time (FMB convention). |
Locks compress design debate into known frames. Re-litigation gate: open a new beads issue, don't rewrite the lock in place.
| Lock | Value |
|---|---|
| Action contract | 7-vec EE-delta normalized [-1, 1]; scaled +-0.06 m / +-0.25 rad / gripper bit; 10 Hz; base frame. |
| SimDist recipe | Action-only burst, ~9% noised, gripper bit excluded, 2.5% never-noised, data-gen only. |
| Stage-2 real fine-tune | Non-optional. Direction + gate heads only; backbone, wrench, contact-point frozen. |
| Recording resolution | 224 (DINOv2-S patch match) is v1 default; 256 is opt-in ablation. |
| Camera subset (v2f) | 2mixed_rgbd minimum (side_left + wrist_left); 3cam_rgbd for data leverage. |
| Lock | Value |
|---|---|
| Noisy / clean wrench discipline | Noisy → policy obs + v2f wrench label. Clean → reward + substage detector + direction-head label. |
| Reward shaping | PBRS with clean wrench; Phi_insert force coefficient 0.2; gates on alignment + seat depth. |
| Substage detector role | Full-state detector canonical for reward + recorded labels; state-only adapter for offline audit only. |
| RL evaluation | IQM + 95% stratified bootstrap CIs (never mean/median); P(A > B) > 0.7; N=5 seeds default. |
| v2f primary metric | Monotonic improvement post-fine-tune; aspirational direction cos-sim ≥ 0.70 global / ≥ 0.60 worst shape. |
The two F/T disciplines (noisy/clean + EE-frame everywhere) are the most-touched rules - they thread through reward, recorder, predictor labels, and policy obs.
Goal 3 - sim2real_visual_gap.md - chi2=1.88, FoAR threshold for "visibly distinct" is 1.0
| Metric | Sim | Real | Gap |
|---|---|---|---|
| Color hist chi2 (mean 4 cams) | - | 1.88 | Above FoAR chi2=1.0 "visibly distinct" |
| Edge density | 0.056 | 0.079 | Real +39% |
| Per-channel brightness (RGB) | (152, 159, 152) | (94, 117, 104) | Sim 1.35-1.62x brighter |
| Per-channel std (RGB) | (40, 37, 38) | (51, 56, 55) | Real 30-52% wider |
side_left
side_right
wrist_left
wrist_right
| Cam | Sim edge | Real edge | Δ edge |
|---|---|---|---|
side_left | 0.059 | 0.069 | +17% |
side_right | 0.049 | 0.057 | +16% |
wrist_left | 0.055 | 0.083 | +51% |
wrist_right | 0.063 | 0.105 | +67% |
Headline: wrist cams have the worst gap because they see hand, fingers, board screws, peg layer-lines - all of which the sim does not model. This is what makes a real fine-tune non-optional and what the contact-point head freezes against.
Goal 3 - substage_and_scene_addenda.md Q3 - leverage x inverse-cost ranking
| Fix | Time | delta chi2 mean | delta chi2 wrist |
|---|---|---|---|
| Procedural cable mesh in wrist FOV | 1 day | -0.05 to -0.1 | -0.3 to -0.5 |
| Lab-clutter distractor spawning | 0.5 day | -0.15 to -0.25 | -0.05 |
| Background plane workshop texture | 0.5 day | -0.1 to -0.2 | -0.05 |
Out of scope: PathTracing (3-5x render cost), hand simulation (stage-2 FT carries this), photographed bin texture (not on critical path).
impl_epic_plan.md - one wave per epic; Wave 4 splits into 4a sim pretrain + 4b real fine-tune
| Epic | Scope | Owner | Wall-days | GPU-days |
|---|---|---|---|---|
| A Wave 1 | Sim surface foundation | SimWorker (+ RLWorker) | ~4 | 0 |
| B Wave 2 | RL data factory | RLWorker (+ SimWorker, Researcher) | ~6 | ~4 |
| C Wave 3 | Data-gen pass | SimWorker (+ RLWorker) | ~3 | <1 |
| D Wave 4a | v2f sim pretrain | VisionWorker | ~3 | ~1-2 |
| E Wave 4b | v2f real fine-tune | VisionWorker (+ Researcher) | ~3 | ~1 |
| F Wave 5 | Real-robot eval (hardware-gated) | VisionWorker (+ User) | ~7 | 0 |
Aggregate: ~7 GPU-days across Epics B + D + E; ~26 wall-days floor if teammates fully available + GPU not contested; ~45 days realistic with contention.
~4 wall-days - 0 GPU-days - SimWorker (+ RLWorker)
SubstageDetector class at isaac_twins/src/isaac_twins/fmb/substage.py per spec - offline unit tests + Kit runtime smokespeg_tip_local_offset attribute on each peg USD via the author scriptread_eef_wrench_ee(art, sensor, *, noisy, state, rng) - stateful when noisy=True, stateless when noisy=False; smoke: push peg into board, wrench grows monotonically with depthgrab_franka_view(num_envs), SceneConfigurator.reset_episode(env_ids), observation packagerPhi_insert reward function (clean wrench, PBRS form) - offline unit tests against synthetic historiesPass criterion: Integration smoke runs FmbInsertionEnv -> reset -> 100 steps -> reward returns correct decomposition AND read_eef_wrench_ee returns non-zero on contact in the same loop. Test suite green.
~6 wall-days + ~4 GPU-days - RLWorker (+ SimWorker, Researcher)
FmbInsertionEnv to DirectRLEnv subclass; multi-env smoke at N=4EEDeltaCorruptedActionMapper subclass with per-DOF Gaussian burst; gripper bit excluded; per-env sigma fixed at run startPass criterion: 11 sub-expert + 1 expert PPO checkpoint on disk for full grasp->insert curriculum; post-Tier-1 chi2 re-measured and documented; scene fixes committed.
SimWorker scene fixes run alongside RLWorker PPO training (different repos, no contention).
~3 wall-days + <1 GPU-day - SimWorker (+ RLWorker)
FmbDataRecorder outer loop - burst-noise applied here (NOT during PPO training); 10-episode test corpus validatedobs/sim_* keys (sim_action_noised, sim_policy_iter, sim_never_noised, sim_noise_std) + 4 episode-meta labels (successful, disturbed, failed, clean_expert)episode_uploader.py - nohup-resumable, plain FTP via curl --ftp-method nocwd, idempotent on NASsinew_fmb_v2f shards on A6000 - tfds.load succeeds locallyftp://143.248.121.169:7002/IntelligentManipulationTeam/DomrachevIvan/sinew/recordings/Pass criterion: Sim corpus on NAS (label distribution per spec; yield eval green); TFDS shards verified loadable on A6000.
~3 wall-days + ~1-2 GPU-days - VisionWorker
V2FPredictor model class: DINOv2-S frozen + 4-ch RGBD patch embed + 2-layer cross-cam fusion + 4 heads (~43M params, 22M trainable)clean_expert subset (~1.25% of corpus)Pass criterion: predictor_sim_pretrain.pt on A6000; sim-test direction-acc > 0.85 (loose sanity bar; if not hit, pipeline is broken).
~3 wall-days + ~1 GPU-day - VisionWorker (+ Researcher)
fmb_parse.py - filter chain + zero-bias subtract + 8N gate threshold + (qx,qy,qz,qw) + BGR->RGB + cam renames; parses 100 episodes without warninginsert-only filtered subset - ~20-35 GB via GCS HTTPS-range; 15-30 shards; ~3000-4000 episodes; per-shape distribution recordedpredictor_real_finetune.pt - the ship artifactPass criterion: Real-test direction cos-sim ≥ 0.60 global, ≥ 0.45 worst-shape (acceptable tier); aspirational ≥ 0.70 / ≥ 0.60. Outcome decision recorded.
~7 wall-days - 0 GPU-days - VisionWorker (+ User) - DEFERRED until hardware access confirmed
docs/research/final_real_eval.mdPass criterion: 50-trajectory real eval done; per-shape direction-acc table on real Franka; final memo committed.
RL eval protocol (IQM + bootstrap CIs) carries over for real-eval reporting.
| Item | Source | Trigger |
|---|---|---|
| chi2 re-measurement post-DR + post-Tier-1 scene fix | sim2real_visual_gap 5 | End of Epic B |
| Detector vs recorded-label cross-check (non-determinism audit) | addenda Q1.3 | Only if non-determinism observed in recordings |
| Per-shape threshold ablation for substage detector | addenda Q1.6 | After FMB raw arrives locally |
| Soft-success bonus shape (reward shaping refinement) | rl_reward_design 8 #2 | First PPO training data |
| Ablations: backbone freeze depth, RGB vs RGBD, temporal stack, voxel contact-point | v2f_arch 6 | Epic D if sim-pretrain underperforms |
| Production asset root pin (S3 staging) | isaac_twins-36 | Before any S3 staging issue resurfaces |
FMB raw .npy download (545 GB) - currently 5-frame smokes only | FMB pull | Required for full per-shape calibration; partial pull (~35 GB) is enough for Epic E |
Each open item has a "when" trigger - they enter the plan on a specific event, not on a schedule. The chi2 re-measurement is the most important; it gates how aggressively we interpret v2f real-test numbers.
| Epic | Wall-days | GPU-days | Bottleneck |
|---|---|---|---|
| A - sim surface | ~4 | 0 | SimWorker bandwidth |
| B - RL data factory | ~6 | ~4 | per-seed wall time |
| C - data-gen pass | ~3 | <1 | I/O on FMB pull and NAS push |
| D - v2f sim pretrain | ~3 | ~1-2 | A6000 training compute |
| E - v2f real fine-tune | ~3 | ~1 | training compute |
| F - real eval | ~7 | 0 | real-robot access |
| Total (A-E) | ~19 wall-days floor | ~7 GPU-days | SimWorker queue (18 tasks) |
| Total (A-F) | ~26 wall-days floor | ~7 GPU-days | hardware availability for F |
This deck is the executive view of the sinew R2S2R plan. For the full analysis - locks, per-memo findings, validator schemas, dependency graph cross-refs, wave deliverable detail - see the long-form report:
docs/r2s2r_research_report.html
Next: Epic A kickoff (sim surface foundation). Source memos: docs/research/. Cloned references: isaac_twins/references/.