Skip to main content

Vision Capture BETA

AI motion capture in the browser. Webcam or video file in, animation out. Privacy-first — your video never leaves your device.

Tier: Free for all (Free Style and up) Status: 🔵 BETA Menu: Capture → Vision (Webcam) or Capture → Video Mocap


What it does

Skips the studio entirely. Point a webcam at yourself, perform the motion you want, and Vision Capture extracts the pose data and applies it to your avatar's skeleton. Or drop a video file (mp4, mov, webm) and capture motion from existing footage.

Powered by Google's Mediapipe Pose model running in the browser via WebAssembly. The same neural network that drives professional dance apps and fitness trackers, pre-trained on millions of human poses.

This isn't a toy. With good lighting and a full-body framing, the captured skeleton is comparable to entry-level professional mocap suits — and it's free, instant, and runs offline once the model loads.


Two capture modes

Webcam mode

Real-time motion capture from your computer's camera.

  1. Click Capture → Vision (Webcam).
  2. Browser asks for camera permission. Grant.
  3. Camera feed appears in a side panel. Stand back so your full body is in frame.
  4. Click Record. Perform your motion.
  5. Click Stop. The captured frames apply to your avatar's skeleton in the editor timeline.

Best for: short improvised motion (a wave, a kick, a dance move), interactive testing, "show, don't tell" animation.

Video mode

Capture from an existing video file.

  1. Click Capture → Video Mocap.
  2. Drop a video file (mp4, mov, webm). The video plays in the side panel.
  3. Vision Capture processes the entire video frame-by-frame.
  4. Result applies to the timeline.

Best for: longer sequences, performance footage, video reference for stylized motion, motion you can't easily re-perform live.


Privacy — fully client-side

This is a real privacy claim, not marketing fluff. No video data ever leaves your device.

  • Mediapipe runs in WebAssembly in your browser. The model weights download once, then all inference is local.
  • No upload step. Your webcam feed or video file never touches our servers.
  • No telemetry of pose data. What you capture is yours alone.
  • Camera/file permission is browser-scoped — revoke any time via browser settings.

This is the same privacy posture as MotionPrint verification — the platform's principle is consistent.


What gets captured

Mediapipe Pose extracts 33 body landmarks per frame: head, shoulders, elbows, wrists, fingers (basic), hips, knees, ankles, plus secondary landmarks for face orientation. These map onto your avatar's skeleton via auto-rig matching:

LandmarkMaps to
Nose / face centerHead bone orientation
ShouldersClavicle/shoulder bones
ElbowsUpper-arm rotation
WristsForearm rotation + hand position
HipsPelvis position + rotation
KneesThigh rotation
AnklesShin rotation + foot position

For avatars with less than 33 mappable bones (Roblox R15 / R6, simplified rigs), unused landmarks are dropped. For avatars with more bones (SL Bento with hand finger detail), Vision Capture provides motion for the matched bones; unmatched bones (individual finger joints, facial bones) stay at rest pose.

For per-finger detail, pair Vision Capture with Mediapipe Hands (a separate model). That integration is on the roadmap but not yet shipped.


Best practices for clean capture

Lighting

Mediapipe degrades gracefully but works best with:

  • Even lighting — front-lit, no harsh shadows.
  • Contrast against background — wear clothes that aren't the same color as your wall.
  • Daylight or warm white LED — fluorescent flicker can introduce noise.

Framing

  • Full body in frame — head to feet visible. Mediapipe loses precision for occluded landmarks.
  • Camera at chest height — ground-level cameras have foreshortening artifacts.
  • 2-3 meters from camera — too close and the head/feet leave frame; too far and resolution suffers.

Performance

  • Slow, deliberate motion captures cleaner than fast jerky motion. Even pro mocap struggles with fast spins; webcam mocap struggles even more.
  • Pause between distinct moves — gives you a clean cut point to trim later.
  • Wear fitted clothing if you can. Loose drape can confuse the body landmark detection.

Hardware

  • Modern webcam, 720p or 1080p, 30 FPS minimum. Built-in laptop cameras work but external HD webcams capture noticeably cleaner.
  • Decent CPU — Mediapipe runs on CPU via WASM, so a recent (2020+) laptop is recommended for smooth real-time webcam.

Workflow — typical capture session

  1. Load your avatar in the editor. Make sure it's rigged if custom.
  2. Capture → Vision (Webcam). Grant camera permission if prompted.
  3. Stand back, full body in frame.
  4. Click Record. Perform the motion (e.g. wave, sit down, throw a punch).
  5. Click Stop. The captured motion applies to the avatar's timeline immediately.
  6. Watch playback. The avatar performs your motion in 3D.
  7. (Optional) Run Foot Locking to clean foot sliding (webcam capture often has slight foot drift due to the perspective).
  8. (Optional) Run Mirror Animation to make the captured motion bilaterally clean.
  9. (Optional) Run Quality Score to validate.
  10. Export GLB / BVH / .anim for use in your platform of choice.

BETA disclosure — known limitations

Vision Capture is BETA quality. Working today, but rough edges:

Pose ambiguities

Mediapipe sometimes confuses left/right (especially during back-facing motion or fast spins). The captured skeleton can have one wrong-side hand for a few frames. Run Mirror Animation in SWAP mode to spot/fix.

Depth uncertainty

Webcam is 2D — depth is inferred. Motion that's directly toward or away from camera is imprecise. Side-on motion captures cleanest.

Single-person only

The current pipeline tracks one person. Multi-person scenes capture only the most prominent figure.

Costume artifacts

Capes, long dresses, very loose hair can confuse the landmark detector. Capture in fitted clothing, then re-apply your character's costume in the rig.

Low-light degradation

Below ~50 lux (dim room), accuracy drops sharply. Capture in well-lit settings.

No face capture (yet)

Body pose only. Facial expression capture is a separate Mediapipe model (Face Mesh), planned for integration post-launch.


Edge cases

Webcam permission denied

The capture won't work without camera access. The "Vision (Webcam)" mode shows a permission prompt; if denied, switch to "Video Mocap" instead and capture from a recorded video.

Video file unsupported

Browsers support mp4, webm, and (sometimes) mov. AVI, MKV, and other formats may not load — convert to mp4 first.

Very long videos

A 30-minute video processes for several minutes. Vision Capture chunks the work to keep the UI responsive, but be patient on long files. Consider trimming the source video to just the segment you want before importing.

Performance during real-time webcam

At 30 FPS webcam input, Vision Capture runs at ~25-28 FPS pose estimation on a typical laptop. The ~2-5 FPS gap doesn't affect captured motion quality (the motion is reconstructed at the source frame rate), but the live preview may feel slightly choppy.


  • Rig Studio — set up your custom rig before capturing onto it
  • Foot Locking — almost always run after Vision Capture to clean foot drift
  • Mirror Animation — fix any L/R confusion in the captured motion
  • Retarget Studio — alternative to Vision Capture if you have existing clips to import
  • Roadmap — face capture, hand detail, multi-person — what's coming next