The control problem: why text isn't enough for VR

Imagine putting on a VR headset and reaching toward a virtual coffee cup. You see your hand approach in the video, but when you try to grab it, your fingers pass through like a ghost. The cup doesn't react. Your hand doesn't bend around it. The virtual world treats your motion as decoration, not control.

This is the core problem that video world models face today. Current systems accept coarse control signals, text prompts mostly. You can ask them to generate "someone picking up a cup," and they'll hallucinate a plausible video. But they have no idea what your actual hand is doing. They can't see your joints bending or your wrist rotating in real time. They generate generic human motion that has nothing to do with your specific movements.

Extended reality demands something different. When you move your hand in physical space, the virtual world should respond to that exact motion. Not a similar motion. Not a motion that fits a text description. Your motion. This mismatch between what tracking technology can measure and what generative models can condition on is the fundamental gap this paper addresses.

The issue becomes sharper when you think about the tasks people actually want to do in VR: push a button with precise finger placement, open a jar by rotating your wrist, turn a steering wheel with both hands in coordination. These require dexterous interaction, fine control, responsiveness. A text-conditioned model doesn't have the information to make these interactions feel real. It can't distinguish between your thumb moving and your pinky moving. It can't track whether you're rotating your wrist clockwise or counterclockwise.

How we see you: tracking hands and head

The solution starts with recognizing what information is already available. Every modern VR headset has cameras tracking the user's head position and hand pose in real time. This isn't vague data. It's concrete 3D geometry: the spatial position of your head in the virtual space, the rotation of your view direction, and the position and rotation of each of your 26 hand joints.

Rather than converting this rich spatial information into imprecise descriptions, the system works directly with the numbers. Head pose is represented as six degrees of freedom, describing both where your eyes are in 3D space and which direction you're looking. Hand pose comes from the UmeTrack model, which tracks the wrist translation and rotation along with the position of each finger joint. This is the language the video generator will learn to speak.

The pipeline is straightforward: your headset tracks your body in real time, converts the tracking data into 3D coordinates, feeds those coordinates into a video model, and receives back a video stream of the virtual environment. Before explaining how the model learns to use this signal, it's worth understanding that tracking provides exact spatial information, not fuzzy inference. When you move your index finger, the system knows the position changed, not just that "the hand moved."

Pipeline showing tracked hands and head feeding into the video generation system, with outputs showing responsive virtual environments

Teaching the model to watch hands: the conditioning challenge

Now comes the hard problem. Video diffusion models, the best video generators available, were trained on internet data where hands are usually tiny background details. A hand is three percent of the frame, fingers are barely distinguishable, and precise hand pose is almost never relevant to the caption. When you suddenly demand that the model pay close attention to 26 joint positions and make them central to generation, you're asking it to unlearn what it naturally learned during pretraining.

The challenge is architectural. Video models need to be told explicitly to use hand pose information, and there are many ways to do this poorly. Too compressed, and the signal disappears into the model's buried layers. Too naive, and the model learns to ignore it in favor of patterns it already knows well. The paper tests multiple conditioning strategies and finds that a hybrid approach works best, one that integrates hand pose information at multiple levels of the diffusion process.

The difference becomes visible when you look at what the model generates. With a poor conditioning strategy, the model produces hands that drift away from the input tracking, or hands that ignore the signal entirely and just do generic human motion. With the hybrid strategy, the generated hands stay aligned with where the tracking says they should be. The hands in the video actually match the hands the user is making in physical space.

Qualitative comparison showing ground-truth hand positions in red, predicted hands in orange, and overlap in green across different conditioning approaches

The hybrid conditioning strategy achieves tighter alignment between tracked hand input and generated output.

This extends to more complex scenarios. When testing on GigaHands, a dataset of diverse hand interactions with objects, the same strategy generalizes across different types of motion and different environments. A conditioning mechanism that works for simple hand tracking also works when hands are interacting with a steering wheel or opening a jar.

The conditioning innovation matters because it makes hand pose a first-class control signal rather than an afterthought. The model doesn't just tolerate hand information; it integrates it into its decision-making at every level of the generation process. This is what allows the subsequent focus on camera control. When both head pose and hand pose are properly conditioned, the system can generate worlds where what you see matches where you're looking, and where your hands interact with objects in physically plausible ways.

Comparison of camera-only, hand-only, and joint control, showing that both modalities together produce coherent results

Camera and hand control work together. Using only one leaves the other dimension of interaction uncontrolled.

Speed matters: from bidirectional teacher to real-time system

A perfect video model that takes 30 seconds to generate one frame is worthless in VR. The moment between moving your hand and seeing the world respond is called latency, and users are exquisitely sensitive to it. More than about 20 milliseconds of latency and the interaction breaks apart psychologically. You stop feeling like you're controlling something and start feeling like you're watching a video that occasionally responds to your input.

This creates a tension in generative models. You can achieve high quality by running the diffusion process iteratively, building up the output over many steps. But iteration takes time. The paper resolves this through knowledge distillation, a technique where a high-quality model (the teacher) trains a faster model (the student) to produce similar outputs without the extra computation.

The teacher model is bidirectional, meaning it can see both past and future context when generating a frame. This gives it the information to make confident predictions. The student model is causal, meaning it can only see the past, just like real interaction where the future hasn't happened yet. The student learns from the teacher how to make good predictions with only past context. Through this training process, the student learns to run fast enough for real-time interaction while preserving the quality benefits of careful hand conditioning.

This isn't just about making things faster. It's about making the conditioned hand information actually useful in practice. A sophisticated conditioning strategy for hands doesn't matter if the system can't run on a VR headset in real time. The distillation approach keeps the conditioning benefits while enabling the responsiveness that embodied interaction requires.

Does it actually work: testing with real users

The technical innovations matter only if they improve what users actually experience. To test this, the paper puts the system in front of humans with three concrete tasks: push a green button, open a jar, and turn a steering wheel. These aren't arbitrary choices. Each task requires different types of hand-object interaction. A button requires precise, localized finger placement. A jar requires coordinated hand rotation around an axis. A wheel requires two-handed coordination and sustained rotation. Together, they exercise the system's ability to handle dexterous interaction.

The three user study tasks showing button pushing, jar opening, and steering wheel interaction

User study tasks: button pushing, jar opening, and wheel turning.

Subjects wore a commercial VR headset and completed these tasks in two conditions. In the baseline, they received a text prompt describing the task. In the experimental condition, the system used their actual tracked hand poses. The evaluation measured two things: whether they succeeded at the task, and whether they felt like they had control over what was happening.

The results show clear improvements. Task success increases when the model actually sees the user's hands. But perhaps more interesting is the subjective finding: users report significantly higher perceived control and agency when the system is conditioned on their actual hand tracking. This matters because it shows the system isn't just hitting targets by chance. It's making users feel like the virtual world is responding to their intention.

User evaluation results comparing hand-conditioned and baseline approaches across task success and perceived control

Hand-conditioned video generation improves both task success and users' sense of control over their actions.

The user study is where the narrative becomes concrete. Every technical decision made sense in isolation, but the real question is whether it produces a system that humans want to use. The answer is yes. When the virtual world responds to your actual hands, not generic ones, the interaction feels responsive and intentional. This is the proof that the conditioning work, the speed optimization, and the training strategy together solve a real problem in embodied interaction.

What breaks and what's next

No system is perfect, and clarity about limitations is more useful than false confidence. Hand occlusion, when one hand hides the other, sometimes confuses tracking and generation. Very fast, dexterous movements can occasionally outpace the model's ability to keep up. Generating novel scenarios that differ significantly from the training data remains challenging, as it does for all generative models. Scaling to longer interaction sequences involves tradeoffs between coherence and the computational constraints of real-time generation.

But these limitations point toward natural directions for future work. Occlusion is hard because the model has to imagine unseen hand configuration; as hand tracking improves, this becomes easier. Speed constraints will relax as hardware advances and model efficiency improves. Generalization to novel scenarios is inherently difficult for any generative model, but hand conditioning at least provides a foothold that text-only models lack. A system that tracks user intent through hand pose can learn from a smaller amount of data more effectively than one that relies purely on language descriptions.

The work also connects to a broader trajectory in embodied AI. Other approaches, like hand-object interaction generation for egocentric video, have explored how to generate interactions from hand motion. This paper extends that by building a full world model conditioned on hand and head control, enabling interactive simulation rather than just interaction prediction. Similarly, research on leveraging human videos for world models has shown the value of using human demonstrations as training signal, a principle that strengthens this work's foundation in real user motion.

The broader context is that embodied AI roadmaps increasingly emphasize world models that simulate visual environments as a key component of agent learning and human-AI interaction. This paper makes a specific contribution to that vision: a world model that responds to human body motion. As these models improve, the question shifts from "can we condition on hand pose" to "what can we build now that we can?"

The honest assessment is that this is progress on a real problem, not a complete solution. Hand tracking and video generation will both continue improving independently, and this work benefits from those advances. The conditioning strategy is general enough to work with better video models as they emerge. The distillation approach will become easier as efficiency improves. The gap between this system and fully immersive, lag-free virtual embodiment narrows with each improvement in the underlying technology.


This is a Plain English Papers summary of a research paper called Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.