Introducing AniX, a system enables users to provide 3DGS scene along with a 3D or multi-view character, enabling interactive control of the character's behaviors and active exploration of the environment through natural language commands. The system features:
(1) Consistent Environment and Character Fidelity, ensuring visual and spatial coherence with the user-provided scene and character; (2) a Rich Action Repertoire covering a wide range of behaviors, including locomotion, gestures, and object-centric interactions; (3) Long-Horizon, Temporally Coherent Interaction, enabling iterative user interaction while maintaining continuity across generated clips; and (4) Controllable Camera Behavior, which explicitly incorporates camera control—analogous to navigating 3DGS views—to produce accurate, user-specified viewpoints.
Recent advances in world models have greatly enhanced interactive environment simulation. Existing methods mainly fall into two categories: (1) static world generation models, which construct 3D environments without active agents, and (2) controllable-entity models, which allow a single entity to perform limited actions in an otherwise uncontrollable environment.
In this work, we introduce AniX, leveraging the realism and structural grounding of static world generation while extending controllable-entity models to support user-specified characters capable of performing open-ended actions. Users can provide a 3DGS scene and a character, then direct the character through natural language to perform diverse behaviors—from basic locomotion to object-centric interactions—while freely exploring the environment. AniX synthesizes temporally coherent video clips that preserve visual fidelity with the provided scene and character, formulated as a conditional autoregressive video generation problem. Built upon a pre-trained video generator, our training strategy significantly enhances motion dynamics while maintaining generalization across actions and characters. Our evaluation covers a broad range of aspects, including visual quality, character consistency, action controllability, and long-horizon coherence.
(a) Each training sample consists of a 3D character and a video. Through segmentation and inpainting, we obtain scene videos and mask sequences.
(b) AniX predicts target video tokens conditioned on scene, mask, text, and multi-view character tokens within a Multi-Modal Diffusion Transformer, trained using Flow Matching.
(c) AniX is extended into an auto-regressive mode by introducing an extra conditioning input—the preceding video tokens, which supports multi-round user interaction and long-horizon generation while maintaining temporal continuity and semantic coherence between adjacent video clips.
(a) Users first specify the inputs, including the character, 3DGS scene, virtual camera location, and character anchor.
(b) The user-provided text instruction is parsed, and a corresponding camera path is generated. Applying this path to the 3DGS scene produces a rendered scene video.
(c) AniX then takes multiple inputs as conditions to generate the final output.
(d) Steps (b) and (c) can be performed iteratively, enabling temporally consistent, long-horizon interactions.
AniX can generalize to various novel actions that are not seen during training. This phenomenon can be interpreted through the lens of post-training in large language models, where fine-tuning typically does not disrupt the pre-trained representation space; rather, it adjusts the response style—for example, to make the outputs more helpful or harmless—while preserving the extensive knowledge acquired during pre-training. In our case, the structurally simple fine-tuning data—composed primarily of fundamental locomotion behaviors—serve to refine motion dynamics and align human embodiment representations, rather than to redefine the model's generative space.
AniX demonstrates strong generalization in controlling previously unseen characters. Leveraging mature 3D generation tools—such as Hunyuan3D, Tripo, Meshy, and Rodin—or sourcing assets from online stores like Sketchfab, diverse 3D characters can be easily acquired and used directly for inference.
AniX supports flexible scene customization. Using state-of-the-art 3DGS scene generators, users can create diverse environments and control any character to explore these worlds. In this work, most 3DGS scenes are sourced from World Labs Marble.
AniX supports auto-regressive generation, enabling the creation of temporally coherent video sequences that build upon previously generated clips. This capability allows for extended, long-horizon user-model interactions.
@article{xxx,
author = {xxx},
title = {xxx},
journal = {xxx},
year = {xxx},
}