kimodo-api π
A REST API wrapper around NVIDIA Kimodo β the state-of-the-art text-to-motion diffusion model trained on 700 hours of commercial mocap data.
This image turns Kimodo into a microservice you can call from any pipeline, no Python environment needed.
Quick Start
docker pull ghcr.io/eyalenav/kimodo-api:latest
docker run --rm --gpus '"device=0"' -p 9551:9551 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HUGGINGFACE_TOKEN=hf_... \
ghcr.io/eyalenav/kimodo-api:latest
β οΈ First run downloads Llama-3-8B-Instruct (~16GB) for the text encoder. Requires a HuggingFace token with access to
meta-llama/Meta-Llama-3-8B-Instruct.
API
POST /generate
Generate a motion clip from a text prompt.
curl -X POST http://localhost:9551/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "person pushing through a crowd aggressively"}'
Response: NPZ file (binary) β SOMA 77-joint skeleton format, compatible with BVH export.
GET /health
curl http://localhost:9551/health
# {"status": "ok"}
Requirements
| Resource | Minimum |
|---|---|
| GPU | RTX 3090 / A100 / RTX 6000 Ada |
| VRAM | 24 GB |
| RAM | 32 GB |
| Disk | 50 GB (model weights) |
What's inside
- Kimodo β NVIDIA's kinematic motion diffusion model (77-joint SOMA skeleton)
- LLM2Vec text encoder backed by Llama-3-8B-Instruct
- FastAPI server on port 9551
- Health check + graceful startup
Part of VisionAI-Flywheel
This service is one component of a full synthetic surveillance data pipeline:
[kimodo-api] β NPZ motion
β
[render-api] β SOMA mesh render (MP4)
β
[cosmos-transfer] β Sim2Real photorealistic video
β
[NVIDIA VSS] β VLM annotation β fine-tuning dataset
π Full pipeline: github.com/EyalEnav/VisionAI-Flywheel
License
Apache 2.0 β see LICENSE
Kimodo model weights are released under the NVIDIA Open Model License and downloaded at runtime. They are not bundled in this image.