Instructions to use nvidia/LocateAnything-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/LocateAnything-3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="nvidia/LocateAnything-3B", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("nvidia/LocateAnything-3B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use nvidia/LocateAnything-3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/LocateAnything-3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/LocateAnything-3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/nvidia/LocateAnything-3B
- SGLang
How to use nvidia/LocateAnything-3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/LocateAnything-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/LocateAnything-3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/LocateAnything-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/LocateAnything-3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use nvidia/LocateAnything-3B with Docker Model Runner:
docker model run hf.co/nvidia/LocateAnything-3B
# [Tool] locateanything-batch — batched + KV-cached LocateAnything-3B, ~2.7× faster
If you've used nvidia/LocateAnything-3B for
open-vocabulary grounding, you may have hit this: its custom generate hard-asserts batch == 1.
You can only ground one (image, query) pair at a time, which is painful when you're scanning a lot of
frames or running several prompts per image.
The good news: the underlying Qwen2 LM and the vision encoder are perfectly batch-safe — the batch==1
lock lives only in the hand-rolled multi-token-prediction (MTP) decode loop. So I wrote a faithful
batched fork of that loop and packaged it:
👉 GitHub: https://github.com/liuwang97/LocateAnything-3B-batch (MIT)
from locateanything_batch import load, generate_batch, load_pil
load()
img = load_pil("photo.jpg")
[answer] = generate_batch([(img, "a dog")]) # ... <box><312><144><688><902></box> ...
# or many at once — heterogeneous images & prompts in one call:
generate_batch([(img1, "a dog"), (img2, "a red car"), (img1, "a bicycle")])
What it does
- True batching over an arbitrary list of (image, query) pairs.
- KV-cache + vision reuse: for the same image with multiple prompts, the prefill KV cache and
the[image + instruction]vision encode are computed once and forked across queries — only the
differing query tails + decode are batched. The decode keeps the model's own MTP KV cache per step. - Numerically exact: under greedy decoding it is token-identical to the stock
batch=1path.
Is it actually equivalent? Yes — and it's tested
The whole point is to be a drop-in speedup, not a different model. The correctness gate runs 5 tiers
(B=1 vs stock generate; identical rows; same image / different prompts in both orders; different
images & lengths; ragged mixed batches) and checks the batched output is token-for-token identical to
both the stock generate and per-pair single runs:
12 / 12 checks pass, 0 fail at greedy (
rp=1.0) — bit-exact box decode.
(Under repetition_penalty=1.15 a few drop to "box Δ ≤ 8/1000" — that's just bf16 batched-GEMM
non-associativity flipping a tight argmax; greedy is the exact-equivalence guarantee.)
Speed (RTX 5070 Ti / sm_120)
- Batched vision encode: 2.6–3×, bit-identical to per-image (
max|Δ| = 0.00e+00), flat VRAM. - Batched shared-prefix prefill: ~3.6× (the tiny image+instruction prefix is GPU-starved at batch=1).
- End-to-end batched pipeline: ~2.7× vs serial.
- Real production scan: 61.3 ms/frame · 16.3 frame/s at batch≈32 over 49,016 frames
(vs ~430 ms/frame single-image baseline).
Each optimization is an independent env-var toggle, all on by default. There's abench_equivalence.py so you can reproduce the gate + throughput on your own GPU.
Thank you for this brilliant contribution and to the open-source community for the support! @Liuwang971 The ~2.7x speedup from batched inference and KV-caching is stunning. This is a very important feature, and I plan to review the code soon to consider incorporating it into the official repo. Outstanding work!