Instructions to use nvidia/LocateAnything-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/LocateAnything-3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="nvidia/LocateAnything-3B", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("nvidia/LocateAnything-3B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nvidia/LocateAnything-3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/LocateAnything-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/LocateAnything-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/LocateAnything-3B

SGLang

How to use nvidia/LocateAnything-3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/LocateAnything-3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/LocateAnything-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/LocateAnything-3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/LocateAnything-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use nvidia/LocateAnything-3B with Docker Model Runner:
```
docker model run hf.co/nvidia/LocateAnything-3B
```

# [Tool] locateanything-batch — batched + KV-cached LocateAnything-3B, ~2.7× faster

#10

by Liuwang971 - opened 1 day ago

Discussion

Liuwang971

1 day ago

If you've used nvidia/LocateAnything-3B for
open-vocabulary grounding, you may have hit this: its custom generate hard-asserts batch == 1.
You can only ground one (image, query) pair at a time, which is painful when you're scanning a lot of
frames or running several prompts per image.

The good news: the underlying Qwen2 LM and the vision encoder are perfectly batch-safe — the batch==1
lock lives only in the hand-rolled multi-token-prediction (MTP) decode loop. So I wrote a faithful
batched fork of that loop and packaged it:

👉 GitHub: https://github.com/liuwang97/LocateAnything-3B-batch (MIT)

from locateanything_batch import load, generate_batch, load_pil

load()
img = load_pil("photo.jpg")
[answer] = generate_batch([(img, "a dog")])     # ... <box><312><144><688><902></box> ...

# or many at once — heterogeneous images & prompts in one call:
generate_batch([(img1, "a dog"), (img2, "a red car"), (img1, "a bicycle")])

What it does

True batching over an arbitrary list of (image, query) pairs.
KV-cache + vision reuse: for the same image with multiple prompts, the prefill KV cache and
the [image + instruction] vision encode are computed once and forked across queries — only the
differing query tails + decode are batched. The decode keeps the model's own MTP KV cache per step.
Numerically exact: under greedy decoding it is token-identical to the stock batch=1 path.

Is it actually equivalent? Yes — and it's tested

The whole point is to be a drop-in speedup, not a different model. The correctness gate runs 5 tiers
(B=1 vs stock generate; identical rows; same image / different prompts in both orders; different
images & lengths; ragged mixed batches) and checks the batched output is token-for-token identical to
both the stock generate and per-pair single runs:

12 / 12 checks pass, 0 fail at greedy (rp=1.0) — bit-exact box decode.

(Under repetition_penalty=1.15 a few drop to "box Δ ≤ 8/1000" — that's just bf16 batched-GEMM
non-associativity flipping a tight argmax; greedy is the exact-equivalence guarantee.)

Speed (RTX 5070 Ti / sm_120)

Batched vision encode: 2.6–3×, bit-identical to per-image (max|Δ| = 0.00e+00), flat VRAM.
Batched shared-prefix prefill: ~3.6× (the tiny image+instruction prefix is GPU-starved at batch=1).
End-to-end batched pipeline: ~2.7× vs serial.
Real production scan: 61.3 ms/frame · 16.3 frame/s at batch≈32 over 49,016 frames
(vs ~430 ms/frame single-image baseline).

Each optimization is an independent env-var toggle, all on by default. There's a
bench_equivalence.py so you can reproduce the gate + throughput on your own GPU.

ShihaoW

NVIDIA org about 11 hours ago

•

edited about 11 hours ago

Thank you for this brilliant contribution and to the open-source community for the support! @Liuwang971 The ~2.7x speedup from batched inference and KV-caching is stunning. This is a very important feature, and I plan to review the code soon to consider incorporating it into the official repo. Outstanding work!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment