# [Tool] locateanything-batch — batched + KV-cached LocateAnything-3B, ~2.7× faster

#10
by Liuwang971 - opened

If you've used nvidia/LocateAnything-3B for
open-vocabulary grounding, you may have hit this: its custom generate hard-asserts batch == 1.
You can only ground one (image, query) pair at a time, which is painful when you're scanning a lot of
frames or running several prompts per image.

The good news: the underlying Qwen2 LM and the vision encoder are perfectly batch-safe — the batch==1
lock lives only in the hand-rolled multi-token-prediction (MTP) decode loop. So I wrote a faithful
batched fork of that loop and packaged it:

👉 GitHub: https://github.com/liuwang97/LocateAnything-3B-batch (MIT)

from locateanything_batch import load, generate_batch, load_pil

load()
img = load_pil("photo.jpg")
[answer] = generate_batch([(img, "a dog")])     # ... <box><312><144><688><902></box> ...

# or many at once — heterogeneous images & prompts in one call:
generate_batch([(img1, "a dog"), (img2, "a red car"), (img1, "a bicycle")])

What it does

  • True batching over an arbitrary list of (image, query) pairs.
  • KV-cache + vision reuse: for the same image with multiple prompts, the prefill KV cache and
    the [image + instruction] vision encode are computed once and forked across queries — only the
    differing query tails + decode are batched. The decode keeps the model's own MTP KV cache per step.
  • Numerically exact: under greedy decoding it is token-identical to the stock batch=1 path.

Is it actually equivalent? Yes — and it's tested

The whole point is to be a drop-in speedup, not a different model. The correctness gate runs 5 tiers
(B=1 vs stock generate; identical rows; same image / different prompts in both orders; different
images & lengths; ragged mixed batches) and checks the batched output is token-for-token identical to
both the stock generate and per-pair single runs:

12 / 12 checks pass, 0 fail at greedy (rp=1.0) — bit-exact box decode.

(Under repetition_penalty=1.15 a few drop to "box Δ ≤ 8/1000" — that's just bf16 batched-GEMM
non-associativity flipping a tight argmax; greedy is the exact-equivalence guarantee.)

Speed (RTX 5070 Ti / sm_120)

  • Batched vision encode: 2.6–3×, bit-identical to per-image (max|Δ| = 0.00e+00), flat VRAM.
  • Batched shared-prefix prefill: ~3.6× (the tiny image+instruction prefix is GPU-starved at batch=1).
  • End-to-end batched pipeline: ~2.7× vs serial.
  • Real production scan: 61.3 ms/frame · 16.3 frame/s at batch≈32 over 49,016 frames
    (vs ~430 ms/frame single-image baseline).

Each optimization is an independent env-var toggle, all on by default. There's a
bench_equivalence.py so you can reproduce the gate + throughput on your own GPU.

Thank you for this brilliant contribution and to the open-source community for the support! @Liuwang971 The ~2.7x speedup from batched inference and KV-caching is stunning. This is a very important feature, and I plan to review the code soon to consider incorporating it into the official repo. Outstanding work!

Sign up or log in to comment