Multi-GPU / Parallel Processing Support

by Iotcv - opened Aug 20, 2025

Aug 20, 2025

We are trying to use this model on multiple GPUs, but noticed that it currently only utilizes a single GPU. This leads to out-of-memory (OOM) errors.
Any guidance on best practices for running this model across multiple GPUs would be very helpful.
Looking forward to exploring more with this model
Thanks

jingwwu

StepFun org Aug 20, 2025

•

edited Aug 20, 2025

Thank you for your attention. You can try the scripts below to enable Multi-GPU / Parallel Processing:

...
import torch.distributed as dist
dist.init_process_group(backend="nccl")
rank = dist.get_rank()

...
pipeline = NextStepPipeline(tokenizer=tokenizer, model=model).to(device=f"cuda:{rank}")

...
image = pipeline.generate_image(
    ....
    seed=42 + rank,
)[0]
image.save(f"./assets/output_{rank}.png")

then use torchrun to start the inference:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc-per-node=8 your_scripts.py

jingwwu changed discussion status to closed Aug 26, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment