YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

DepthPro Wrapper — Image to Point Cloud

A clean, drop-in Python wrapper around Apple's DepthPro (arXiv:2410.02073) that turns a single RGB image into a metric depth map and, if you want, a 3D point cloud — with zero calibration.

DepthPro is a 952 M-parameter ViT-L model that predicts absolute metric depth (meters, not relative) and estimates the camera focal length and field of view automatically. No camera intrinsics, no per-scene training, no LiDAR required.

🚀 Quick start (10 lines)

from depthpro_wrapper import DepthProEstimator, rgbd_to_point_cloud, save_point_cloud

# 1. Load model (~2 GB download on first run)
estimator = DepthProEstimator(device="cuda:0")

# 2. Drop an image in
result = estimator.estimate("photo.jpg")
print(f"Focal length: {result.focal_length:.1f} px")
print(f"Depth range: {result.depth.min():.2f} – {result.depth.max():.2f} m")

# 3. Get a coloured point cloud out
points, colors = rgbd_to_point_cloud(
    result.depth, result.image, result.focal_length
)

# 4. Save as PLY
save_point_cloud("scene.ply", points, colors=colors)

Or use the CLI:

python scripts/image_to_pointcloud.py photo.jpg scene.ply --colored --normals

📦 Installation

# 1. Core dependencies
pip install torch torchvision transformers pillow numpy

# 2. Install this wrapper
pip install -e .

DepthPro is a large ViT-L model (~2 GB). The weights download automatically from HuggingFace the first time you instantiate DepthProEstimator.

GPU strongly recommended. The model runs in ~0.3 s on a modern GPU; CPU inference is possible but extremely slow.

🔬 How it works (the full pipeline)

Here is exactly what happens under the hood when you call estimate().

Step 1 — Preprocessing (handled automatically)

Your input image is resized to 1536 × 1536 (the fixed operating resolution DepthPro was trained on), rescaled by 1/255, and normalised to [-1, 1] with mean=0.5, std=0.5. This is done by DepthProImageProcessorFast.

Step 2 — Feature extraction (DINOv2 ViT-L + multi-scale patches)

DepthPro's backbone is a DINOv2 ViT-L/16 (24 layers, 1024 hidden dim, 16×16 patch size). It processes the image at three scales simultaneously:

Scale	Resolution	Purpose
0.25×	384×384	Global context, far-away geometry
0.5×	768×768	Mid-range structure
1.0×	1536×1536	Fine detail, edges, thin structures

The three-scale features are fused with a DPT-style decoder (hidden size 256) into a dense feature map.

Step 3 — Depth prediction (canonical inverse depth)

The decoder outputs canonical inverse depth C. This is not the final metric depth yet — it is a scale-invariant representation that the network learns to predict robustly across scenes. The actual metric depth is recovered in the post-processing step.

Step 4 — FOV / focal-length estimation (no calibration needed)

DepthPro has a dedicated FOV head. It ingests frozen features from the depth network plus task-specific features from a separate ViT encoder to predict the horizontal field of view (FOV) in degrees.

From the FOV, the focal length in pixels is derived:

focal_length = (image_width / 2) / tan(FOV / 2)

This is the critical piece that makes the depth metric (in meters) rather than just up-to-scale.

Step 5 — Post-processing (metric depth + back-projection)

The processor converts canonical inverse depth C to metric depth D_m using the estimated focal length:

D_m = (focal_length × image_width) / C

Then, using the pinhole camera model, every pixel (u, v) is back-projected to a 3D point:

X = (u - cx) * Z / focal_length
Y = (v - cy) * Z / focal_length
Z = D_m[v, u]

where (cx, cy) is the principal point (image centre by default). DepthPro assumes square pixels (fx == fy), which is standard for most modern cameras.

Result

You get:

depth — (H, W) metric depth map in meters
focal_length — estimated focal length in pixels
field_of_view — estimated horizontal FOV in degrees
points — (N, 3) 3D point cloud in camera coordinates (+Z forward)

🧰 API Reference

`DepthProEstimator`

class DepthProEstimator(
    model_name="apple/DepthPro-hf",
    device="cuda:0",
    dtype=torch.float16,
)

model_name — HuggingFace model ID or local path.
device — PyTorch device. CUDA strongly recommended.
dtype — torch.float16 (default, fast) or torch.float32 (slightly higher precision).

`.estimate(image)`

result = estimator.estimate(
    image,                    # str, Path, PIL.Image, or np.ndarray
    return_confidence=False,
)

Returns a DepthResult dataclass:

Attribute	Shape	Description
`depth`	(H, W)	Metric depth in meters (float32)
`focal_length`	scalar	Estimated focal length in pixels
`field_of_view`	scalar	Estimated horizontal FOV in degrees
`image`	(H, W, 3)	Original RGB image (uint8)
`confidence`	(H, W) or None	Per-pixel confidence (if requested)
`height`, `width`	scalars	Convenience properties

`.estimate_batch(images)`

Process multiple images in a single forward pass for efficiency:

results = estimator.estimate_batch(["a.jpg", "b.jpg", "c.jpg"])
for r in results:
    print(r.depth.shape)

`depth_to_point_cloud(depth, focal_length, ...)`

from depthpro_wrapper import depth_to_point_cloud

points = depth_to_point_cloud(
    depth=result.depth,           # (H, W) metric depth
    focal_length=result.focal_length,
    principal_point=None,         # default = image centre
    mask=None,                    # optional boolean mask
    sample_step=1,                # 2 = 1/4 points, 4 = 1/16
)

Returns (N, 3) float32 array of 3D points in camera coordinates.

`rgbd_to_point_cloud(depth, rgb, focal_length, ...)`

Same as above but also returns per-point RGB colours:

points, colors = rgbd_to_point_cloud(
    result.depth, result.image, result.focal_length,
    sample_step=2,
)

Returns (N, 3) points and (N, 3) uint8 colours.

`normals_from_depth(depth, focal_length)`

Compute surface normals directly from the depth map (useful for feeding into surface-reconstruction pipelines like Poisson or NKSR):

from depthpro_wrapper import normals_from_depth
normals = normals_from_depth(result.depth, result.focal_length)

Returns (H, W, 3) float32 unit normals (unoriented).

`save_point_cloud(path, points, colors=None, normals=None)`

Save a point cloud to an ASCII PLY file (readable by Open3D, MeshLab, CloudCompare, Blender, etc.):

from depthpro_wrapper import save_point_cloud
save_point_cloud("cloud.ply", points, colors=colors, normals=normals)

🖥️ CLI Usage

# Basic: image → point cloud
python scripts/image_to_pointcloud.py photo.jpg cloud.ply

# With colours and normals
python scripts/image_to_pointcloud.py photo.jpg cloud.ply --colored --normals

# Down-sample for faster processing / smaller files
python scripts/image_to_pointcloud.py photo.jpg cloud.ply --sample-step 2

# Save intermediate depth & confidence maps
python scripts/image_to_pointcloud.py photo.jpg cloud.ply \
    --save-depth depth.npy --save-confidence conf.npy

# CPU fallback (very slow)
python scripts/image_to_pointcloud.py photo.jpg cloud.ply --device cpu --dtype float32

📂 Repository layout

depthpro-wrapper/
├── depthpro_wrapper/
│   ├── __init__.py              # public API
│   ├── depth_estimator.py      # DepthProEstimator + DepthResult
│   ├── point_cloud.py          # back-projection + normal estimation
│   └── io.py                   # image / PLY I/O helpers
├── scripts/
│   └── image_to_pointcloud.py  # CLI entry point
├── examples/
│   ├── quickstart.py           # 10-line minimal example
│   └── batch_processing.py     # folder-of-images batch script
├── setup.py
├── requirements.txt
└── README.md

🎯 Tips & Troubleshooting

Problem	Solution
Out of memory	Use `dtype=torch.float16` (default). If still OOM, use `--sample-step 2` or smaller images.
Depth looks wrong / flat	DepthPro works best on images with perspective (indoor rooms, outdoor scenes). Very flat macro shots may under-estimate depth.
Point cloud is noisy at edges	Depth has uncertainty at object boundaries. Use `sample_step=2` or filter by `confidence` if you saved it.
Focal length seems off	DepthPro estimates FOV from image content. Very unusual aspect ratios or heavy cropping can confuse it. You can override with your own `focal_length` in `depth_to_point_cloud()`.
Want a mesh, not a point cloud	Feed the point cloud into a surface-reconstruction method: Poisson (Open3D), Alpha shapes, or better yet NKSR for neural surface reconstruction.
Batch processing is slow	Use `estimate_batch()` with batch size 4–8 instead of looping over `estimate()`.

🔗 Citation

If you use DepthPro in your research, please cite the original paper:

@article{depthpro2024,
  title={Depth Pro: Sharp Monocular Metric Depth in Less Than a Second},
  author={von_PLaten et al.},
  journal={arXiv preprint arXiv:2410.02073},
  year={2024}
}

Original code: https://github.com/apple/ml-depth-pro
HuggingFace model: https://huggingface.co/apple/DepthPro-hf

📄 License

This wrapper is released under the MIT License. DepthPro itself is under Apple's own license (see the original repository).

Built with ❤️ on top of Apple's DepthPro.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for bdck/depthpro-wrapper

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Paper • 2410.02073 • Published Oct 2, 2024 • 43