YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

DepthPro Wrapper β€” Image to Point Cloud

Paper Model License

A clean, drop-in Python wrapper around Apple's DepthPro (arXiv:2410.02073) that turns a single RGB image into a metric depth map and, if you want, a 3D point cloud β€” with zero calibration.

DepthPro is a 952 M-parameter ViT-L model that predicts absolute metric depth (meters, not relative) and estimates the camera focal length and field of view automatically. No camera intrinsics, no per-scene training, no LiDAR required.


πŸš€ Quick start (10 lines)

from depthpro_wrapper import DepthProEstimator, rgbd_to_point_cloud, save_point_cloud

# 1. Load model (~2 GB download on first run)
estimator = DepthProEstimator(device="cuda:0")

# 2. Drop an image in
result = estimator.estimate("photo.jpg")
print(f"Focal length: {result.focal_length:.1f} px")
print(f"Depth range: {result.depth.min():.2f} – {result.depth.max():.2f} m")

# 3. Get a coloured point cloud out
points, colors = rgbd_to_point_cloud(
    result.depth, result.image, result.focal_length
)

# 4. Save as PLY
save_point_cloud("scene.ply", points, colors=colors)

Or use the CLI:

python scripts/image_to_pointcloud.py photo.jpg scene.ply --colored --normals

πŸ“¦ Installation

# 1. Core dependencies
pip install torch torchvision transformers pillow numpy

# 2. Install this wrapper
pip install -e .

DepthPro is a large ViT-L model (~2 GB). The weights download automatically from HuggingFace the first time you instantiate DepthProEstimator.

GPU strongly recommended. The model runs in ~0.3 s on a modern GPU; CPU inference is possible but extremely slow.


πŸ”¬ How it works (the full pipeline)

Here is exactly what happens under the hood when you call estimate().

Step 1 β€” Preprocessing (handled automatically)

Your input image is resized to 1536 Γ— 1536 (the fixed operating resolution DepthPro was trained on), rescaled by 1/255, and normalised to [-1, 1] with mean=0.5, std=0.5. This is done by DepthProImageProcessorFast.

Step 2 β€” Feature extraction (DINOv2 ViT-L + multi-scale patches)

DepthPro's backbone is a DINOv2 ViT-L/16 (24 layers, 1024 hidden dim, 16Γ—16 patch size). It processes the image at three scales simultaneously:

Scale Resolution Purpose
0.25Γ— 384Γ—384 Global context, far-away geometry
0.5Γ— 768Γ—768 Mid-range structure
1.0Γ— 1536Γ—1536 Fine detail, edges, thin structures

The three-scale features are fused with a DPT-style decoder (hidden size 256) into a dense feature map.

Step 3 β€” Depth prediction (canonical inverse depth)

The decoder outputs canonical inverse depth C. This is not the final metric depth yet β€” it is a scale-invariant representation that the network learns to predict robustly across scenes. The actual metric depth is recovered in the post-processing step.

Step 4 β€” FOV / focal-length estimation (no calibration needed)

DepthPro has a dedicated FOV head. It ingests frozen features from the depth network plus task-specific features from a separate ViT encoder to predict the horizontal field of view (FOV) in degrees.

From the FOV, the focal length in pixels is derived:

focal_length = (image_width / 2) / tan(FOV / 2)

This is the critical piece that makes the depth metric (in meters) rather than just up-to-scale.

Step 5 β€” Post-processing (metric depth + back-projection)

The processor converts canonical inverse depth C to metric depth D_m using the estimated focal length:

D_m = (focal_length Γ— image_width) / C

Then, using the pinhole camera model, every pixel (u, v) is back-projected to a 3D point:

X = (u - cx) * Z / focal_length
Y = (v - cy) * Z / focal_length
Z = D_m[v, u]

where (cx, cy) is the principal point (image centre by default). DepthPro assumes square pixels (fx == fy), which is standard for most modern cameras.

Result

You get:

  • depth β€” (H, W) metric depth map in meters
  • focal_length β€” estimated focal length in pixels
  • field_of_view β€” estimated horizontal FOV in degrees
  • points β€” (N, 3) 3D point cloud in camera coordinates (+Z forward)

🧰 API Reference

DepthProEstimator

class DepthProEstimator(
    model_name="apple/DepthPro-hf",
    device="cuda:0",
    dtype=torch.float16,
)
  • model_name β€” HuggingFace model ID or local path.
  • device β€” PyTorch device. CUDA strongly recommended.
  • dtype β€” torch.float16 (default, fast) or torch.float32 (slightly higher precision).

.estimate(image)

result = estimator.estimate(
    image,                    # str, Path, PIL.Image, or np.ndarray
    return_confidence=False,
)

Returns a DepthResult dataclass:

Attribute Shape Description
depth (H, W) Metric depth in meters (float32)
focal_length scalar Estimated focal length in pixels
field_of_view scalar Estimated horizontal FOV in degrees
image (H, W, 3) Original RGB image (uint8)
confidence (H, W) or None Per-pixel confidence (if requested)
height, width scalars Convenience properties

.estimate_batch(images)

Process multiple images in a single forward pass for efficiency:

results = estimator.estimate_batch(["a.jpg", "b.jpg", "c.jpg"])
for r in results:
    print(r.depth.shape)

depth_to_point_cloud(depth, focal_length, ...)

from depthpro_wrapper import depth_to_point_cloud

points = depth_to_point_cloud(
    depth=result.depth,           # (H, W) metric depth
    focal_length=result.focal_length,
    principal_point=None,         # default = image centre
    mask=None,                    # optional boolean mask
    sample_step=1,                # 2 = 1/4 points, 4 = 1/16
)

Returns (N, 3) float32 array of 3D points in camera coordinates.

rgbd_to_point_cloud(depth, rgb, focal_length, ...)

Same as above but also returns per-point RGB colours:

points, colors = rgbd_to_point_cloud(
    result.depth, result.image, result.focal_length,
    sample_step=2,
)

Returns (N, 3) points and (N, 3) uint8 colours.

normals_from_depth(depth, focal_length)

Compute surface normals directly from the depth map (useful for feeding into surface-reconstruction pipelines like Poisson or NKSR):

from depthpro_wrapper import normals_from_depth
normals = normals_from_depth(result.depth, result.focal_length)

Returns (H, W, 3) float32 unit normals (unoriented).

save_point_cloud(path, points, colors=None, normals=None)

Save a point cloud to an ASCII PLY file (readable by Open3D, MeshLab, CloudCompare, Blender, etc.):

from depthpro_wrapper import save_point_cloud
save_point_cloud("cloud.ply", points, colors=colors, normals=normals)

πŸ–₯️ CLI Usage

# Basic: image β†’ point cloud
python scripts/image_to_pointcloud.py photo.jpg cloud.ply

# With colours and normals
python scripts/image_to_pointcloud.py photo.jpg cloud.ply --colored --normals

# Down-sample for faster processing / smaller files
python scripts/image_to_pointcloud.py photo.jpg cloud.ply --sample-step 2

# Save intermediate depth & confidence maps
python scripts/image_to_pointcloud.py photo.jpg cloud.ply \
    --save-depth depth.npy --save-confidence conf.npy

# CPU fallback (very slow)
python scripts/image_to_pointcloud.py photo.jpg cloud.ply --device cpu --dtype float32

πŸ“‚ Repository layout

depthpro-wrapper/
β”œβ”€β”€ depthpro_wrapper/
β”‚   β”œβ”€β”€ __init__.py              # public API
β”‚   β”œβ”€β”€ depth_estimator.py      # DepthProEstimator + DepthResult
β”‚   β”œβ”€β”€ point_cloud.py          # back-projection + normal estimation
β”‚   └── io.py                   # image / PLY I/O helpers
β”œβ”€β”€ scripts/
β”‚   └── image_to_pointcloud.py  # CLI entry point
β”œβ”€β”€ examples/
β”‚   β”œβ”€β”€ quickstart.py           # 10-line minimal example
β”‚   └── batch_processing.py     # folder-of-images batch script
β”œβ”€β”€ setup.py
β”œβ”€β”€ requirements.txt
└── README.md

🎯 Tips & Troubleshooting

Problem Solution
Out of memory Use dtype=torch.float16 (default). If still OOM, use --sample-step 2 or smaller images.
Depth looks wrong / flat DepthPro works best on images with perspective (indoor rooms, outdoor scenes). Very flat macro shots may under-estimate depth.
Point cloud is noisy at edges Depth has uncertainty at object boundaries. Use sample_step=2 or filter by confidence if you saved it.
Focal length seems off DepthPro estimates FOV from image content. Very unusual aspect ratios or heavy cropping can confuse it. You can override with your own focal_length in depth_to_point_cloud().
Want a mesh, not a point cloud Feed the point cloud into a surface-reconstruction method: Poisson (Open3D), Alpha shapes, or better yet NKSR for neural surface reconstruction.
Batch processing is slow Use estimate_batch() with batch size 4–8 instead of looping over estimate().

πŸ”— Citation

If you use DepthPro in your research, please cite the original paper:

@article{depthpro2024,
  title={Depth Pro: Sharp Monocular Metric Depth in Less Than a Second},
  author={von_PLaten et al.},
  journal={arXiv preprint arXiv:2410.02073},
  year={2024}
}

Original code: https://github.com/apple/ml-depth-pro
HuggingFace model: https://huggingface.co/apple/DepthPro-hf


πŸ“„ License

This wrapper is released under the MIT License. DepthPro itself is under Apple's own license (see the original repository).


Built with ❀️ on top of Apple's DepthPro.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for bdck/depthpro-wrapper