YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
DepthPro Wrapper β Image to Point Cloud
A clean, drop-in Python wrapper around Apple's DepthPro (arXiv:2410.02073) that turns a single RGB image into a metric depth map and, if you want, a 3D point cloud β with zero calibration.
DepthPro is a 952 M-parameter ViT-L model that predicts absolute metric depth (meters, not relative) and estimates the camera focal length and field of view automatically. No camera intrinsics, no per-scene training, no LiDAR required.
π Quick start (10 lines)
from depthpro_wrapper import DepthProEstimator, rgbd_to_point_cloud, save_point_cloud
# 1. Load model (~2 GB download on first run)
estimator = DepthProEstimator(device="cuda:0")
# 2. Drop an image in
result = estimator.estimate("photo.jpg")
print(f"Focal length: {result.focal_length:.1f} px")
print(f"Depth range: {result.depth.min():.2f} β {result.depth.max():.2f} m")
# 3. Get a coloured point cloud out
points, colors = rgbd_to_point_cloud(
result.depth, result.image, result.focal_length
)
# 4. Save as PLY
save_point_cloud("scene.ply", points, colors=colors)
Or use the CLI:
python scripts/image_to_pointcloud.py photo.jpg scene.ply --colored --normals
π¦ Installation
# 1. Core dependencies
pip install torch torchvision transformers pillow numpy
# 2. Install this wrapper
pip install -e .
DepthPro is a large ViT-L model (~2 GB). The weights download automatically from HuggingFace the first time you instantiate DepthProEstimator.
GPU strongly recommended. The model runs in ~0.3 s on a modern GPU; CPU inference is possible but extremely slow.
π¬ How it works (the full pipeline)
Here is exactly what happens under the hood when you call estimate().
Step 1 β Preprocessing (handled automatically)
Your input image is resized to 1536 Γ 1536 (the fixed operating resolution DepthPro was trained on), rescaled by 1/255, and normalised to [-1, 1] with mean=0.5, std=0.5. This is done by DepthProImageProcessorFast.
Step 2 β Feature extraction (DINOv2 ViT-L + multi-scale patches)
DepthPro's backbone is a DINOv2 ViT-L/16 (24 layers, 1024 hidden dim, 16Γ16 patch size). It processes the image at three scales simultaneously:
| Scale | Resolution | Purpose |
|---|---|---|
| 0.25Γ | 384Γ384 | Global context, far-away geometry |
| 0.5Γ | 768Γ768 | Mid-range structure |
| 1.0Γ | 1536Γ1536 | Fine detail, edges, thin structures |
The three-scale features are fused with a DPT-style decoder (hidden size 256) into a dense feature map.
Step 3 β Depth prediction (canonical inverse depth)
The decoder outputs canonical inverse depth C. This is not the final metric depth yet β it is a scale-invariant representation that the network learns to predict robustly across scenes. The actual metric depth is recovered in the post-processing step.
Step 4 β FOV / focal-length estimation (no calibration needed)
DepthPro has a dedicated FOV head. It ingests frozen features from the depth network plus task-specific features from a separate ViT encoder to predict the horizontal field of view (FOV) in degrees.
From the FOV, the focal length in pixels is derived:
focal_length = (image_width / 2) / tan(FOV / 2)
This is the critical piece that makes the depth metric (in meters) rather than just up-to-scale.
Step 5 β Post-processing (metric depth + back-projection)
The processor converts canonical inverse depth C to metric depth D_m using the estimated focal length:
D_m = (focal_length Γ image_width) / C
Then, using the pinhole camera model, every pixel (u, v) is back-projected to a 3D point:
X = (u - cx) * Z / focal_length
Y = (v - cy) * Z / focal_length
Z = D_m[v, u]
where (cx, cy) is the principal point (image centre by default). DepthPro assumes square pixels (fx == fy), which is standard for most modern cameras.
Result
You get:
depthβ (H, W) metric depth map in metersfocal_lengthβ estimated focal length in pixelsfield_of_viewβ estimated horizontal FOV in degreespointsβ (N, 3) 3D point cloud in camera coordinates (+Zforward)
π§° API Reference
DepthProEstimator
class DepthProEstimator(
model_name="apple/DepthPro-hf",
device="cuda:0",
dtype=torch.float16,
)
model_nameβ HuggingFace model ID or local path.deviceβ PyTorch device. CUDA strongly recommended.dtypeβtorch.float16(default, fast) ortorch.float32(slightly higher precision).
.estimate(image)
result = estimator.estimate(
image, # str, Path, PIL.Image, or np.ndarray
return_confidence=False,
)
Returns a DepthResult dataclass:
| Attribute | Shape | Description |
|---|---|---|
depth |
(H, W) | Metric depth in meters (float32) |
focal_length |
scalar | Estimated focal length in pixels |
field_of_view |
scalar | Estimated horizontal FOV in degrees |
image |
(H, W, 3) | Original RGB image (uint8) |
confidence |
(H, W) or None | Per-pixel confidence (if requested) |
height, width |
scalars | Convenience properties |
.estimate_batch(images)
Process multiple images in a single forward pass for efficiency:
results = estimator.estimate_batch(["a.jpg", "b.jpg", "c.jpg"])
for r in results:
print(r.depth.shape)
depth_to_point_cloud(depth, focal_length, ...)
from depthpro_wrapper import depth_to_point_cloud
points = depth_to_point_cloud(
depth=result.depth, # (H, W) metric depth
focal_length=result.focal_length,
principal_point=None, # default = image centre
mask=None, # optional boolean mask
sample_step=1, # 2 = 1/4 points, 4 = 1/16
)
Returns (N, 3) float32 array of 3D points in camera coordinates.
rgbd_to_point_cloud(depth, rgb, focal_length, ...)
Same as above but also returns per-point RGB colours:
points, colors = rgbd_to_point_cloud(
result.depth, result.image, result.focal_length,
sample_step=2,
)
Returns (N, 3) points and (N, 3) uint8 colours.
normals_from_depth(depth, focal_length)
Compute surface normals directly from the depth map (useful for feeding into surface-reconstruction pipelines like Poisson or NKSR):
from depthpro_wrapper import normals_from_depth
normals = normals_from_depth(result.depth, result.focal_length)
Returns (H, W, 3) float32 unit normals (unoriented).
save_point_cloud(path, points, colors=None, normals=None)
Save a point cloud to an ASCII PLY file (readable by Open3D, MeshLab, CloudCompare, Blender, etc.):
from depthpro_wrapper import save_point_cloud
save_point_cloud("cloud.ply", points, colors=colors, normals=normals)
π₯οΈ CLI Usage
# Basic: image β point cloud
python scripts/image_to_pointcloud.py photo.jpg cloud.ply
# With colours and normals
python scripts/image_to_pointcloud.py photo.jpg cloud.ply --colored --normals
# Down-sample for faster processing / smaller files
python scripts/image_to_pointcloud.py photo.jpg cloud.ply --sample-step 2
# Save intermediate depth & confidence maps
python scripts/image_to_pointcloud.py photo.jpg cloud.ply \
--save-depth depth.npy --save-confidence conf.npy
# CPU fallback (very slow)
python scripts/image_to_pointcloud.py photo.jpg cloud.ply --device cpu --dtype float32
π Repository layout
depthpro-wrapper/
βββ depthpro_wrapper/
β βββ __init__.py # public API
β βββ depth_estimator.py # DepthProEstimator + DepthResult
β βββ point_cloud.py # back-projection + normal estimation
β βββ io.py # image / PLY I/O helpers
βββ scripts/
β βββ image_to_pointcloud.py # CLI entry point
βββ examples/
β βββ quickstart.py # 10-line minimal example
β βββ batch_processing.py # folder-of-images batch script
βββ setup.py
βββ requirements.txt
βββ README.md
π― Tips & Troubleshooting
| Problem | Solution |
|---|---|
| Out of memory | Use dtype=torch.float16 (default). If still OOM, use --sample-step 2 or smaller images. |
| Depth looks wrong / flat | DepthPro works best on images with perspective (indoor rooms, outdoor scenes). Very flat macro shots may under-estimate depth. |
| Point cloud is noisy at edges | Depth has uncertainty at object boundaries. Use sample_step=2 or filter by confidence if you saved it. |
| Focal length seems off | DepthPro estimates FOV from image content. Very unusual aspect ratios or heavy cropping can confuse it. You can override with your own focal_length in depth_to_point_cloud(). |
| Want a mesh, not a point cloud | Feed the point cloud into a surface-reconstruction method: Poisson (Open3D), Alpha shapes, or better yet NKSR for neural surface reconstruction. |
| Batch processing is slow | Use estimate_batch() with batch size 4β8 instead of looping over estimate(). |
π Citation
If you use DepthPro in your research, please cite the original paper:
@article{depthpro2024,
title={Depth Pro: Sharp Monocular Metric Depth in Less Than a Second},
author={von_PLaten et al.},
journal={arXiv preprint arXiv:2410.02073},
year={2024}
}
Original code: https://github.com/apple/ml-depth-pro
HuggingFace model: https://huggingface.co/apple/DepthPro-hf
π License
This wrapper is released under the MIT License. DepthPro itself is under Apple's own license (see the original repository).
Built with β€οΈ on top of Apple's DepthPro.