Click to expand full abstract
Spatial intelligence is emerging as a transformative frontier in AI, yet it remains constrained by the scarcity of large-scale 3D datasets. Unlike the abundant 2D imagery, acquiring 3D data typically requires specialized sensors and laborious annotation.
In this work, we present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations - including point clouds, camera poses, depth maps, and pseudo-RGBD - via integrated depth estimation, camera calibration, and scale calibration.
Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding. By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence.
We release two generated spatial datasets, i.e., COCO-3D and Objects365-v2-3D, and demonstrate through extensive experiments that our generated data can benefit various 3D tasks, ranging from fundamental perception to MLLM-based reasoning. These results validate our pipeline as an effective solution for developing AI systems capable of perceiving, understanding, and interacting with physical environments.
- Python 3.8+
- PyTorch > 2.0 with CUDA support
# Update system packages
sudo apt-get update
# Install essential libraries
apt-get install -y \
libgl1-mesa-dev \
libglib2.0-0 \
ffmpeg libsm6 libxext6 \
aria2 \
git-lfs \
vim tmux wget unzip htop rsync
# Configure HuggingFace endpoint (only for China mainland users)
echo 'export HF_ENDPOINT=https://hf-mirror.com' >> ~/.bashrc
source ~/.bashrc# Install core dependencies
pip install -r requirements.txt
# Install flash-attention (requires special build configuration)
pip install flash-attn==2.5.8 --no-build-isolation
# Only for H20 platform compatibility (run at the end if needed)
pip install nvidia-cublas-cu12==12.4.5.8cd ./PerspectiveFields
pip install -r requirements.txt
python setup.py install
cd ..# Single GPU version
# Note: First run will download pre-trained checkpoints (~10 minutes)
python generate_spatial_img_coco.py \
-i /path/to/Datasets/coco/train2017 \
-a /path/to/Datasets/coco/annotations/instances_train2017.json \
-o ./path/to/output/# Process COCO validation set
python generate_spatial_img_coco.py \
-i ./data/coco/val2017 \
-a ./data/coco/annotations/instances_val2017.json \
-o ./output/coco_val_3d/Verify that spatial images are correctly generated by reconstructing the 3D scene:
python validate_spatial_img_coco.py \
--rgb_image_path ./demo_output/rgb/000000000632.png \
--depth_image_path ./demo_output/depth/000000000632_remove_edges.png \
--camera_params_path ./demo_output/camera_parameters/000000000632.json \
--output_ply_path ./validation/output.ply \
--visualize # Optional: directly visualize the point cloudExpected Result:
- ✅ Semantically meaningful point cloud
- ✅ Correct scale representation
- ✅ Z-axis pointing upward
Ensure point clouds and annotations are correctly aligned:
python validate_pointcloud_and_anno_coco.py \
--point_cloud_path ./demo_output/point/000000000632.ply \
--json_path ./demo_output/json/000000000632.json \
--output_ply_path ./validation/annotated.ply \
--visualize # Optional: directly visualizeExpected Result:
- ✅ Instances marked with distinct colors
- ✅ Correct spatial boundaries
If you find our work useful in your research, please consider citing:
@inproceedings{miao2025towards,
title={Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting},
author={Miao, Xingyu and Duan, Haoran and Qian, Quanhao and
Wang, Jiuniu and Long, Yang and Shao, Ling and
Zhao, Deli and Xu, Ran and Zhang, Gongjie},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2025}
}