Skip to content
/ GST Public

[ICLR 2025] Where Am I and What Will I See : An Auto-Regressive Model for Spatial Localization and View Prediction

Notifications You must be signed in to change notification settings

SOTAMak1r/GST

Repository files navigation

🧬 Generative Spatial Transformer (GST)

Implementation of GST from Where Am I and What Will I See : An Auto-Regressive Model for Spatial Localization and View Prediction in Pytorch.

arXiv (coming soon)  project page  huggingface weights 

✨️ News

  • 2025-2: Code is released.

🛠️ Installation

  1. Environment setting
conda create -n gst python=3.8

pip install -r requirements.txt
  1. Model weight download

We provide Image tokenizer, Camera tokenizer, and Auto-regressive model in huggingface weights . Please download the following three ckpt and place them in the folder ./ckpts.

image-16.pt # Adopting from LlamaGen
camera-4.pt
gst.pt

🚀 Inference

GST has constructed a joint distribution of images and corresponding perspectives. Use the following command to sample --num-sample perspectives and images under a given observation --image-path.

python run_sample_camera_image.py \
    --image-ckpt   /path/to/image-16.pt  \
    --gpt-ckpt     /path/to/gst.pt \
    --camera-ckpt  /path/to/camera-4.pt \
    --image-path assets/hydrant.jpg \
    --num-sample 16 

More optional parameters can be found in the script run_sample_camera_image.py. After sampling, the results will be saved in the folder sample. The folder structure is as follows:

sample
├── camera.ply      # Saved the 3D position and orientation of the perspectives
├── images.obj      # Saved the images corresponding to each perspective
│   
├── material_0.png  # Texture
├── material_1.png 
├── ...
├── material.mtl    # Texture mapping of 3D files
│   
├── sample_0.png    # Sampled image
├── sample_0.npy    # The camera matrix obtained by converting the sampled camera
├── sample_1.png 
├── sample_1.npy 
└── ...

The GST employs the RDF coordinate system, where the positive direction of the x-axis is oriented to the right (R), the positive direction of the y-axis is directed downward (D), and the positive direction of the z-axis is oriented forward (F). The sampled ply and obj files can be opened in meshlab or other three-dimensional software, as illustrated below:

📃 License

The majority of this project is licensed under MIT License. Portions of the project are available under separate license of referred projects, detailed in corresponding files.

✨ Citation

If our work assists your research, feel free to give us a star ⭐ or cite us using:

@article{chen2024and,
  title={Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction},
  author={Chen, Junyi and Huang, Di and Ye, Weicai and Ouyang, Wanli and He, Tong},
  journal={arXiv preprint arXiv:2410.18962},
  year={2024}
}

💖 Acknowledgement

We would like to express our gratitude to the contributors of the codebase provided by LlamaGen, which served as the foundation for our work. Additionally, we acknowledge the valuable insights drawn from the works of B and C, which significantly influenced the direction of our research. Special thanks are extended to the pioneering contributions of Zero123, ZeroNVS and RayDiffusion within the field, which have enriched our understanding and inspired our endeavors.

About

[ICLR 2025] Where Am I and What Will I See : An Auto-Regressive Model for Spatial Localization and View Prediction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages