Diffusion for Simulation

This trajectory is simulated by the model.

We explore the application of transformer-based diffusion models for game environment generation, investigating multiple approaches for conditioning on past frames and actions: Video Generation (VG) to predict the whole video sequence conditioned on past actions, and Single Frame Generation (SFG) to predict a single frame conditioned on past frames and actions, with both concatenation and cross-attention mechanisms. Using the ViZDoom My Way Home environment as our test environment, we demonstrate that while SFG models achieve superior performance in teacher-forcing scenarios with PSNR values up to 32.21, VG models show better stability in autoregressive generation, suggesting important tradeoffs between model architecture and performance.

You can find the details about our findings in the final report and see our code here.