Computer Vision Fall 2025 project for generating frames of Pong or Tetris based off of key board actions in real time.
Latest DiT generation
python main.py -f 10000 -t -ae 25 -de 25 -b 32
Inside the project folder, run the following
- Windows:
python -m venv venv
.\venv\Scripts\activate
pip install -r requirements.txt
python main.py -f 300 -e 0.0
- Mac/Linux:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python main.py -f 300 -e 0.0
The following commands should open an ALE Atari Pong window with the green paddle playing optimally. The window will close after 5 seconds (300 frames/60FPS = 5s).
| File | Function |
|---|---|
| decoder.py | ViT Encoder - Encoding Game Frames |
| encoder.py | ViT Decoder + DiT - Reconstruction and Generation |
| train.py | Complete Training System |
| OASIS/* | Publicly released OASIS files |
All files in the OASIS/ directory are the publicly released files from the Oasis model, which we tested on our dataset with its pre-trained weights. All other files are created by us.
Train your model
-
Without preloading:
python main.py -t -f FRAMES -ae AE_EPOCHS -de DIT_EPOCHS -b BATCHES -
With preloading:
python main.py -l -t -f FRAMES -ae AE_EPOCHS -de DIT_EPOCHS -b BATCHES
Note: Train only the autoencoder or the DiT by setting the other's epochs to 0
Visualizing the game loop
-
Computer policy:
python main.py -f FRAMES -e EPSILON -
Player Input:
python main.py -f FRAMES -p -e EPSILON
Run model inference
-
Generate video:
python main.py -l -i INFERENCE_FRAMES -
Interactive game:
python main.py -l -p
Oasis version
-
Train:
python main_oasis.py --mode train --frames FRAMES --vae-epochs AE_EPOCHS --dit-epochs DIT_EPOCHS --batch-size BATCHES --save-dir DIRECTORY -
Generate Gif:
python main_oasis.py --mode inference --vae-path DIRECTORY/best_vae.pth --dit-path DIRECTORY/dit_final.pth --num-frames FRAMES --output animation.gif
General Syntax
python main.py [-f FRAMES] [-e EPSILON] [-v] [-p] [-h] [-t] [-ae AE_AMOUNT] [-de DIT_AMOUNT] [-b BATCHES] [-l]
| Flag | Name | Type | Default | Description |
|---|---|---|---|---|
-f |
Frames amount | int | 10 |
Number of frames to run simulation for |
-v |
View in window | bool | false |
Run simulation in window view |
-p |
Player input mode | bool | false |
Flag to use keyboard input |
-e |
Epsilon probability | float | 0.01 |
Probability to take a random action |
-t |
Training mode | bool | false |
Run training loop |
-ae |
Autoencoder epochs | int | 20 |
How many epochs to train the Autoencoder |
-de |
DiT epochs | int | 15 |
How many epochs to train the DiT |
-b |
Batch amount | int | 16 |
Batch size |
-l |
Load model | bool | false |
Loads weights from checkpoint files |
-h |
Help | Shows args |
- Clone a copy of this repository to your local machine
- Open a terminal and navigate to the CV_2025 folder
- Once inside the CV_2025 folder, make a new python environment (this creates a new environment named venv):
python -m venv venv
- Activate your environment:
- Windows:
.\venv\Scripts\activate - Mac:
source venv/bin/activate
- Windows:
- Once your environment is activated, you should see:
- Windows:
(venv) C:\...your folder path...> - Mac:
(venv) ... $
- Windows:
- Now install the required libraries:
pip install -r requirements.txt
- Run commands:
- Main file:
python main.py [-f FRAMES] [-e EPSILON] [-v] [-p] [-h] [-t] [-ae AE_AMOUNT] [-de DIT_AMOUNT] [-b BATCHES] [-l] - Any file:
python file_name.py
- Main file:
- When finished, deactivate your environment:
deactivate
Add any additional required libraries to the requirements.txt file
- ✅ Basic Pong setup
- Allows keyboard input for future human playing
- Has a mode to visualize actions in real time (for our understanding/tests)
- Uses computer policy to automatically make best moves with a probability to do something random
- When training, this will allow us to generate games with different levels of "expertise"
- Has scaffold for using actions from the encoder if implemented in the future
- ✅ First level ViT Encoder, DiT, ViT Decoder creation
- ✅ Created a
main.pyfile:- Inside
main.py, we define different command line arguments to parse the following arguments:- f: Frames amount (Default 10)
- v: View in window (Default true)
- p: Player keyboard input mode (Default false)
- e: Episolon probability to pick any random move (Default 0.01)
- h: Help
- By default, no training is happening here. This code just calls the Pong interface to view a game with FRAMES length, Computer or Player control, and with an Epsilon probability for the Computer mode.
- Inside
- ✅ Created a
train.pyfile:- Inside
train.pywe define the different components of our model:- Pong Frame Dataset: Stores frames and actions for the model
- Autoencoder Trainer: Trains our autoencoder with a given decoder
- DiT Trainer: Trains our DiT for frame generation
- train: Callable train loop
- Trains model using our data and tracks loss over time with a plot
- Inside
- ✅ Extract frames from the
pong.pyinterface:- Hooked up our pong simulate in the training loop to generate our data from our computer policy
- ✅ Basic inference in
pipeline.py:- Generates one frame based off of the trained Autoencoder and DiT so we can immediately visualize our model
- ✅ Add logic to
main.pyso that we can continue training a loaded model- If you flag both
-land-tit will load the weights from the paths in themain.pyand continue training them for the given-aeand-deepoch amounts. Set either to0to freeze its weights.
- If you flag both
- ✅ Finish
data_utils:save_animation- Use
matplotlib.animationto make a function that saves an animation togenerated/from a list offramesgenerated at inference time.
- Use
- ✅ Update the
trainloop to ocassionaly save images generated by the autoencoder and DiT- Helpful for us to visualize the differences as we go to see progress.
- ✅ Update
pipeline:inferenceso that we generate frames in a loop- So far,
pipeline:inferenceonly generates one frame. For real time video playing, we'd need to continuously generate frames in a loop. For testing purposes, this would look like:current latent+action:0+random timestep->DiT->predicted noise->remove noise->display framepredicted noise+action:0+random timestep->DiT-> repeat
- So far,
- ✅ Add seed logic to
pong.pyso that we can generate frames from a new seed- It would be helpful for inference and training to be able to optionally pass a seed to start from in the pong game rather than always start from (0,0).
- ✅ Change model to train based off of
(frame_t, action_t, frame_{t+1})tuples- Model should generate the next frame conditioned on the given start frame and action, a three-way dependency
- If you change this, the training loop and inference loop should be modified too
- Oasis Model: https://oasis-model.github.io/
- MineWorld: https://github.com/microsoft/mineworld
- Atari World Modeling: https://arxiv.org/pdf/2405.12399

