Sequentially load / unload train datasets to GPU #20676

meilame-tayebjee · 2025-03-27T08:58:03Z

meilame-tayebjee
Mar 27, 2025

Hi,

I have 100 subsamples of a huge dataset that are identified with an index idx.
Let's say I want to use 80 subsamples as training set (idx 1 to 80), 10 as val and 10 as test. Each subsample is approximately 6 GB in GPU memory.

Note that I have two H100 GPUs, each of them having 95Gb memory. My model is a GPT-like model, having 31 millions parameters.

I want to use Lightning for training over several epochs, sequentially loading / unloading the datasets on to the GPUs. But without having them all in memory at once - so I do not even want to initialize the datasets beforehand (I also need to initialize them sequentially).

Basically, during one epoch, I want to load first train dataset on GPU / train / unload and load next one --> until the last training dataset. And restart for another epoch.

I started using the DataModule class, with something like the following. However, when calling self.trainer.datamodule.next_train_subsample() the dataset in indeed updated as I want, but I am not sure if the data_loader takes into account that update.

Happy to have any insights on how to do it in the right way ! Thank you very much.

import torch
import gc
import pytorch_lightning as pl
from torch.utils.data import DataLoader

class SequentialSubsampleDataModule(pl.LightningDataModule):
    def __init__(self, dataset_class, train_indices, val_indices, test_indices, batch_size, num_workers=4):
        super().__init__()
        self.dataset_class = dataset_class
        self.train_indices = train_indices
        self.val_indices = val_indices
        self.test_indices = test_indices
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.current_train_idx = 0  # Track current training subsample
        self.current_val_idx = 0  # Track current validation subsample

    def setup(self, stage=None):
        if stage == "fit":
            self.train_dataset = self.dataset_class(subsample_id=self.train_indices[self.current_train_idx])
            self.val_dataset = self.dataset_class(subsample_id=self.val_indices[self.current_val_idx])
        elif stage == "test":
            self.test_dataset = self.dataset_class(subsample_id=self.test_indices[0])

    def train_dataloader(self):
        return DataLoader(self.train_dataset, batch_size=self.batch_size, shuffle=True, num_workers=self.num_workers)

    def val_dataloader(self):
        return DataLoader(self.val_dataset, batch_size=self.batch_size, shuffle=False, num_workers=self.num_workers)

    def test_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=self.batch_size, shuffle=False, num_workers=self.num_workers)

    def next_train_subsample(self):
        """Load the next training subsample on GPU."""
        self.current_train_idx += 1
        if self.current_train_idx < len(self.train_indices):
            del self.train_dataset  # Free memory
            torch.cuda.empty_cache()
            gc.collect()
            self.train_dataset = self.dataset_class(subsample_id=self.train_indices[self.current_train_idx])

    def next_val_subsample(self):
        """Load the next validation subsample on GPU."""
        self.current_val_idx += 1
        if self.current_val_idx < len(self.val_indices):
            del self.val_dataset
            torch.cuda.empty_cache()
            gc.collect()
            self.val_dataset = self.dataset_class(subsample_id=self.val_indices[self.current_val_idx])

class GPTLightningModel(pl.LightningModule):
    def __init__(self, model, learning_rate=1e-4):
        super().__init__()
        ...

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        ...

    def validation_step(self, batch, batch_idx):
        ...

    def configure_optimizers(self):
        ...

    def on_train_epoch_end(self):
        """Switch to next subsample after each epoch."""
        self.trainer.datamodule.next_train_subsample()

    def on_validation_epoch_end(self):
        """Switch to next validation subsample after each epoch."""
        self.trainer.datamodule.next_val_subsample()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sequentially load / unload train datasets to GPU #20676

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Sequentially load / unload train datasets to GPU #20676

Uh oh!

Uh oh!

meilame-tayebjee Mar 27, 2025

Replies: 0 comments

meilame-tayebjee
Mar 27, 2025