fix: random sampling in ForgetRetainDataset by ZeguanXiao · Pull Request #145 · locuslab/open-unlearning

ZeguanXiao · 2025-09-27T04:59:48Z

What does this PR do?

Fixes #139

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Have you gone through the contributions guide?
Are your changes documented? Read documentation guidelines here.

molereddy · 2025-10-01T01:21:16Z

src/data/unlearn.py

+        g = torch.Generator()
+        rank = torch.distributed.get_rank() if torch.distributed.is_initialized() else 0
+        seed = int(torch.empty((), dtype=torch.int64).random_().item() + rank)
+        g.manual_seed(seed)


it would be better to use the seed from the experiment config here, rather than
int(torch.empty((), dtype=torch.int64).random_().item() to avoid introducing randomness uncontrolled by the seed.

can you try to see if you can make the experiment's cfg.seed available to this dataset class and then use seed = exp_seed + rank here?

molereddy

Thank you for the PR! Please see comment

ZeguanXiao · 2025-10-01T05:09:55Z

Thanks for the feedback! I've updated the PR accordingly. Please let me know if there are any further adjustments required.

molereddy · 2025-10-05T07:24:40Z

Please fix the lint errors!

molereddy

It is not ideal to set the seed at the exact example level. This would mean we select the same retain example index sequences even if we are using a different dataset.

Since the point is that each rank must get a different seed, imo it is better to get the rank in the global seed function: https://github.com/locuslab/open-unlearning/blob/main/src/trainer/utils.py#L8

Let me know if you see any issues.

cc @Dornavineeth

ZeguanXiao · 2025-10-18T15:24:19Z

@molereddy Simply modifying seed_everything() doesn’t work, because the process group is only initialized after the Accelerator is created — that is, after the trainer is initialized (in src/train.py). Calling seed_everything() again after the trainer has been initialized can achieve the desired effect. Do you think this implementation is acceptable, or do you have a better suggestion?

…n seed_everything()

ZeguanXiao · 2025-10-19T13:12:07Z

@molereddy Currently, my implementation adds a torch.Generator instance to DataCollatorForSupervisedDataset.
After the process group is initialized (i.e., after the trainer is initialized in src/train.py),
I set the seed of the torch.Generator instance for each rank’s DataCollatorForSupervisedDataset object.

Could you please check if this approach is feasible/correct?

filyp · 2026-02-22T09:45:47Z

If the goal is simply to have different idx for different ranks, there's also a simpler solution:

        rank = torch.distributed.get_rank() if torch.distributed.is_initialized() else 0
        retain_idx = torch.randint(0, len(self.retain), (1,)).item()  # like before
        retain_idx = (retain_idx + rank) % len(self.retain)  # fans out the indices

Same for forget_idx. And that's the only code change that would be needed. Or for even more scrambling maybe instead of adding rank, add some hash of rank + idx. That should be as good as random, right?

chengjiali · 2026-03-07T05:07:34Z

Hi, is there any estimate on this?

fix: random sampling in ForgetRetainDataset

faad8d9

molereddy reviewed Oct 1, 2025

View reviewed changes

molereddy requested changes Oct 1, 2025

View reviewed changes

feat: add seed parameter for reproducibility in ForgetRetainDataset

c079574

ZeguanXiao had a problem deploying to tests October 2, 2025 20:26 — with GitHub Actions Failure

ZeguanXiao added 2 commits October 9, 2025 00:10

refactor: fix lint

60e099a

fix: ensure unique random seed per item in ForgetRetainDataset

7a8b5fd

molereddy requested changes Oct 9, 2025

View reviewed changes

ZeguanXiao added 2 commits October 19, 2025 10:42

fix: remove seed arg from data pipeline and make rank-aware seeding i…

19a6141

…n seed_everything()

fix: use rank-specific seeding for ForgetRetainDataset

80ef3c7

This was referenced Jan 11, 2026

Script Batch Size #164

Open

Significant difference in results with different GPU counts on WMDP bio dataset using GradDiff #144

Open

Dornavineeth mentioned this pull request Feb 22, 2026

Upgrading transformers version to 4.51.3 to support recent models #175

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: random sampling in ForgetRetainDataset#145

fix: random sampling in ForgetRetainDataset#145
ZeguanXiao wants to merge 6 commits intolocuslab:mainfrom
ZeguanXiao:unlearn_dataset

ZeguanXiao commented Sep 27, 2025

Uh oh!

molereddy Oct 1, 2025

Uh oh!

molereddy left a comment

Uh oh!

ZeguanXiao commented Oct 1, 2025

Uh oh!

molereddy commented Oct 5, 2025

Uh oh!

molereddy left a comment

Uh oh!

ZeguanXiao commented Oct 18, 2025

Uh oh!

ZeguanXiao commented Oct 19, 2025

Uh oh!

filyp commented Feb 22, 2026 •

edited

Loading

Uh oh!

chengjiali commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ZeguanXiao commented Sep 27, 2025

What does this PR do?

Before submitting

Uh oh!

molereddy Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

molereddy left a comment

Choose a reason for hiding this comment

Uh oh!

ZeguanXiao commented Oct 1, 2025

Uh oh!

molereddy commented Oct 5, 2025

Uh oh!

molereddy left a comment

Choose a reason for hiding this comment

Uh oh!

ZeguanXiao commented Oct 18, 2025

Uh oh!

ZeguanXiao commented Oct 19, 2025

Uh oh!

filyp commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chengjiali commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

filyp commented Feb 22, 2026 •

edited

Loading