Gradmatch Data subset selection method making training slow #78

animesh-007 · 2022-06-27T19:11:25Z

I tried to run some experiments as follows:

Ran full cifar10 without any subset selection method to train resnet50 which took around 32m 31s.
Ran Gradmatch cifar10 subset selection with 0.1 fractions taking longer time than full cifar10 i.e 22h 48m 40s.
Ran Gradmatch cifar10 subset selection with 0.3 fractions taking longer time than 0.1 Gradmatch selection method.

I am using scaled resolution images of cifar10 i.e 224x224 resolution and accordingly defined resnet50 architecture.
Can you let me know how to speed up experiments 2 and 3? In general subset selection method should faster the whole training process right?

krishnatejakk · 2022-07-14T16:46:51Z

@animesh-007

Can you point out what version of GradMatch you are using?

Ideally, subset selection should be faster unless something is wrong with the experimental setup. Please attach the log files so that I can figure out the issue after analyzing them.

animesh-007 · 2022-07-14T17:14:53Z

@animesh-007

Can you point out what version of GradMatch you are using?

Ideally, subset selection should be faster unless something is wrong with the experimental setup. Please attach the log files so that I can figure out the issue after analyzing them.

@krishnatejakk
These are the initial logs. Should I paste whole log? I cloned the repo on June 27. So I guess I am using the latest version.

[06/27 16:40:56] train_sl INFO: DotMap(setting='SL', is_reg=True, dataset=DotMap(name='cifar10', datadir='../storage', feature='dss', type='image'), dataloader=DotMap(shuffle=True, batch_size=256, pin_memory=True, num_workers=8), model=DotMap(architecture='ResNet50_224', type='pre-defined', numclasses=10), ckpt=DotMap(is_load=False, is_save=True, dir='results/', save_every=20), loss=DotMap(type='CrossEntropyLoss', use_sigmoid=False), optimizer=DotMap(type='sgd', momentum=0.9, lr=0.01, weight_decay=0.0005, nesterov=False), scheduler=DotMap(type='cosine_annealing', T_max=300), dss_args=DotMap(type='GradMatch', fraction=0.3, select_every=5, lam=0.5, selection_type='PerClassPerGradient', v1=True, valid=False, kappa=0, eps=1e-100, linear_layer=True), train_args=DotMap(num_epochs=300, device='cuda', print_every=1, results_dir='results/', print_args=['val_loss', 'val_acc', 'tst_loss', 'tst_acc', 'time'], return_args=[]))
Files already downloaded and verified
Files already downloaded and verified
18it [00:01, 10.12it/s]
[06/27 16:41:12] train_sl INFO: Epoch: 1 , Validation Loss: 3.1551918701171875 , Validation Accuracy: 0.1914 , Test Loss: 3.5032728210449218 , Test Accuracy: 0.2142 , Timing: 7.0498366355896
15it [00:01, 10.10it/s]
[06/27 16:41:21] train_sl INFO: Epoch: 2 , Validation Loss: 2.387578009033203 , Validation Accuracy: 0.3002 , Test Loss: 2.735808560180664 , Test Accuracy: 0.3253 , Timing: 6.075047492980957
1it [00:00, 6.72it/s]
[06/27 16:41:31] train_sl INFO: Epoch: 3 , Validation Loss: 2.139058850097656 , Validation Accuracy: 0.3246 , Test Loss: 2.036042041015625 , Test Accuracy: 0.3344 , Timing: 6.058322191238403
8it [00:00, 10.55it/s]
[06/27 16:41:41] train_sl INFO: Epoch: 4 , Validation Loss: 3.5549482177734375 , Validation Accuracy: 0.3576 , Test Loss: 2.480993505859375 , Test Accuracy: 0.3838 , Timing: 5.7214953899383545
9it [00:01, 8.89it/s]
[06/27 16:41:50] train_sl INFO: Epoch: 5 , Validation Loss: 3.782627294921875 , Validation Accuracy: 0.3624 , Test Loss: 3.3407586791992188 , Test Accuracy: 0.39 , Timing: 5.925083160400391
12it [00:01, 10.56it/s]
4it [00:00, 8.48it/s]
5it [00:00, 10.51it/s]
15it [00:01, 11.37it/s]
16it [00:01, 11.82it/s]
18it [00:01, 13.24it/s]
11it [00:01, 9.24it/s]
7it [00:00, 7.66it/s]
2it [00:00, 12.04it/s]
15it [00:01, 11.98it/s]
[06/27 16:58:59] train_sl INFO: Epoch: 6, GradMatch subset selection finished, takes 1028.8181.

shiyf129 · 2022-07-21T09:29:04Z

@krishnatejakk
I got the similar test results using cifar10 dataset and ResNet18 model.
For one epoch training, "full dataset" took about 50 seconds, GradMatch and CRAIG took more than 100 seconds.
Besides, GradMatch and CRAIG took about 100 seconds to select sub dataset in an epoch.

Can we preprocess the whole dataset first to get the weighted training sub dataset, and then train directly with the weighted sub dataset, which should shorten the training time. Is there an example about that?
Is there a faster sub dataset selection method?
Thank you.

[Full dataset]：
INFO: The length of dataloader: 2250
INFO: Training Timing: 50.17572069168091

[GradMatch]:
INFO: The length of dataloader: 225
INFO: GradMatch subset selection finished, takes 99.8966.
INFO: Training Timing: 104.97514295578003

[CRAIG]:
INFO: The length of dataloader: 225
INFO: subset selection finished, takes 108.4812.
INFO: Training Timing: 114.62646007537842

animesh-007 · 2022-07-21T10:07:04Z

@shiyf129 What is the resolution of the images you are using while training? I am using 224x224.

krishnatejakk · 2022-07-21T13:10:52Z

@animesh-007 @shiyf129 I am working on the issue. We have recently updated the OMP version in GradMatch code which improves its performance further. However the new OMP version is making it slower in this case. I will debug why it is very slow in this case.

For faster training, One option is to use GradMatchPB (i.e., perBatch version) or revert back to previous OMP version in GradMatch strategy code below:

cords/cords/selectionstrategies/SL/gradmatchstrategy.py

Line 6 in 844f897

    
           from ..helpers import OrthogonalMP_REG_Parallel, OrthogonalMP_REG, OrthogonalMP_REG_Parallel_V1

In import statement, remove _V1 to revert back to previous version of OMP code

shiyf129 · 2022-07-25T01:37:24Z

@shiyf129 What is the resolution of the images you are using while training? I am using 224x224.

I use the original cifar10 dataset, 32*32 image size

shiyf129 · 2022-07-25T09:17:44Z

@krishnatejakk I test GradMatchPB algorithm and set v1=False to use previous OMP version. I compared the beginning 10 epoch training between GradMatchPB alogithm and full dataset training, the result shows GradMatchPB takes longer time, and the average accuracy is relatively low. Do you know the reason about it?

GradMatchPB

the mean epoch training time is 26.70+32.06 = 58.76 seconds
the mean of accuracy is 0.463

Full dataset training

the mean epoch training time is 50.867 seconds
the mean of accuracy is 0.7548

dss_args=dict(type="GradMatchPB",
            fraction=0.1,
            select_every=20,
            lam=0,
            selection_type='PerBatch',
            v1=False,
            valid=False,
            eps=1e-100,
            linear_layer=True,
            kappa=0),

GradMatchPB beginning 10 epoch training:

Index	Subset selection time (second)	A training epoch time (second)	Test Accuracy
1	25.85	30.91	0.3588
2	25.61	30.72	0.3707
3	25.39	31.07	0.4201
4	28.71	34.43	0.4314
5	28.69	33.85	0.4748
6	25.81	31.17	0.485
7	29.03	34.72	0.4881
8	26.78	31.85	0.511
9	25.82	31.45	0.537
10	25.4	30.47	0.5535
Mean	26.7	32.06	0.463

Full dataset beginning 10 epoch training:

Index	A training epoch time (second)	Test Accuracy
1	51.59	0.5279
2	52.13	0.6543
3	50.17	0.7183
4	51.26	0.7495
5	51.62	0.7779
6	50.14	0.8205
7	47.99	0.8026
8	51.54	0.8324
9	49.91	0.8229
10	52.32	0.8423
Mean	50.867	0.7548

krishnatejakk · 2022-07-25T13:56:08Z

@shiyf129 why is subset selection happening every epoch? We usually set it to 20. Subset selection takes some time and you dont need to select a subset every time.

Furthermore, training with 10% subset should be 10x faster than full dataset training. From your logs, it doesn't seem that way. Can you check if you create a 10% subset of dataset and train on it for one epoch, is it 10x faster than full training?

shiyf129 · 2022-07-26T12:32:16Z

@krishnatejakk I modified the code to select a subset every 20 epoches.
I run the cifar10 dataset on ResNet18 model to compare GradMatchPB and Full dataset.
Both of them run for 10 minutes and record the test accuracy every minute.
The average test accuracy of GradMatchPB is slightly lower than that of full dataset.
What is the reason for this?

	Full dataset	GradMatchPB (fraction=0.3)	GradMatchPB (fraction=0.1)
Average test accuracy	0.7633	0.7515	0.6714

animesh-007 changed the title ~~Data subset selection method making training slow~~ Gradmatch Data subset selection method making training slow Jul 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradmatch Data subset selection method making training slow #78

Gradmatch Data subset selection method making training slow #78

animesh-007 commented Jun 27, 2022 •

edited

Loading

krishnatejakk commented Jul 14, 2022 •

edited

Loading

animesh-007 commented Jul 14, 2022

shiyf129 commented Jul 21, 2022

animesh-007 commented Jul 21, 2022

krishnatejakk commented Jul 21, 2022 •

edited

Loading

shiyf129 commented Jul 25, 2022

shiyf129 commented Jul 25, 2022

krishnatejakk commented Jul 25, 2022

shiyf129 commented Jul 26, 2022

Gradmatch Data subset selection method making training slow #78

Gradmatch Data subset selection method making training slow #78

Comments

animesh-007 commented Jun 27, 2022 • edited Loading

krishnatejakk commented Jul 14, 2022 • edited Loading

animesh-007 commented Jul 14, 2022

shiyf129 commented Jul 21, 2022

animesh-007 commented Jul 21, 2022

krishnatejakk commented Jul 21, 2022 • edited Loading

shiyf129 commented Jul 25, 2022

shiyf129 commented Jul 25, 2022

krishnatejakk commented Jul 25, 2022

shiyf129 commented Jul 26, 2022

animesh-007 commented Jun 27, 2022 •

edited

Loading

krishnatejakk commented Jul 14, 2022 •

edited

Loading

krishnatejakk commented Jul 21, 2022 •

edited

Loading