This is the python implementation of the paper "SLSGD: Secure and Efficient Distributed On-device Machine Learning"
The following python packages needs to be installed by pip:
- MXNET (we use Intel CPU cluster, thus mxnet-mkl is preferred)
- Gluon-CV
- Numpy
- MPI4py
- Keras (with Tensorflow backend, we use this only for dataset preparation, not for model training)
- PIL (also for dataset preparation)
- Gluon-NLP (only for the experiment of LSTM on Wikitext-2)
The users can simply run the following commond in their own virtualenv:
pip install --no-cache-dir numpy mxnet-mkl gluoncv mpi4py keras pillow gluonnlp
Option | Desctiption |
---|---|
--output DATASET_DIR | the directory where the dataset will be placed |
--nsplit 100 | partition to 100 devices |
--normalize 1 | Normalize the data |
--step 8 | Increment of the partition size |
- CIFAR-10 with balanced partition:
python convert_cifar10_to_np_normalized.py --nsplit 100 --normalize 1 --output DATASET_DIR
- CIFAR-10 with unbalanced partition:
python convert_cifar10_to_np_normalized_unbalanced.py --nsplit 100 --normalize 1 --step 8 --output DATASET_DIR
Option | Desctiption |
---|---|
--dir DATASET_DIR | the directory where the training dataset is placed |
--valdir VAL_DATASET_DIR | the directory where the validation dataset is placed |
--batchsize 50 | batch size of the workers |
--epochs 800 | total number of epochs |
--interval 10 | log interval |
--nsplit 100 | training data is partitioned to 100 devices |
--lr 0.1 | learning rate |
--lr-decay | lr decay rate |
--lr-decay-epoch | epochs where lr decays |
--alpha 1 | weight of moving average |
--alpha-decay | alpha decay rate |
--alpha-decay-epoch | epochs where alpha decays |
--log | path to the log file |
--classes 10 | number of different classes/labels |
--iterations 1 | number of local iterations in each epoch |
--aggregation mean | aggregation method, mean or trim |
--nbyz 2 | number of malicious workers |
--trim 4 | hyperparameter |
--model default | name of the model, "default" means the CNN used in the paper experiments |
--seed 337 | random seed |
- Train with no malicious users, mean as aggregation,
$k=10$ workers are randomly selected in each epoch:
mpirun -np 10 -machinefile hostfile python ecml_federated/slsgd.py --classes 10 --model default --nsplit 100 --batchsize 50 --lr 0.1 --alpha 1 --alpha-decay 0.9 --alpha-decay-epoch 400 --epochs 800 --iterations 1 --seed 733 --dir $inputdir --valdir $valdir -o $logfile 2>&1 | tee $watchfile
- Train with 2 nefarious users, trimmed mean as aggregation,
$k=10$ workers are randomly selected in each epoch:
mpirun -np 10 -machinefile hostfile python icml_federated/slsgd.py --classes 10 --model default --nsplit 100 --batchsize 50 --lr 0.1 --alpha 1 --alpha-decay 0.9 --alpha-decay-epoch 400 --epochs 800 --iterations 1 --seed 733 --nbyz 2 --trim 4 --dir $inputdir --valdir $valdir -o $logfile 2>&1 | tee $watchfile
There is a demo script experiment_script_1.sh
When executed on the Intel vLab Academic Cluster, the following instructions might be needed:
- manually install libfabric 1.1.0
- set I_MPI_FABRICS=ofi