forked from tensorflow/models
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Showing
27 changed files
with
5,010 additions
and
134 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,190 +1,136 @@ | ||
# Image Classification | ||
|
||
This folder contains the TF 2.0 model examples for image classification: | ||
This folder contains TF 2.0 model examples for image classification: | ||
|
||
* [ResNet](#resnet) | ||
* [MNIST](#mnist) | ||
* [Classifier Trainer](#classifier-trainer), a framework that uses the Keras | ||
compile/fit methods for image classification models, including: | ||
* ResNet | ||
* EfficientNet[^1] | ||
|
||
[^1]: Currently a work in progress. We cannot match "AutoAugment (AA)" in [the original version](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet). | ||
For more information about other types of models, please refer to this | ||
[README file](../../README.md). | ||
|
||
## ResNet | ||
|
||
Similar to the [estimator implementation](../../r1/resnet), the Keras | ||
implementation has code for the ImageNet dataset. The ImageNet | ||
version uses a ResNet50 model implemented in | ||
[`resnet_model.py`](./resnet/resnet_model.py). | ||
|
||
## Before you begin | ||
Please make sure that you have the latest version of TensorFlow | ||
installed and | ||
[add the models folder to your Python path](/official/#running-the-models). | ||
|
||
### Pretrained Models | ||
|
||
* [ResNet50 Checkpoints](https://storage.googleapis.com/cloud-tpu-checkpoints/resnet/resnet50.tar.gz) | ||
|
||
* ResNet50 TFHub: [feature vector](https://tfhub.dev/tensorflow/resnet_50/feature_vector/1) | ||
and [classification](https://tfhub.dev/tensorflow/resnet_50/classification/1) | ||
|
||
### ImageNet Training | ||
### ImageNet preparation | ||
|
||
Download the ImageNet dataset and convert it to TFRecord format. | ||
The following [script](https://github.com/tensorflow/tpu/blob/master/tools/datasets/imagenet_to_gcs.py) | ||
and [README](https://github.com/tensorflow/tpu/tree/master/tools/datasets#imagenet_to_gcspy) | ||
provide a few options. | ||
|
||
Once your dataset is ready, you can begin training the model as follows: | ||
|
||
```bash | ||
python resnet/resnet_imagenet_main.py | ||
``` | ||
|
||
Again, if you did not download the data to the default directory, specify the | ||
location with the `--data_dir` flag: | ||
|
||
```bash | ||
python resnet/resnet_imagenet_main.py --data_dir=/path/to/imagenet | ||
``` | ||
|
||
There are more flag options you can specify. Here are some examples: | ||
|
||
- `--use_synthetic_data`: when set to true, synthetic data, rather than real | ||
data, are used; | ||
- `--batch_size`: the batch size used for the model; | ||
- `--model_dir`: the directory to save the model checkpoint; | ||
- `--train_epochs`: number of epoches to run for training the model; | ||
- `--train_steps`: number of steps to run for training the model. We now only | ||
support a number that is smaller than the number of batches in an epoch. | ||
- `--skip_eval`: when set to true, evaluation as well as validation during | ||
training is skipped | ||
|
||
For example, this is a typical command line to run with ImageNet data with | ||
batch size 128 per GPU: | ||
|
||
```bash | ||
python -m resnet/resnet_imagenet_main.py \ | ||
--model_dir=/tmp/model_dir/something \ | ||
--num_gpus=2 \ | ||
--batch_size=128 \ | ||
--train_epochs=90 \ | ||
--train_steps=10 \ | ||
--use_synthetic_data=false | ||
``` | ||
|
||
See [`common.py`](common.py) for full list of options. | ||
|
||
### Using multiple GPUs | ||
|
||
You can train these models on multiple GPUs using `tf.distribute.Strategy` API. | ||
You can read more about them in this | ||
[guide](https://www.tensorflow.org/guide/distribute_strategy). | ||
|
||
In this example, we have made it easier to use is with just a command line flag | ||
`--num_gpus`. By default this flag is 1 if TensorFlow is compiled with CUDA, | ||
and 0 otherwise. | ||
|
||
- --num_gpus=0: Uses tf.distribute.OneDeviceStrategy with CPU as the device. | ||
- --num_gpus=1: Uses tf.distribute.OneDeviceStrategy with GPU as the device. | ||
- --num_gpus=2+: Uses tf.distribute.MirroredStrategy to run synchronous | ||
distributed training across the GPUs. | ||
|
||
If you wish to run without `tf.distribute.Strategy`, you can do so by setting | ||
`--distribution_strategy=off`. | ||
|
||
### Running on multiple GPU hosts | ||
|
||
You can also train these models on multiple hosts, each with GPUs, using | ||
`tf.distribute.Strategy`. | ||
|
||
The easiest way to run multi-host benchmarks is to set the | ||
[`TF_CONFIG`](https://www.tensorflow.org/guide/distributed_training#TF_CONFIG) | ||
appropriately at each host. e.g., to run using `MultiWorkerMirroredStrategy` on | ||
2 hosts, the `cluster` in `TF_CONFIG` should have 2 `host:port` entries, and | ||
host `i` should have the `task` in `TF_CONFIG` set to `{"type": "worker", | ||
"index": i}`. `MultiWorkerMirroredStrategy` will automatically use all the | ||
available GPUs at each host. | ||
|
||
### Running on Cloud TPUs | ||
|
||
Note: This model will **not** work with TPUs on Colab. | ||
Note: These models will **not** work with TPUs on Colab. | ||
|
||
You can train the ResNet CTL model on Cloud TPUs using | ||
You can train image classification models on Cloud TPUs using | ||
`tf.distribute.TPUStrategy`. If you are not familiar with Cloud TPUs, it is | ||
strongly recommended that you go through the | ||
[quickstart](https://cloud.google.com/tpu/docs/quickstart) to learn how to | ||
create a TPU and GCE VM. | ||
|
||
To run ResNet model on a TPU, you must set `--distribution_strategy=tpu` and | ||
`--tpu=$TPU_NAME`, where `$TPU_NAME` the name of your TPU in the Cloud Console. | ||
From a GCE VM, you can run the following command to train ResNet for one epoch | ||
on a v2-8 or v3-8 TPU: | ||
## MNIST | ||
|
||
To download the data and run the MNIST sample model locally for the first time, | ||
run one of the following command: | ||
|
||
```bash | ||
python resnet/resnet_ctl_imagenet_main.py \ | ||
--tpu=$TPU_NAME \ | ||
python3 mnist_main.py \ | ||
--model_dir=$MODEL_DIR \ | ||
--data_dir=$DATA_DIR \ | ||
--batch_size=1024 \ | ||
--steps_per_loop=500 \ | ||
--train_epochs=1 \ | ||
--use_synthetic_data=false \ | ||
--dtype=fp32 \ | ||
--enable_eager=true \ | ||
--enable_tensorboard=true \ | ||
--distribution_strategy=tpu \ | ||
--log_steps=50 \ | ||
--single_l2_loss_op=true \ | ||
--use_tf_function=true | ||
--train_epochs=10 \ | ||
--distribution_strategy=one_device \ | ||
--num_gpus=$NUM_GPUS \ | ||
--download | ||
``` | ||
|
||
To train the ResNet to convergence, run it for 90 epochs: | ||
To train the model on a Cloud TPU, run the following command: | ||
|
||
```bash | ||
python resnet/resnet_ctl_imagenet_main.py \ | ||
python3 mnist_main.py \ | ||
--tpu=$TPU_NAME \ | ||
--model_dir=$MODEL_DIR \ | ||
--data_dir=$DATA_DIR \ | ||
--batch_size=1024 \ | ||
--steps_per_loop=500 \ | ||
--train_epochs=90 \ | ||
--use_synthetic_data=false \ | ||
--dtype=fp32 \ | ||
--enable_eager=true \ | ||
--enable_tensorboard=true \ | ||
--train_epochs=10 \ | ||
--distribution_strategy=tpu \ | ||
--log_steps=50 \ | ||
--single_l2_loss_op=true \ | ||
--use_tf_function=true | ||
--download | ||
``` | ||
|
||
Note: `$MODEL_DIR` and `$DATA_DIR` must be GCS paths. | ||
Note: the `--download` flag is only required the first time you run the model. | ||
|
||
|
||
## MNIST | ||
## Classifier Trainer | ||
The classifier trainer is a unified framework for running image classification | ||
models using Keras's compile/fit methods. Experiments should be provided in the | ||
form of YAML files, some examples are included within the configs/examples | ||
folder. Please see [configs/examples](./configs/examples) for more example | ||
configurations. | ||
|
||
To download the data and run the MNIST sample model locally for the first time, | ||
run one of the following command: | ||
The provided configuration files use a per replica batch size and is scaled | ||
by the number of devices. For instance, if `batch size` = 64, then for 1 GPU | ||
the global batch size would be 64 * 1 = 64. For 8 GPUs, the global batch size | ||
would be 64 * 8 = 512. Similarly, for a v3-8 TPU, the global batch size would | ||
be 64 * 8 = 512, and for a v3-32, the global batch size is 64 * 32 = 2048. | ||
|
||
### ResNet50 | ||
|
||
#### On GPU: | ||
```bash | ||
python mnist_main.py \ | ||
python3 classifier_trainer.py \ | ||
--mode=train_and_eval \ | ||
--model_type=resnet \ | ||
--dataset=imagenet \ | ||
--model_dir=$MODEL_DIR \ | ||
--data_dir=$DATA_DIR \ | ||
--train_epochs=10 \ | ||
--distribution_strategy=one_device \ | ||
--num_gpus=$NUM_GPUS \ | ||
--download | ||
--config_file=configs/examples/resnet/imagenet/gpu.yaml \ | ||
--params_override='runtime.num_gpus=$NUM_GPUS' | ||
``` | ||
|
||
To train the model on a Cloud TPU, run the following command: | ||
#### On TPU: | ||
```bash | ||
python3 classifier_trainer.py \ | ||
--mode=train_and_eval \ | ||
--model_type=resnet \ | ||
--dataset=imagenet \ | ||
--tpu=$TPU_NAME \ | ||
--model_dir=$MODEL_DIR \ | ||
--data_dir=$DATA_DIR \ | ||
--config_file=config/examples/resnet/imagenet/tpu.yaml | ||
``` | ||
|
||
### EfficientNet | ||
**Note: EfficientNet development is a work in progress.** | ||
#### On GPU: | ||
```bash | ||
python3 classifier_trainer.py \ | ||
--mode=train_and_eval \ | ||
--model_type=efficientnet \ | ||
--dataset=imagenet \ | ||
--model_dir=$MODEL_DIR \ | ||
--data_dir=$DATA_DIR \ | ||
--config_file=configs/examples/efficientnet/imagenet/efficientnet-b0-gpu.yaml \ | ||
--params_override='runtime.num_gpus=$NUM_GPUS' | ||
``` | ||
|
||
|
||
#### On TPU: | ||
```bash | ||
python mnist_main.py \ | ||
python3 classifier_trainer.py \ | ||
--mode=train_and_eval \ | ||
--model_type=efficientnet \ | ||
--dataset=imagenet \ | ||
--tpu=$TPU_NAME \ | ||
--model_dir=$MODEL_DIR \ | ||
--data_dir=$DATA_DIR \ | ||
--train_epochs=10 \ | ||
--distribution_strategy=tpu \ | ||
--download | ||
--config_file=config/examples/efficientnet/imagenet/efficientnet-b0-tpu.yaml | ||
``` | ||
|
||
Note: the `--download` flag is only required the first time you run the model. | ||
Note that the number of GPU devices can be overridden in the command line using | ||
`--params_overrides`. The TPU does not need this override as the device is fixed | ||
by providing the TPU address or name with the `--tpu` flag. | ||
|
Oops, something went wrong.