Internal change

PiperOrigin-RevId: 302937425
brkygokcen · Mar 25, 2020 · 54602a6 · 54602a6
1 parent 7a25758
commit 54602a6
Show file tree

Hide file tree

Showing 27 changed files with 5,010 additions and 134 deletions.
diff --git a/official/vision/image_classification/README.md b/official/vision/image_classification/README.md
@@ -1,190 +1,136 @@
 # Image Classification
 
-This folder contains the TF 2.0 model examples for image classification:
+This folder contains TF 2.0 model examples for image classification:
 
-* [ResNet](#resnet)
 * [MNIST](#mnist)
+* [Classifier Trainer](#classifier-trainer), a framework that uses the Keras
+compile/fit methods for image classification models, including:
+  * ResNet
+  * EfficientNet[^1]
 
+[^1]: Currently a work in progress. We cannot match "AutoAugment (AA)" in [the original version](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet).
 For more information about other types of models, please refer to this
 [README file](../../README.md).
 
-## ResNet
-
-Similar to the [estimator implementation](../../r1/resnet), the Keras
-implementation has code for the ImageNet dataset. The ImageNet
-version uses a ResNet50 model implemented in
-[`resnet_model.py`](./resnet/resnet_model.py).
-
+## Before you begin
 Please make sure that you have the latest version of TensorFlow
 installed and
 [add the models folder to your Python path](/official/#running-the-models).
 
-### Pretrained Models
-
-* [ResNet50 Checkpoints](https://storage.googleapis.com/cloud-tpu-checkpoints/resnet/resnet50.tar.gz)
-
-* ResNet50 TFHub: [feature vector](https://tfhub.dev/tensorflow/resnet_50/feature_vector/1)
-and [classification](https://tfhub.dev/tensorflow/resnet_50/classification/1)
-
-### ImageNet Training
+### ImageNet preparation
 
 Download the ImageNet dataset and convert it to TFRecord format.
 The following [script](https://github.com/tensorflow/tpu/blob/master/tools/datasets/imagenet_to_gcs.py)
 and [README](https://github.com/tensorflow/tpu/tree/master/tools/datasets#imagenet_to_gcspy)
 provide a few options.
 
-Once your dataset is ready, you can begin training the model as follows:
-
-```bash
-python resnet/resnet_imagenet_main.py
-```
-
-Again, if you did not download the data to the default directory, specify the
-location with the `--data_dir` flag:
-
-```bash
-python resnet/resnet_imagenet_main.py --data_dir=/path/to/imagenet
-```
-
-There are more flag options you can specify. Here are some examples:
-
-- `--use_synthetic_data`: when set to true, synthetic data, rather than real
-data, are used;
-- `--batch_size`: the batch size used for the model;
-- `--model_dir`: the directory to save the model checkpoint;
-- `--train_epochs`: number of epoches to run for training the model;
-- `--train_steps`: number of steps to run for training the model. We now only
-support a number that is smaller than the number of batches in an epoch.
-- `--skip_eval`: when set to true, evaluation as well as validation during
-training is skipped
-
-For example, this is a typical command line to run with ImageNet data with
-batch size 128 per GPU:
-
-```bash
-python -m resnet/resnet_imagenet_main.py \
-    --model_dir=/tmp/model_dir/something \
-    --num_gpus=2 \
-    --batch_size=128 \
-    --train_epochs=90 \
-    --train_steps=10 \
-    --use_synthetic_data=false
-```
-
-See [`common.py`](common.py) for full list of options.
-
-### Using multiple GPUs
-
-You can train these models on multiple GPUs using `tf.distribute.Strategy` API.
-You can read more about them in this
-[guide](https://www.tensorflow.org/guide/distribute_strategy).
-
-In this example, we have made it easier to use is with just a command line flag
-`--num_gpus`. By default this flag is 1 if TensorFlow is compiled with CUDA,
-and 0 otherwise.
-
-- --num_gpus=0: Uses tf.distribute.OneDeviceStrategy with CPU as the device.
-- --num_gpus=1: Uses tf.distribute.OneDeviceStrategy with GPU as the device.
-- --num_gpus=2+: Uses tf.distribute.MirroredStrategy to run synchronous
-distributed training across the GPUs.
-
-If you wish to run without `tf.distribute.Strategy`, you can do so by setting
-`--distribution_strategy=off`.
-
-### Running on multiple GPU hosts
-
-You can also train these models on multiple hosts, each with GPUs, using
-`tf.distribute.Strategy`.
-
-The easiest way to run multi-host benchmarks is to set the
-[`TF_CONFIG`](https://www.tensorflow.org/guide/distributed_training#TF_CONFIG)
-appropriately at each host.  e.g., to run using `MultiWorkerMirroredStrategy` on
-2 hosts, the `cluster` in `TF_CONFIG` should have 2 `host:port` entries, and
-host `i` should have the `task` in `TF_CONFIG` set to `{"type": "worker",
-"index": i}`.  `MultiWorkerMirroredStrategy` will automatically use all the
-available GPUs at each host.
-
 ### Running on Cloud TPUs
 
-Note: This model will **not** work with TPUs on Colab.
+Note: These models will **not** work with TPUs on Colab.
 
-You can train the ResNet CTL model on Cloud TPUs using
+You can train image classification models on Cloud TPUs using
 `tf.distribute.TPUStrategy`. If you are not familiar with Cloud TPUs, it is
 strongly recommended that you go through the
 [quickstart](https://cloud.google.com/tpu/docs/quickstart) to learn how to
 create a TPU and GCE VM.
 
-To run ResNet model on a TPU, you must set `--distribution_strategy=tpu` and
-`--tpu=$TPU_NAME`, where `$TPU_NAME` the name of your TPU in the Cloud Console.
-From a GCE VM, you can run the following command to train ResNet for one epoch
-on a v2-8 or v3-8 TPU:
+## MNIST
+
+To download the data and run the MNIST sample model locally for the first time,
+run one of the following command:
 
 ```bash
-python resnet/resnet_ctl_imagenet_main.py \
-  --tpu=$TPU_NAME \
+python3 mnist_main.py \
   --model_dir=$MODEL_DIR \
   --data_dir=$DATA_DIR \
-  --batch_size=1024 \
-  --steps_per_loop=500 \
-  --train_epochs=1 \
-  --use_synthetic_data=false \
-  --dtype=fp32 \
-  --enable_eager=true \
-  --enable_tensorboard=true \
-  --distribution_strategy=tpu \
-  --log_steps=50 \
-  --single_l2_loss_op=true \
-  --use_tf_function=true
+  --train_epochs=10 \
+  --distribution_strategy=one_device \
+  --num_gpus=$NUM_GPUS \
+  --download
 ```
 
-To train the ResNet to convergence, run it for 90 epochs:
+To train the model on a Cloud TPU, run the following command:
 
 ```bash
-python resnet/resnet_ctl_imagenet_main.py \
+python3 mnist_main.py \
   --tpu=$TPU_NAME \
   --model_dir=$MODEL_DIR \
   --data_dir=$DATA_DIR \
-  --batch_size=1024 \
-  --steps_per_loop=500 \
-  --train_epochs=90 \
-  --use_synthetic_data=false \
-  --dtype=fp32 \
-  --enable_eager=true \
-  --enable_tensorboard=true \
+  --train_epochs=10 \
   --distribution_strategy=tpu \
-  --log_steps=50 \
-  --single_l2_loss_op=true \
-  --use_tf_function=true
+  --download
 ```
 
-Note: `$MODEL_DIR` and `$DATA_DIR` must be GCS paths.
+Note: the `--download` flag is only required the first time you run the model.
 
 
-## MNIST
+## Classifier Trainer
+The classifier trainer is a unified framework for running image classification
+models using Keras's compile/fit methods. Experiments should be provided in the
+form of YAML files, some examples are included within the configs/examples
+folder. Please see [configs/examples](./configs/examples) for more example
+configurations.
 
-To download the data and run the MNIST sample model locally for the first time,
-run one of the following command:
+The provided configuration files use a per replica batch size and is scaled
+by the number of devices. For instance, if `batch size` = 64, then for 1 GPU
+the global batch size would be 64 * 1 = 64. For 8 GPUs, the global batch size
+would be 64 * 8 = 512. Similarly, for a v3-8 TPU, the global batch size would
+be 64 * 8 = 512, and for a v3-32, the global batch size is 64 * 32 = 2048.
 
+### ResNet50
+
+#### On GPU:
 ```bash
-python mnist_main.py \
+python3 classifier_trainer.py \
+  --mode=train_and_eval \
+  --model_type=resnet \
+  --dataset=imagenet \
   --model_dir=$MODEL_DIR \
   --data_dir=$DATA_DIR \
-  --train_epochs=10 \
-  --distribution_strategy=one_device \
-  --num_gpus=$NUM_GPUS \
-  --download
+  --config_file=configs/examples/resnet/imagenet/gpu.yaml \
+  --params_override='runtime.num_gpus=$NUM_GPUS'
 ```
 
-To train the model on a Cloud TPU, run the following command:
+#### On TPU:
+```bash
+python3 classifier_trainer.py \
+  --mode=train_and_eval \
+  --model_type=resnet \
+  --dataset=imagenet \
+  --tpu=$TPU_NAME \
+  --model_dir=$MODEL_DIR \
+  --data_dir=$DATA_DIR \
+  --config_file=config/examples/resnet/imagenet/tpu.yaml
+```
+
+### EfficientNet
+**Note: EfficientNet development is a work in progress.**
+#### On GPU:
+```bash
+python3 classifier_trainer.py \
+  --mode=train_and_eval \
+  --model_type=efficientnet \
+  --dataset=imagenet \
+  --model_dir=$MODEL_DIR \
+  --data_dir=$DATA_DIR \
+  --config_file=configs/examples/efficientnet/imagenet/efficientnet-b0-gpu.yaml \
+  --params_override='runtime.num_gpus=$NUM_GPUS'
+```
 
+
+#### On TPU:
 ```bash
-python mnist_main.py \
+python3 classifier_trainer.py \
+  --mode=train_and_eval \
+  --model_type=efficientnet \
+  --dataset=imagenet \
   --tpu=$TPU_NAME \
   --model_dir=$MODEL_DIR \
   --data_dir=$DATA_DIR \
-  --train_epochs=10 \
-  --distribution_strategy=tpu \
-  --download
+  --config_file=config/examples/efficientnet/imagenet/efficientnet-b0-tpu.yaml
 ```
 
-Note: the `--download` flag is only required the first time you run the model.
+Note that the number of GPU devices can be overridden in the command line using
+`--params_overrides`. The TPU does not need this override as the device is fixed
+by providing the TPU address or name with the `--tpu` flag.
+