Skip to content

Commit

Permalink
Internal change
Browse files Browse the repository at this point in the history
PiperOrigin-RevId: 302937425
  • Loading branch information
allenwang28 authored and tensorflower-gardener committed Mar 25, 2020
1 parent 7a25758 commit 54602a6
Show file tree
Hide file tree
Showing 27 changed files with 5,010 additions and 134 deletions.
214 changes: 80 additions & 134 deletions official/vision/image_classification/README.md
Original file line number Diff line number Diff line change
@@ -1,190 +1,136 @@
# Image Classification

This folder contains the TF 2.0 model examples for image classification:
This folder contains TF 2.0 model examples for image classification:

* [ResNet](#resnet)
* [MNIST](#mnist)
* [Classifier Trainer](#classifier-trainer), a framework that uses the Keras
compile/fit methods for image classification models, including:
* ResNet
* EfficientNet[^1]

[^1]: Currently a work in progress. We cannot match "AutoAugment (AA)" in [the original version](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet).
For more information about other types of models, please refer to this
[README file](../../README.md).

## ResNet

Similar to the [estimator implementation](../../r1/resnet), the Keras
implementation has code for the ImageNet dataset. The ImageNet
version uses a ResNet50 model implemented in
[`resnet_model.py`](./resnet/resnet_model.py).

## Before you begin
Please make sure that you have the latest version of TensorFlow
installed and
[add the models folder to your Python path](/official/#running-the-models).

### Pretrained Models

* [ResNet50 Checkpoints](https://storage.googleapis.com/cloud-tpu-checkpoints/resnet/resnet50.tar.gz)

* ResNet50 TFHub: [feature vector](https://tfhub.dev/tensorflow/resnet_50/feature_vector/1)
and [classification](https://tfhub.dev/tensorflow/resnet_50/classification/1)

### ImageNet Training
### ImageNet preparation

Download the ImageNet dataset and convert it to TFRecord format.
The following [script](https://github.com/tensorflow/tpu/blob/master/tools/datasets/imagenet_to_gcs.py)
and [README](https://github.com/tensorflow/tpu/tree/master/tools/datasets#imagenet_to_gcspy)
provide a few options.

Once your dataset is ready, you can begin training the model as follows:

```bash
python resnet/resnet_imagenet_main.py
```

Again, if you did not download the data to the default directory, specify the
location with the `--data_dir` flag:

```bash
python resnet/resnet_imagenet_main.py --data_dir=/path/to/imagenet
```

There are more flag options you can specify. Here are some examples:

- `--use_synthetic_data`: when set to true, synthetic data, rather than real
data, are used;
- `--batch_size`: the batch size used for the model;
- `--model_dir`: the directory to save the model checkpoint;
- `--train_epochs`: number of epoches to run for training the model;
- `--train_steps`: number of steps to run for training the model. We now only
support a number that is smaller than the number of batches in an epoch.
- `--skip_eval`: when set to true, evaluation as well as validation during
training is skipped

For example, this is a typical command line to run with ImageNet data with
batch size 128 per GPU:

```bash
python -m resnet/resnet_imagenet_main.py \
--model_dir=/tmp/model_dir/something \
--num_gpus=2 \
--batch_size=128 \
--train_epochs=90 \
--train_steps=10 \
--use_synthetic_data=false
```

See [`common.py`](common.py) for full list of options.

### Using multiple GPUs

You can train these models on multiple GPUs using `tf.distribute.Strategy` API.
You can read more about them in this
[guide](https://www.tensorflow.org/guide/distribute_strategy).

In this example, we have made it easier to use is with just a command line flag
`--num_gpus`. By default this flag is 1 if TensorFlow is compiled with CUDA,
and 0 otherwise.

- --num_gpus=0: Uses tf.distribute.OneDeviceStrategy with CPU as the device.
- --num_gpus=1: Uses tf.distribute.OneDeviceStrategy with GPU as the device.
- --num_gpus=2+: Uses tf.distribute.MirroredStrategy to run synchronous
distributed training across the GPUs.

If you wish to run without `tf.distribute.Strategy`, you can do so by setting
`--distribution_strategy=off`.

### Running on multiple GPU hosts

You can also train these models on multiple hosts, each with GPUs, using
`tf.distribute.Strategy`.

The easiest way to run multi-host benchmarks is to set the
[`TF_CONFIG`](https://www.tensorflow.org/guide/distributed_training#TF_CONFIG)
appropriately at each host. e.g., to run using `MultiWorkerMirroredStrategy` on
2 hosts, the `cluster` in `TF_CONFIG` should have 2 `host:port` entries, and
host `i` should have the `task` in `TF_CONFIG` set to `{"type": "worker",
"index": i}`. `MultiWorkerMirroredStrategy` will automatically use all the
available GPUs at each host.

### Running on Cloud TPUs

Note: This model will **not** work with TPUs on Colab.
Note: These models will **not** work with TPUs on Colab.

You can train the ResNet CTL model on Cloud TPUs using
You can train image classification models on Cloud TPUs using
`tf.distribute.TPUStrategy`. If you are not familiar with Cloud TPUs, it is
strongly recommended that you go through the
[quickstart](https://cloud.google.com/tpu/docs/quickstart) to learn how to
create a TPU and GCE VM.

To run ResNet model on a TPU, you must set `--distribution_strategy=tpu` and
`--tpu=$TPU_NAME`, where `$TPU_NAME` the name of your TPU in the Cloud Console.
From a GCE VM, you can run the following command to train ResNet for one epoch
on a v2-8 or v3-8 TPU:
## MNIST

To download the data and run the MNIST sample model locally for the first time,
run one of the following command:

```bash
python resnet/resnet_ctl_imagenet_main.py \
--tpu=$TPU_NAME \
python3 mnist_main.py \
--model_dir=$MODEL_DIR \
--data_dir=$DATA_DIR \
--batch_size=1024 \
--steps_per_loop=500 \
--train_epochs=1 \
--use_synthetic_data=false \
--dtype=fp32 \
--enable_eager=true \
--enable_tensorboard=true \
--distribution_strategy=tpu \
--log_steps=50 \
--single_l2_loss_op=true \
--use_tf_function=true
--train_epochs=10 \
--distribution_strategy=one_device \
--num_gpus=$NUM_GPUS \
--download
```

To train the ResNet to convergence, run it for 90 epochs:
To train the model on a Cloud TPU, run the following command:

```bash
python resnet/resnet_ctl_imagenet_main.py \
python3 mnist_main.py \
--tpu=$TPU_NAME \
--model_dir=$MODEL_DIR \
--data_dir=$DATA_DIR \
--batch_size=1024 \
--steps_per_loop=500 \
--train_epochs=90 \
--use_synthetic_data=false \
--dtype=fp32 \
--enable_eager=true \
--enable_tensorboard=true \
--train_epochs=10 \
--distribution_strategy=tpu \
--log_steps=50 \
--single_l2_loss_op=true \
--use_tf_function=true
--download
```

Note: `$MODEL_DIR` and `$DATA_DIR` must be GCS paths.
Note: the `--download` flag is only required the first time you run the model.


## MNIST
## Classifier Trainer
The classifier trainer is a unified framework for running image classification
models using Keras's compile/fit methods. Experiments should be provided in the
form of YAML files, some examples are included within the configs/examples
folder. Please see [configs/examples](./configs/examples) for more example
configurations.

To download the data and run the MNIST sample model locally for the first time,
run one of the following command:
The provided configuration files use a per replica batch size and is scaled
by the number of devices. For instance, if `batch size` = 64, then for 1 GPU
the global batch size would be 64 * 1 = 64. For 8 GPUs, the global batch size
would be 64 * 8 = 512. Similarly, for a v3-8 TPU, the global batch size would
be 64 * 8 = 512, and for a v3-32, the global batch size is 64 * 32 = 2048.

### ResNet50

#### On GPU:
```bash
python mnist_main.py \
python3 classifier_trainer.py \
--mode=train_and_eval \
--model_type=resnet \
--dataset=imagenet \
--model_dir=$MODEL_DIR \
--data_dir=$DATA_DIR \
--train_epochs=10 \
--distribution_strategy=one_device \
--num_gpus=$NUM_GPUS \
--download
--config_file=configs/examples/resnet/imagenet/gpu.yaml \
--params_override='runtime.num_gpus=$NUM_GPUS'
```

To train the model on a Cloud TPU, run the following command:
#### On TPU:
```bash
python3 classifier_trainer.py \
--mode=train_and_eval \
--model_type=resnet \
--dataset=imagenet \
--tpu=$TPU_NAME \
--model_dir=$MODEL_DIR \
--data_dir=$DATA_DIR \
--config_file=config/examples/resnet/imagenet/tpu.yaml
```

### EfficientNet
**Note: EfficientNet development is a work in progress.**
#### On GPU:
```bash
python3 classifier_trainer.py \
--mode=train_and_eval \
--model_type=efficientnet \
--dataset=imagenet \
--model_dir=$MODEL_DIR \
--data_dir=$DATA_DIR \
--config_file=configs/examples/efficientnet/imagenet/efficientnet-b0-gpu.yaml \
--params_override='runtime.num_gpus=$NUM_GPUS'
```


#### On TPU:
```bash
python mnist_main.py \
python3 classifier_trainer.py \
--mode=train_and_eval \
--model_type=efficientnet \
--dataset=imagenet \
--tpu=$TPU_NAME \
--model_dir=$MODEL_DIR \
--data_dir=$DATA_DIR \
--train_epochs=10 \
--distribution_strategy=tpu \
--download
--config_file=config/examples/efficientnet/imagenet/efficientnet-b0-tpu.yaml
```

Note: the `--download` flag is only required the first time you run the model.
Note that the number of GPU devices can be overridden in the command line using
`--params_overrides`. The TPU does not need this override as the device is fixed
by providing the TPU address or name with the `--tpu` flag.

Loading

0 comments on commit 54602a6

Please sign in to comment.