The last time Hackerfall tried to access this page, it returned a not found error. A cached version of the page is below, or click here to continue anyway

Training ImageNet on a TPU in 12.5 hours with GKE and RiseML

Training ImageNet on a TPU in 12.5 hours with GKE andRiseML

Googles Tensor Processing Unit (TPU), a custom-developed accelerator for deep learning, offers a fast and cost-efficient alternative to training deep learning models in the cloud: it is capable of training a ResNet-50 model on ImageNet in 12.5 hoursfor an equivalent of ~$81 of TPU compute time.

At RiseML, we believe that machine learning engineers shouldnt have to worry about infrastructure. Recently, Google Kubernetes Engine (GKE), the managed Kubernetes offering by Google, started providing alpha level support for provisioning TPUs. Each TPUs lifetime is automatically bound to the lifetime of its job, so you only pay for your actual use. The combination of GKE and RiseML offers a hassle-free machine learning infrastructure that is easy-to use, highly scalable, and cost-efficient.

To illustrate how to use TPUs on GKE with RiseML, we show below how to train a ResNet-50 model on ImageNet. Bringing up a GKE cluster with TPU support and installing RiseML on it only takes about 10 minutes. While Cloud TPUs are now available for everyone in public beta, please note that TPUs are still in closed alpha on GKE and RiseML. Contact us if you are interested in giving it a spin.

Preparing themodel

To train ImageNet, well use the bfloat16 implementation of ResNet-50 provided by Google from the TPU repository. We can get the training code for the model from GitHub:

$ git clone$ cd tpu/models/experimental/resnet_bfloat16

Next, lets define the experiment wed like to run by creating a riseml.yml file:

project: resnet_imagenettrain:  framework: tensorflow  tensorflow:    version: 1.7.0  resources:    tpus: 1    cpus: 2    mem: 2048  run: >    python    --master=${KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS}    --data_dir=gs://imagenet/tpu    --model_dir=gs://results/riseml-tpu-support/${HOSTNAME}

For this experiment, we want to run Tensorflow 1.7 and use one TPU for training. We need very little CPU and memory since all of the heavy computation happens on the TPU. We also specify the command to train the model, the endpoint to reach the TPU (provided by the environment), and the locations for training data and model output. Having training data and model output on Google Cloud Storage is currently a requirement for using TPUs.

Training the model withRiseML

Starting the training process is as easy as running:

$ riseml trainSyncing project (56.3 KB, 6 files)doneStarted experiment 5 in backgroundTensorBoard: `riseml logs 5` to connect to log stream.

This will:

  1. Copy the code from our workstation to the cluster, where it is versioned
  2. Build a versioned Docker image with TensorFlow 1.7 and our code
  3. Store the versioned Docker image in the RiseML registry
  4. Provision a TPU for our experiment
  5. Start a container with our versioned image that is connected to the TPU

By using RiseML all of these steps are taken care of automatically and offloaded to the cluster.

We could now look at the logs using the riseml CLI, but following the training progress in TensorBoard is more interesting:

Top-5 accuracy of the ResNet-50 model on the validation data ofImageNet

After about ~12.5 hours the model achieves a top-5 accuracy of 93%! Once the experiment finishes training, the TPU is deprovisioned and the training container stopped. All information about the experiment, e.g., versioned code, Docker image, and logs are kept in RiseML. Because the TPU is automatically deprovisioned, costs are kept at a minimum.


Running RiseML on GKE gives you an easy-to-use, highly scalable, and cost-efficient machine learning infrastructure. You dont need to worry about system administration or DevOps tasks so you can focus on machine learning itself. Contact us, if you are interested in giving RiseML on GKE with TPUs a spin!

Continue reading on