In Getting Started with Deep MNIST and TensorFlow on iOS, I walked through the process of getting TensorFlow set up such that we can perform inference on a pre-trained model directly on an iOS device. Even though we were able to get an iPad Pro to classify 5,000 images of dimensions 28x28 in a little over 5 seconds, we can do even better by leveraging the compute power modern GPUs provide on data-parallel tasks. While TensorFlow offers GPU support for CUDA- and OpenCL-enabled devices, iOS supports neither, so in this article, we’ll implement the inference pipeline ourselves with Metal.
Background
Our CPUs are good at performing long-running tasks such as compiling code or rendering audio on a handful of cores (2-4 on today’s iOS devices). This is known as task-parallelism. Meanwhile, our GPUs are optimized for running hundreds or thousands of short operations simultaneously – think applying a transformation matrix to thousands of vertices in a 3D game. This is known as data parallelism. While the exact architectural details of Apple’s AX chips are not available, we know the A9X chip found in the iPad Pro has 12 GPU cores, and each of those likely contains several processing elements.
Since performing inference with deep networks involves repeating lots of short calculations over millions of elements in a data-parallel fashion, we can get a noticeable speed-up by moving this work to the GPU. More concretely, the convolutions and matrix multiplies that are typically performed as data is propagated through a deep network can be parallelized at the data level. For example, a matrix multiply can be thought of as N * N dot products of length N, each of which can be computed independently. Similarly, in a convolution, a small filter (of size 5x5 in the Deep MNIST example) is multiplied by a region of the same size surrounding each pixel in an image. Each of those operations can be performed independently as well.
Metal
Since iOS does not support CUDA or OpenCL, we’ll have to use Metal to perform work on the GPUs found in iOS devices. Prior to iOS 10, we’d have to implement the programs that run on the GPU (known as kernels) ourselves. While the Metal Shading Language is quite similar to OpenCL’s variant of C, writing high-performance kernels is as much of an art as a science. For example, understanding GPU memory access patterns, taking advantage of local memory across workgroups, and avoiding operations that are expensive on GPUs (such as modulus) can all involve weeks of work and implementing mathematical tricks. Moreover, ALUs on Apple’s AX chips are only 16-bits wide, so if we implement a naïve 1:1 port of an OpenCL kernel that uses 32-bit float
s (instead of 16-bit float
s known as half
s), we’ll see subpar performance.
To make this easier, Apple introduced support for deep network operations in the Metal Performance Shaders framework in iOS 10. This API is optimized to squeeze every drop of power out of the GPUs found in AX chips and saves from having to write Metal kernels ourselves that perform convolutions, matrix multiplications, and more.
Getting Started
We’ll build on top of the Getting Started with Deep MNIST and TensorFlow on iOS article and move inference to the GPU with Metal. To get started, we’ll link the target with the MetalPerformanceShaders
framework and import its umbrella header file:
One of the things we did in our TensorFlow implementation was load a graph of our deep network. This is no longer necessary as we’ll hardcode the structure of our network with Metal APIs (plus, Metal doesn’t know anything about TensorFlow graphs anyway).
Metal also doesn’t know anything about our exported “checkpoint” file containing our learned parameters. Instead of writing a parser for it, we can simply modify our training script to export each of our 8 variables (4 weight tensors + 4 bias vectors) in a binary format. The resulting file will simply be a list of floating-point (IEEE 754) values stored in C (row-major) order. In other words, if we have a 2D matrix of dimensions 4x4, the resulting file will be 64 bytes in size since we have 16 float
s that are each 4 bytes in size; further, the data will be laid out row-wise.
One thing we have to watch out for is the order in which Metal expects learned parameters: [outputChannels][{source/kernel}Height][{source/kernel}Width][inputChannels]
.
This differs from TensorFlow, which stores things in [{source/kernel}Height][{source/kernel}Width][inputChannels][outputChannels]
order.
To re-order the dimensions of a matrix before exporting it, we can use the tf.transpose
function. Here is how we export W_conv1
at the end of our training script:
We permute the tensor such that the output channels come first. Then, we export it as a binary tensor of float
s.
The rest of the data is exported in a similar fashion:
Note that we don’t have to re-order the bias variables as they’re 1D vectors. Moreover, the original code flattens W_fc1
and W_fc2
into a 2D matrix to perform a matrix multiply. To re-order the columns in 4D space, we have to re-shape it back into a 4D tensor prior to permuting the dimensions.
Now we can re-run the training script to export our 8 variables as binary tensors.
Loading the Model
Next, we’ll drag the 8 binary files into our Xcode project (be sure to include them as bundle resources as well).
We’ll also define a helper function for loading them. Since the data is already in the correct format, this is pretty straightforward:
Now, we’ll define a new -testGPU:
method for running inference on the GPU. We’ll start by loading our weights and biases using the loadTensor
function we just defined:
We’ll also load the test images we’ll be working with, as well as their labels:
This is nearly identical to the code in the original article, so I won’t explain it again.
Metal Pipeline
Now, we’re ready to create the Metal pipeline, starting with the Metal device and the command queue we’ll use to submit work to it:
If you’ve worked with OpenCL before, you’ll find a lot of the terminology familiar.
Convolutional Layers
Next, we’ll have to set up the structure of our deep network in code. If we look back at our training script, you’ll note we start with a convolutional layer that uses a 5x5 filter, unit stride, zero padding, 1 input channel, 32 output channels, and a ReLU activation function:
To translate this to Metal, we set-up a MPSCNNConvolution
:
We directly pass in the kernel width/height and the number of input/output channels; we also specify the ReLU activation function. The default stride is already {1,1}
and the default padding is already zero
, so we don’t have to explicitly set those. If you’d like to change them, see the edgeMode
and strideInPixels{X/Y}
properties on MPSCNNConvolution
. conv1outdescriptor
simply describes the format and dimensions of the output matrix: [28][28][32]
. This is also the size of h_conv1
in the training script. Note that we use float16
s for storage and computation as the AX GPUs have 16-bit ALUs. Metal will convert our weights and biases from 32-bit float
s automatically.
Next, we set up our max pooling layer. In the training script, this looks as such:
Now let’s translate it to Metal:
The kernel size of {2,2}
and stride of {2,2}
are directly specified in MPSCNNPoolingMax
’s initializer. Since the kernel is positioned around its center, we set its starting offset to {1, 1}, this way it doesn’t run off the top and left corners of the image. More visually, we want to start pooling at {0,0}
, not {-1,-1}
:
Since we’re dealing with even image sizes, the edge mode doesn’t matter since our 2x2 kernel will never run off the edges of the image. However, suppose we do run off the right edge of the image and two of our values are negative. With zero padding, the other two values are 0:
If we take the max of this 2x2 region, we get 0, but that’s not really correct since the zeros lie outside of the image. For this reason, it’s better to use clamped padding, which will repeat the values closest to the missing ones:
Now if we take the max of this 2x2 region, it’s -0.412641, which is more correct.
That it’s for our first convolutional layer with max pooling. The second one is fairly similar; the only thing that changes are the dimensions (the 2x2 max pool effectively reduces the size of the image by 2 in both dimensions). To make this easier to follow, I defined two new constants:
The 32 output channels from the first convolution are now passed in as input channels to the second convolution.
For reference, the training script sets this up as:
Now, we’re ready to move on to our final two fully-connected layers.
Fully-Connected Layers
Our first fully-connected layer maps the output of our second max pooling operation to 1024 hidden units:
In Metal, the implementation is similar to that of our convolutional layers, but we now use MPSCNNFullyConnected
instead:
The output is more simply a 1024-unit vector.
Finally, we’ll implement our second fully-connected layer that will map our 1024-unit vector to a 10-unit vector and take the softmax instead of using a ReLU activation function.
In the training script, this looked as follows:
First, we’ll port the second fully-connected layer:
This is fairly similar to the first fully-connected layer, but note that we pass in nil
for neuronFilter
. Instead, we’ll set up a separate softmax layer:
Before we finish up, let’s also define an image descriptor for our input test images:
Even though they’re 32-bit float
s, Metal will convert them to 16-bit half
s for us with no loss in accuracy (they’re just 0s and 1s).
Now we’re ready to iterate through 5,000 test examples and run each one through our pipeline.
Running the Pipeline
Before we do so, we’ll create two arrays to store the command buffer and softmax buffer for each test image. Keeping references to the command buffers will let us track and wait on any work sent to the GPU which runs asynchronously. The softmax buffers will simply store the class probability distribution for each test image.
Now, we’ll begin timing the code and run through each test image.
We create a new command buffer that will be used to encode GPU compute commands and create a new MPSImage
object that will represent our current test image. We load in the image data from x
with a call to -replaceRegion:...
.
Whereas MPSImage
s can be accessed from both the host and the GPU. MPSTemporaryImage
s can only be accessed from the GPU but are faster to work with. We’ll use MPSImage
s for our input images and softmax outputs; we’ll use MPSTemporaryImage
s for all of the tensors we allocate in-between.
The +[MPSTemporaryImage prefetchStorageWithCommandBuffer:...]
method can pre-allocate temporary images for us and optimize them for re-use, so we’ll call it now with all of our output image descriptors.
Now, we’ll simply allocate our temporary images and enqueue each layer onto our command buffer. This is fairly mechanical:
The final softmax buffer will be created as an MPSImage
rather than a MPSTemporaryImage
so we can access it from the host. We’ll also add it to our array of results
, and we’ll add the command buffer to our array of buffers
. Finally, we commit the buffer for execution. Note that results will not be available immediately as work happens asynchronously.
Once all the work is enqueued, we wait for it to finish and log the time it took:
On my iPad Pro, this took 3.29s, down from 5.4s (a 40% improvement). If you try running the project now, please note that Metal Performance Shaders are not available in the iOS Simulator.
Finally, we can compute the accuracy on our test set:
Unfortunately, getting the softmax values is a little complicated since Metal stores them in an “image” with a planar RGBA layout. Fortunately, we have the vImageConvert_Planar16FtoPlanarF
to help us. Once we have the softmax values, we can simply take their argmax
and compare it to the expected class label, as we did in the original article. When I ran this, I obtained 98.6% accuracy.
Conclusion
That’s what it takes to implement inference ourselves for a deep network pre-trained with TensorFlow. By using Metal Performance Shaders, this yields performance that is around 40% better and reduces our dependency on TensorFlow. By no longer having to link in the TensorFlow and Protocol Buffers static libraries, the size of our binary is reduced from 40 MB to 160 KB.
The original project on GitHub has been updated with Metal support.