DeepTesla: End-to-End Learning from Human and Autopilot Driving

End-to-end steering describes the driving-related AI task of producing a steering wheel value given an image.  To this end we developed DeepTesla, a simple neural network demonstration of both end-to-end driving as well as in-the-browser neural network training.

Site Overview

This demonstration is supposed to give students a simple demonstration of using convolutional neural networks in end-to-end steering.   Readers are expected to have a basic understanding of neural networks.  When a user loads the page, an end-to-end steering model begins training: downloading packages of images/steering wheel values, doing a forward and backward pass, and visualizing the individual layers of the network.

Below we can see an example of what the demo looks like while training a model.

At the very top of the page you’ll see an area which contains some metrics about our network:

  • Forward/backward pass (ms): the amount of time it took the network to perform a forward or backward pass on a single example.  This metric is important for both training and evaluation.
  • Total examples seen / unique: the number of examples the network has trained on in total, as well as the number of unique examples
  • Network status: the current operation the network is performing, either training, or fetching data.

In the metrics box is also the loss graph.  This is a live graph that updates after every 250 training examples.  The Y-axis shows the number of examples seen in the currently loaded network, and the X-axis is the loss function value over the last 250 examples.  Ideally we should see this decrease over time.

Immediately below the metrics box, you’ll see the editor.  This is how the user interacts with the network: by providing layer organization/type and parameters, as well as parameters to the training algorithm.  The input to this editor must be a single valid JSON object containing two indexes: “network” and “trainer”.

After editing the parameters, you’ll want to reload the network and begin training it.  To do so, click the “Restart Training” button in the lower-left corner of the editor.  This will send the JSON to our training web worker, and it will be parsed and loaded as a ConvNetJS model.

Below the editor is another area, where the visualization takes place.  Upon first opening the page, you’ll see the layer-by-layer visualization.  You’ll see images for each layer in the network showing the activations for an arbitrary training example.  While the network is training, you’ll notice the activations will change – representing the network learning new features.

The very first layer will always be the input layer (and the last layer will always be a single neuron).  For each input example, we show the actual steering angle for that example as well as the currently predicted steering angle.

Depending on the layer type, you’ll see different types of visualization.  For convolutional layers, we create canvas objects containing the actual activations at each neuron – these will look remarkably similar to the input image.  We only visualize the filters that produced the activations if they are bigger than 3×3.

There is one more type of visualization: video validation.  To get there, find the button in the lower-right corner of the network editor.

Upon clicking the video visualization button, the layer visualization will be replaced by a video clip from a Tesla driving.  The currently loaded network evaluates each frame of video and makes a prediction about the steering wheel angle.  The network continues training while the video plays – you can see the network become more and more accurate as more training examples are seen (if your network is working).  When the video finishes, it starts over again.

At the bottom of the video, we visualize some information about the current performance of the model.  On the far left, we see the actual value of the steering wheel in blue, and the predicted value in white, and in red we see the difference between them.  To the right, we draw two steering wheels representing the actual and predicted values for the current frame.  To the right of the steering wheels, we see the current frame number, the forward pass time in milliseconds, and the average error (total difference between actual and predicted value divided by the total number of frames evaluated).

You’ll see a green box around the lower third of the video player.  This box shows the portion of the image being used as input to the network for evaluation (the coordinates and size of this box cannot be changed).

On the far right of the video information box, we see rapidly changing black/white bars.  This is a simple 17-bit sign-magnitude barcode, a hack we use to ensure accuracy in determining the frame number and wheel value for each video frame.  They are encoded into the video itself.

Training an End-to-End model

Now let’s look at how we can improve our model.  When you first load DeepTesla, your model editor will contain this:

So, we can see our network has an input layer of size (200, 66, 3) – representing a width of 200, a height of 66, and 3 channels (red-green-blue).  Following that, there are three convolutional layers, a pooling layer, and a single output neuron.

For example, we may want to decrease the stride parameters of our convolutional layer – right now, it is two which means every filter takes a two pixel step.  If we decrease that parameter to one, our filters will pass over more of the image.  However, this also means our convolutional layers will take more time for each pass.  We can offset this by adding additional pooling after each layer.

Below our “network” key, we see that we also supply a training algorithm, along with some parameters to that algorithm.  ConvNetJS provides several training algorithms to us: Adadelta, Adagrad, and standard SGD.  Each of these algorithms takes specific specific parameters as JSON key/value pairs.

More specifics about the available algorithms and parameters can be found in the ConvNetJS documentation ( http://cs.stanford.edu/people/karpathy/convnetjs/docs.html ).

After editing our network/trainer, we can begin training it by pressing the “Restart Training” button.

After allowing our network to train for 5 minutes and evaluating its performance on on our test video, we decide that we want to submit our network.  Press the “Submit Network” button to complete the assignment.

Additional Info/How it Works

Training

In ConvNetJS there are two important constructs: the network/trainer object, and the Volume object.  Each network in ConvNetJS is specified by a JSON object containing a list of layers.

To construct the images used for training, we use OpenCV.  We iterate over each frame of our video examples, extract and crop the frame, pair it with a synchronized wheel value, and push it onto a list – one for training, one for validation.  After we do this for all examples, we shuffle our examples and create an image that contains batches of 250 images – one image, flattened, on each row.  It looks like the following:

For the wheel values, we keep track of the synchronized values and create JSON object containing the frame ID and the ground truth.

First, we need to load the image into the browser:

When the image is finished loading we blit the image onto a canvas which is the way you can extract the RGB values for that image.

To keep the page responsive, we use multiple threads (via Web Workers) – in our main thread we perform visualization and respond to user input, and in another thread we perform training.  Thus, the final call in our dimg.onload callback is to postMessage, which sends the image batch and wheel values to our training thread.

Now that we have our image data, we need to transform it into a ConvNetJS volume – the basic unit of data representation in ConvNetJS.  In the code snippet below, we create a ConvNetJS volume of size (base_input_x, base_input_y, 3) and copy our image data to the volume.

Image data from a canvas is stored as a 1D array with four values for each pixel: red, green, blue, alpha.  Because our volume only contains 3 channels, we have to transform each RGBA value to RGB and set the appropriate value in our volume.

Video Evaluation

Video evaluation is more complex and requires a modern, non-mobile browser that supports HTML5 video.

First, we load a hidden video element:

Next, we create a Javascript function that runs another function on each repaint of the browser:

In our video_to_canvas function we copy the currently shown video frame to a canvas element.

Finally, we call the same image_data_to_volume function that we used on our training images.

To decode our barcodes, we iterate over each individual “bar” and calculate a pixel value average.  If the bar falls below a threshold, we decode that as a “0”, otherwise we decode it as a “1”.

Resources

DeepTesla: http://selfdrivingcars.mit.edu/deepteslajs/

ConvNetJS: http://cs.stanford.edu/people/karpathy/convnetjs/