Efficiently Querying the YouTube API with Google Appengine Pipeline

Sometimes a user on your website clicks a button, and you need to do some pretty heavy lifting in the backend to make the magic happen.  But, you don’t want to make the user wait for this magic to happen, and the work required may exceed appengine’s 60 second request deadline.  An easy way to let user-facing requests return quickly and postpone the hard work is to put a task on a queue so that appengine will perform the work at a later time.  You can even configure the queue to execute the task on a backend, where you will not impact scheduling of user-facing requests and have more freedom in the resources you use.

After some success speeding up your user-facing requests with advanced task queue maneuvers, you may start wondering how you can architect other background processes to utilize this great resource.  You may even read this four year old article about a nifty class that automatically reschedules itself on the queue in an intrincate dance of deadline dodging.

Take it from me, there is an easier way.

The task queue is great for running small, independent operations outside of user requests, but for large, interconnected tasks you should consider the pipeline library.

In appengine-land, the fundamental unit of parallelization is an HTTP request*.  So, to execute smaller portions of work in parallel, you must fire off one request for each unit of work.  You could do this with tasks on a queue, but there is no easy way for the tasks to communicate amongst themselves.  The pipeline library solves this problem and provides several convenient features for controlling program and data flow between many interconnected tasks.  You can think of the pipeline library as a wrapper around task queues that allows you to control the fan out of your tasks, collect outputs from the tasks, and easily establish dependency chains.  The pipeline docs go into pretty good detail on these fundamentals, so I’m going to spend the rest of this post talking about how we’ve used this library to implement certain features at Khan Academy.

The simplest and most natural use of the pipeline library we have is to download data from external APIs to cache in our own datastore for rapid access later.  For example, we have a pipeline job that downloads data from the YouTube Analytics API for each of the videos on our site.  With 5000 videos and counting, we want to download the data with a lot of parallel connections, but we have to make sure that we fit within the API’s rate limiting restrictions.

To do this, we:

  1. Have a master pipeline that queries YouTube to find all video IDs that that user has uploaded.  (For Sal, this is ~4000).

  2. For every 25 videos, we spawn a child pipeline to download data about each of those videos and store that data in our datastore.

The control flow when a child pipeline throws an exception due to exceeding the API’s rate limiting goes like this:

  1. SoftRetryPipeline.handle_exception() logs the full exception so that we can debug any unexpected failures.

  2. If the current attempt is less than the maximum minus 1, we simply re-raise the exception.  This causes the pipeline library to reschedule this task after waiting some amount of time, as specified by the backoff parameters.

  3. If this is the final attempt, we do not re-raise the exception.  If we did, the entire pipeline job (including all of this tasks sibling tasks) would be immediately aborted.  Generally speaking, you do not want this to happen because the other tasks may still be doing useful work.

Take a look at the code to get a better understanding of how this works.

* Well, there is the BackgroundThread available on backends, but I have not really used this and it doesn’t fit it with all of the other tools in appengine, which all assume that requests are the finest grain of parallel work.