Improving Khan Academy’s student knowledge model for better predictions

Recently, I have been working on improving Khan Academy’s user knowledge model to get better predictions on how each student will perform on exercises. We use this model for many things including assessing a student’s mastery of an exercise, and recommending the next piece of content that they work through. The following is an overview of the model, with a link to the full write-up of the work I did to improve and measure it at the bottom. This write-up was meant for an internal audience, but I thought it may be interesting to others as well. Let me know if you have any questions or ideas for improvements!


Khan Academy models each student’s total knowledge state with a single 100-dimensional vector.  This vector is obtained by the artful combination of many other 100-dimensional vectors, depending on how that student has interacted with exercises in the past. Furthermore, we model a student’s interaction with a single exercise with a 6-dimensional vector for every exercise that student has interacted with.

These feature vectors allow us to build the following statistical model to predict a student’s ability to correctly answer the next question in an exercise, even if the “next question” is the very first for that exercise.

User Knowledge Model

To make a prediction, we lookup that student’s exercise-specific features and their global knowledge state features, and multiply each one by the corresponding theta. So, our job is the find the values for those 107 theta values which will give us the highest likelihood of correctly predicting a student’s success on the next question in an exercise. A different set of theta values is found for each exercise. This allows each exercise to weight aspects of the KnowledgeState differently. The KnowledgeState should only influence predictions for exercises that are highly correlated to the exercises it is composed of.

If we compute the likelihood that a student will get the next problem correct for all exercises, we can sort the list of exercises by these likelihoods to understand which exercises are more or less difficult for this student and recommend content accordingly. One way we use this list is to offer the student “challenge cards”. Challenge cards allow the student to quickly achieve “mastery” since their history in other exercises shows us that they probably already know this exercise well.

The 100-dimensional vectors are known as random components. There is one random component vector for each exercise known when the values are discovered. The vectors are computed deterministically and stored in a database alongside the theta values.

This means that a student’s performance on an exercise that was added to the site after a set of theta values were discovered will not influence any other exercise’s prediction. It cannot be added to the KnowledgeState because the random components for this exercise do not exist. It also means that we cannot predict a student’s success on this new exercise. Theta values for this exercise do not exist. When a student’s predicted success is null, the exercise is said to be “infinitely difficult”.

The thetas we are using today were discovered in early 2013, which means that they do not account for the all of the new ways students are using the site (e.g. via the Learning Dashboard).

This project sets about to achieve two goals:

  1. Upgrade the KnowledgeState mechanism so that it can understand how newly added exercises influence a student’s total knowledge state.  Technically, this means computing new random component vectors, and using them during the discovery process.

  2. Discover new values which will understand all of the new ways students use the website along with all of the new exercises that have been added since they were last discovered.

Click here to read the full details on the data collection, verification, performance analysis and conclusions of this project.