Case study: Machine Learning and Deep Learning for Knowledge Tracing in Programming Education

A beautiful scene

Knowledge tracing of students has always been a challenging task. In this article, we propose our approach to the problem of predicting students’ future performance using their attempts on earlier programming problems. Our method employs multiple Machine Learning and Deep Learning techniques to extract students’ as well as problems’ properties. In the end, we achieved competitive performance on the dataset provided by the 2nd CSEDM Data Challenge, which is held as part of the Educational Data Mining 2021 conference.

Challenge Overview

The 2nd CSEDM Challenge [1] is a part of the 5th CSEDM workshop, which is held during the Educational Data Mining 2021 conference. The objective of the challenge is to do knowledge tracing in the context of programming classrooms, in which we not only have access to students’ performance but also their detailed programming codes.

The challenge is divided into 2 phases, Cross-semester and Within-semester. For the Cross-semester phase, we use the training data of a semester (Spring 2019) to predict the next semester (Fall 2019). In contrast, for the Within-semester phase, both training and test data come from the same semester (Fall 2019).

For each phase, there are 2 separate tasks. Both tasks use training data as the students’ attempts on the first 3 assignments (or 30 problems) of the course, we call this the early data. The first task requires predicting their performance on the last 2 assignments (or 20 problems), we call this the late date, while the second task asks for their grade in the paper final exam.

Let us take an example. Suppose we are participating in the first task of the Within-semester phase, and Bob is a student attending the programming course in Fall 2019. Our job is to look at Bob’s attempts on the first 30 problems (or 3 assignments, as each assignment consists of 10 problems) and predict whether he will find the last 20 problems (of the last 2 assignments; there are 5 assignments in total) easy. Here, since easy is a qualitative word, we need to define it quantitatively: we say that Bob finds a problem easy if he does solve that problem and the number of attempts he needs to solve that problem is not more than 75% of the other students who attempted that problem. With that to say, even if Bob solves a problem, it is not necessary that he thinks the problem is easy.

For the second task, we look at the same data (Bob’s attempts on the first 3 assignments) to predict Bob’s grade in his final exam.

This article is primarily aimed to describe our approach for the 2 tasks in the Within-semester phase.


The dataset contains anonymized information of 329 and 490 students during the Spring 2019 and Fall 2019 semesters, respectively. Most students attempted a majority of the 50 problems. Furthermore, for each pair of {student, problem}, there are often multiple attempts (since students usually try to submit until they get the problems solved). In sum, there are more than 100k attempts, each is accompanied by a Java program submitted by the students.

Questions that can be answered with the dataset include:

  • How many attempts each student made for each problem?
  • Which students successfully solved which problems?
  • Who attempted but could not solve which problem?
  • What errors does each student get when running her programs?
  • How their programs look like?
  • Etc.


The first task requires to predict, for each pair of {student, late problem}, whether the student will find the problem easy. The predictions should be probabilistic (instead of binary). The main metric to evaluate in this task is Area Under the ROC Curve (AUC), higher is better.

The second task’s target is the grade of the students in their final exam, which is a value between 0 and 100, inclusively. The predictions are assessed by computing their Mean Squared Errors (MSE) with the ground truth, lower is better.

More details can be found on the Challenge’s official website [1].

Our approach

The usual approach for tabular data is to do feature engineering followed by applying a canonical Machine Learning model to predict the output. On the other hand, if the data is unstructured (such as texts or images), the state-of-the-art method is to rely on an end-to-end Deep Neural Network. However, there is no consensus on a one-size-fits-all approach when the dataset contains both tabular and unstructured data. And this is exactly our case.

We need to find a way to incorporate both types of data to achieve the best performance.

Our approach prioritizes the tabular format of the data. We extract features from both the tabular and unstructured data before concatenating them into a feature table. A Machine Learning model is then trained on this table with regularization to predict the outputs.

Feature Extraction

We distribute the list of features into 3 categories: student features, problem features, and code features. Student features and problem features mostly consist of students’ and problems’ statistics, respectively. Code features are extracted from the actual programs the students submitted for the problems.

Student features

  • The number of attempted problems.
  • The number of solved problems.
  • The number of problems that the student feels easy.
  • The ratio of solved over attempted problems.
  • The ratio of easy over attempted problems.
  • The min, mean, median, max number of attempts to solve a problem.
  • The min, mean, median, max number of attempts on a problem.
  • The number of corrects on the first try.
  • The time of the first attempt on an assignment.
  • The number of syntax errors made.
  • The mean solving speed.
  • The concepts corresponding to solved problems. For this feature, we take 9 concepts from the list of concepts for these problems [2]. Note that this data sheet is provided by an anonymous researcher, and is not guaranteed to be correct.

All student features, except the 9 features concerning concepts, are extracted separately for each assignment.

Problem features

  • The number of students who solved.
  • The mean number of attempts a student made.

Code features

  • For each student, we extract the number of distinct Java keywords she has used in her successful submissions. The list of keywords is referenced from [3].
  • We use CodeBERT [4] to train a classification model that inputs a pair of {a successful program that solves an early problem, a late problem} and outputs whether the student who wrote that program will find that late problem easy. This prediction is then used as a feature for the main model.

For task 1, we use all but the student feature about syntax errors as it does not improve validation performance. For task 2, since there is no information about the problems in the final exam, the problem features are excluded.

Model training

Since the dataset is quite small (hundreds of students and dozens of problems), we employ regularization techniques to stabilize and prevent the models from over-fitting. After extracting all the features, we standardize each of them independently, followed by label-balancing and instance-duplicating before adding normally-distributed noise.

We tried various types of models and use cross-validation to select the best model for each task. While some SOTA models like Boosting trees and Neural networks may sometimes produce very good results, we observe their instability across different random initialization. In the end, Ridge is chosen for the first task and Random Forest is used for the second task. The implementations of both models are taken from scikit-learn [5].


For task 1, we get the AUC score of 0.7944, which remarkably outperforms the baseline (0.7565) and achieves the second runner-up position in the contest. For task 2, our model reports the MSE of 196.5, which means it can predict students’ paper final exam with an average accuracy within 1.4-grade letters. Both results are better than from the winner models of the Cross-semester phase (0.7919 and 234.1, respectively).

What did not work?

In this section, we list our unsuccessful efforts on improving model performance.

  • Using data from both semesters (Spring 19 and Fall 19) for training. We also add a feature to separate instances from the 2 semesters. However, experiments showed that doing so, instead of using only Fall 19 data, gives lower validation performance.
  • Using one-hot encoding to encode the 20 late problems. In fact, the 2 problem features (as stated above) give similar effects to the 20 one-hot encoding features.
  • Adding the feature of student’s average code length.
  • Adding nonlinearities to the features (by taking their squared or square root values).
  • Using one-hot encoding of the Java keywords.
  • Using GraphCodeBERT [6] in place of CodeBERT. This did not give improvement over cross-validation.
  • Adding the min, median, max solving speed of each student. In the final model, we only use the mean.
  • Experimenting different models, such as Logistic Regression, Lasso, MARS, Extremely Randomized Trees, LightGBM, XGBoost, MLP.
  • Experimenting Temporal-CodeBERT. Being motivated by the Temporal-ASTNN [7], we put a LSTM on top of the CodeBERT model as described in the above section. Instead of inputting only one solution code at a time (as with CodeBERT), the Temporal-CodeBERT gets a sequence of 3 solution codes from the same student on 3 problems (each problem belongs to one of the 3 early assignments). The LSTM is intended to capture the changes in student knowledge over the 3 assignments before predicting her performance on the late 2 assignments. However, cross-validation did not show encouraging results.


  • [1] 2nd CSEDM Data Challenge, T.Price and Y.Shi, website.
  • [2] Concep list for the problems in the 2nd CSEDM Data Challenge, sheet.
  • [3] List of all java keywords from, webpage.
  • [4] CodeBERT: A pre-trained model for programming and natural languages, Z.Feng et al., 2020, arXiv.
  • [5] Scikit-learn: Machine learning in Python, F. Pedregosa et al.
  • [6] Graphcodebert: Pre-training code representations with data flow, Guo Daya et al., 2020, arXiv.
  • [7] Knowing” When” and” Where”: Temporal-ASTNN for Student Learning Progression in Novice Programming Tasks, Mao Ye et al., 2021, paper.

One thought on “Case study: Machine Learning and Deep Learning for Knowledge Tracing in Programming Education

  1. Thank you for the valuable information on the blog.
    I am not an expert in blog writing, but I am reading your content slightly, increasing my confidence in how to give the information properly. Your presentation was also good, and I understood the information easily.

Leave a Reply