Deep Learning with Code Data

Connor Shorten
6 min readJul 15, 2021

Code languages, such as Python or Java, have become a core application area of Deep Learning. OpenAI and GitHub have recently unveiled “Copilot” and the corresponding paper describing the technology and underlying “Codex” models. Copilot is powered by taking the GPT-3 language modeling show on the road to datasets made of code. These datasets are typically scraped from the GitHub repository of open-source code. Platforms used to help prospective Software Engineers prepare for the coding interview have also been used as well, with Codeforces as a notable source of this data. More particularly Codex collects a filtered 159 GB data of raw GitHub data for generative pre-training and fine-tunes the models on 10,000 competitive programming problems and 40,000 GitHub projects implementing continuous integration (CI). If interested, here is a video explaining the technical details behind Codex:

Coding Problems adapted for Supervised Learning

Codex is an auto-regressive generative model. That means it has been trained to take a sequence as input and predict the token that would continue the sequence at the rightmost position. For example, given a sequence such as “I am going for a walk outside to get some”, the model would predict “air” to continue the sequence.

There are generally two ways of thinking about how to use these kinds of models. The first strategy is to unify all supervised learning tasks into this generative framework. This was most famously demonstrated with the T5 models. The T5 framework, shows how all supervised learning tasks can be unified in this way by prepending the task. So in the T5 framework, we take as input “text classification: This was the worst movie I have ever seen” or “natural language inference: premise: I really do not like cake hypothesis: I had cake for my birthday”.

The second strategy is Transfer Learning. The algorithm is to take the pre-trained neural network, remove the last layer that maps into token predictions for the entire vocabulary (usually somewhere between 30k to 55k tokens), and replace it with a classification layer for a supervised learning task (say 2 outputs for positive/negative sentiment classification). These neural networks usually have greater than 12 layers, so you would be using the first 11 layers as a pre-trained representation of the data. Transfer Learning is one of the most interesting successes of…