Deep Learning with Code Data

Connor Shorten
6 min readJul 15, 2021

Code languages, such as Python or Java, have become a core application area of Deep Learning. OpenAI and GitHub have recently unveiled “Copilot” and the corresponding paper describing the technology and underlying “Codex” models. Copilot is powered by taking the GPT-3 language modeling show on the road to datasets made of code. These datasets are typically scraped from the GitHub repository of open-source code. Platforms used to help prospective Software Engineers prepare for the coding interview have also been used as well, with Codeforces as a notable source of this data. More particularly Codex collects a filtered 159 GB data of raw GitHub data for generative pre-training and fine-tunes the models on 10,000 competitive programming problems and 40,000 GitHub projects implementing continuous integration (CI). If interested, here is a video explaining the technical details behind Codex:

Coding Problems adapted for Supervised Learning

Codex is an auto-regressive generative model. That means it has been trained to take a sequence as input and predict the token that would continue the sequence at the rightmost position. For example, given a sequence such as “I am going for a walk outside to get some”, the model would predict “air” to continue the sequence.

There are generally two ways of thinking about how to use these kinds of models. The first strategy is to unify all supervised learning tasks into this…