Deep Learning with Code Data

Connor Shorten
6 min readJul 15, 2021

Code languages, such as Python or Java, have become a core application area of Deep Learning. OpenAI and GitHub have recently unveiled “Copilot” and the corresponding paper describing the technology and underlying “Codex” models. Copilot is powered by taking the GPT-3 language modeling show on the road to datasets made of code. These datasets are typically scraped from the GitHub repository of open-source code. Platforms used to help prospective Software Engineers prepare for the coding interview have also been used as well, with Codeforces as a notable source of this data. More particularly Codex…

--

--