AI Weekly Update — December 28th, 2020

Connor Shorten
10 min readDec 28, 2020

Computer Vision, taken over by Transformers

Dear Readers,

Thank you for checking out the AI Weekly Update Newsletter from Henry AI Labs! This newsletter tours updates in Deep Learning and Artificial Intelligence, providing quotes and images that tell each story.

I am working on publishing my first experimental paper in Contrastive Learning. More than anything else, this has really tested my ability to manage a large code repository. This inspired a quick video explaining why (in my opinion) you should get away from exclusively writing code in Jupyter notebooks as soon as possible.

I’ve also started a cohort to walkthrough MIT’s open source “Machine Learning for Healthcare” course. I am extremely grateful to have this group together, some members are either graduate students as well, or are professional Machine Learning Engineers. I am extremely excited to report our experience organizing a study group through Slack and Zoom.

The headline for this weekly update is “Computer Vision, taken over by Transformers.” Researchers from Facebook AI have deployed a new strategy for ConvNet to Transformer Knowledge Distillation that seems to perform even better than EfficientNet, without using additional data (Google ImageNet models generally rely on their private JFT-300M dataset). Additionally, the model was trained on a single 8-GPU computer for 3 days! This is still far from an everyday machine, but quite the improvement nonetheless.

The biggest challenge for me in preparing this weekly update was “Large-scale clinical interpretation of genetic variants using evolutionary data.” I am very interested and excited about the intersection of Deep Learning and Biology, but I often bite off more than I can chew with the technical details on the Biology side. Particularly, this paper describes how they can assess the effects of thousands of mutations in parallel. This was very overwhelming for me, but I was still inspired by the more relatable (to my background) discussion of generative vs. classification models. The authors describe how their Bayesian VAE generative model can generalize better than classification models of genetic variants due to noisy and small datasets of clinical labels. There is an exciting generalization capability achieved through generative models. I was surprised they turned to the VAE instead of everyone’s favorite, the Transformer language model.

The most “head-scratching” update was “Pre-training a Language Model without Human Language.” Huh? This paper shows that pre-training on Amino Acid sequences, JavaScript code, or even artificially constructed data performs better than random init in the pre-train then fine-tune pipeline.

I hope you find this list of quotes and images useful, the video version of this AI Weekly Update is also linked below if that style appeals more to you.

Thank you for Reading!

-Connor Shorten

Video Version of this Summary

Content List

  • Data-Efficient Image Transformers
  • Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning
  • Pre-Training a Language Model Without Human Language
  • Evaluating Agents Without Rewards
  • Few-Shot Text Generation with Pattern-Exploiting Training
  • When BERT Plays The Lottery, All Tickets Are Winning
  • Interfaces for Explaining Transformers
  • Reddit AI in 2020
  • MSR 2020 in Review
  • Ruder Newsletter
  • DeepMind’s annual report: Why it’s hard to run a commercial lab
  • HuggingFace Dataset Sprint
  • AutoNLP

Data-Efficient Image Transformers

Content Link

“We produce a competitive convolution-free transformer by training on ImageNet only”

“We train it on a single computer in less than 3 days.”

“We introduce a new distillation procedure based on a distillation token, which plays the same role as the class token, except that it aims at reproducing the label estimated by the teacher. Both tokens interact in the transformer through attention. We show that this transformer-specific strategy outperforms vanilla distillation by a significant margin.”

→ Note: Look at how the Distillation token is indexed at the end of the output vector in a similar style as the [CLS] classification token for BERT. This idea isn’t completely new, we’ve seen Vision-Language architectures try this (e.g. OSCAR, VilBERT, ImageBERT), however, this seems to be the most successful result so far. Very “outside-the-box” extension to relate this to distillation.

Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning

Content Link

“The exponential growth in human genome sequencing has underlined the substantial genetic variation in the human population”

“Ideally, computation could accelerate clinical variant interpretation. However, widely-used computational methods are supervised, training on sparse, imbalanced, and noisy clinical labels, and can be implicitly circular due to data leakage during cross-validation.”

“Unsupervised probabilistic models of evolutionary sequences alone have been remarkably successful at predicting the effects of variants on protein function and stability and are fundamentally generalizable as they avoid learning from labels”

→ Note the use of a Bayesian VAE, rather than a generative Language Model. A VAE essential solves the same pre-training task that BERT does. The core difference between a VAE and BERT is that a VAE uses a highly sophisticated sampling trick to maximize the ELBO (Expected Lower Bound) of the data. BERT is a much simpler model that does not introduce this probabilistic sampling trick. In contrast to BERT/VAEs, the GPT models solve a fairly different task. The auto-regressive style (vs. the denoising auto-encoder approach used here) may be another direction to explore in this study.

Pre-Training a Language Model Without Human Language

Content Link

“We study how pre-trained data might and might not affect the downstream performance of a transformer-based pre-trained LM”

“We reveal that fine-tuning models pre-trained on unstructured data outperforms models trained form scratch on downstream tasks”

“We discover that pre-training on a simple artificial dataset with hierarchical structure leads to downstream performance comparable to models pre-trained on human language”

4.1 Is Structured Data All You Need for Pre-training?

→ “Our results show that models benefit from pre-training on a certain type of structured corpora, while not every structured corpus leads to a good pre-trained model for NLP downstream tasks.

4.2 Does Pre-training Data Token Distribution Affect the Performance on Downstream Tasks?

→ “The results […] show that even when the pre-training data is structured, token distribution still has little influence on how well the model can be fine-tuned.”

4.3 Does Token Numbers Mismatch between Pre-training and Fine-tuning Affect Downstream Performance?

→ “We can conclude that the main reason a pre-trained model failed to transfer to human language downstream tasks lies in the intrinsic property of the pre-training data”

4.4 Further Fine-tuning with English MLM before Fine-tuning on GLUE

→ “We find the performance slightly advance mostly, with improvement in JavaScript being the most salient.”

→ Reminder of the standard fine-tuning workflow. The unique thing about this paper is that Stage 1 is pre-training with non-human language. Rather than using say, Wikipedia or the BooksCorpus, the Transformer models Amino Acids or JavaScript code.

Evaluating Agents without Rewards

Content Link

“Children explore the world by crawling around and playing with objects they find. Inspired by this, the field of intrinsic motivation […] seeks mathematical objectives for RL agents that do not depend on a specific task and can be applicable to any unknown environment.”

3 Common Types of Intrinsic Motivation

  • “Input entropy encourages encountering rare sensory inputs, measured by a learned density model”
  • “Information gain rewards the agent for discovering the rules of its environment”
  • “Empowerment rewards the agent for maximizing the influence it has over its sensory inputs or environments”

“We propose the methodology of evaluating and comparing intrinsic objectives by correlation analysis on a fixed dataset”

→ In a similar strategy to Offline RL more generally, the authors will use the previously logged experience to see what Intrinsic Motivation scores are achievable. Note the Offline data has no sense of this metric, so the score could be completely uncorrelated. Surprisingly, these intrinsic motivation rewards actually correlate more with human similarity than the direct metric of task reward. Greatness cannot be planned, long live NOVELTY SEARCH! (Search → Kenneth Stanley, Jeff Clune)

“All studied intrinsic objectives correlate more strongly with human similarity than the task rewards do.”

Few-Shot Text Generation with Pattern-Exploiting Training

Content Link

“Providing pretrained language models with simple task descriptions or prompts in natural language yields impressive few-shot results for a wide range of text classification tasks when combined with gradient-based learning from examples.”

“Further improvements are often possible by choosing a different pretraining objective that more closely matches the downstream task of interest”

“Instead of making pretraining more similar to a downstream task, we can reformulate the task itself to make it more similar to the pretraining objective.”

“Enabling users to explain a task to a pretrained model, making it much easier for the model to understand the task”

“We adapt PET to train generative models on text generation tasks […] enable us to fine-tune a pretrained PEGASUS model […] with PET.”

→ Using this verbalizer helps the model transition from the PEGASUS pre-training task into downstream abstractive summarization. In these patterns, “x” represents the new data instance. For our particular interest, this x could represent something like a news article. The <mask> or underscore ( _ ) is where the model will write the abstractive summarization.

If you have already been studying PET a bit… → “We do not require a verbalizer as the output space already consists of natural language sentences.”

Old video I made explaining PEGASUS: (pre-training task for abstractive summarization, basically the same idea as SpanBERT)

Marrying PET with a language model pre-trained with the PEGASUS task

When BERT Plays the Lottery, All Tickets are Winners

Content Link

“The first version of our attempt to survey BERTology literature (Rogers et al. 2020) provided an overview of about 40 papers in February 2020. By June, there were over a hundred. The final TACL camera-ready version has about 150 BERT-related citations, and no illusions of completeness: we ran out of journal-allotted pages in August 2020.”

“The Lottery Ticket Hypothesis proposes that randomly initialized neural networks contain subnetworks that could be re-trained alone to reach (and sometimes exceed) the performance of the full model”

“Most of BERT’s self-attention heads can be pruned based on importance scores derived from the model’s gradients”

“For the Base-Transformer models, trained for machine translation, the heads pruned last tended to have syntactic functions”

→ Illustration of Magnitude Pruning compared to Structured Pruning. Structured pruning doesn’t seem to work as well although it would be more desirable for uncovering insights about the inner workings of BERT.

Core Idea → “Given all that: if BERT is so overparameterized, could we make it more interpretable by pruning it down to its most essential components? […] We would use pruning as a technique for model analysis rather than model compression”

“For most GLUE tasks, the ‘good’ subnetworks can be retrained to reach performance close to that of the full model, but so can randomly sampled subnetworks of the same size. This is good news for BERT compression (it’s a lottery you can’t lose), but bad news for interpretability.”

Interfaces for Explaining Transformers

Content Link

There are two “Explorables” in the beginning of the article. These are interactive visualizations of attention and I HIGHLY RECOMMEND testing these interfaces for yourself (reminder, link to article in section title).

What’s AI (YouTube) → AI in 2020 Recap

Content Link

Similar to above, I recommend just going to the link rather than trying to snip out quotes from this.

Microsoft Research in 2020

Content Link

“Microsoft researchers pursue the big questions about what the world will be like in the future and the role technology will play. Not only do they take on the responsibility of exploring the long-term vision of their research, but they must also be ready to react to the immediate needs of the present. This year in particular, they were asked to use their roles as futurists to address pressing societal challenges.”

Research Areas

  • Artificial Intelligence
  • Graphics and Multimedia
  • Human language technologies
  • Medical, health and genomics
  • Programming languages and software engineering
  • Quantum computing
  • Security, privacy, and cryptography

Ruder Newsletter

Content Link

Minimum viable datasets

“One of the most important factors for making progress on an idea in ML is the speed of iteration, i.e. how long it takes to try a hypothesis or a set of hyper-parameters on a dataset and to obtain results.”


“This year, it felt like such progress was nicely juxtaposed with progress in making models smaller.”

Roguelikes for RL Research or: The Promise of Procedural Generation

“Compared to deterministic environments for reinforcement learning (RL), procedurally generated environments are interesting as agents are forced to generalise and are not able to memorise sequences of actions that lead to previously visited states (see e.g. Go-Explore and Montezuma’s Revenge)

DeepMind’s annual report: Why it’s hard to run a commercial AI Lab

Content Link

“DeepMind is not a normal company seeking to grab a share of a specific market. It is an AI research lab that has had to repurpose itself into a semi-commercial outfit to ensure its survival”

Hugging Face Newsletter

Content Link

“This December, we had our largest community event ever: the Hugging Face Datasets Sprint 2020.”

“The datasets hub is now the largest open-source NLP datasets hub and will soon pass 600 datasets, covering a significant portion of the NLP dataset world.”

“Each dataset is provided with an editable “Dataset card” […] which describes the content of the datasets and welcomes information about the curation process leading to the creation of the dataset.”

“New pre-print on avoiding dataset biases […] Led by Research Scientist Victor Sanh, we show a method to train a model to ignore dataset biases without explicitly identifying/modeling them by learning from the errors of a “dumb” model.


Content Link

Please fill out this survey to help Hugging Face develop an AutoNLP tool!

All Content Links

Data-Efficient Image Transformers:

Pre-Training a Language Model without Human Language:

Evaluating Agents without Rewards:

Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning:

Few-Shot Text Generation with Pattern-Exploiting Training:

When BERT plays the Lottery, All Tickets are Winning:

Explaining Transformers:

AI in 2020:

MSR in 2020:

Ruder NLP News:

DeepMind Annual Report Blog:

Hugging Face Newsletter:

Hugging Face AutoNLP:

Thank you for Reading

Reminder: all original content is linked under the title of “Content Link”.

-Connor Shorten, Ph.D. student at Florida Atlantic University College of Engineering and Computer Science. YouTuber @ Henry AI Labs.