AI Weekly Update Preview — April 12th, 2021

Connor Shorten
10 min readApr 9, 2021


This article presents a few salient quotes from each of the papers that will be covered on the next AI Weekly Update on Henry AI Labs!

Major Themes of the Latest Papers

  • Self-Supervised Learning
  • Vision-Language Learning
  • Generative Modeling (mostly GANs)
  • Meta-Learning
  • NLP
  • Generalization
  • Model-based RL
  • Code Examples (GPT-Neo, RAPIDS+Determined, PT Lightning+DeepSpeed)
  • Meta (Ruder Newsletter, Commentary on Medical AI approval)

Self-Supervised Learning

Large-scale forecasting: Self-supervised learning framework for hyperparameter tuning

“The SSL-HPT algorithm estimates hyperparameters 6–20x faster when compared with baseline search-based algorithms, while producing comparably accurate forecasting results in various applications.”

“Most existing hyperparameter tuning methods — such as grid search, random search, and Bayesian optimal search — are based on one key component: search. Because of this, they are computationally expensive.”

  • Learn more about this bottleneck in a video I recently made explaining Determined’s ASHA algorithm and other ideas related to HP optimization.

Two Parts → Model Selection (MS) and Hyperparam Tuning (HPT)

“The self-supervised learning framework for model selection (SSL-MS) consists of three steps:

  1. Offline training data preparation. We obtain (a). time series features for each time series, and (b) the best performing model for each time series via offline exhaustive hyperparameter tuning.
  2. Offline training. A classifier (self-supervised learner) is trained with the data from Step (1), where the input feature (predictor) is the time series feature, the label is the best performing model.
  3. Offline model prediction. In our online services, for a new time series data, we first extract features, then make inference with our pre-trained classifier, such as random forest.”

“The SSL-HPT framework consists of three steps:

  1. Offline training data preparation. Similar to SSL-MS, we also need to obtain the time series features, then perform offline exhaustive parameter tuning to get the best performed hyper-parameters for each model and data combination.
  2. Offline training. A multi-task neural network (self-supervised learner) is trained with the datasets from Step(1) for each model.
  3. Online hyper-parameters tuning. In our online system, for a new time series data, we first extract features, then make inference with our pre-trained multi-task neural network.”

“SSL-HPT takes constant time to choose hyper-parameters, it makes fast and accurate forecasts at large scale become feasible.”

MoCo v3 — An Empirical Study of Training Self-Supervised Visual Transformers

This paper does not describe a novel method. Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Visual Transformers (ViT).”

→ “We study the frameworks that are based on Siamese networks, including MoCo and others.”

“MoCo v3 is an incremental improvement of MoCo v1/2 […] In MoCo v3 we use the keys that naturally co-exist in the same batch. We abandon the memory queue, which we find has diminishing gain if the batch is sufficiently large (e.g. 4096).”

“We observe that instability [for training self-supervised ViT] is a major issue that degrades accuracy, and it can be hidden by apparently good results.”

“We observe that unstable ViT training may not result in catastrophic failure (e.g. divergence); instead, it can cause mild degradation in accuracy (e.g. 1~3%).”

“The instability problem can not be simply reflected by accuracy numbers.”

“We notice that a sudden change of gradients (a ‘spike’ in Fig. 4) causes a ‘dip’ in the training curve.”

Proposed Solution:

  • “By comparing all layers’ gradients, we observe that the gradient spikes happen earlier in the first layer (patch projection), and are delayed by couples of iterations in the last layers (see Fig. 4).
  • Based on this observation, we hypothesize that the instability happens earlier in the shallower layers.
  • Motivated by this, we explore freezing the patch projection layer during training.
  • We use a fixed random patch projection layer to embed the patches, which is not learned.
  • This can be easily done by applying a stop-gradient operation right after this layer.

Vision-Language Learning

Towards General Purpose Vision Systems

“GPV-I can be trained end-to-end on any task that demands a box or text output without any architecture modifications such as adding a new task-head.”

“Tenets of General Purpose Learning

  1. Generality of architecture: The system can learn and perform any task within a broad domain without change to network structure (e.g. learn to classify bird species, without adding new output heads, by re-using ability to encode images, interpret task from text, and produce words)
  2. Generality of concepts across skills: The system can perform tasks in skill-concept combinations not seen during training (e.g. localize ‘muskrat’ after learning to answer questions about ‘muskrats’)
  3. Generality of learning: The system can learn new tasks sample-efficiently with minimal loss to performance on previously learned tasks”
“GPV-I consisting of a visual encoder, language encoder, vision-language co-attention module, and output heads for the supported output modalities — boxes, relevance scores, and text.”

Generative Modeling

Regularizing Generative Adversarial Networks under Limited Data

“We theoretically show a connection between the regularized loss and an f-divergence called LeCam-divergence, which we find is more robust under limited training data.”

→ Get caught up with f-divergences in this video:

f-divergence: “An f-divergence is a function D[f](P || Q) that measures the difference between two probability distributions P and Q.”

“The proposed regularization scheme

  1. Improves the generalization performance and stabilizes the learning dynamics of GAN models under limited training data
  2. Complements the recent data augmentation methods.”

“Tracking the moving average of the prediction reduces the variance across mini-batches and stabilizes the regularization term […] The moving average becomes stable while the discriminator’s prediction gradually converges to the stationary point.”

Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Experts

“A few-shot font generation (FFG) method has to satisfy two objectives: the generated images should:

  • Preserve the underlying global structure of the target character
  • Present the diverse local reference style”

“Multiple Localized eXperts Few-Shot Font Generation (MX-Font) […] MX-Font has a multi-headed encoder, named multiple localized experts. Each localized expert is specialized for different local sub-concepts from the given complex glyph image.”


Comparing Transfer and Meta Learning Approaches on a Unified Few-Shot Classification Benchmark

“Meta and transfer learning are two successful families of approaches to few-shot learning. Despite highly related goals, state-of-the-art advances in each family are measured largely in isolation of each other.”

“We find that, on average, large-scale transfer methods (Big Transfer, BiT) outperform competing approaches on MD, even when trained only on ImageNet.”

“In contrast, meta-learning approaches struggle to compete on VTAB when trained and validated on MD.”

“BiT is not without limitations, and pushing for scale does not improve performance on highly out-of-distribution MD tasks.”

→ Another very interesting related paper:

Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML

Natural Language Processing

What will it Take to Fix Benchmarking in Natural Language Understanding?

Quotes from Professor Sam Bowman’s Twitter Thread describing the paper:

“On the cranky side, we argue that adversarial filtering is good at making datasets that look hard by the numbers, but that these datasets can drift arbitrarily far away from measuring the thing that we actually want to measure.”

“On the positive side, we lay out four difficult but relatively familiar challenges that we’ll need to face in order to build benchmarks that will allow us to responsibly measure further progress on language understanding”

Proposed criteria for future NLP benchmarks

How many Data Points is a Prompt Worth?

New HuggingFace Blog Post on the Paper

“GPT-3 has popularized prompts, natural language inputs designed to steer the pre-trained language model itself into solving the task, rather than a classifier built on top of it.”

“Prompts are interesting because they allow a practitioner to give information to the model, although in a very different fashion from standard ML supervision.”

“As we interpret a prompt as additional human-crafted information for the model, we measure that edge in terms of data points and quantify: how many data points is a prompt worth?”

Example of a prompt (red text) applied to a sample of the BoolQ dataset

“Writing a prompt is consistently worth hundreds of data points.”

“Furthermore, this advantage holds even with non-informative target tokens and is fairly robust to the choice of prompt.”

For Practitioners

  • We believe that prompt-based fine-tuning should become a standard tool: especially for small- and middle-sized task-specific datasets, designing a prompt yourself is a small effort for a sizable data advantage.

For Researchers […]

  • Why is the same prompt worth 3500 MNLI data points but only 282 RTE data points?
  • How are prompts related to standard ML supervision?
  • Do they react differently to adversarial or out-of-domain examples, since they have some zero-shot behaviour?”

A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

“After a discussion about the different dimensions of data availability, we give a structured overview of methods that enable learning when training data is sparse.”

“A goal of our survey is to explain how these methods differ in their requirements as understanding them is essential for choosing a technique suited for a specific low-resource setting.”

Deep Learning with Code Data

CodeTrans: Towards Cracking the Language of Silicone’s Code Through Self-Supervised Deep Learning and High Performance Computing

“We applied CodeTrans to six different kinds of tasks, including:

  • Code Documentation generation
  • Source Code Summarization
  • Code Comment Generation
  • Git Commit Message Generation
  • API Sequence Recommendation
  • Program Synthesis.”

“We have also contributed our pre-trained checkpoints and published the models for each task in HuggingFace model Hub.”


Evaluating Prediction-Time Batch Normalization for Robustness under Covariate Shift

“Covariate shift […] Informally defined as situations in which training data differs from the data seen at final prediction time.”

“We make the observation that modern machine learning systems utilize batching at prediction time for reasons of computational efficiency, especially with hardware such as GPUs and TPUs that amortize cost well across batches of hundreds or thousands of examples.”

“Prediction-time batch norm achieves an mCE of 60.28% on the challenging ImageNet-C benchmark, which is the best result to our knowledge for a model that does not incorporate additional data augmentation.”

Colab Notebook to reproduce results

Model-Based RL

Debugging Deep Model-based Reinforcement Learning Systems

“The post is set up in the following format:

  1. Overview of model-based RL,
  2. Core things to tinker with in these systems
  3. Other considerations that may come up (e.g. when working with robotics),
  4. Practical tips: quick things to change or run for a big potential improvement, and
  5. Conclusion”

“Many of the debugging tools I discuss are from the lens of long-term planning for control, and could be tweaked to be better phrased for methods using value functions and model-free control.”



GPT-Neo for Beginners

  • How to load GPT-Neo
  • Generate from “My name is Zack and I like to”
  • Generate from “Below is React code for a to-do list app:”

Generate Netflix Movie Descriptions (GPT-Neo)

Speedy Model Training with RAPIDS + Determined AI

“Data prep with RAPIDS, training with Determined […]

  • Read location and historical sales CSVs into cuDF DataFrames residing in GPU memory.
  • Join these data sets into a denormalized DataFrame. This GPU-accelerated join is handled by cuDF.
  • Construct a PyTorch Dataset from the denormalized DataFrame.
  • Train with Determined!”

“RAPIDS is a software suite that bridges the gap from CUDA primitives to data-hungry analytics and machine learning use cases.”

“Determined AI’s deep learning training platform frees the model developer from hassles: operational hassles they are guaranteed to hit in a cluster setting, and model development hassles as they move from toy prototype to scale.”

Accessible Multi-Billion Parameter Model Training with PyTorch Lightning + DeepSpeed

“Organizing PyTorch code with Lightning enables seamless training on multiple-GPUs, TPUs, CPUs and the use of difficult to implement best practices such as model sharding, and even in 16-bit precision without changing your code.”

“DeepSpeed is a deep learning library on top of PyTorch that makes training models at extreme-scale efficient and easy for everyone. DeepSpeed offers powerful training features for data scientists training on massive supercomputers as well as those training on low-end clusters or even on a single GPU.”


Sebastian Ruder’s Newsletter

  • ICLR 2021 Outstanding Papers
  • Char Wars
  • Speech-first NLP
  • Virtual conference ideas

How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals

“The FDA releases publicly available information on approved devices in the form of a summary document that generally contains information about the device description, indications for use, and performance data of the device’s evaluation study.”

“We have created an annotated database of FDA-approved medical AI devices and systematically analyzed how these devices were evaluated before approval.”

“Case study of multi-site evaluation for pneumothorax detection […] Across the board, we found substantial drop-offs in model performance when the models were evaluated on a different site.”