Learnings from KDD 2022

Last month, I had an opportunity to attend the KDD Conference in Washington DC. In this post, I’m sharing a summary of my learnings from the talks that I attended and/or found interesting.

1. Multimodal Learning

Using Transformers to classify Multimodal Data

This tutorial focused on using Transformers to classify multimodal data consisting of both text and images. It explored and compared two techniques for classification:

Building a dual-encoder text-image classifier: The dual encoder comprises a separate text encoder (BERT) and an image encoder (ResNet-50).
Building a Joint-Encoder Text-Image Classifier with Align before Fuse (ALBEF): This approach involves an image encoder, a text encoder, and a multimodal encoder, and uses image-text contrastive loss to align the unimodal representations of an image-text pair before fusion.

The tutorial was comprehensive, and I’d recommend following along the notebook and slides linked below.

Relevant Links: Slides, Code and Tutorial

Illustration of Align before Fuse: It consists of an image encoder, a text encoder, and a multimodal encoder.

Multimodal Transformers for detecting bad quality ads in YouTube

The paper reports significant performance gains in content quality prediction for YouTube video ads by transitioning to Transformer based models from simpler feed-forward neural networks. The paper compares various flavors of Transformer architectures:

Unimodal text and video feature representations: Pre-trained BERT was used to encode the text, and ResNet was used to encode each frame of the video.
Multimodal learning using Early Fusion, Mid Fusion and Late Fusion: The paper shows the results of experiments with various flavors of self-Attention and co-Attention modules and embedding fusion techniques.
Results: Transformers are effective at condensing multimodal sequential input data into a useful ad representation. Co-Attention was most effective when placed after a few layers of self-Attention blocks (mid fusion). The experiments also found that video modality is the primary driver of quality.

Relevant Links: Paper, Recording.

Model Architecture (left); Multimodal Encoder Architectures (right)

Multi Modality To Text Transfer Transformer (Amazon)

In this paper, the authors present a new generative model to involve different modalities (e.g. text and vision).

The proposed model is an encoder-decoder model in which the non-text components are fused to the text tokens. The experiments were done over Amazon’s ecommerce catalog involving image and text, with the rationale that the image of a product provides more information about the product.
While this architecture was evaluated for attribute generation, image-text matching, and captioning, I do see some potential applications of using a similar architecture in the Integrity domain.

Relevant Links: Paper

Text and non-text inputs are fused in the encoder using early fusion method and the encoder’s last hidden state is used in the decoder for text generation.

Large-Scale Commerce MultiModal Representation Learning

This paper introduces a multimodal model capable generating rich representations of commerce data.

It provides a diverse and granular understanding of commerce topics associated to the given piece of content (image, text, image+text).
The model is composed of an image encoder, a text encoder, and a multimodal fusion encoder, and has the capability to generalize to a wide range of tasks, including Multimodal Categorization, Image-Text Retrieval, Query-to-Product Retrieval, Image-to-Product Retrieval, etc.

Relevant Links: Blog, Paper

Model Architecture

Learning Product Embeddings at Pinterest

This paper from Pinterest introduces a single set of product embeddings called ItemSage to provide relevant recommendations in use cases such as user, image and search based recommendations. ItemSage uses a transformer-based architecture capable of aggregating information from both text and image modalities, this enables it to significantly outperform single modality baselines.

Relevant Links: Paper

Multilingual Taxonomic Web Page Classification at Yahoo

This paper uses multilingual Transformer-based transfer learning models to classify web pages in five high-impact languages. The authors leverage knowledge distillation to train accurate models that are lightweight in terms of (i) model size, and (ii) the input text used. The paper also explores building a model that can accurately classify web pages based only on text extracted from the URL (instead of crawling the entire web page, which can be expensive).

Relevant Links: Paper, Video

2. Few-Shot Learning

Training deep vision models in low-data regimes

The talk showed that incorporating domain-specific and modality-specific inductive biases leads to improved model performance when training data is limited. The following methods for object detection while training in low data regimes were discussed:

Kernelized Few-Shot Object Detection With Efficient Integral Aggregation: An Encoding Network encodes support and query images. The Kernelization block forms the linear, polynomial and RBF kernelized representations from features extracted within support regions of support images. These features are then cross-correlated against features of a query image to obtain attention weights, and generate query proposal regions via an Attention Region Proposal Net.
Time-rEversed diffusioN tEnsor Transformer: A new TENET of Few-Shot Object Detection: To address some of the problems with existing Few-Shot Object Detection approaches, this paper proposes TENET, which forms high-order tensor representations that capture multi-way feature occurrences to provide highly-discriminative representations. It also uses a transformer mechanism to dynamically extract correlations between the query image and the entire support set for a class, instead of a single average-pooled support embedding.

Kernelized Few-Shot Object Detector (KFSOD)

Relevant Links: Video

3. Graph Learning

Scaling up Graph Neural Networks at Snap

Graphs are used for multiple applications at Snap: Friendships, Chatting, Story Viewing, Games, Lenses etc. These graphs have hundreds of millions of nodes, and billions of edges. Scalable large-scale graph learning with Graph Neural Networks (GNNs) is hard. This talk discussed the following challenges of scaling GNNs:

Storing very large graphs in non trivial
Models are slow to train due to data dependency
Realtime inference is slower than traditional models

To solve challenges #1 and #2, can we significantly speed up GNN training by making large graph data smaller? This is where Graph Condensation comes in:

Graph Condensation for Graph Neural Networks: In this paper, the authors aim to condense the large, original graph into a small, synthetic and highly-informative graph, such that GNNs trained on the small graph and large graph have comparable performance. They approach the condensation problem by imitating the GNN training trajectory on the original graph through the optimization of a gradient matching loss and design a strategy to condense node features and structural information simultaneously.

Graph Condensation

To solve challenge #3, can we deploy GNNs in real-time inference settings without latency overheads? This is where Graph Less Neural Networks (GLNNs) comes in:

Graph Less Neural Networks: Teaching old MLPs new tricks via Distillation: GNNs are less popular for practical deployments in the industry owing to their scalability challenges incurred by data dependency. Namely, GNN inference depends on neighbor nodes multiple hops away from the target, and fetching them burdens latency-constrained applications. In this paper, the authors bring GNNs and MLPs together via Knowledge Distillation (KD). This work shows that the performance of MLPs can be improved by large margins with GNN KD. The paper calls the distilled MLPs Graph-less Neural Networks (GLNNs) as they have no inference graph dependency.

The GLNN Framework

Graph Minimally-Supervised Learning

It is common for graphs to be associated with a small amount of labeled data as data annotation and labeling on graphs is always time and resource-consuming. This talk focused on state-of-the-art techniques for graph learning with minimal human supervision for the low-resource settings where limited or even no labeled data is available. Topics covered:

Graph Weakly-Supervised Learning: Methodologies and applications of graph learning with weak supervision, with a focus on three types of weak supervisions, i.e., incomplete supervision, indirect supervision, and inaccurate supervision.
Graph Few-Shot Learning: Two categories of approaches: meta gradient based methods and metric learning-based to show how to handle never-before-seen nodes, edges, and graphs. In addition, the talk also discussed graph zero-shot learning.
Graph Self-Supervised Learning: Three main paradigms, including graph generative modeling, graph property prediction and graph contrastive learning.

Relevant Links: Slides

Graph-based Representation Learning at Twitter

High-quality user and item representations are crucial for personalized recommendations. To construct these user and item representations, self-supervised graph embedding has emerged as a principled approach to embed relational data such as user social graphs, user membership graphs, user-item engagements, and other heterogeneous graphs.
This talk discussed the different approaches to self-supervised graph embedding and demonstrated how to effectively utilize the resultant large embedding tables to improve candidate retrieval and ranking.
TwHIN: Embedding the Twitter Heterogeneous Information Network for Personalized Recommendation was an interesting paper discussed as a part of the talk. The paper talks about knowledge-graph embeddings for entities in the Twitter Heterogeneous Information Network (TwHIN). The authors show that these pretrained representations yield significant offline and online improvement for a diverse range of downstream recommendation and classification tasks: personalized ads rankings, search ranking and offensive content detection.

Relevant Links: Slides

The end-to-end framework aggregates disparate network data to construct TwHIN, joint-embedding is performed and embeddings are consumed in downstream tasks and ML models.

Other Interesting Graph Learning Papers:

4. Adversarial Learning

Towards Adversarial Learning: from Evasion Attacks to Poisoning Attacks

Although deep neural networks (DNNs) have been successfully deployed in various real-world application scenarios, recent studies demonstrated that DNNs are extremely vulnerable to adversarial attacks. This talk provided a comprehensive overview of the recent advances of adversarial learning, including both attack methods and defense methods:

An introduction of various types of Evasion and Poisoning Attacks methods, followed by a series of representative defense methods against such attacks.
The presenters also talked about DeepRobust, a PyTorch adversarial learning library which aims to build a comprehensive and easy-to-use platform to foster this research field.

Relevant Links: Slides

Evasion Attack: the network is fed an “adversarial example: — a carefully perturbed input that looks and feels exactly the same as its untampered copy to a human, but that completely throws off the classifier.

Adversarial Stop Sign

How much can we trust large language models?

Large language Models (LLMs, e.g., GPT-3, TNLG, T-5) are shown to have a remarkably high performance on standard benchmarks, due to their high parameter count, extremely large training datasets, and significant compute. Although the high parameter count in these models leads to more expressiveness, it can also lead to higher memorization, which, coupled with large unvetted, web-scraped datasets can cause multiple different negative societal and ethical impacts:

leakage of private, sensitive information — i.e. LLMs are ‘leaky’. ‘Leakage’ is being able to learn information about the training data, which cannot be learned from other models/data (from the same distribution).
generation of biased text — i.e. LLMs are ‘sneaky’, and
generation of hateful or stereotypical text — i.e. LLMs are ‘creepy’.

Carlini et al. Extracting Training Data from Large Language Models. USENIX SEC 2021.”(Eykholt

This talk also discussed topics such as measuring Leakage in Pre-training Large Language Models and Masked Language Models, Membership Inference Attacks, Reference-based attacks and Shadow-model based attacks.

Relevant Links: Slides, Recording

Membership Inference Attack on a Masked Language Model

While the talk did not discuss ways to mitigate memorization of training samples by LLMs, when looking at the recent research, I came across this relevant paper: SubMix: Practical Private Prediction for Large-scale Language Models. SubMix limits the leakage of information that is unique to any individual user in the private corpus, and is the first protocol that maintains strong privacy guarantees even when publicly releasing tens of thousands of next-token predictions made by large transformer-based models such as GPT-2.

5. Model Monitoring

Model Monitoring in Practice: Lessons Learned and Open Challenges

ML models are often deployed to automate decisions and critical business processes. A model’s behavior is determined by the picture of the world it was “trained” against — but real-world data can diverge from the picture it “learned.” Consequently, it becomes critical to ensure that these models are making accurate predictions, are robust to shifts in the data, are not relying on spurious features, and are not unduly discriminating against minority groups.

This talk started with motivating the need for ML model monitoring, as part of a broader AI model governance and responsible AI framework, and provided a roadmap for thinking about model monitoring in practice.
To achieve model understanding, one approach is to build inherently interpretable predictive models, however, this is not always possible. The other approach is to explain pre-built models in a post-hoc manner using Local explanations (example: LIME) or Global explanations (example: Model Distillation). The talk also discussed some open-source & commercial tools for Model Assessment and Monitoring.
The final part of the talk covered monitoring in a practice case study: [Alexa AI] Error Detection in Large-Scale Conversational Assistants through Offline Models (related paper). Two techniques for monitoring and detecting errors in a large scale conversational assistant were presented. Both systems rely on an offline Transformer-based model to detect errors in the online system.

Relevant Links: Slides, Recording

Engineering for Fairness in ML Lifecycle (paper)

Amazon SageMaker Model Monitor: A System for Real-Time Insights into Deployed Machine Learning Models

Continuous monitoring of production models can be used to identify the right time and frequency to retrain the model. The authors propose a model monitoring service that can automatically detect data, concept, bias, and feature attribution drift in models in real-time. At a regular frequency, the system automatically analyses the collected data based on user-provided rules to determine if there are any rule violations. The system can also provide alerts so that model owners can take corrective actions and thereby maintain high quality models.

One use-case described in the paper is detecting data drift in NLP models:

NLP encoders like word2vec, BERT and RoBERTa operate by converting input words or sequences of words into word-level embeddings. These embeddings are then used by downstream task specific models. A change in the distribution of the input text can clearly impact the performance of the downstream model.
Unlike tabular data which is often fixed dimensional and bounded, text data is often free form, which makes monitoring challenging. To overcome this challenge, the authors demonstrate monitoring the embeddings of the text data as opposed to the raw text itself. The figure below shows an example of configuring a custom monitoring schedule to detect drifts in text data.

Relevant Links: Paper

High level architecture of “Model Monitor” component

1. Multimodal Learning

Using Transformers to classify Multimodal Data

Multimodal Transformers for detecting bad quality ads in YouTube

Multi Modality To Text Transfer Transformer (Amazon)

Large-Scale Commerce MultiModal Representation Learning

Learning Product Embeddings at Pinterest

Multilingual Taxonomic Web Page Classification at Yahoo

2. Few-Shot Learning

Training deep vision models in low-data regimes

3. Graph Learning

Scaling up Graph Neural Networks at Snap

Graph Minimally-Supervised Learning

Graph-based Representation Learning at Twitter

Other Interesting Graph Learning Papers:

4. Adversarial Learning

Towards Adversarial Learning: from Evasion Attacks to Poisoning Attacks

How much can we trust large language models?

5. Model Monitoring

Model Monitoring in Practice: Lessons Learned and Open Challenges

Amazon SageMaker Model Monitor: A System for Real-Time Insights into Deployed Machine Learning Models

6. Interesting Papers and Talks from Other Topics

Trending Articles