In today’s digital age, with technical constraints and legal regulations constantly increasing, the need for contextualizing content to personalize ads while ensuring user privacy has never been so important. Moreover, advertisers strive to maintain a positive and trustworthy association between their brand and the content in which their ads are displayed. Thus, categorizing text, images and video content can provide a better user experience through better contextual ads, while providing better performance and guarantees to advertisers. It stands out how using image categorization of a website’s content to place meaningful adjacent ads can be beneficial for all stakeholders.
The ability to classify and understand images has thus become increasingly valuable across various industries. Whether for content organization, targeted advertising, or customer analysis, image classification plays a crucial role in extracting meaningful insights from visual data.
In this blog post, we will explore the process of fine-tuning a Vision Transformer (ViT) model from the Hugging Face library for image classification into the categories of the Interactive Advertising Bureau (IAB) Tech Lab’s Content Taxonomy, the standard in the ad industry. Vision Transformers have recently demonstrated remarkable success in computer vision, making them an intriguing choice for image classification. By leveraging the power of pre-trained models and aligning the predictions with the IAB taxonomy, we can unlock the potential to accurately categorize images.
Throughout this post, we will delve into the fundamental concepts of ViTs, explaining their architecture and advantages over traditional computer vision techniques. Then, we will introduce the IAB taxonomy categories, highlighting their significance in organizing visual content. Afterwards, we will reach the core of the blog post: the fine-tuning process. Starting with selecting a pre-trained model from the Hugging Face model hub, we will quickly implement data preprocessing, model training, and evaluation leveraging the power of Amazon SageMaker Studio.
Understanding Vision Transformers
Vision Transformers, introduced in 2021 by Dosovitskiy et al., have emerged as a powerful alternative to traditional architectures like convolutional neural networks (CNNs) in computer vision tasks. Transformers, which were initially designed for natural language processing (NLP) tasks, have recently demonstrated remarkable success in image classification tasks.
A pure vision transformer breaks down an image into smaller patches, treating them as tokens (similar to words in NLP). Each patch is then linearly embedded, and position embeddings are added to perform classification by using the approach of adding an extra learnable “classification token” [CLS] to the sequence. The resulting sequence of vectors is fed to a standard transformer encoder and processed through multiple layers of self-attention mechanisms.
Extensive work has been done to compare ViTs to state-of-the-art CNNs on their performance in image classification tasks (Maurício, et al. 2023). Let’s briefly cover more information about the performance of the two architectures and what distinguishes them.
Compared to CNNs, ViTs offer better scalability and generalization capabilities. When pre-trained on large amounts of data and then transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB), ViT attains excellent results compared to state-of-the-art CNNs, while requiring substantially fewer computational resources to train.
Thus, by pre-training ViTs on large-scale datasets such as ImageNet-21k (14 million images, 21,843 classes), we can leverage the learned representations to adapt to specific image classification tasks through fine-tuning, thereby reducing the need for extensive training from scratch, thus decreasing training time and overall costs.
However, CNNs do achieve excellent results when training from scratch on smaller data volumes as those required by Vision Transformers. This different behavior seems to be due to the presence in CNNs of inductive biases, which are exploited by these networks to grasp more rapidly the particularities of the analyzed images even if, on the other hand, they end up making it more complex to understand global relations. On the other hand, the ViTs are free from these biases, which enables them to also capture long-range dependencies in images at the cost of more training data. The self-attention mechanism allows the model to capture global and local dependencies within the image, enabling it to understand the contextual relationships between different regions.
Among some other notable differences between the two architectures, ViTs perform better and are more resilient to images with natural or adverse disturbances compared to CNNs, and ViTs are more robust to adversarial attacks and CNNs are more sensitive to high-frequency features.
Overall, transformer-based architecture – or the combination of ViTs with CNNs – allows for better accuracy compared to CNN networks, performing better compared due to the work of the self-attention mechanism. ViTs architecture is lighter than CNNs, consuming fewer computational resources and taking less training time, and is more robust than CNN networks for images that have noise or are augmented.
On the other hand, CNNs can generalize better with smaller datasets and get better accuracy than ViTs, but in contrast, ViTs have the advantage of learning information better with fewer images when fine tuning. However, it has been noted that ViT performance may struggle with generalization when trained on smaller image datasets.
Understanding the IAB Content Taxonomy
The Interactive Advertising Bureau (IAB) has developed a comprehensive taxonomy that plays a pivotal role in organizing and categorizing digital content. Their taxonomy provides a standardized framework for content classification, enabling efficient content targeting, contextual advertising, and audience segmentation.
The IAB taxonomy is structured hierarchically, with several levels of categories and subcategories. At the top level, it covers broad content categories such as Arts & Entertainment, Sports, and more.
Each category is then further divided into more specific subcategories, creating a more precise and granular classification system that can capture a wide range of content types. For example, within the Sports category, there are subcategories like Basketball, Soccer and Tennis.
In the digital advertising industry, the IAB taxonomy serves various purposes. It helps advertisers align their campaigns with specific content categories, ensuring that their advertisements are shown in relevant contexts. Content publishers can use the taxonomy to categorize their content effectively, making it easier for users to discover relevant information. Moreover, the IAB taxonomy facilitates audience segmentation by enabling advertisers to target specific categories that align with their target audience’s interests.
In the following section, we will explore how to fine-tune a ViT model to accurately classify images into the specific IAB taxonomy categories.
Fine – Tuning a Vision Transformer Model
Fine-tuning a vision transformer model involves adapting a pre-trained model on a large-scale dataset and fine-tuning to smaller downstream tasks, such as image classification into the IAB taxonomy categories. This process allows us to leverage the knowledge and representations learned by the pre-trained model, saving time and computational resources. For this, we remove the pre-trained prediction head and attach a new feedforward layer, with K outputs being the number of downstream classes.
To fine-tune the ViT model we will be using the Hugging Face library. This library offers a wide range of pre-trained vision transformer models and provides an easy-to-use interface for fine-tuning and deploying these models. By utilizing the Hugging Face library, we can take advantage of their model availability, easy integration, strong community support, and the benefits of transfer learning.
Since we wanted to tackle this problem with a supervised training model using the IAB categories as labels, we needed a properly labeled dataset. For this, we collected and labeled an in-house dataset of commercial-use allowed images. As a baseline, images are single labeled with a single Tier 1 IAB category. The dataset consists of image files, organized in folders per category, so that we can easily load it into the appropriate format using the Hugging Face datasets library.
Using SageMaker’s Studio capabilities, we can easily launch a notebook backed with a GPU instance, such as a ml.g4dn.xlarge, to conduct our experiment of training a baseline classification model.
To get started, let’s first install the required libraries from Hugging Face and PyTorch into our notebook’s virtual environment:
!pip install transformers==4.26.1 datasets==2.10.1 evaluate==0.4.0 -q
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -q
Note that the exclamation mark (!) allows us to run shell commands from inside a Jupyter Notebook code cell.
Now, let’s do the necessary imports for the notebook to run properly:
import evaluate
import json
import numpy as np
import os
import pandas as pd
import pyarrow as pa
import requests
import torch
from datasets import load_dataset, load_from_disk, Dataset, Features, Array3D
from io import BytesIO
from transformers import AutoProcessor, ViTFeatureExtractor, ViTForImageClassification, Trainer, TrainingArguments, default_data_collator
from typing import Tuple
from PIL import Image
Let’s define some variables:
# The directory where our images are saved in folders by category
images_dir = "./dataset"
# The output directory of the processed datasets
train_save_path = "./processed-datasets/train"
val_save_path = "./processed-datasets/val"
test_save_path = "./processed-datasets/test"
# Sizes of dataset splits
val_size = 0.2
test_size = 0.1
# Name of model as named in the HuggingFace Hub
model_name = "google/vit-base-patch16-224-in21k"
We leverage the create an image dataset feature to easily load our custom dataset by specifying our local folder where the images are stored in folders by category.
dataset = load_dataset("imagefolder", data_dir=images_dir, split='train')
Let’s perform some data cleaning and remove from dataset images which are non-RGB (single-channel, grayscale).
# Remove from dataset images which are non-RGB (single-channel, grayscale)
condition = lambda data: data['image'].mode == 'RGB'
dataset = dataset.filter(condition)
Now we split our dataset into Train, Validation and Test sets.
def split_dataset(
dataset: Dataset,
val_size: float=0.2,
test_size: float=0.1
) -> Tuple[Dataset, Dataset, Dataset]:
"""
Returns a tuple with three random train, validation and test subsets by splitting the passed dataset.
Size of the validation and test sets defined as a fraction of 1 with the `val_size` and `test_size` arguments.
"""
print("Splitting dataset into train, validation and test sets...")
# Split dataset into train and (val + test) sets
split_size = round(val_size + test_size, 3)
dataset = dataset.train_test_split(shuffle=True, test_size=split_size)
# Split (val + test) into val and test sets
split_ratio = round(test_size / (test_size + val_size), 3)
val_test_sets = dataset['test'].train_test_split(shuffle=True, test_size=split_ratio)
train_dataset = dataset["train"]
val_dataset = val_test_sets["train"]
test_dataset = val_test_sets["test"]
return train_dataset, val_dataset, test_dataset
# Split dataset into train and test sets
train_dataset, val_dataset, test_dataset = split_dataset(dataset, val_size, test_size)
Finally, we can prepare these images for our model.
When ViT models are trained, specific transformations are applied to images being fed into them so they fit the expected input format.
To make sure we apply the correct transformations, we will use an ‘AutoProcessor’ initialized with a configuration that was saved to Hugging Face’s Hub along with the pretrained “google/vit-base-patch16-224-in21k” model we plan to use.
So, we preprocess our datasets with the model’s image AutoProcessor:
def process_examples(examples, image_processor):
"""Processor helper function. Used to process batches of images using the
passed image_processor.
Parameters
----------
examples
A batch of image examples.
image_processor
A HuggingFace image processor for the selected model.
Returns
-------
examples
A batch of processed image examples.
"""
# Get batch of images
images = examples['image']
# Preprocess
inputs = image_processor(images=images)
# Add pixel_values
examples['pixel_values'] = inputs['pixel_values']
return examples
def apply_processing(
model_name: str,
train_dataset: Dataset,
val_dataset: Dataset,
test_dataset: Dataset
) -> Tuple[Dataset, Dataset, Dataset]:
"""
Apply model's image AutoProcessor to transform train, validation and test subsets.
Returns train, validation and test datasets with `pixel_values` in torch tensor type.
"""
# Extend the features
features = Features({
**train_dataset.features,
'pixel_values': Array3D(dtype="float32", shape=(3, 224, 224)),
})
# Instantiate image_processor
image_processor = AutoProcessor.from_pretrained(model_name)
# Preprocess images
train_dataset = train_dataset.map(process_examples, batched=True, features=features, fn_kwargs={"image_processor": image_processor})
val_dataset = val_dataset.map(process_examples, batched=True, features=features, fn_kwargs={"image_processor": image_processor})
test_dataset = test_dataset.map(process_examples, batched=True, features=features, fn_kwargs={"image_processor": image_processor})
# Set to torch format for training
train_dataset.set_format('torch', columns=['pixel_values', 'label'])
val_dataset.set_format('torch', columns=['pixel_values', 'label'])
test_dataset.set_format('torch', columns=['pixel_values', 'label'])
# Remove unused column
train_dataset = train_dataset.remove_columns("image")
val_dataset = val_dataset.remove_columns("image")
test_dataset = test_dataset.remove_columns("image")
return train_dataset, val_dataset, test_dataset
# Apply AutoProcessor
train_dataset, val_dataset, test_dataset = apply_processing(model_name,
train_dataset, val_dataset, test_dataset)
We now save our processed datasets:
# Save train, validation and test preprocessed datasets
train_dataset.save_to_disk(train_save_path, num_shards=1)
val_dataset.save_to_disk(val_save_path, num_shards=1)
test_dataset.save_to_disk(test_save_path, num_shards=1)
Let’s proceed to the training step. We begin by loading the train and validation datasets.
train_dataset = load_from_disk(train_save_path)
val_dataset = load_from_disk(val_save_path)
Our new fine-tuned model will have the following number of output classes:
num_classes = train_dataset.features["label"].num_classes
Now, let’s define a ‘ViTForImageClassification’, which places a linear layer (nn.Linear) on top of a pre-trained ViT model. The linear layer is placed on top of the last hidden state of the [CLS] token, which serves as a good representation of an entire image.
We also specify the number of output neurons by setting the `num_labels` parameter.
# Download model from model hub
model = ViTForImageClassification.from_pretrained(model_name, num_labels=num_classes)
# Download feature extractor from hub
feature_extractor = ViTFeatureExtractor.from_pretrained(model_name)
Let’s proceed by defining a `compute_metrics` function. This function is used to compute any defined target metrics at every evaluation step.
Here, we use the `accuracy` metric from `datasets`, which can easily be used to compare the predictions with the expected labels on the validation set.
We also define a custom metric to compute the accuracy at “k” for our predictions. The accuracy at “k” is the number of instances where the real label is in the set of the “k” most probable classes.
# K for top accuracy metric
k_for_top_acc = 3
# Compute metrics function for binary classification
acc_metric = evaluate.load("accuracy", module_type="metric")
def compute_metrics(eval_pred):
predicted_probs, labels = eval_pred
# Accuracy
predicted_labels = np.argmax(predicted_probs, axis=1)
acc = acc_metric.compute(predictions=predicted_labels, references=labels)
# Top-K Accuracy
top_k_indexes = [np.argpartition(row, -k_for_top_acc)[-k_for_top_acc:] for row in predicted_probs]
top_k_classes = [top_k_indexes[i][np.argsort(row[top_k_indexes[i]])] for i, row in enumerate(predicted_probs)]
top_k_classes = np.flip(np.array(top_k_classes), 1)
acc_k = {
f"accuracy_k" : sum([label in predictions for predictions, label in zip(top_k_classes, labels)]) / len(labels)
}
# Merge metrics
acc.update(acc_k)
return acc
As we would like to know the actual class names as outputs, rather than just integer indexes, we set the ‘id2label’ and ‘label2id’ mapping as attributes to the configuration of the model (which can be accessed as model.config):
# Change labels
id2label = {key:train_dataset.features["label"].names[index] for index,key in enumerate(model.config.id2label.keys())}
label2id = {train_dataset.features["label"].names[index]:value for index,value in enumerate(model.config.label2id.values())}
model.config.id2label = id2label
model.config.label2id = label2id
Next, we specify the output directories of the model and other artifacts, and the set of hyperparameters we’ll use for training.
model_dir = "./model"
output_data_dir = "./outputs"
# Total number of training epochs to perform
num_train_epochs = 15
# The batch size per GPU/TPU core/CPU for training
per_device_train_batch_size = 32
# The batch size per GPU/TPU core/CPU for evaluation
per_device_eval_batch_size = 64
# The initial learning rate for AdamW optimizer
learning_rate = 2e-5
# Number of steps used for a linear warmup from 0 to learning_rate
warmup_steps = 500
# The weight decay to apply to all layers except all bias and LayerNorm weights in AdamW optimizer
weight_decay = 0.01
main_metric_for_evaluation = "accuracy"
We just need to define two more things before we can start training.
First, the ‘TrainingArguments’, which is a class that contains all the attributes to customize the training. There we set the evaluation and checkpoints to be done at the end of each epoch, the output directories and set out hyperparameters (such as the learning rate and batch_sizes).
Then, we create a ‘Trainer’, where we pass the model name, the ‘TrainingArguments’, ‘compute_metrics’ function, datasets and feature extractor.
# Define training args
training_args = TrainingArguments(
output_dir = model_dir,
num_train_epochs = num_train_epochs,
per_device_train_batch_size = per_device_train_batch_size,
per_device_eval_batch_size = per_device_eval_batch_size,
warmup_steps = warmup_steps,
weight_decay = weight_decay,
evaluation_strategy = "epoch",
save_strategy = "epoch",
logging_strategy = "epoch",
logging_dir = f"{output_data_dir}/logs",
learning_rate = float(learning_rate),
load_best_model_at_end = True,
metric_for_best_model = main_metric_for_evaluation,
)
# Create Trainer instance
trainer = Trainer(
model = model,
args = training_args,
compute_metrics = compute_metrics,
train_dataset = train_dataset,
eval_dataset = val_dataset,
data_collator = default_data_collator,
tokenizer = feature_extractor
)
Start training by calling ‘trainer.train()’:
trainer.train()
By inspecting the training metrics, we can see how loss and accuracy stabilize by the end of training.
log_history = pd.DataFrame(trainer.state.log_history)
log_history = log_history.fillna(0)
log_history = log_history.groupby(['epoch']).sum()
log_history
log_history[["loss", "eval_loss", "eval_accuracy",
"eval_accuracy_k"]].plot(subplots=True)I I
Figure 3: Loss and accuracy after fine-tuning
Now that we are satisfied with our results we just have to save the trained model.
trainer.save_model(model_dir)
We finally have our fine-tuned ViT model. Now, we need to verify its performance on the test set. Let’s first reload our model, create a new Trainer instance and then call the `evaluate` method to validate the performance of the model on the test set.
# Load dataset
test_dataset = load_from_disk(test_save_path)
# Load trained model
model = ViTForImageClassification.from_pretrained('./model')
# Load feature extractor
feature_extractor = ViTFeatureExtractor.from_pretrained('./model')
# Create Trainer instance
trainer = Trainer(
model=model,
compute_metrics=compute_metrics,
data_collator=default_data_collator,
tokenizer=feature_extractor
)
# Evaluate model
eval_results = trainer.evaluate(eval_dataset=test_dataset)
# Writes eval_result to file which can be accessed later
with open(os.path.join(output_data_dir, "eval_results.json"), "w") as writer:
print(f"Logging evaluation results at {output_data_dir}/eval_results.json")
writer.write(json.dumps(eval_results))
print(json.dumps(eval_results, indent=4))
{
"eval_loss": 2.441051483154297,
"eval_accuracy": 0.48269581056466304,
"eval_accuracy_k": 0.7103825136612022,
"eval_runtime": 1.9207,
"eval_samples_per_second": 285.83,
"eval_steps_per_second": 35.924
}
Let’s see our model’s performance in the wild. We’ll get a random image from the web and see how the model predicts the most likely labels for it:
# Get test image from the web
test_image_url = 'https://media.cnn.com/api/v1/images/stellar/prod/111024080409-steve-jobs-book.jpg'
response = requests.get(test_image_url)
test_image = Image.open(BytesIO(response.content))
# Resize and display the image
aspect_ratio = test_image.size[0] / test_image.size[1]
max_height = 250
resized_width = int(max_height * aspect_ratio)
resized_img = test_image.resize((resized_width, max_height))
display(resized_img)
# Predict the top k classes for the test image
inputs = feature_extractor(images=test_image, return_tensors="pt").to("cuda")
outputs = model(**inputs)
logits = outputs.logits
top_classes = torch.topk(outputs.logits, k_for_top_acc).indices.flatten().tolist()
for i, class_idx in enumerate(top_classes):
print(str(i + 1), "- Predicted class:", model.config.id2label[class_idx])
Results
1 - Predicted class: Education
2 - Predicted class: Books and Literature
3 - Predicted class: Productivity
Conclusion & Future Work
By following these steps and leveraging the capabilities of fine-tuning and transfer learning with ViTs, we demonstrate we can achieve a good baseline image classification into the IAB taxonomy categories. This opens up opportunities for various applications such as content targeting, contextual advertising, content personalization and audience segmentation.
With the availability of pre-trained models and user-friendly libraries like Hugging Face, fine-tuning vision transformers has become more accessible and efficient.
As we conclude, while we are happy with the baseline performance we achieved, it’s essential to consider possible future work and model improvements to improve the model’s performance. Here are some areas to focus on:
Hyperparameter Tuning: Perform systematic hyperparameter search to identify optimal values, including learning rate, batch size, and weight decay.
Model Architecture: Explore different transformer architectures, such as hybrid models combining transformers with CNNs, to further enhance performance.
Data Augmentation: Apply techniques like random cropping, rotation, flipping, and color jittering to artificially expand the training dataset and improve model generalization.
Dataset Improvement: Train the model on larger and more task specific datasets. Training sets may vary depending on the task. For example, it may depend on whether we are categorizing web or video content. Moreover, in any case, collecting additional annotated images, especially for challenging or underrepresented categories, can boost the model’s performance.
Finally, remember to stay tuned to the latest developments in the field, as new techniques and advancements continue to shape the world of computer vision and artificial intelligence.
Thank you for reading, and if you have any questions or feedback, please feel free to reach out to us. Happy fine-tuning and image classification with ViTs and Hugging Face!
Sources
An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale | Papers With Code
[2108.08810] Do Vision Transformers See Like Convolutional Neural Networks?
Stay ahead of the curve on the latest trends and insights in big data, machine learning and artificial intelligence. Don’t miss out and subscribe to our newsletter!