Sarcasm Detection With Transformers

Sarcasm detection is notoriously difficult for machines. Why is that? Well, sarcasm is a complex and nuanced form of communication. It often relies on context, tone, and shared knowledge between speaker and listener. It can be subtle, and it can be highly variable across cultures and languages. Just consider the following aspects of sarcasm:

  • Sarcasm is often context dependent. It relies on knowledge that goes beyond the text itself. To detect sarcasm, one needs to understand the situation, relationship between speaker and audience, and any other cultural background context necessary to understand the speaker’s intent.
  • Sarcasm is often subtle. It can even be hard for humans to detect sarcasm in text, let alone machines.
  • Sarcasm comes in many forms. It can be expressed through irony, understatement, or even a simple word choice. Sarcastic strategy can be highly specialized for a particular speaker.
  • Sarcasm is highly linguistically and culturally variable. What is sarcastic in one culture may not be in another, even when they speak the same language. One need only travel to different regions of the United States in an English context to begin to see this is true.
  • Sarcasm can contain highly figurative and abstract language. What is said is often not what is meant. This is a challenge for machines that rely on literal meaning to understand text.
  • Sarcasm is subject to interpretation. Different human annotators may disagree on whether a given text is sarcastic or not. This makes it difficult to create a gold standard for training and evaluating sarcasm detection models. Should one rely on consensus? Or should they try and match the distribution of how a text is interpreted by different people?
  • Sarcasm often depends on visual or non-verbal cues. These are not present in text, making it even more difficult for machines to detect sarcasm.

In many ways, one might even debate whether or not sarcasm detection is a well-defined problem. It is not clear that there is a single “correct” interpretation of a given text as sarcastic or not. In fact, sarcasm detection likely changes with time. This would make any static dataset age and become less useful over time. Let’s keep these things in mind as we explore sarcasm detection with machine learning.

With all of that in mind, however, I thought it would be interesting to see how well a pre-trained Transformer model can detect sarcasm. In this post, I will explore the use of a pre-trained Transformer model for sarcasm detection. I will use the Hugging Face transformers library to load a pre-trained model and fine-tune it on a sarcasm detection dataset. I will then evaluate the model’s performance on a held-out test set.

Sarcasm Detection Dataset - News Headlines

When looking for a dataset to use as a test-case for sarcasm detection, I stumbled up on Sarcasm New Headlines. It’s a dataset that combines headlines from both The Onion and The Huffington Post. The Onion is a satirical news site that publishes sarcastic news headlines, while The Huffington Post publishes real news. It’s pre-split into train/test and is of a reasonable size (about 50k examples).

Let’s download the dataset and look at a couple of samples:

from pprint import pprint
from datasets import load_dataset
dataset = load_dataset("raquiba/Sarcasm_News_Headline")
pprint(dataset["train"][0])
pprint(dataset["train"][1])
{'article_link': 'https://www.theonion.com/thirtysomething-scientists-unveil-doomsday-clock-of-hai-1819586205',
 'headline': 'thirtysomething scientists unveil doomsday clock of hair loss',
 'is_sarcastic': 1}
{'article_link': 'https://www.huffingtonpost.com/entry/donna-edwards-inequality_us_57455f7fe4b055bb1170b207',
 'headline': 'dem rep. totally nails why congress is falling short on gender, '
             'racial equality',
 'is_sarcastic': 0}

So we see that our source text is in the headline field and the label is in the is_sarcastic field. It’s nice that the URL of the source article is also included, but we won’t use that in our model. Since the dataset is scraped from only two domains, with one assumed to be sarcastic and the other not, the model will trivially learn perfect performance while learning nothing about sarcasm.

# update dataset mapping `headline` to `text` and `is_sarcastic` to `label`, remove `article_link`
dataset = dataset.map(
    lambda example: {
        "text": example["headline"], 
        "label": example["is_sarcastic"]
        }, remove_columns=["headline", "article_link", "is_sarcastic"]
)
dataset
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 28619
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 26709
    })
}) 

Fine-Tuning a Pre-Trained Transformer Model

I will use the transformers library to load a pre-trained model and fine-tune it on the sarcasm detection dataset. I’m not going to do anything crazy to start, I’ll use the pre-trained FacebookAI/roberta-base model and fine-tune it on the sarcasm detection dataset.

Let’s load the model and tokenizer:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
model_name = "FacebookAI/roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Next, let’s tokenize the dataset to prepare it for training:

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

We don’t pad the dataset here because we’ll be using a DataCollatorWithPadding to do that during training. We have rather short sequences, so dynamic padding will help speed up training.

Now, let’s configure the evaluation metric:

import numpy as np
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

And finally, let’s setup the training arguments and train the model:

from datetime import datetime

# Initialize the DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, pad_to_multiple_of=8)  # Using pad_to_multiple_of is optional

# Setup TrainingArguments and Trainer as previously described, including the data_collator
dt_str = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
num_epochs = 5
training_args = TrainingArguments(
    output_dir="./results/" + dt_str,
    num_train_epochs=num_epochs,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=128,
    warmup_steps=min(500, int(0.1 * num_epochs * len(tokenized_datasets['train']))),
    weight_decay=0.01,
    logging_dir="./logs/" + dt_str,
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="tensorboard",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

With trainer.train() we can start training the model.

Evaluating the Model

So how did we do? Let’s consider the test set accuracy after each epoch:

EpochTest Set Accuracy
10.963
20.978
30.994
40.998
50.998

Essentially, the model is able to detect sarcasm with near perfect accuracy. This is a bit surprising, given the complexity of sarcasm detection. Immediately, we should pause and consider what this means.

In my view, what we’ve actually trained is a model that can perfectly detect the difference between The Onion and The Huffington Post. This is not sarcasm detection. It’s domain detection. The model has learned to detect the domain of the source text, and that’s it.

This is a common problem in machine learning. It’s easy to train a model that does well on a particular dataset, but it’s much harder to train a model that generalizes to new, unseen data. In this case, the model has learned to detect the domain of the source text, but it has not learned to detect sarcasm.

Interacting with the Model

Let’s test our hypothesis by interacting with the model.

First, let’s load the model and tokenizer:

from transformers import pipeline

model = AutoModelForSequenceClassification.from_pretrained('results/2024-02-25_20-24-51/checkpoint-4475')

clf = pipeline('text-classification', model=model, tokenizer=tokenizer)

Now, let’s test the model with some examples.

First, let’s try an Onion article from this week, something I know to be sarcastic and not in the training data. Let’s use “Alabama Supreme Court Justice Invokes ‘VeggieTales’ In Ruling”:

clf("Alabama Supreme Court Justice Invokes ‘VeggieTales’ In Ruling")
[{'label': 'LABEL_0', 'score': 0.99916672706604}]

The model is extremely confident that this is not sarcastic.

Let’s try a different Onion article, possibly even more difficult: Breaking News Trump Booed, Frozen Burritos, And More: This Week In Breaking News February 24, 2024:

clf("Breaking News Trump Booed, Frozen Burritos, And More: This Week In Breaking News February 24, 2024")
[{'label': 'LABEL_0', 'score': 0.9993497729301453}]

Again, very confident that this is not sarcastic. Hmm. It could be the temporal accuracy of our model just cannot capture the sarcasm of the Onion in 2024.

Let’s try one more Onion article, this one that is still recent but a bit more of a low-hanging fruit: Mom Only Likes The Other Outback Steakhouse:

clf("Mom Only Likes The Other Outback Steakhouse")
[{'label': 'LABEL_1', 'score': 0.9997231364250183}]

Finally, a correct prediction! The model is confident that this is sarcastic. So we see our model can detect sarcasm, but only very specific types of sarcasm. It’s not generalizing to new, unseen data. Not even in the same domain.

Let’s also try some headlines from the Huffington Post, which the model should predict as not sarcastic. Let’s try the five most recent headlines from the Huffington Post:

clf([
    "Donald Trump Won South Carolina — But There's 1 Big Caveat",
    "Man Sets Himself On Fire In Front Of Israeli Embassy In Washington",
    "Israeli Media Report Progress On Reaching A Temporary Truce In Gaza And A Hostage-Prisoner Exchange",
    "A White Liberal Is Trying To Oust A Progressive Black Congressman. His Comments Could Make That Job Harder.",
    "Climate Change-Fueled Winter Extremes Put 90% Of This Country At 'High Risk'"
])
[{'label': 'LABEL_0', 'score': 0.9993808269500732},
 {'label': 'LABEL_0', 'score': 0.9993786811828613},
 {'label': 'LABEL_0', 'score': 0.9985186457633972},
 {'label': 'LABEL_0', 'score': 0.9993883371353149},
 {'label': 'LABEL_0', 'score': 0.9993487000465393}]

The model is extremely confident that these are not sarcastic.

So we see that the model is able to detect sarcasm in some cases, but it’s not generalizing to new, unseen data. It’s not even generalizing to new, unseen data in the same domain. This is a common problem in machine learning. It’s easy to train a model that does well on a particular dataset, but it’s much harder to train a model that generalizes to new, unseen data. Murkier still, we set out to detect sarcasm and we’ve really only created a domain classifier. For fuzzier concepts like sarcasm, it’s important to be clear about what we’re actually detecting, and to collect the necessary scale of data to capture the full range of the concept.

Conclusion

In this post, I explored the use of a pre-trained Transformer model for sarcasm detection. I used the Hugging Face transformers library to load a pre-trained model and fine-tune it on a sarcasm detection dataset. I then evaluated the model’s performance on a held-out test set. I found that the model was able to detect sarcasm with near perfect accuracy. However, upon further investigation, I found that the model was not generalizing to new, unseen data. It was only able to detect the domain of the source text, and not sarcasm itself. This is a common problem in machine learning. It’s easy to train a model that does well on a particular dataset, but it’s much harder to train a model that generalizes to new, unseen data. In the case of sarcasm detection, it’s important to be clear about what we’re actually detecting, and to collect the necessary scale of data to capture the full range of the concept.

If you enjoyed this post, consider: