Introduction

This week has been a great week for natural language processing and I’m really excited (and you should be too)! This week, I happened to browse over to Arxiv Sanity Preserver as I normally do, only instead of being greeted by a barrage of GAN-this or CNN-that, my eyes fell upon two datasets that made me very happy. They were CoQA and QuAC, and in this blog we are going to talk about why CoQA is so exciting.

I also talk about QuAC here if you’re interested in more question answering!

A Conversational Question Answering Challenge

What does a “conversational question answering challenge” even entail? Well, it requires not only reading a passage to uncover answers, but also having a conversation about that information. It requires using context clues in a dialog to uncover what is being asked for. And it requires that information to be rephrased in a way that is abstractive instead of extractive.

Whoa.

This dataset is unlike anything we’ve seen yet in question answering!

The Motivation

So why do we need a dataset like CoQA?

Well, simply because our current datasets just aren’t doing it for us. Humans learn by interacting with one another, asking questions, understanding context, and reasoning about information. Chatbots… Well, chatbots don’t do that. At least not now. And as a result, virtual agents are lacking and chatbots seem stupid (most of the time).

CoQA was created as a dataset to help us measure an algorithms ability to participate in a question-answering dialog. Essentially, to measure how conversational a machine can be.

In their paper, the creation of CoQA came with three goals:

  • To create a dataset that features question-answer pairs that depend on a conversation history
    • CoQA is the first dataset to come out to do this at a large scale!
    • After the first question in a dialog, every question is dependent on the history of what was said
    • There are 127,000 questions that span 8,000 dialogs!
  • To create a dataset that seeks natural answers
    • Many question-answering datasets are extractive, searching for lexically similar responses and pulling them directly out of the text
    • CoQA expects rationale to be extracted from the passage, but to have a re-phrased, abstracted answer
  • To ensure that question-answering systems perform well in a diverse set of domains
    • CoQA has a train set that features 5 domains and a test set that features 7 domains (2 never seen in training!)

And while human performance has an F1 score of 88.8%, the top performing model only achieves an F1 of 65.1%. That is a big margin and a lot of catching up that machines have to do!

Creating CoQA

How was a dataset like this created?

With the help of Amazon Mechanical Turk, a marketplace that offers jobs to people that require human intelligence, such as labelling a dataset. Two AMT workers would be paired together, one asking questions and the other giving answers. An answerer would be asked to highlight the portion of text in a passage that gives the answer and then respond in a way that was using different words than was given in the passage (thus becoming abstractive).

What Are the Domains of the Passages

The passages come from 7 domains, 5 in the training set and 2 reserved for the test set. They are:

The domains of the CoQA dataset

The domains of the CoQA dataset

They also went out of their way to collect multiple answers to questions, since questions are re-phrased which helps so that there are more opportunities for dialog agents to get a better score and actually hit a correct answer.

Comparing to Other Datasets

What did we have before this?

Well… We had SQuAD, the Stanford Question Answering dataset. Version 1.0 gave use 100,000+ questions on Wikipedia that required models to extract correct portions of articles in response to questions. When it was decided that that wasn’t enough, we got Version 2.0. This added 50,000+ unanswerable questions to SQuAD, now requiring algorithms to not only find correct answers but also reason about whether an answer was even exists in a passage.

A comparison of SQuAD and CoQA

A comparison of SQuAD and CoQA

This task is still unsolved.

But it also doesn’t solve the problems of abstracting information and understanding what is being asked for in a dialog. That’s why CoQA was created, as outlined in its first goal!

More Analysis

The creators of CoQA did an analysis of SQuAD versus CoQA and found that while about half of SQuAD questions are “what” questions, CoQA has a much wider, more balanced distribution of types of questions. They gave us this cool graphic to depict that:

SQuAD and CoQA Question Types

SQuAD and CoQA Question Types

What makes CoQA even more difficult is that sometimes it just features one word questions like “who?” or “where?” or even “why?”

These are totally context dependent! We as humans can’t even begin to try to answer these without knowing the context that they were asked in!

SQuAD and CoQA Answer Lengths

SQuAD and CoQA Answer Lengths

A Linguistic Challenge

In case you haven’t put it together yet, CoQA is very, very difficult. It is filled with things called co-references which can be something as simple as a pronoun, but in general is when two or more expressions refer to the same thing (thus co-references!).

This is something that is still an open problem in NLP (co-reference resolution, pronoun resolution, etc.), so to incorporate this step in a question-answering dataset certainly increases the difficulty some more.

About half the questions in CoQA contain an explicit co-reference (some indicator like him, it, her, that). Close to a fifth of the questions contain an implicit co-reference. This is when you ask a question like “where?” We are asking for something to be resolved, but it’s implied. This is very hard for machines. Hell, it can even be hard for people sometimes!

Co-references in CoQA

Co-references in CoQA

And Now for a Challenge!

How do current models hold up to this challenge? The answer is not well.

See for yourself:

CoQA Scores

CoQA Scores

Will CoQA help us do to question answering what ImageNet did to image recognition? Only time will tell, but things are exciting for sure!

So now for a challenge for all of us (just like I posed with QuAC)! We all can push to try and tackle this challenge and take conversational agents to the next evolution. CoQA has a leaderboard which as of now is empty! It is our duty to go out there and try and work on this challenge, try and solve this problem, push knowledge forward, and maybe land on the leaderboard for a little bit of fame and a lot a bit of pride in our work.

I’ll see you all there!