Introduction

If you’re in the field of natural language processing and you’re not excited, you’re about to be! This week, I happened to browse over to Arxiv Sanity Preserver as I normally do, only instead of being greeted by a barrage of GAN-this or CNN-that, my eyes fell upon two datasets that made me very happy. They were CoQA and QuAC, and in this blog we are going to talk about why QuAC is so exciting.

I also talk about CoQA here if you’re interested in more question answering!

The Motivation to QuAC

Why was QuAC created and what is it going to do?

The idea behind QuAC stems from the idea of a student asking questions of a teacher. A teacher has the knowledge of interest and the student must have a dialog with the teacher to get the information that they want to know.

The issue? Questions in this scenario are context-dependent, can be abstract, and might not even have an answer. It’s up to the teacher to use the information they know to sift through all of this and give the best answer they can give.

This is what QuAC wishes to measure, placing machines in the role of teacher so that we may develop intelligent dialog agents.

It does so by compiling 100,000+ questions across 14,000+ dialogs.

QuAC Statistics

QuAC Statistics

The Creation Process

How was QuAC created?

With the help of Amazon Mechanical Turk, a marketplace that offers jobs to people that require human intelligence, such as labelling a dataset.

Two workers were assigned to a dialog. The teacher would be able to see a Wikipedia passage. The student would see the title of the article, the introduction paragraph, and the particular sub-heading. From there, the two would begin a dialog, going back and forth so that the student could learn about the article that the teacher has access to. The hope was that by not being able to see the article, the student’s questions would naturally be lexically different from what the passage contained.

The teacher’s answers must be straight from the passage, however. This results in a dataset that is extractive. It is easier to score, for one! It also makes evaluation less subjective and more reliable.

Dialogs continue until 12 questions were answered, one of the workers ends the dialog manually, or 2 unanswerable questions are asked in a row.

QuAC Conversation Example

QuAC Conversation Example

What Wikipedia Articles Were Used?

It requires less background knowledge to ask questions about people, so QuAC consists of only Wikipedia articles about people. These people cover a wide range of domains, but in they end, they are all people. They also make sure that they are people with at least 100 incoming links.

Comparison to Other Datasets

QuAC is a unique dataset, for sure. It contains a wide variety of question types. It is context-dependent. It has unanswerable questions. It has longer answers. Luckily, QuAC summarizes how it compares to a lot of other datasets in a little chart for us:

QuAC Comparison to Other Datasets

QuAC Comparison to Other Datasets

What’s more, QuAC is 54% non-factoid. It is also 86% contextual, requiring some sort of co-references resolution (figuring out which expressions refer to the same thing). This is largely an unsolved problem, and is difficult.

QuAC Question Types

QuAC Question Types

The difficulty is compounded even further by the fact that some co-references refer to things in the article, while others refer to past dialog. And more than a tenth refer to a very abstract question of “what else?”

These are tough questions to answer, even for a human!

Co-references in QuAC

Co-references in QuAC

Baseline Performance on QuAC

QuAC is very much unsolved. Human performance is at an F1 of 81.1%, but the top performing baseline is far behind. It is a BIDAF++ with context, something that the authors of QuAC designed. It only achieves an F1 of 60.2%. That’s a huge margin to improve upon. The rest of the test baselines did a lot worse:

QuAC Baseline Performance

QuAC Baseline Performance

A Human-Based Evaluation

QuAC also comes up with 2 human-based evaluations. If for a given question, a machine gets equivalent or higher F1, we say it exceed human equivalence score (HEQ). For the questions that it does that, we give a Human HEQ-Q point. If it is able to do that for an entire dialog of questions, we give it a HEQ-D point. We can calculate the percentage HEQQ and HEQD of a model this way.

By definition, human performance has an HEQQ and HEQD of 100%. Our top performing model currently has a HEQQ of 55.1% and a HEQD of 5.20%. We’ve got a long way to go! But when we reach, and exceed that, our dialog agents will have gotten to a point where they have human-level performance in settings like this.

And Now for a Challenge!

Will QuAC help us do to question answering what ImageNet did to image recognition? Only time will tell, but things are exciting for sure!

So now for a challenge for all of us (just like I posed with CoQA)! We all can push to try and tackle this challenge and take conversational agents to the next evolution. QuAC has a leaderboard which as of now is empty! It is our duty to go out there and try and work on this challenge, try and solve this problem, push knowledge forward, and maybe land on the leaderboard for a little bit of fame and a lot a bit of pride in our work.

I’ll see you all there!