CoQA
Basic Information
Full NameConversational Question Answering Challenge
DomainNatural Language Processing
Year2019
Dataset Scale
Conversations8,000
Questions127,000
Data Points127,000 question-answer pairs
Research Context
AuthorsSiva Reddy , Danqi Chen , Christopher D. Manning
InstitutionsStanford University
ConferenceTACL 2019
Technical Details
FormatJSON format with passages, conversations, and answers
ObservablesQuestion answering in conversational context, coreference resolution
Subsets7 subsets
PublicationTACL 2019DOI
DatasetOfficial Website & Leaderboard

Dataset Summary

CoQA (Conversational Question Answering Challenge) introduces conversational dynamics to question answering research. Unlike traditional QA datasets that treat questions independently, CoQA requires models to maintain context across multi-turn conversations while understanding coreferences and providing natural, abstractive answers.

Key Features

  • Conversational Context: Every question after the first depends on dialogue history, requiring models to track conversational state
  • Coreference Resolution: ~70% of questions contain explicit or implicit references that must be resolved using dialogue context
  • Abstractive Answers: Natural, rephrased responses rather than exact text extraction, making conversations more realistic
  • Domain Diversity: Training on 5 domains with testing on 7 domains (including 2 unseen) for robust generalization evaluation
  • Large Scale: 127,000 questions across 8,000 conversations providing substantial training data

Dataset Structure

CoQA conversations follow a structured format where:

  • Each conversation begins with a text passage from one of seven domains
  • Questions build upon previous dialogue turns through pronouns and implicit references
  • Answers are abstractive (rephrased) rather than extractive (copied) text spans
  • Human annotators first highlight relevant text, then rephrase it naturally

Domain Distribution

CoQA Domain Distribution
CountDescriptionName
Training domainStories designed for reading comprehensionChildren’s Stories (MCTest)
Training domainClassic literature passagesLiterature (Project Gutenberg)
Training domainMiddle and high school English passagesEducational Content (RACE)
Training domainNews articles and current eventsCNN News Articles
Training domainEncyclopedic contentWikipedia Articles
Test-only domainAI2 Science Questions domain for generalization testingScience Articles
Test-only domainReddit WritingPrompts for generalization testingCreative Writing

Question Characteristics

CoQA Question Analysis
DescriptionPercentageType
Every question after the first depends on dialogue history100% after first turnContext-dependent questions
Clear pronouns and references (him, it, her, that)~50%Explicit coreferences
Context-dependent references requiring inference~20%Implicit coreferences
Challenging questions like ‘who?’, ‘where?’, ‘why?’Single-word questions

Use Cases

Primary Applications

  • Conversational AI systems and chatbots
  • Reading comprehension with dialogue context
  • Multi-turn question answering systems
  • Virtual assistants requiring contextual understanding

Research Applications

  • Coreference resolution in dialogue
  • Abstractive vs extractive answer generation
  • Domain transfer and generalization studies
  • Dialogue state tracking and context modeling

Quality & Limitations

Strengths

  • Realistic Conversations: Natural dialogue flow with context dependencies mirrors real human communication
  • Challenging Benchmark: Substantial human-machine performance gap (23.7% F1) indicates room for advancement
  • Domain Diversity: Multiple text types ensure robust evaluation across different writing styles
  • Abstractive Answers: Natural rephrasing creates more realistic conversational responses

Limitations

  • Annotation Complexity: Abstractive answering and coreference resolution create inherent annotation challenges
  • Performance Gap: Large gap between human (88.8% F1) and machine (65.1% F1) performance indicates dataset difficulty
  • English Only: Limited to English language conversations
  • Domain Bias: Training domain distribution may not reflect all real-world conversational contexts

Performance Benchmarks

The dataset establishes challenging benchmarks:

  • Human Performance: 88.8% F1 score
  • Best Models (2019): 65.1% F1 score
  • Performance Gap: 23.7% F1 difference indicating significant research opportunities

Modern transformer-based models have since improved these scores, but CoQA remains a challenging benchmark for conversational understanding.

Getting Started

  1. Access: Download from the official CoQA website
  2. Format: JSON files containing passages, conversations, and answer annotations
  3. Evaluation: Use official evaluation scripts and submit to the leaderboard for comparison
  4. Baseline Models: Start with transformer-based models fine-tuned for reading comprehension

CoQA complements other conversational QA datasets:

  • QuAC: Another conversational QA dataset with different design choices
  • SQuAD: Traditional single-turn QA for comparison and baseline development

Research Impact

CoQA has driven advances in conversational AI by highlighting the importance of:

  • Context tracking across dialogue turns
  • Coreference resolution in natural language understanding
  • Abstractive answer generation for natural conversations
  • Domain transfer capabilities in question answering systems

The dataset continues to serve as a benchmark for measuring progress in conversational question answering research.


Citation: Reddy, S., Chen, D., & Manning, C. D. (2019). CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7, 249-266.