CoQA | |
---|---|
Basic Information | |
Full Name | Conversational Question Answering Challenge |
Domain | Natural Language Processing |
Year | 2019 |
Dataset Scale | |
Conversations | 8,000 |
Questions | 127,000 |
Data Points | 127,000 question-answer pairs |
Research Context | |
Authors | Siva Reddy , Danqi Chen , Christopher D. Manning |
Institutions | Stanford University |
Conference | TACL 2019 |
Technical Details | |
Format | JSON format with passages, conversations, and answers |
Observables | Question answering in conversational context, coreference resolution |
Subsets | 7 subsets |
Publication | TACL 2019 • DOI |
Dataset | Official Website & Leaderboard |
Dataset Summary
CoQA (Conversational Question Answering Challenge) introduces conversational dynamics to question answering research. Unlike traditional QA datasets that treat questions independently, CoQA requires models to maintain context across multi-turn conversations while understanding coreferences and providing natural, abstractive answers.
Key Features
- Conversational Context: Every question after the first depends on dialogue history, requiring models to track conversational state
- Coreference Resolution: ~70% of questions contain explicit or implicit references that must be resolved using dialogue context
- Abstractive Answers: Natural, rephrased responses rather than exact text extraction, making conversations more realistic
- Domain Diversity: Training on 5 domains with testing on 7 domains (including 2 unseen) for robust generalization evaluation
- Large Scale: 127,000 questions across 8,000 conversations providing substantial training data
Dataset Structure
CoQA conversations follow a structured format where:
- Each conversation begins with a text passage from one of seven domains
- Questions build upon previous dialogue turns through pronouns and implicit references
- Answers are abstractive (rephrased) rather than extractive (copied) text spans
- Human annotators first highlight relevant text, then rephrase it naturally
Domain Distribution
Count | Description | Name |
---|---|---|
Training domain | Stories designed for reading comprehension | Children’s Stories (MCTest) |
Training domain | Classic literature passages | Literature (Project Gutenberg) |
Training domain | Middle and high school English passages | Educational Content (RACE) |
Training domain | News articles and current events | CNN News Articles |
Training domain | Encyclopedic content | Wikipedia Articles |
Test-only domain | AI2 Science Questions domain for generalization testing | Science Articles |
Test-only domain | Reddit WritingPrompts for generalization testing | Creative Writing |
Question Characteristics
Description | Percentage | Type |
---|---|---|
Every question after the first depends on dialogue history | 100% after first turn | Context-dependent questions |
Clear pronouns and references (him, it, her, that) | ~50% | Explicit coreferences |
Context-dependent references requiring inference | ~20% | Implicit coreferences |
Challenging questions like ‘who?’, ‘where?’, ‘why?’ | Single-word questions |
Use Cases
Primary Applications
- Conversational AI systems and chatbots
- Reading comprehension with dialogue context
- Multi-turn question answering systems
- Virtual assistants requiring contextual understanding
Research Applications
- Coreference resolution in dialogue
- Abstractive vs extractive answer generation
- Domain transfer and generalization studies
- Dialogue state tracking and context modeling
Quality & Limitations
Strengths
- Realistic Conversations: Natural dialogue flow with context dependencies mirrors real human communication
- Challenging Benchmark: Substantial human-machine performance gap (23.7% F1) indicates room for advancement
- Domain Diversity: Multiple text types ensure robust evaluation across different writing styles
- Abstractive Answers: Natural rephrasing creates more realistic conversational responses
Limitations
- Annotation Complexity: Abstractive answering and coreference resolution create inherent annotation challenges
- Performance Gap: Large gap between human (88.8% F1) and machine (65.1% F1) performance indicates dataset difficulty
- English Only: Limited to English language conversations
- Domain Bias: Training domain distribution may not reflect all real-world conversational contexts
Performance Benchmarks
The dataset establishes challenging benchmarks:
- Human Performance: 88.8% F1 score
- Best Models (2019): 65.1% F1 score
- Performance Gap: 23.7% F1 difference indicating significant research opportunities
Modern transformer-based models have since improved these scores, but CoQA remains a challenging benchmark for conversational understanding.
Getting Started
- Access: Download from the official CoQA website
- Format: JSON files containing passages, conversations, and answer annotations
- Evaluation: Use official evaluation scripts and submit to the leaderboard for comparison
- Baseline Models: Start with transformer-based models fine-tuned for reading comprehension
Related Work
CoQA complements other conversational QA datasets:
- QuAC: Another conversational QA dataset with different design choices
- SQuAD: Traditional single-turn QA for comparison and baseline development
Research Impact
CoQA has driven advances in conversational AI by highlighting the importance of:
- Context tracking across dialogue turns
- Coreference resolution in natural language understanding
- Abstractive answer generation for natural conversations
- Domain transfer capabilities in question answering systems
The dataset continues to serve as a benchmark for measuring progress in conversational question answering research.
Citation: Reddy, S., Chen, D., & Manning, C. D. (2019). CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7, 249-266.