Abstract

We introduce NewsTweet, a dataset and data collection pipeline designed to study the embedding of social media in digital journalism. Our descriptive analysis of articles collected from Google News reveals that 13% of stories include embedded tweets. The dataset provides a foundation for exploring how social media content is sourced and which users become newsworthy.

Key Contributions

  • Large-Scale Dataset: A dataset of 273,899 news articles, with 35,218 containing embedded tweets, collected from Google News RSS feeds over a four-month period.
  • Data Collection Pipeline: Details an automated pipeline for acquiring news articles, extracting embedded tweets, and collecting the corresponding user timelines from Twitter’s API.
  • Descriptive Statistics: Presents statistics on the prevalence of tweet embedding across different news categories, outlets, and users, highlighting key patterns.

Dataset Characteristics

Scale and Coverage

  • News Sources: 5,961 unique news domains aggregated through Google News RSS feeds.
  • Time Period: Data collection initiated on May 15th, 2019, with the paper describing the first four months of data.
  • Content Types: Focuses specifically on embedded tweets from Twitter, the most frequently embedded platform.
  • Metadata: Includes article source, Google News category (e.g., Sports, Health), and full tweet and user objects from the Twitter API.

Technical Implementation

  • An automated data pipeline that crawls Google News for articles, extracts tweet IDs, and downloads tweet and user data via the Twitter API.
  • The system is designed to continuously track users once they are found in an embedded tweet.

Key Findings

Embedding Prevalence

  • 13% of news articles in our Google News-sourced collection contained embedded tweets.
  • Significant variation across categories: Sports (24% of articles) and Entertainment (14%) had the highest rates of embedding, while Health (2%) had the lowest.
  • News outlets that publish the most articles are well-known mass media organizations, while outlets with the highest average number of embeds per article are often focused on Sports and Entertainment.

User and Content Patterns

  • Public figures dominate: Well-known figures like politicians and celebrities, alongside organizations, are embedded far more often than ordinary users.
  • Some users have a small number of their tweets embedded many times, while others gain newsworthiness from a wider range of their content.
  • The Health category, despite having few embedded tweets, had the highest proportion of unique tweets (93%), suggesting that when tweets are embedded, they are less likely to be reused across multiple stories.

Societal Impact

This work provides a foundational dataset for researchers to investigate how social media is shaping news narratives and public discourse. It enables future studies on journalistic sourcing, the rise of internet celebrities, and the role of user-generated content in the modern media landscape.

Applications

  • Journalism Research: Studying how sourcing routines are evolving in the digital age.
  • Media Studies: Analyzing the interplay between traditional news outlets and social media platforms.
  • Computational Social Science: Modeling information flow and the dynamics of newsworthiness.
  • Platform Studies: Examining how platform policies and features influence news content.

Citation

@article{mujib2020newstweet,
  title={NewsTweet: a dataset of social media embedding in online journalism},
  author={Mujib, Munif Ishad and Heidenreich, Hunter Scott and Murphy, Colin J and Santia, Giovanni C and Zelenkauskaite, Asta and Williams, Jake Ryland},
  journal={arXiv preprint arXiv:2008.02870},
  year={2020}
}