Look, Don't Tweet: Unified Data Models for Social NLP

Abstract

This is my undergraduate senior thesis, completed at Drexel University in 2021. The scope (308 million posts across four platforms, structural topology analysis, and domain adaptation experiments with Transformer models) was unusually broad for a senior thesis, spanning large-scale data engineering, graph-structural analysis, and representation-learning experiments.

Social media research is often siloed by platform, with tools built specifically for Twitter’s flat structure or Reddit’s tree structure. This fragmentation makes cross-platform analysis difficult. In this work, I introduce PyConversations, an open-source Python package that normalizes data from Twitter, Facebook, Reddit, and 4chan into a single, platform-agnostic data model. (Note: the repository is archived and no longer actively maintained.)

Leveraging this tool, I processed over 308 million posts to analyze the structural “shape” of online conversations. I then evaluated the efficacy of domain-adaptive pre-training (DAPT) for Transformer-based language models, finding that training on a toxic domain (4chan) boosts hate-speech detection by over 5 F1.

The Engineering Problem: Data Normalization

Social media platforms impose different structural constraints on discourse, making it difficult to feed heterogeneous data into a single ML pipeline:

Twitter: Technically allows infinite depth, but functionally operates as a flat stream or shallow tree.
Facebook: Enforces a hard limit of two depth levels (comments and replies), resulting in “short and fat” conversation trees.
Reddit & 4chan: Allow for deep, branching tree structures.

To solve this, I designed a Universal Message Schema and the PyConversations library. This system ingests raw dumps from these disparate sources and maps them to a unified Directed Acyclic Graph (DAG) format, preserving the parent-child relationships regardless of the source platform’s constraints.

Key Contributions

PyConversations Library: An open-source package for robust conversational analysis, featuring graph-based traversing and filtering.
Massive Dataset Analysis: Processed a collection of 308 million posts and 15.8 million conversations, creating one of the largest comparative cross-platform analyses at the time of thesis submission.
Structural Insights: Quantified how UI constraints shape human behavior. For instance, Facebook’s depth limit forces users to “bunch” comments, creating uniquely wide conversation trees compared to Reddit’s deep, narrow threads.
Domain Adaptation Experiments: Continued-pretrained RoBERTa on platform-specific slices (e.g., the 4chan-adapted RoBERTa-4chan), demonstrating that exposing models to toxic domains improved hate-speech detection F1 by over 5 points.

Structural Analysis Findings

By treating conversations as graphs, we uncovered distinct topological signatures for each platform:

The “Shape” of Discourse

We measured the width (max posts at any depth) and depth (max distance from root) of conversation trees.

Facebook exhibited a “short and fat” topology due to its 2-level nesting limit.
4chan threads were surprisingly shallow despite having no depth limits. This suggests that the platform’s ephemerality (threads are deleted quickly) and the “bump limit” mechanic discourage long-term dialogue, though data scraping limitations on this transient platform also contribute to this topology.
Reddit maintained the most robust tree structures, with “good faith” communities like r/ChangeMyView showing distinct patterns of sustained engagement.

Information Density

We analyzed Innovation Rate, a measure of how quickly a text introduces new vocabulary. We found that Twitter threads have negative innovation rates (indicating high novelty per token) likely forced by the strict character limits. In contrast, Reddit posts showed higher redundancy, typical of longer-form essay writing.

Representation Learning & Domain Adaptation

We experimented with “Warm-Start” tuning: taking a standard RoBERTa model and pre-training it further on platform-specific data before fine-tuning on downstream tasks (TweetEval).

Limited gains on most general tasks: Domain-adaptive pre-training added little on sentiment and emotion (from well under 1 up to a few F1 points), with irony detection the exception (+5.6 to +5.9 F1). Base RoBERTa already covers most of the signal for general NLP tasks.
The Toxic Exception: The notable exception was Hate Speech Detection. The 4chan-adapted model (RoBERTa-4chan) was the strongest here, outperforming the baseline by over 5 F1. This highlights that for specialized, out-of-distribution language (like toxic slang), domain adaptation remains valuable.

Significance

This work bridges the gap between Computational Social Science and ML Engineering. It provides the community with a reusable tool (PyConversations) to handle the messy reality of social data and offers empirical evidence on the limits and benefits of domain-adaptive pre-training for LLMs.

Citation

@thesis{heidenreich2021look,
  title={Look, Don't Tweet: Representation Learning and Social Media},
  author={Hunter Heidenreich},
  year={2021},
  school={Drexel University},
  type={Undergraduate Senior Thesis}
}

For related work on how social media content surfaces in digital journalism, including a dataset of embedded tweets across 273,899 news articles, see NewsTweet Dataset: Social Media in Digital Journalism.

Abstract#

The Engineering Problem: Data Normalization#

Key Contributions#

Structural Analysis Findings#

The “Shape” of Discourse#

Information Density#

Representation Learning & Domain Adaptation#

Significance#

Citation#

Related Work#