Abstract

Undergraduate thesis exploring representation learning for social media text and developing tools for cross-platform conversational analysis. Built PyConversations, a Python module for analyzing social media conversations, and found that large pre-trained models don’t always outperform domain-specific approaches.

What I Built

PyConversations Module

  • Graph-based modeling: Models conversations as Directed Acyclic Graphs (DAGs) to quantify topological structure (depth, width, density)
  • Unified interface: Polymorphic design normalizing heterogeneous data from Twitter, Reddit, 4chan, and Facebook into a single analysis schema
  • Linguistic dynamics: Implements information-theoretic feature extraction, including harmonic mixing laws and entropy measures
  • Stream processing: Memory-efficient generators to ingest and traverse multi-gigabyte JSON dumps (e.g., 147M+ Reddit posts) without loading the full corpus into RAM

Research Contributions

  • Representation learning: Investigated domain-specific vs. general-purpose Transformers (BERT vs. specialized variants) on social media text
  • Topological analysis: Demonstrated that conversational structure (context) is as critical as content for classification tasks
  • Cross-platform study: Comparative analysis of communication dynamics across moderated (Reddit/Twitter) and unmoderated (4chan) spaces

Key Findings

  • Model performance: Standard pre-trained models didn’t always outperform smaller, domain-specific approaches
  • Context importance: Conversational context and dialogue structure proved crucial for understanding social media interactions
  • Domain adaptation: Social media text benefits from specialized handling rather than generic approaches
  • Cross-platform challenges: Different platforms require adapted approaches despite seeming similarities

Team & Recognition

  • Hunter Heidenreich - Lead Researcher and Developer
  • Jake Williams - Faculty Advisor
  • 🏆 First Place - Research Undergraduate Senior Thesis at Drexel University

Impact

This library served as the engineering backbone for my thesis, Look, Don’t Tweet, enabling the processing of 308 million posts to evaluate Transformer performance on toxic data.

The findings about model performance suggested that larger models aren’t automatically better for specialized domains, a perspective that has become more relevant as the field continues to evolve.