Document Processing

LLMs for Insurance Document Automation

LLM applications for insurance document automation using parameter-efficient fine-tuning and analysis of calibration …...

Document Processing

LLMs for Page Stream Segmentation

Enhanced TABME benchmark for page stream segmentation, creating TABME++, showing fine-tuned decoder-based LLMs …...

Natural Language Processing
Word vector illustration showing text classification and NLP concepts

Sarcasm Detection with Transformers: A Cautionary Tale

Learn how dataset bias can lead to misleading results in NLP: a sarcasm detection model that actually learned to …

Computational Social Science
Data visualization showing congressional bill analysis and legislative patterns

Congressional Data Analysis & Classification

Data science project scraping 47,000+ congressional bills, analyzing legislative patterns, and building ML models …...

Natural Language Processing

Analytical Model of Word2Vec and GloVe Statistics

Analytical model of Word2Vec and GloVe statistics. First analytical solution to Word2Vec's softmax skip-gram with bias …...

Natural Language Processing

EigenNoise: Data-Free Word Vector Initialization

Investigation into EigenNoise, a data-free initialization scheme for word vectors that approaches pre-trained model …...

Computational Social Science
Network visualization showing social media conversational analysis and text relationships

PyConversations: Social Media Conversational Analysis

Undergraduate thesis exploring representation learning for social media text and developing tools for cross-platform …

AI Safety

GPT-2 Susceptibility to Universal Adversarial Triggers

Investigation into whether universal adversarial triggers can control both topic and stance of GPT-2's generated text …...

Natural Language Processing

Data-Driven WordNet Construction from Wiktionary

Explores a data-driven approach to construct a WordNet-like semantic network using the entirety of the noisy, …...

Natural Language Processing
Types and distribution of coreferences in QuAC dataset showing dialogue complexity

QuAC: Question Answering in Context Dataset

Analysis of QuAC's conversational QA through student-teacher interactions, featuring 100K+ context-dependent questions …

Natural Language Processing

CoQA Dataset: Advancing Conversational Question Answering

Analysis of CoQA, a conversational QA dataset with multi-turn dialogue, coreference resolution, and natural answers for …

Natural Language Processing
One-hot encoding and count vectorization visualization showing sparse vector representation

Count Vectorization with scikit-learn in Python

Learn count vectorization in Python: convert text to numerical vectors using scikit-learn's CountVectorizer with …