Mapping My Facebook Data -- Part 1: Simple NLP

Part 1 in a series where we analyze our social media data, diving in on our Facebook data

Imagem de capa

Introduction

A little while ago, I got really hooked on the thought of the amount of text data I generate in a day. If you’re like me, chance are you do a lot of writing. Emails. Text messages. Facebook. And maybe you have some other creative outlets as well. Perhaps you journal or write music or something. Maybe you’re a student and have some writing assignments. For me, I really wanted to get some insight into all this data I was producing and thought it would be cool to analyze it using natural language processing.

This series is going to document how and what I did so that you can do the same if you’re interested.

Finding My Data

When thinking through all the data I had, I decided to focus in on several sources. They were:

Though for my full project, I used all the sources, for this series, I’m just going to use my Facebook data.

For most of the data, I just put it in a text file and called it a day. For the Facebook data, I had to do a bit more pre-processing.

How to Get Your Facebook Data

How do we actually get our hands on our Facebook data? Well, it’s actually easier than you might think. As of right now (August 20th, 2018), you can download your data by:

I chose to download everything for all time, and in a JSON format. When my download was ready, I got a bunch of folders like this:

Facebook Download Data

Filled with my requested JSON.

Pre-Processing Your Facebook Data

I wanted to download all my Facebook data, but I don’t actually want all my Facebook data for this project. For this project, I only care about my posts, comments, and chat history. To help with this, I wrote a pre-processing script for each of these to allow me to dump my desired content into text files.

First up, I processed my messages:

import os
import json
import datetime

chat_root = 'messages'
writing_dir = 'data/facebook_chat'

for f in os.listdir(chat_root):
    file = os.path.join(chat_root, f, 'message.json')
    data = json.load(open(file, 'r'))

    for msg in data['messages']:
        try:
            if msg['sender_name'] == 'Hunter Heidenreich':
                if msg['content'] != "You are now connected on Messenger.":
                    content = msg['content']
                    ts = datetime.datetime.fromtimestamp(msg['timestamp_ms'] / 1000)

                    filename = os.path.join(writing_dir,
                                            str(ts.year) + '.' + str(ts.month) + '.' + str(ts.day) + '.txt')

                    with open(filename, 'a+') as out_file:
                        out_file.write(content + '\n')
                        print('wrote.')
        except KeyError:
            pass

What you’ll see going on here is that I’m iterating through all the sub-folders in my messages folder. What I do from there is I read in the message JSON. For each message that’s available, I check if it’s a message that I sent. If it’s not Facebook’s default “You are now connected on Messenger.” then I want it. I time stamp the message, and then add it to a file that takes the format of year.month.day.txt which is the format that I stamp all my text files so that I can document the change in vocabulary over time.

If for some reason the JSON’s keys don’t work, I just pass over it.

I do very similar things for both the posts I wrote:

import os
import datetime
import json

in_data = 'posts/your_posts.json'
writing_dir = 'data/facebook_posts'

data = json.load(open(in_data, 'r'))

for post in data['status_updates']:
    try:
        ts = datetime.datetime.fromtimestamp(post['timestamp'])
        post_text = post['data'][0]['post']

        filename = os.path.join(writing_dir, str(ts.year) + '.' + str(ts.month) + '.' + str(ts.day) + '.txt')

        with open(filename, 'a+') as out_file:
            out_file.write(post_text + '\n')
    except KeyError:
        print(post)

And my comments:

import os
import datetime
import json

comment_path = 'comments/comments.json'
writing_dir = 'data/facebook_comments'

data = json.load(open(comment_path, 'r'))
for comment in data['comments']:
    try:
        for d in comment['data']:
            if d['comment']['author'] == "Hunter Heidenreich":
                ts = datetime.datetime.fromtimestamp(d['comment']['timestamp'])
                com = d['comment']['comment']

                filename = os.path.join(writing_dir, str(ts.year) + '.' + str(ts.month) + '.' + str(ts.day) + '.txt')

                with open(filename, 'a+') as out_file:
                    out_file.write(com + '\n')
    except KeyError:
        print(comment)

From there, I’m ready to rock and roll with my Facebook data.

Loading in Our Data

To get started, we will write a simple function to get the list of all our files in a certain category. This will allow us to easily keep track of which is which and we’ll keep these naming schemes as we manipulate and analyze our data.

import os


def get_file_list(text_dir):
    files = []
    for f in os.listdir(text_dir):
        files.append(os.path.join(text_dir, f))
    return files


comments = 'data/facebook_comments'
posts = 'data/facebook_posts'
chats = 'data/facebook_chat'

comment_files = get_file_list(comments)
post_files = get_file_list(posts)
chat_files = get_file_list(chats)

all_files = comment_files + post_files + chat_files

Before we actually read in our data, we are going to write a function that we will use to preprocess our data in several different ways.

import string

import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer


def preprocess_text(lines):

    def _flatten_list(two_list):
        one_list = []
        for el in two_list:
          one_list.extend(el)
        return one_list

    translator = str.maketrans('', '', string.punctuation)
    upd = []
    for line in lines:
        upd.extend(nltk.sent_tokenize(line))
    lines = [line.translate(translator) for line in upd]
    lines = [nltk.word_tokenize(line) for line in lines]
    lines = [[word.lower() for word in line if word not in [
        '\'', '’', '”', '“']] for line in lines]

    raw = lines

    stop = [[word for word in line if word not in set(
        stopwords.words('english'))] for line in raw]

    snowball_stemmer = SnowballStemmer('english')
    stem = [[snowball_stemmer.stem(word) for word in line] for line in stop]

    wordnet_lemmatizer = WordNetLemmatizer()
    lemma = [[wordnet_lemmatizer.lemmatize(
        word) for word in line] for line in stop]

    raw = _flatten_list(raw)
    stop = _flatten_list(stop)
    stem = _flatten_list(stem)
    lemma = _flatten_list(lemma)

    return raw, stop, stem, lemma

What we are doing here is producing 4 variations of our text. We are producing:

With this in mind, we can now create a basic object that is going to hold our file data data and allow use to conglomerate different sources of writing from Facebook that happened on the same day:

class FileObject:
    def __init__(self):
        self._datetime = None

        self._lines = []

        self._raw = []
        self._stop = []
        self._stem = []
        self._lemma = []

    @property
    def lines(self):
        return self._lines

    def preprocess_text(self):
        self._raw, self._stop, self._stem, self._lemma = preprocess_text(
            self._lines)

And now let’s load our data and pre-process it. I’ll demonstrate the code on the conglomerate data, but it works for the other input file lists as well:

all_text = {}

for f in all_files:
    base = os.path.basename(f)

    if base not in all_text:
        all_text[base] = FileObject()

    with open(f, 'r') as in_file:
        all_text[base]._lines += in_file.readlines()

for k, v in all_text.items():
    v.preprocess_text()

This may take a quick second, but when we’re done, we will be able to start taking a look at some basic things about our text!

What Are My Top Words?

Let’s start somewhere basic. We have these lists of lists of words loaded into our various sources. Let’s just do a count and see what are our most frequent words. Let’s see our top 20.

We can do so by writing:

from collections import Counter

def file_objects_to_words(fo_text):
    raw, stop, stem, lemma = [], [], [], []
    for key, value in fo_text.items():
        raw.append(value._raw)
        stop.append(value._stop)
        stem.append(value._stem)
        lemma.append(value._lemma)
    return raw, stop, stem, lemma


def top_20_words(list_of_word_lists):
    full_count = Counter()

    for word_list in list_of_word_lists:
        for word in word_list:
            full_count[word] += 1

    return full_count.most_common(20)

raw, stop, stem, lemma = file_objects_to_words(all_text)
print('All words -- raw')
print(top_20_words(raw))
print('All words -- stop')
print(top_20_words(stop))
print('All words -- stem')
print(top_20_words(stem))
print('All words -- lemma')
print(top_20_words(lemma))

print('Chat words -- lemma')
_, _, _, lemma = file_objects_to_words(chat_text)
print(top_20_words(lemma))
print('Post words -- lemma')
_, _, _, lemma = file_objects_to_words(post_text)
print(top_20_words(lemma))
print('Comment words -- lemma')
_, _, _, lemma = file_objects_to_words(comment_text)
print(top_20_words(lemma))

And we’ll get a nice output of:

All words -- raw
[('i', 47935), ('you', 43131), ('to', 24551), ('and', 20882), ('the', 18681), ('that', 16725), ('a', 15727), ('it', 15140), ('haha', 13516), ('me', 12597), ('like', 10149), ('so', 9945), ('im', 9816), ('but', 8942), ('do', 8599), ('is', 8319), ('of', 8266), ('just', 8154), ('be', 8092), ('my', 8017)]
All words -- stop
[('haha', 13516), ('like', 10149), ('im', 9816), ('na', 6528), ('yeah', 5970), ('dont', 5361), ('okay', 5332), ('good', 5253), ('know', 4896), ('would', 4702), ('think', 4555), ('thats', 4163), ('want', 4148), ('cause', 3922), ('really', 3803), ('get', 3683), ('wan', 3660), ('hahaha', 3518), ('well', 3332), ('feel', 3231)]
All words -- stem
[('haha', 13516), ('like', 10580), ('im', 9816), ('na', 6528), ('yeah', 5970), ('think', 5772), ('want', 5396), ('dont', 5364), ('okay', 5333), ('good', 5261), ('know', 5078), ('would', 4702), ('get', 4535), ('feel', 4270), ('that', 4163), ('caus', 3992), ('go', 3960), ('realli', 3803), ('make', 3727), ('wan', 3660)]
All words -- lemma
[('haha', 13516), ('like', 10207), ('im', 9816), ('na', 6530), ('yeah', 5970), ('dont', 5361), ('okay', 5333), ('good', 5254), ('know', 5038), ('would', 4702), ('think', 4631), ('want', 4419), ('thats', 4163), ('cause', 3934), ('get', 3901), ('really', 3803), ('wan', 3660), ('hahaha', 3518), ('feel', 3358), ('well', 3332)]
Chat words -- lemma
[('haha', 13464), ('like', 10111), ('im', 9716), ('na', 6497), ('yeah', 5957), ('okay', 5329), ('dont', 5316), ('good', 5226), ('know', 4995), ('would', 4657), ('think', 4595), ('want', 4384), ('thats', 4150), ('cause', 3913), ('get', 3859), ('really', 3760), ('wan', 3649), ('hahaha', 3501), ('feel', 3336), ('well', 3299)]
Post words -- lemma
[('game', 68), ('video', 41), ('check', 38), ('like', 36), ('birthday', 36), ('happy', 34), ('im', 33), ('noah', 29), ('song', 28), ('play', 24), ('music', 21), ('one', 20), ('get', 19), ('guy', 18), ('time', 18), ('made', 18), ('evans', 18), ('hey', 17), ('make', 17), ('people', 16)]
Comment words -- lemma
[('noah', 120), ('im', 67), ('like', 60), ('evans', 60), ('thanks', 51), ('game', 50), ('haha', 47), ('cote', 44), ('one', 43), ('coby', 40), ('wilcox', 38), ('would', 33), ('man', 31), ('dont', 29), ('really', 29), ('know', 28), ('think', 28), ('na', 24), ('want', 23), ('get', 23)]

I like looking at just my lemmatized words, so that’s why I only logged that for my individual sources. I find it interesting how often I use a variation of ‘haha’ in my chats. And that most of my comments are someone’s name.

What’s My Individual Word Usage Look Like?

So what if we wanted to graph our individual words to see how our usage decays from top word to bottom word? We could write a general bar graph function that would look like:

from matplotlib import pyplot as plt


def plot_bar_graph(x, y, x_label='X', y_label='Y', title='Title', export=False, export_name='default.png'):
    plt.bar(x, y)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.title(title)
    plt.xlim(0, len(x))
    plt.ylim(0, max(y))

    if export:
        plt.savefig(export_name)

    plt.show()

From there, we can modify our top 20 function and can feed our comment lemma list directly into the graph:

def aggregate_words_counts(list_of_word_lists):
    full_count = Counter()

    for word_list in list_of_word_lists:
        for word in word_list:
            full_count[word] += 1

    return full_count

counter = aggregate_words_counts(lemma)
word_counts_raw = list(counter.values())
word_counts_sorted = sorted(word_counts_raw)

cap = len(word_counts_raw)

plot_bar_graph(range(len(word_counts_raw[:cap])), word_counts_raw[:cap],
                      x_label='Words', y_label='Counts', title='Word Frequencies (Raw)',
                      export=True, export_name='visualizations/word_freq_bar_raw.png')
plot_bar_graph(range(len(word_counts_sorted[:cap])), word_counts_sorted[-cap:],
                      x_label='Words', y_label='Counts', title='Word Frequencies (Sorted)',
                      export=True, export_name='visualizations/word_freq_bar_sorted.png')

And we get two very nice graphs:

Raw Facebook Word Frequency

Sorted Facebook Word Frequency

What are Some Basic Stats on My Data?

Let’s generate some basic stats on our data. Let’s set up a function to create a table:

def plot_table(cell_data, row_labels=None, col_labels=None, export=False, export_name='default.png'):
    _ = plt.figure(figsize=(6, 1))

    _ = plt.table(cellText=cell_data, rowLabels=row_labels,
                  colLabels=col_labels, loc='center')

    plt.axis('off')
    plt.grid(False)

    if export:
        plt.savefig(export_name, bbox_inches='tight')

    plt.show()

And then generate the data to dump into this function:

import numpy as np


def top_k_words(list_of_word_lists, k=10):
    full_count = Counter()

    for word_list in list_of_word_lists:
        for word in word_list:
            full_count[word] += 1

    return full_count.most_common(k)

raw, stop, stem, lemma = file_objects_to_words(all_text)

counter = aggregate_words_counts(lemma)
data = [[len(all_files)],
        [top_k_words(lemma, k=1)[0][0] + ' (' + str(top_k_words(lemma, k=1)[0][1]) + ')'],
        [top_k_words(lemma, k=len(list(counter.keys())))[-1][0] + ' (' + str(top_k_words(lemma, k=len(list(counter.keys())))[-1][1]) + ')'],
        [len(counter.items())],
        [np.mean(list(counter.values()))],
        [sum([len(word_list) for word_list in raw])],
        [sum([sum([len(w) for w in word_list]) for word_list in raw])]]
row_labels = ['Collection size: ', 'Top word: ', 'Least common: ', 'Vocab size: ', 'Average word usage count: ',
              'Total words: ', 'Total characters: ']

plot_table(cell_data=data, row_labels=row_labels,
                  export=True, export_name='visualizations/basic_stats_table.png')

Facebook Basic NLP stats

These are just some of the stats that I thought were interesting. I dumped all the data in this time because I felt like that would be the most interesting to see.

We can view that I have 2,147 days of text activity on Facebook.

My top word is ‘haha’ (no surprise there).

A total vocab size of 19,508 words :o

And I’ve used almost 4 million characters.

That is a lot of text data!

What’s My Vocab Usage Look Like Over Time?

I want to know how my vocab usage shifts over time. How can we generate that? Well, luckily we time-stamped all our files!

First, let’s create our plot function:

def plot(x, y, x_label='X', y_label='Y', title='Title', export=False, export_name='default.png'):
    plt.plot(x, y)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.title(title)

    if export:
        plt.savefig(export_name)

    plt.show()

And now let’s write some functions to graph our word usage over time:

import datetime


def get_datetimes(list_of_files):
    base_files = [os.path.basename(f) for f in list_of_files]
    no_ext = [os.path.splitext(f)[0] for f in base_files]
    splits = [f.split('.') for f in no_ext]
    times = np.array(
        [datetime.datetime(int(t[0]), int(t[1]), int(t[2])) for t in splits])
    return times

def unique_vocab(word_list):
    cnt = Counter()
    for word in word_list:
        cnt[word] += 1
    return cnt

raws = []
names = []
for key, value in all_text.items():
    raws.append(value._raw)
    names.append(key)

raw_wc = [len(word_list) for word_list in raws]
labels = get_datetimes(names)

labels, raw_wc = zip(*sorted(zip(labels, raw_wc)))

plot(labels, raw_wc,
            x_label='Date', y_label='Word Count', title='Word Count Over Time',
            export=True, export_name='visualizations/word_count_by_time.png')

raw_wc_u = [len(list(unique_vocab(word_list).items())) for word_list in raws]
plot(labels, raw_wc_u,
            x_label='Date', y_label='Word Count', title='Unique Word Count Over Time',
            export=True, export_name='visualizations/unique_word_count_by_time.png')

And here we go:

Facebook Word Count By Time

Unique Facebook Word Count By Time

I think it’s really interesting that in mid-2013, I used a lot of words. I’m not quite sure what I was up to, but when you strip it down to unique words, out of the 20,000 words I used that day, not that many were very unique…

Not to mention, you can definitely see a die down in my Facebook usage post-2017.

I think this is really cool stuff!

Wrapping Up

And there we have it! A basic analysis of some of our Facebook data. Hopefully you learned a trick or two and maybe something about yourself from your Facebook data! I know that I certainly did when I started analyzing mine. If you have cool visuals or things you want to share, drop me a comment! I’m curious to see what other people find in their own data.

Next time, I think we’ll take a stab at some sentiment analysis of our Facebook data and see if we can find any interesting tidbits inside of that.

If you enjoyed this article or found it helpful in any way, I would love you forever if you passed me along a dollar or two to help fund my machine learning education and research! Every dollar helps me get a little closer and I’m forever grateful.

And stay tuned for some more Facebook NLP analysis!