Twitter Archive to Markdown

I have been wanting to convert my Twitter Export archive into a more useful format for a while. I had vague dreams of maybe turning them into some sort of digest posts on WordPress but there are a fuckton of tweets (25,000) and it would require a LOT of scrubbing. I could probably manage to find and remove Retweets pretty easily, but then there is the issue of media and getting the images into the digest posts and it’s just not worth the hassle.

What I can, and did do, is preserve the data is a better, more digestible, and searchable format. Specifically, Markdown. Well, ok, the files are not doing anything fancy, so it’s just, plaintext, pretending to be Markdown.

I have no idea if Twitter still offers an export dump of your data, I have not visited the site at all in over a year. I left, I deleted everything, I blocked it on my DNS. But assuming they do, or you have one, it’s a big zip file that can be unrolled I to a sort of, local, Twitter-like interface. There are a lot of files in this ball, and while I am keeping the core archive, I just mostly care about the content.

If you dig in, it’s easy to find, there is a folder called data, the tweets are in a file called “tweets.js.”. It’s some sort of JSON/XML style format. If you want to media, it’s in a folder called “Tweets_Media” or something like that. I skimmed through mine, most of the images looked familiar, because I already have them, I removed the copy because I didn’t need it.

But I kept the Tweets.js file.

So, what to do with it? It has a bunch of extraneous meta data for each Tweet that makes it a cluttered mess. It’s useful for a huge website, but all I want is the date and the text. Here is a sample Tweet in the file.

{
    "tweet" : {
      "edit_info" : {
        "initial" : {
          "editTweetIds" : [
            "508262277464608768"
          ],
          "editableUntil" : "2014-09-06T15:05:44.661Z",
          "editsRemaining" : "5",
          "isEditEligible" : true
        }
      },
      "retweeted" : false,
      "source" : "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
      "entities" : {
        "hashtags" : [ ],
        "symbols" : [ ],
        "user_mentions" : [ ],
        "urls" : [ ]
      },
      "display_text_range" : [
        "0",
        "57"
      ],
      "favorite_count" : "0",
      "id_str" : "508262277464608768",
      "truncated" : false,
      "retweet_count" : "0",
      "id" : "508262277464608768",
      "created_at" : "Sat Sep 06 14:35:44 +0000 2014",
      "favorited" : false,
      "full_text" : "\"Sorry, you are over the limit for spam reports!\"  Heh...",
      "lang" : "en"
    }
  },

So I wrote a quick and simple Python Script (it’s below). I probably could have done something fancy with Beautiful Soup or Pandas, but instead I did a quick and basic scan that pulls the data I care about. If a line contains “created_at” pull it out to get the data, if it has “full_text”, pull it out to get the text.

Once I was able to output these two lines, I went about cleaning them up a bit. I don’t need the titles, so I started by splitting on “:”. This was quickly problematic if the Tweet contained a semicolon and because the time contained several semicolons. Instead I did a split on ‘ ” : ” ‘. Specifically, quote, space, semicolon, space, quote.”. Only the breaks I wanted had the spaces and quotes, so that got me through step one. The end quotation mark was easy to slice off as well.

I considered simplifying things by using the same transformation on the date and the text, but the data also had this +0000 in it that I wanted to remove. It’s not efficient, but it was just as simple to just have two, very similar operations.

After some massaging, I was able to output something along the lines of “date – text”.

But then I noticed that for some reason the Tweets are apparently not in date order. I had decided that I was just going to create a series of year based archival files, so I needed them to be in order.

So I added a few more steps to sort each Tweet during processing into an array of arrays based on the year. Once again, this isn’t the cleanest code, It assumes a range of something like, 2004 to 2026, which covers my needs for certain. I also had some “index out of range” errors with my array of arrays, which probably have a clever loopy solution, but instead it’s just a bug pre-initialized copy/paste array.

Part of the motivation of doing the array of arrays was also that I could make the script output my sorted yearly files directly, but I just did it manually from the big ball final result.. the job is done, but it could easily be done by adjusting the lower output block a bit.

Anyway, here is the code, and a link to a Git repository for it.

# A simple script that takes an exported tweets.js file and outputs it to a markdown text file for archiving.
# In pulling data for this, I noticed that older Twitter exports use a csv file instead of a .js file.
# As such, this is for newer exports.
# The Tweets.js file is in the 'data' directory of a standard Twitter archive export file.

# Open the tweet.js file containing all the tweets, should eb in the same folder
with open("tweets.js", encoding="utf-8") as file:
    filedata = file.readlines()

tweet_data = []
current_tweet = []
# The Tweets don't seem to be in order, so I needed to sort them out, this is admitedly ugly
# but I only need to cover so many years of sorting and this was the easiest way to avoid index errors
sorted_tweets = [[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []]

# Does a simple search through the file.  It pulls out the date posted and the full text.
# This does not do anything with images, sorry, that gets more complicated, it would be doable
for line in filedata:
    if "created_at" in line:
        timesplit = line.split(":")
        posted_at = line.split("\" : \"")[1].replace(" +0000 ", " ")[:-3]
        current_tweet.append(posted_at)
    elif "full_text" in line:
        current_tweet.append(line.split("\" : \"")[1][:-3])
        #        current_tweet.append(line.split(":")[1].split("\"")[1])
        tweet_data.append(current_tweet)
        current_tweet = []
        # Because full text is always after the date, it just moves on after it gets both
    else:
        pass

# An ugly sort, it simply looks for the year in the date, then creates an array of arrays based on year.
# I did it this way partly in case I wanted to output to seperate files based on year, but I can copy/paste that
# It probably is still out of order based on date, but whatever, I just want a simple archive file
for each in tweet_data:
    for year in range(2004, 2026):
        if str(year) in each[0]:
            sorted_tweets[year - 2004].append(each)

# Prints the output and dumps it to a file.
with open("output.md", encoding="utf-8", mode="w") as output:
    for eachyear in sorted_tweets:
        for each in reversed(eachyear):
            output.write(each[0] + " : " + each[1] + "\n")
            print(each[0] + " : " + each[1])

Sorting Out all My Writing

Coding Python isn’t the only project I’ve been working on recently, though it IS the major one.  Another project I’ve been working on, that is at least tangential to “modernizing how I code” is organizing all of my writing.  I write a LOT.  I sometimes list “writing” as a hobby, but I almost never list it as a “Primary Hobby” but it’s arguably the one hobby I have done the longest, even longer than collecting toys, and that I would like to think I do, pretty well.  Ok, no scratch that, I’ve been a “Gamer” since before I could really write.  Actually, it seems like all of my “major hobbies” started when I was like 5-10, so I guess those “formative years” really do matter.  My first programming was on the family’s old Franklin PC with two 5/25 floppy drives, writing BASIC that my dad had taught me.  He had been going to college for Computer Science at the time.

Anyway, writing.

I write, a lot.  I write about all sorts of topics.  Sometimes I write technical write ups, sometimes I write (purposely) shitty Final Fantasy VII Fan Fiction. I write casual blog posts about music, and movies and toys, I write detailed instructions for work or FAQs for Video Games. They aren’t all “winners” but I have gotten a lot of compliments of the years for my writing style and methods.  i also save everything.  I mean, literally EVERYTHING I create.  There are a few things I no longer have and I still think about them sometimes, and wish I had copies.  A few years ago I even started transposing some of my old paper journals and stories into digital text.  

The end result is that I have a lot of files in a lot of formats. Some are text files, some are Word Files, some are exported XML archive files.  A few are PDF based exports as well as some olf “Windows Live Writer” files.

As part of my personal journey to “level up” a bit on my computer skills (which are already pretty great), I have been working on getting more accustom to using Markdown.  Markdown is essentially “Fancy Text Files”. They are plain text files, with special symbols inserted occasionally to make things look prettier in a Markdown reader.  The thing is, this means they are very compact in size and can still be read by even the most basic reader (albeit with the random symbols inserted sometimes).

Most of this effort involves a LOT of copy and pasting.  I’ve converted a bunch of Word Docs I had over to Markdown files. Text docs aren’t generally huge to start with, but the Markdown files mean files that are sometimes 1/4th the file size.  When we are talking hundreds to thousands of files, this is significant savings.  So far, I’ve been skipping reviews if they have embedded images, but I already have those images saved elsewhere, so I may revisit that concept.

This also means finally sorting through some other “to sort” boxes.  For example, for a while, I was posting blog posts with Microsoft’s now discontinued “Windows Live Writer”.  The shitty part is, it used a proprietary format that even Word can’t open.  Fortunately, there is a open source alternative, “Open Live Writer”.  I don’t use it to post, but I can open those old Live Writer Files and convert them to useful Markdown Files.

One fun thing I did was export all of my Reddit Posts, and pull out anything over 500 characters as a “Journal Entry”.

Another source is old WordPress Exports. I have used my newfound l33t Pythonista Skills to build a sweet little script that takes a WordPress XML export, and parses through it for dates, titles, and content. Next, it cleans up the post content a bit (it’s not perfect sadly), and spits it all out to a series of files in the format I want.This script could easily be modified to work with other similar data exports like Reddit)

That code can be found over on Github. It’s probably buggy, but it works for the most part.

Which brings up sorting.  I have posted a few times about digital organization, and I’ve gotten the text down to a science as well.  A folder called “Journal” in my One Drive, which syncs to several PCs and my NAS.  Inside it’s sorted by year, inside each year are files in YYYY.MM.DD – TOPIC.md.  I’ve also incorporated this into my blogging workflow, and so partially written posts in the current year get X_ added to the front, so they all sort to the bottom, but I have an idea of when I had the idea.

This whole new system also allows me an easy way to just Journal occasionally.  One thing I’ve been trying to work on is that “not everything has to be a blog post”.  Sometimes it’s good to just, write, for myself, date it, and spit it out.

It’s healthy to get those thoughts out sometimes. For example, would you like to know how many times I’ve randomly bitched about the show Glee over the past 10-15 years?  Because it’s more than is probably healthy.

Anyway, this project is still a work in progress, but I’ve made a LOT of progress and I’m pretty happy with how it’s been going.