Twitter Archive to Markdown

I have been wanting to convert my Twitter Export archive into a more useful format for a while. I had vague dreams of maybe turning them into some sort of digest posts on WordPress but there are a fuckton of tweets (25,000) and it would require a LOT of scrubbing. I could probably manage to find and remove Retweets pretty easily, but then there is the issue of media and getting the images into the digest posts and it’s just not worth the hassle.

What I can, and did do, is preserve the data is a better, more digestible, and searchable format. Specifically, Markdown. Well, ok, the files are not doing anything fancy, so it’s just, plaintext, pretending to be Markdown.

I have no idea if Twitter still offers an export dump of your data, I have not visited the site at all in over a year. I left, I deleted everything, I blocked it on my DNS. But assuming they do, or you have one, it’s a big zip file that can be unrolled I to a sort of, local, Twitter-like interface. There are a lot of files in this ball, and while I am keeping the core archive, I just mostly care about the content.

If you dig in, it’s easy to find, there is a folder called data, the tweets are in a file called “tweets.js.”. It’s some sort of JSON/XML style format. If you want to media, it’s in a folder called “Tweets_Media” or something like that. I skimmed through mine, most of the images looked familiar, because I already have them, I removed the copy because I didn’t need it.

But I kept the Tweets.js file.

So, what to do with it? It has a bunch of extraneous meta data for each Tweet that makes it a cluttered mess. It’s useful for a huge website, but all I want is the date and the text. Here is a sample Tweet in the file.

{
    "tweet" : {
      "edit_info" : {
        "initial" : {
          "editTweetIds" : [
            "508262277464608768"
          ],
          "editableUntil" : "2014-09-06T15:05:44.661Z",
          "editsRemaining" : "5",
          "isEditEligible" : true
        }
      },
      "retweeted" : false,
      "source" : "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
      "entities" : {
        "hashtags" : [ ],
        "symbols" : [ ],
        "user_mentions" : [ ],
        "urls" : [ ]
      },
      "display_text_range" : [
        "0",
        "57"
      ],
      "favorite_count" : "0",
      "id_str" : "508262277464608768",
      "truncated" : false,
      "retweet_count" : "0",
      "id" : "508262277464608768",
      "created_at" : "Sat Sep 06 14:35:44 +0000 2014",
      "favorited" : false,
      "full_text" : "\"Sorry, you are over the limit for spam reports!\"  Heh...",
      "lang" : "en"
    }
  },

So I wrote a quick and simple Python Script (it’s below). I probably could have done something fancy with Beautiful Soup or Pandas, but instead I did a quick and basic scan that pulls the data I care about. If a line contains “created_at” pull it out to get the data, if it has “full_text”, pull it out to get the text.

Once I was able to output these two lines, I went about cleaning them up a bit. I don’t need the titles, so I started by splitting on “:”. This was quickly problematic if the Tweet contained a semicolon and because the time contained several semicolons. Instead I did a split on ‘ ” : ” ‘. Specifically, quote, space, semicolon, space, quote.”. Only the breaks I wanted had the spaces and quotes, so that got me through step one. The end quotation mark was easy to slice off as well.

I considered simplifying things by using the same transformation on the date and the text, but the data also had this +0000 in it that I wanted to remove. It’s not efficient, but it was just as simple to just have two, very similar operations.

After some massaging, I was able to output something along the lines of “date – text”.

But then I noticed that for some reason the Tweets are apparently not in date order. I had decided that I was just going to create a series of year based archival files, so I needed them to be in order.

So I added a few more steps to sort each Tweet during processing into an array of arrays based on the year. Once again, this isn’t the cleanest code, It assumes a range of something like, 2004 to 2026, which covers my needs for certain. I also had some “index out of range” errors with my array of arrays, which probably have a clever loopy solution, but instead it’s just a bug pre-initialized copy/paste array.

Part of the motivation of doing the array of arrays was also that I could make the script output my sorted yearly files directly, but I just did it manually from the big ball final result.. the job is done, but it could easily be done by adjusting the lower output block a bit.

Anyway, here is the code, and a link to a Git repository for it.

# A simple script that takes an exported tweets.js file and outputs it to a markdown text file for archiving.
# In pulling data for this, I noticed that older Twitter exports use a csv file instead of a .js file.
# As such, this is for newer exports.
# The Tweets.js file is in the 'data' directory of a standard Twitter archive export file.

# Open the tweet.js file containing all the tweets, should eb in the same folder
with open("tweets.js", encoding="utf-8") as file:
    filedata = file.readlines()

tweet_data = []
current_tweet = []
# The Tweets don't seem to be in order, so I needed to sort them out, this is admitedly ugly
# but I only need to cover so many years of sorting and this was the easiest way to avoid index errors
sorted_tweets = [[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []]

# Does a simple search through the file.  It pulls out the date posted and the full text.
# This does not do anything with images, sorry, that gets more complicated, it would be doable
for line in filedata:
    if "created_at" in line:
        timesplit = line.split(":")
        posted_at = line.split("\" : \"")[1].replace(" +0000 ", " ")[:-3]
        current_tweet.append(posted_at)
    elif "full_text" in line:
        current_tweet.append(line.split("\" : \"")[1][:-3])
        #        current_tweet.append(line.split(":")[1].split("\"")[1])
        tweet_data.append(current_tweet)
        current_tweet = []
        # Because full text is always after the date, it just moves on after it gets both
    else:
        pass

# An ugly sort, it simply looks for the year in the date, then creates an array of arrays based on year.
# I did it this way partly in case I wanted to output to seperate files based on year, but I can copy/paste that
# It probably is still out of order based on date, but whatever, I just want a simple archive file
for each in tweet_data:
    for year in range(2004, 2026):
        if str(year) in each[0]:
            sorted_tweets[year - 2004].append(each)

# Prints the output and dumps it to a file.
with open("output.md", encoding="utf-8", mode="w") as output:
    for eachyear in sorted_tweets:
        for each in reversed(eachyear):
            output.write(each[0] + " : " + each[1] + "\n")
            print(each[0] + " : " + each[1])

Ramen Junkie

Josh Miller aka “Ramen Junkie”. I write about my various hobbies here. Mostly coding, photography, and music. Sometimes I just write about life in general. I also post sometimes about toy collecting and video games at Lameazoid.com.

Related

You Missed

Linux, Again

Vote By Mail Should Be Standard

Duolingo’s Music and Math Courses

Keyboard Jiggler with Python and Raspberry Pi