Twitter Archive to Markdown

I have been wanting to convert my Twitter Export archive into a more useful format for a while. I had vague dreams of maybe turning them into some sort of digest posts on WordPress but there are a fuckton of tweets (25,000) and it would require a LOT of scrubbing. I could probably manage to find and remove Retweets pretty easily, but then there is the issue of media and getting the images into the digest posts and it’s just not worth the hassle.

What I can, and did do, is preserve the data is a better, more digestible, and searchable format. Specifically, Markdown. Well, ok, the files are not doing anything fancy, so it’s just, plaintext, pretending to be Markdown.

I have no idea if Twitter still offers an export dump of your data, I have not visited the site at all in over a year. I left, I deleted everything, I blocked it on my DNS. But assuming they do, or you have one, it’s a big zip file that can be unrolled I to a sort of, local, Twitter-like interface. There are a lot of files in this ball, and while I am keeping the core archive, I just mostly care about the content.

If you dig in, it’s easy to find, there is a folder called data, the tweets are in a file called “tweets.js.”. It’s some sort of JSON/XML style format. If you want to media, it’s in a folder called “Tweets_Media” or something like that. I skimmed through mine, most of the images looked familiar, because I already have them, I removed the copy because I didn’t need it.

But I kept the Tweets.js file.

So, what to do with it? It has a bunch of extraneous meta data for each Tweet that makes it a cluttered mess. It’s useful for a huge website, but all I want is the date and the text. Here is a sample Tweet in the file.

{
    "tweet" : {
      "edit_info" : {
        "initial" : {
          "editTweetIds" : [
            "508262277464608768"
          ],
          "editableUntil" : "2014-09-06T15:05:44.661Z",
          "editsRemaining" : "5",
          "isEditEligible" : true
        }
      },
      "retweeted" : false,
      "source" : "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
      "entities" : {
        "hashtags" : [ ],
        "symbols" : [ ],
        "user_mentions" : [ ],
        "urls" : [ ]
      },
      "display_text_range" : [
        "0",
        "57"
      ],
      "favorite_count" : "0",
      "id_str" : "508262277464608768",
      "truncated" : false,
      "retweet_count" : "0",
      "id" : "508262277464608768",
      "created_at" : "Sat Sep 06 14:35:44 +0000 2014",
      "favorited" : false,
      "full_text" : "\"Sorry, you are over the limit for spam reports!\"  Heh...",
      "lang" : "en"
    }
  },

So I wrote a quick and simple Python Script (it’s below). I probably could have done something fancy with Beautiful Soup or Pandas, but instead I did a quick and basic scan that pulls the data I care about. If a line contains “created_at” pull it out to get the data, if it has “full_text”, pull it out to get the text.

Once I was able to output these two lines, I went about cleaning them up a bit. I don’t need the titles, so I started by splitting on “:”. This was quickly problematic if the Tweet contained a semicolon and because the time contained several semicolons. Instead I did a split on ‘ ” : ” ‘. Specifically, quote, space, semicolon, space, quote.”. Only the breaks I wanted had the spaces and quotes, so that got me through step one. The end quotation mark was easy to slice off as well.

I considered simplifying things by using the same transformation on the date and the text, but the data also had this +0000 in it that I wanted to remove. It’s not efficient, but it was just as simple to just have two, very similar operations.

After some massaging, I was able to output something along the lines of “date – text”.

But then I noticed that for some reason the Tweets are apparently not in date order. I had decided that I was just going to create a series of year based archival files, so I needed them to be in order.

So I added a few more steps to sort each Tweet during processing into an array of arrays based on the year. Once again, this isn’t the cleanest code, It assumes a range of something like, 2004 to 2026, which covers my needs for certain. I also had some “index out of range” errors with my array of arrays, which probably have a clever loopy solution, but instead it’s just a bug pre-initialized copy/paste array.

Part of the motivation of doing the array of arrays was also that I could make the script output my sorted yearly files directly, but I just did it manually from the big ball final result.. the job is done, but it could easily be done by adjusting the lower output block a bit.

Anyway, here is the code, and a link to a Git repository for it.

# A simple script that takes an exported tweets.js file and outputs it to a markdown text file for archiving.
# In pulling data for this, I noticed that older Twitter exports use a csv file instead of a .js file.
# As such, this is for newer exports.
# The Tweets.js file is in the 'data' directory of a standard Twitter archive export file.

# Open the tweet.js file containing all the tweets, should eb in the same folder
with open("tweets.js", encoding="utf-8") as file:
    filedata = file.readlines()

tweet_data = []
current_tweet = []
# The Tweets don't seem to be in order, so I needed to sort them out, this is admitedly ugly
# but I only need to cover so many years of sorting and this was the easiest way to avoid index errors
sorted_tweets = [[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []]

# Does a simple search through the file.  It pulls out the date posted and the full text.
# This does not do anything with images, sorry, that gets more complicated, it would be doable
for line in filedata:
    if "created_at" in line:
        timesplit = line.split(":")
        posted_at = line.split("\" : \"")[1].replace(" +0000 ", " ")[:-3]
        current_tweet.append(posted_at)
    elif "full_text" in line:
        current_tweet.append(line.split("\" : \"")[1][:-3])
        #        current_tweet.append(line.split(":")[1].split("\"")[1])
        tweet_data.append(current_tweet)
        current_tweet = []
        # Because full text is always after the date, it just moves on after it gets both
    else:
        pass

# An ugly sort, it simply looks for the year in the date, then creates an array of arrays based on year.
# I did it this way partly in case I wanted to output to seperate files based on year, but I can copy/paste that
# It probably is still out of order based on date, but whatever, I just want a simple archive file
for each in tweet_data:
    for year in range(2004, 2026):
        if str(year) in each[0]:
            sorted_tweets[year - 2004].append(each)

# Prints the output and dumps it to a file.
with open("output.md", encoding="utf-8", mode="w") as output:
    for eachyear in sorted_tweets:
        for each in reversed(eachyear):
            output.write(each[0] + " : " + each[1] + "\n")
            print(each[0] + " : " + each[1])

Thoughts on Twitter, Musk, and Alternatives…

I have really really tried to mostly avoid discussing Twitter and Musk and everything that has happened over the past, year and a half to two years there. I do occasionally share news in the link blog posts, but even there, I mostly just avoid it. I am pretty outspoken about my dislike of Musk and Twitter on other forums but not on my own forums.

Watching this death spiral is really entertaining though.

And it is a death spiral. It may not actually result in the death of Twitter, god knows we won’t get that lucky, but it’s just increasingly looking shittier and shittier over there. I stopped using Twitter completely the day Musk took over. I deleted a bunch of random secondary meme accounts I had after that, and I did log in a few times to pull all my Tweet archive data. I want to, someday, maybe, write a Python Script that will parse through it all and compile it into a bunch of daily digests I can dump into a WordPress blog, for posterity. I also started running some Python scripts before the API was cut off to delete all my old Tweets from the site. As far as I know, I still have my @ handles, mostly kept to prevent them from getting scooped up by spammers and bots.

I am not sure though. I blocked Twitter shortly after I started using NextDNS (Referral Link) everywhere. I can’t even check on my own accounts without a bunch of extra steps anymore. At this point, I really don’t care. I am not going back ever so long as Musk is even remotely connected to the service and I doubt he ever gives it up. I do keep watch from the sidelines. I see mentions of large businesses or politicians or news outlets moving permanently to Threads. I see people talking about how blue-checked bots are topping all the replies. I see complaints about all the crypto scams and weed gummies being advertised. I see it, and I quietly laugh to myself. Because all of this happening was clearly going to be the outcome of a big winey racist narcissist forcibly taking things over.

I’m not entirely convinced this wasn’t the intended outcome honestly. People like Musk, with their “free speech advocacy”, generally dislike actual open discussion and speech. They dislike when people can talk openly to each other and let ideas swell and become reality while smashing down stupid racist bull shit and conspiracy lies.

Fun fact, you can post a tweet with phrases like “Transwomen aren’t women” but if you post about “CIS people” you get flagged for using a slur.

Probably the first and biggest stupidity was the new pay-to-play blue check system that was implemented pretty early on. Blue Checks were originally issued as a way to verify people and companies were actually who they were. Someone at Twitter would do due diligence to make sure @McDonalds was actually run by the popular restaurant chain. This also meant not allowing blue checks for “@MacDonalds” or “”@McD0nalds” or various other typo-style fake accounts. It meant something. Early on, this was changed so Blue Checks just meant you had a paid subscription. Anyone could get a blue check. It also showed that you were supporting the racist jackass and his company, so a lot of previously verified celebrity types, refused to pay. Some were given checks anyway, which also upset these companies and people since it of course, implies support. It’s essentially a false endorsement.

As more advertisers fled the platform as it became increasingly filled with assholes and bots and scams, the Blue Check system has just been pushed more and more in a desperate attempt to make up for lost ad revenue. The irony being that even if EVERYONE signed up, it’s not where neat what advertisers were paying. The latest stupidity is that they now require new users to pay in to start posting. It’s pushed as a way to “deter bots”. Twitter doesn’t seem to understand just how cheap $8/month/account is for priority visibility for scams. One might wonder if it’s still worthwhile if so many are jumping ship, but it’s like those scam emails full of spelling errors. The scammers do this to weed out the intelligent users so only the choices of marks remain. Twitter is doing a GREAT job of weeding out the intelligence from its system leaving nothing but easy marks for these scammers.

I almost would feel bad for these people if they weren’t mostly the same people pushing all the hate-filled stupidity on the world in politics during the past decade. But that’s probably left to another discussion, if ever.

The really funny part is how this isn’t even the first time this has happened to a microblog service centered around “Free speech”. Gab, Truth, Parlor, and others I am sure I’ve forgotten are all basically complete failures after they failed to take off and get any real traction after being filled with right-wing extremists which at best just drives away any legitimate advertisers. Truth recently pushed a scam IPO as a way to grift money for Trump’s lawsuits which is failing pretty spectacularly.

Because of course it is. It was a grift to funnel money in a “legitimate” manner, and now it’s just a bunch of bag holders getting fucked over.

Alternatives

I have not really quite settled on a good alternative to Twitter yet. I’m not entirely sure I really NEED one. I wasn’t using Twitter a lot before the fall, though I had used it since 2006 when it was very very new. The alternatives all have their own sort of pitfalls.

Threads seems to be the most active. It’s run by Facebook and is technically a spin-off of Instagram. I kind of like Threads, because it’s full of people posting Toy photos. Basically, everything I used to like about Instagram, before it became TikTok but with ads every 3 posts, is Threads. I don’t super like that it’s a Facebook property. I also hate how the timeline feels really really algorithm-driven.

BlueSky feels the most like “old Twitter”. and I don’t mean “2021/2022 Twitter”, I mean like, “2007-2008 Twitter”. OLD old Twitter. But it’s also kind of dead as fuck. Even now that it’s open to anyone without the need for invites, it feels a bit deserted.

Mastodon is probably my favorite. People claim it’s “hard to use” but it really isn’t. The real technical hurdles on Mastodon kind of stem from servers and admins who tend to be a little… eccentric, for lack of a better thing to call them. There are admins who will ban entire other instances because ONE user on that other instance says something that is kind of maybe offensive to … somebody. Or heck, even blatantly offensive to everyone. But the whole server gets banned over one person. Which feels a bit shitty, especially since there also feels like a lot of mindset that “once banned, it’s banned forever.”

The federation also had some weirdness. Sometimes I get a new follower, so I go and check them out to see if I want to follow back, but in the app, they LOOK like they have a blank profile. But if I open their profile in a web browser, it’s complete and they have posts. So there is clearly some weird syncing issue there. I’m not familiar enough with how the federation works to know the details, but from what I have gleaned from other discussions, it’s something like that. Or maybe that server is banned for some reason.

It’s also kind of clunky to re-toot something, from that something. If I link to a Toot, and you want to re-toot it, from what I can tell, you need to cut and paste the URL and do a search to find it from your own server. Or do a weird login jaunt from the local server. And it’s all very doable, but it’s cludgy as fuck.

Anyway, I kind of post to all three, sometimes I post the same thing to all three, sometimes I kind of segment it out depending on “audience”. Not that I really have an audience. My pseudo plan is to mostly use Threads for Toy stuff, and BlueSky or Mastodon for everything else. I’m not entirely sure yet. There also aren’t really easy tools to post things like, blog posts, automatically to Threads or BlueSky. This was a factor that always felt like part of why Google Plus failed.

Twitter Drama and Mastodon

What a completely non eventful roller coaster the latest Twitter Drama is shaping up to be. I suppose it’s somewhat in the “early stages” and a lot of people, including myself, may be acting a bit over dramatic, but I don’t think Elon Musk buying Twitter will be anything good long term.

Twitter isn’t, wasn’t, whatvern’t that great. It was ok, personally, I’ve been kind of struggling to care about Twitter as a platform for a while. It’s probably just some sort of burn out, I’ve been there since essentially the beginning, in 2006. Back when good ol’ Leo Laporte was the number one most followed user, until Kevin Rose was. Then Leo again, it was sort of a competition. Those whopping follow counts were in the thousands as well back then. Twitter is definitely much larger and much more since then. And I find it hard to keep up with anymore.

I’ve tried using lists, but for some reason Twitter only lets you easily pin 5 lists. How useless is that? I have dozens of lists. Politics lists, tech lists, toys lists, music lists, transformers lists, also split across several sub lists, like “Toys – News,” and “Toys – Bloggers”, “Tech – News,” or “Tech – Cybersecurity”. Segmentation of content makes it much easier to follow and be in the right mindset for each topic.

Over time, it also became sort of a crazy place for politics and the spread of misinformation campaigns promoted by trolls and bots. These are the classic style trolls of the days of Ye Olde Usenet, where one person might be harassing another over something the latter was taking a bit too seriously. These are weaponized trolls pushed by people wish absolutely awful agendas against large groups of people. This was bad during the Obama Era of the US but made absolutely worse during the Trump Era.

It’s not entirely just a Trump thing, or a US thing, there is idiocy going on all around the world, but I’m still going to use the US as a frame of reference, since I am in the US. It’s also a problem across many Social Platforms. Lately there have been a lot of actual efforts to stem the spread of lies and stupidity on a lot of platforms, Twitter included. This is where we end up with more rift and part of Musk’s stated reasoning for pissing away billions of dollars on a platform that isn’t worth anywhere near that.

Free Speech.

Which is the real crux of the issue. Some are trying to confuse it with the idea that people angry over this don’t like that Musk is a Billionaire. How it’s hypocritical because Bezos bought the Washington post. The problem isn’t that Musk is a billionaire, it’s that he’s kind of a jackass. And he wants to open the platform back up to let other jackasses be jackasses. “Free Speech” isn’t at all about free speech to these people it’s about freedom to be an asshole. This is why people are upset. They are tired of people spreading lies and idiocy then just screaming people down when they are called out on it.

It was getting better.

It will be interesting to see what comes out of all this. I don’t think it’s going to be anything good. For one, every discussion about Musk buying Twitter on Reddit, seems to end up locked. Because just discussing the issue, people can’t keep civilized. There have also been a LOT of “Free Speech” platforms pop up over the past several years, and basically every single one failed. Some still limp along, but they all devolve into a bunch of jackasses calling for violence and spouting endless hate speech. They get kicked off their hosting platforms for violating TOS, sometimes the creators realize what a mess they unleashed and close things down themselves, sometimes they just fall apart because they can’t create any real way to financially support the platform.

Twitter may be big enough to survive for a while, but that’s not even real clear. It’s still one of the smallest social platforms in terms of users at around 350million. For comparison, Facebook and TikTok have Billions, with an s. Basically a measurable 25% of the entire world’s population. There is a greater than good chance that at least half of Twitter’s users are bot accounts, either actual scripted agent bots or sweat shop people in 3rd world countries clicking retweet buttons “bots”. Add this in with a lot of people leaving Twitter in disgust, and it will be interesting to see what the user base is in a month or two.

So what’s the alternative? A lot of people are pushing and moving to Mastodon. Mastodon isn’t quite the same as Twitter but it’s very similar, especially to old Twitter. For starters, it’s Federated, which means, anyone can host a Mastodon server (called Instances), and it can connect to other Mastodon Instances. This means there are many Instances themed around specific topics. It also means that if an Instance becomes full of idiots, then it can easily be blocked by other Instances.

This is not my first attempt at Mastodon either. I’ve used it off and on for a while and even ran a script for a long time that would sync my Twitter and Mastodon profiles, creating an illusion of activity. Now I’m trying to use it full time though. I have wanted to make it work for a while anyway, now, with all of the attention it’s getting, seems like as good of a time as any. I guess maybe it might be best to just treat it more like the “Classic Twitter” days, and just toss stuff out into the Ether and see if anyone reacts.

Currently I’m on the core Mastodon.social, though I may look into moving elsewhere, but if you want to give me a follow, you can find me <a rel=”me” href=”https://mastodon.social/@RamenJunkie“>Here</a>.

Next Thing CHiP as a Twitter Bot

twitter-logoThere was a post that came across on Medium recently, How to Make a Twitter Bot in Under an Hour.  It’s pretty straight forward, though it seems to be pretty geared towards non “techie” types, mostly because it’s geared towards people making the bot on a Mac and it uses something called Heroku to run the bot.  Heroku seems alright, except that this sort of feels like an abuse of their free tier, and it’s not free for any real projects.

I already have a bunch of IOT stuff floating around that’s ideal for running periodic services.  I also have a VPS is I really wanted something dedicated.  So I adapted the article for use in a standard Linux environment.  I used one of my CHiPs but this should work on a Raspberry Pi, an Ubuntu box, a VPS, or pretty much anything running Linux.

The first part of the article is needed, set up a new Twitter account, or use one you already have if you have extras.  Go to apps.twitter.com, create an app and keys, keep it handy.

Install git and python and python’s twitter extension.

sudo apt-get install git

sudo apt-get install python-twitter

This should set up everything we’ll need later.  Once it’s done, close the repository.

git clone https://github.com/tommeagher/heroku_ebooks.git

This should download the repository and it’s files.  Next it’s time to set up the configuration files.

cd heroku_ebooks

cp local_settings_example.py local_settings.py

pico local_settings.py

This should open up an editor with the settings file open.  It’s pretty straight forwards, you’ll need to copy and paste the keys from Twitter into the file, there are 4 of them total, make sure you don’t leave any extra spaces inside the single quotes.  You’ll also need to add one or more accounts for the bot to model itself after.  You’ll also need to change DEBUG = TRUE to DEBUG = FALSE as well as adding your bot’s username to the TWEET_ACCOUNT=” entry at the bottom.

Once that is all done do a Control+O to write out the file and Control+X to exit.  Now it’s time to test out the bot with the following…

python ebooks.py

It may pause for a second while it does it’s magic.  If you get the message ” No, sorry, not this time.” it means the bot decided not to tweet, just run the command again until it tweets, since we’re testing it at the moment.  If it worked, it should print a tweet to the command line and the tweet should show up in the bot’s timeline.  If you get some errors, you may need to do some searching and troubleshooting, and double check the settings file.

Next we need to automate the Twitter Bot Tweets.  This is done using Linux’s built in cron.  But first we need to make our script executable.

 chmod 755 ebooks.py

Next, enter the following….

sudo crontab -e

Then select the default option, which should be nano.  This will open the cron scheduler file.  You’ll want to schedule the bot to run according to whatever schedule you want.  Follow the columns above as a guide.  For example:

# m h  dom mon dow   command

*/15 * * * * python /home/chip/heroku_ebooks/ebooks.py

m = minutes = */15 = every 15 minutes of an hour (0, 15, 30, 45)

h = hour = * (every hour)

dom = day of month = * = every day and so on.  The command to run, in this case, is “python /home/chip/heroku_ebooks/ebooks.py”.  If you’re running this on a Raspberry Pi, or your own server, you will need to change “chip” to be the username who’s directory has the files.  Or, if you want to put the files elsewhere, it just needs to b e the path to the files.  For example, on a Raspberry Pi, it would be “python /home/pi/heroku_ebooks/ebooks.py”.

If everything works out, the bot should tweet on schedule as long as the CHIP is powered on and connected.  Remember, by default the bot only tweets 1/8th of the time when the script is run (this can be adjusted in the settings file), so you may not see it tweet immediately.

This is also a pretty low overhead operation, you could conceivably run several Twitter Bots on one small IOT device, with a staggered schedule even.  Simply copy the heruko_ebooks directory to a new directory, change the keys and account names and set up a new cron job pointing to the new directory.