Monday 2024-07-15 – Link List

Blogging Intensifies Link List for Monday 2024-07-15

Sunday 2024-07-14 – Link List

Blogging Intensifies Link List for Sunday 2024-07-14

Twitter Archive to Markdown

I have been wanting to convert my Twitter Export archive into a more useful format for a while. I had vague dreams of maybe turning them into some sort of digest posts on WordPress but there are a fuckton of tweets (25,000) and it would require a LOT of scrubbing. I could probably manage to find and remove Retweets pretty easily, but then there is the issue of media and getting the images into the digest posts and it’s just not worth the hassle.

What I can, and did do, is preserve the data is a better, more digestible, and searchable format. Specifically, Markdown. Well, ok, the files are not doing anything fancy, so it’s just, plaintext, pretending to be Markdown.

I have no idea if Twitter still offers an export dump of your data, I have not visited the site at all in over a year. I left, I deleted everything, I blocked it on my DNS. But assuming they do, or you have one, it’s a big zip file that can be unrolled I to a sort of, local, Twitter-like interface. There are a lot of files in this ball, and while I am keeping the core archive, I just mostly care about the content.

If you dig in, it’s easy to find, there is a folder called data, the tweets are in a file called “tweets.js.”. It’s some sort of JSON/XML style format. If you want to media, it’s in a folder called “Tweets_Media” or something like that. I skimmed through mine, most of the images looked familiar, because I already have them, I removed the copy because I didn’t need it.

But I kept the Tweets.js file.

So, what to do with it? It has a bunch of extraneous meta data for each Tweet that makes it a cluttered mess. It’s useful for a huge website, but all I want is the date and the text. Here is a sample Tweet in the file.

{
    "tweet" : {
      "edit_info" : {
        "initial" : {
          "editTweetIds" : [
            "508262277464608768"
          ],
          "editableUntil" : "2014-09-06T15:05:44.661Z",
          "editsRemaining" : "5",
          "isEditEligible" : true
        }
      },
      "retweeted" : false,
      "source" : "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
      "entities" : {
        "hashtags" : [ ],
        "symbols" : [ ],
        "user_mentions" : [ ],
        "urls" : [ ]
      },
      "display_text_range" : [
        "0",
        "57"
      ],
      "favorite_count" : "0",
      "id_str" : "508262277464608768",
      "truncated" : false,
      "retweet_count" : "0",
      "id" : "508262277464608768",
      "created_at" : "Sat Sep 06 14:35:44 +0000 2014",
      "favorited" : false,
      "full_text" : "\"Sorry, you are over the limit for spam reports!\"  Heh...",
      "lang" : "en"
    }
  },

So I wrote a quick and simple Python Script (it’s below). I probably could have done something fancy with Beautiful Soup or Pandas, but instead I did a quick and basic scan that pulls the data I care about. If a line contains “created_at” pull it out to get the data, if it has “full_text”, pull it out to get the text.

Once I was able to output these two lines, I went about cleaning them up a bit. I don’t need the titles, so I started by splitting on “:”. This was quickly problematic if the Tweet contained a semicolon and because the time contained several semicolons. Instead I did a split on ‘ ” : ” ‘. Specifically, quote, space, semicolon, space, quote.”. Only the breaks I wanted had the spaces and quotes, so that got me through step one. The end quotation mark was easy to slice off as well.

I considered simplifying things by using the same transformation on the date and the text, but the data also had this +0000 in it that I wanted to remove. It’s not efficient, but it was just as simple to just have two, very similar operations.

After some massaging, I was able to output something along the lines of “date – text”.

But then I noticed that for some reason the Tweets are apparently not in date order. I had decided that I was just going to create a series of year based archival files, so I needed them to be in order.

So I added a few more steps to sort each Tweet during processing into an array of arrays based on the year. Once again, this isn’t the cleanest code, It assumes a range of something like, 2004 to 2026, which covers my needs for certain. I also had some “index out of range” errors with my array of arrays, which probably have a clever loopy solution, but instead it’s just a bug pre-initialized copy/paste array.

Part of the motivation of doing the array of arrays was also that I could make the script output my sorted yearly files directly, but I just did it manually from the big ball final result.. the job is done, but it could easily be done by adjusting the lower output block a bit.

Anyway, here is the code, and a link to a Git repository for it.

# A simple script that takes an exported tweets.js file and outputs it to a markdown text file for archiving.
# In pulling data for this, I noticed that older Twitter exports use a csv file instead of a .js file.
# As such, this is for newer exports.
# The Tweets.js file is in the 'data' directory of a standard Twitter archive export file.

# Open the tweet.js file containing all the tweets, should eb in the same folder
with open("tweets.js", encoding="utf-8") as file:
    filedata = file.readlines()

tweet_data = []
current_tweet = []
# The Tweets don't seem to be in order, so I needed to sort them out, this is admitedly ugly
# but I only need to cover so many years of sorting and this was the easiest way to avoid index errors
sorted_tweets = [[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []]

# Does a simple search through the file.  It pulls out the date posted and the full text.
# This does not do anything with images, sorry, that gets more complicated, it would be doable
for line in filedata:
    if "created_at" in line:
        timesplit = line.split(":")
        posted_at = line.split("\" : \"")[1].replace(" +0000 ", " ")[:-3]
        current_tweet.append(posted_at)
    elif "full_text" in line:
        current_tweet.append(line.split("\" : \"")[1][:-3])
        #        current_tweet.append(line.split(":")[1].split("\"")[1])
        tweet_data.append(current_tweet)
        current_tweet = []
        # Because full text is always after the date, it just moves on after it gets both
    else:
        pass

# An ugly sort, it simply looks for the year in the date, then creates an array of arrays based on year.
# I did it this way partly in case I wanted to output to seperate files based on year, but I can copy/paste that
# It probably is still out of order based on date, but whatever, I just want a simple archive file
for each in tweet_data:
    for year in range(2004, 2026):
        if str(year) in each[0]:
            sorted_tweets[year - 2004].append(each)

# Prints the output and dumps it to a file.
with open("output.md", encoding="utf-8", mode="w") as output:
    for eachyear in sorted_tweets:
        for each in reversed(eachyear):
            output.write(each[0] + " : " + each[1] + "\n")
            print(each[0] + " : " + each[1])

NAS Recovery

What a fun time it’s been with my Synology NAS lately. And before I get going here, I want to make it clear, nothing here is a knock against Synology, or WD for that matter. The NAS I have is like ten years old, if it had failed, I was already pricing out a new, up-to-date, Synology. Heck, I may still get one anyway.

But for now, it seems to be working fine again.

As I mentioned, it’s been like ten years or so. I ran on one 4TB WD Red drive for a long time. Eventually, I did add the second drive to make things RAID and redundant. Sometimes last year, my original WD drive died on me, I ordered a replacement and swapped it out, and everything was fine.

Sometime, maybe a month ago now, I received an error about a drive failure. The newer drive was already showing bad. I made up an RMA request with Western Digital, wiped the drive, and then sent it in. They sent me a replacement.

A short time before the replacement arrived, I found another error, “Volume has crashed”. It showed that the, at the time, one drive was “Healthy”, and I could still read all of my data. This was starting to feel a bit suspect. I have everything important backed up online to OneDrive, but just in case, I started pulling things off to other storage as a secondary backup. This basically involved eating up all the spare space on my project server (temporarily) and using a USB enclosure and an old 2TB drive that, seems to be failing, but would work well enough for short-term storage. The point was, if I had to rebuild things, I would not have to download mountains of data off OneDrive. USB transfer is much easier and faster.

With everything backed up, I received the replacement for my RMA drive. My hope was, I could attach the replacement drive, and whatever was causing this Volume to show as crashed would clear itself out. Unfortunately, I could not really interact with the volume at all.

After several attempts at various workarounds, I gave up on recovering the Volume. I had the data, that is what matters.

I pulled the crashed drive out, which allowed me to create a new volume using the new drive. I then recreated the set of shared network folders, Books, Video, Music, Photo, General Files, as well as reestablished the home folders for users.

Fortunately, because I kept the same base names, all of my Network Mapped drives to the NAS, just worked. Fixing my own connections would be easy, hassling with connections on my wife and kids’ laptops, would be a pain. They get all annoyed when I do anything on their laptops.

Unfortunately, the crashed volume seems to have killed all of the apps I had set up. This is not a huge loss honestly, I don’t actually use most of the built-in Synology apps anymore beyond Cloud Sync and the Torrent client. The main one I need to reconfigure is the VPN client. I may just move that to a docker instance on my project PC. Fortunately, last year, I pulled both my email and blog archives off of the NAS. All my email is consolidated again in Outlook, and my blog archive is in a Docker container now. This means I can just remove all of these apps instead of reinstalling them.

I did find that I had failed to do a fresh local backup of my “Family Videos” folder, but I was able to resync that down from the One Drive backup. Speaking of which rebuilding all those sync connections was a little tedious since they are spread across two One Drive accounts, but I got them worked out and thankfully, everything recognized existing files and called it good. While I didn’t put everything back on the NAS, I have a few things that are less important that I’m just going to store on the file server/project server, I somehow gained about 1.5TB of space. I’ve repeatedly checked and everything is there as it should be. I can only speculate that there was some sort of residual cruft that had built up over time in logs or something somewhere. I also used to use Surveillance station, so it’s possible I had a mountain of useless videos stored on it.

In general, it’s actually been a bit of an excuse to clean up a few things. I had some folders in there that used to sync my DropBox and Google Drive, neither of which I use anymore, for example.

I am 99% sure everything is back in working order, and the last step I keep putting off is to whip the drive from the crashed volume (it still reads healthy) and read it to the current, new volume.

It’s been a hassle, but not really that bad. The main hassle is because it’s large amounts of data, it often means starting a copy and just, letting it run for hours.

Saturday 2024-06-22 – Link List

Blogging Intensifies Link List for Saturday 2024-06-22