Python - Simple URL Extractor - [Blogging Intensifies]

This one is pretty basic tier, but more useful than most. I had a website full of links to pages of video files. I wanted a list that I could stick into yt-dlp. I could do a bunch of copy and pasting, or I could use Python to scrape and take all the links.

It takes a URL as a CLI Argument, specifically like:

> main.py https://website.com

It skims the page with Beautiful Soup and spits out a text file with the date-time-url.txt format. Useful if a site changes over time.

The site I was scraping was using some relative links, so it checks for “http” in the URLs and if its present, just dumps the URL, otherwise, it prepends “REPLACEME” in front of the link, so it’s easy to do a Find/Replace operation and add whatever the full URL is.

For example, if the URL is “/video/12345.php”, which takes you to “website.com/video/12345.php”, it outputs “REPLACEME /video/12345.php” on each line. It’s easy to then replace the “REPLACEME” with the URL on 1-1000+ URLs. I didn’t just add the URL because, at least for my use case, the links added a bit more than just the base URL, and I wanted it to function more universally.

Anyway, here is the script. I don’t think it uses any non-standard library that would need a pip install or anything but if it does, it’ll complain and tell you what to do.

## Simple URL Extractor
## ToUse, at CLI $> python3 main.py [Replace With URL No Braces]
## Will output a list of all href links on a page to a file witht he date time and URL.
## Useful for pushing to a bulk downloader program, though it does not processing so URLs may need to be edited
## If there is not full URL, it pre prends an easily find/replaceable slug

import httplib2
import sys
from datetime import datetime
from bs4 import BeautifulSoup, SoupStrainer

current_datetime = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

try:
    url = sys.argv[1]
except IndexError:
    print("Error: No URL Defined! Please use main.py [URL]")
    sys.exit(1)

http = httplib2.Http()
status, response = http.request(url)
filename = f"{current_datetime}-{url.split('//')[1].replace('/','_')}.txt"

with open(filename, "x") as f:
    for link in BeautifulSoup(response, 'html.parser', parse_only=SoupStrainer('a')):
        if link.has_attr('href'):
            #print(link['href'])
            write_link = link['href']
            if "http://" not in write_link:
                write_link = f"REPLACEME_{write_link}"
            f.write(f"{write_link}\n")


## Reference
## https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup
## https://stackoverflow.com/questions/4033723/how-do-i-access-command-line-arguments
## https://stackoverflow.com/questions/14016742/detect-and-print-if-no-command-line-argument-is-provided

Ramen Junkie

Josh Miller aka “Ramen Junkie”. I write about my various hobbies here. Mostly coding, photography, and music. Sometimes I just write about life in general. I also post sometimes about toy collecting and video games at Lameazoid.com.

Python – Simple URL Extractor

Related

Likes

Mentions

Leave a Reply Cancel reply