This one is pretty basic tier, but more useful than most. I had a website full of links to pages of video files. I wanted a list that I could stick into yt-dlp. I could do a bunch of copy and pasting, or I could use Python to scrape and take all the links.

It takes a URL as a CLI Argument, specifically like:

> main.py https://website.com

It skims the page with Beautiful Soup and spits out a text file with the date-time-url.txt format. Useful if a site changes over time.

The site I was scraping was using some relative links, so it checks for “http” in the URLs and if its present, just dumps the URL, otherwise, it prepends “REPLACEME” in front of the link, so it’s easy to do a Find/Replace operation and add whatever the full URL is.

For example, if the URL is “/video/12345.php”, which takes you to “website.com/video/12345.php”, it outputs “REPLACEME /video/12345.php” on each line. It’s easy to then replace the “REPLACEME” with the URL on 1-1000+ URLs. I didn’t just add the URL because, at least for my use case, the links added a bit more than just the base URL, and I wanted it to function more universally.

Anyway, here is the script. I don’t think it uses any non-standard library that would need a pip install or anything but if it does, it’ll complain and tell you what to do.

## Simple URL Extractor
## ToUse, at CLI $> python3 main.py [Replace With URL No Braces]
## Will output a list of all href links on a page to a file witht he date time and URL.
## Useful for pushing to a bulk downloader program, though it does not processing so URLs may need to be edited
## If there is not full URL, it pre prends an easily find/replaceable slug

import httplib2
import sys
from datetime import datetime
from bs4 import BeautifulSoup, SoupStrainer

current_datetime = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

try:
    url = sys.argv[1]
except IndexError:
    print("Error: No URL Defined! Please use main.py [URL]")
    sys.exit(1)

http = httplib2.Http()
status, response = http.request(url)
filename = f"{current_datetime}-{url.split('//')[1].replace('/','_')}.txt"

with open(filename, "x") as f:
    for link in BeautifulSoup(response, 'html.parser', parse_only=SoupStrainer('a')):
        if link.has_attr('href'):
            #print(link['href'])
            write_link = link['href']
            if "http://" not in write_link:
                write_link = f"REPLACEME_{write_link}"
            f.write(f"{write_link}\n")


## Reference
## https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup
## https://stackoverflow.com/questions/4033723/how-do-i-access-command-line-arguments
## https://stackoverflow.com/questions/14016742/detect-and-print-if-no-command-line-argument-is-provided

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.