Python – simple web scraper

Just outputs html elements for now if they exist, I made a special ‘scrape’ page here you can use. To make this I was looking at a random quote script someone else made, never used the BeautifulSoup module before this and it’s super easy. This can be made into a full on web scraper by adding more arguments and use of more BS functions. I do not recommend using this on url’s or websites you do not have permission or do not own, they can spot these if they are looking.

This thing can be hugely improved, right now it can output decent results of most html tags like <a> <p> <div> <head> etc.. It only allows one search and then exits, to make better use of the data an exit command can be added so more searches can happen using the data stored in ‘soup’. When starting this script it assumes only html tags and not css class data, this too can be improved to allow more use of the request. Run script then it should ask what type of data you want class, tag, other etc.. I have a version written in PHP that I stopped working on with quite a few more features and can’t wait to get it working in Python in way less code.

Note I have formatted the ‘scrape’ page a bit.

# Updated May 2 2022, a now shows target text and link
# also added str.strip(get.text()) to remove extra whitespace

import requests
from bs4 import BeautifulSoup

# dc_scrape 1.0, to learn with, just outputs web elements for now like <p>
# You may run this code to test with and the website is mine Ryan B.

dc_content = [] # empty list to store scraped data
dc_url = '' #specific url to scrape

res = requests.get(f"{dc_url}") # making that web request for the dataaaaah
print(f"Now scraping {dc_url.upper()}") # nerd printing for thrills
soup = BeautifulSoup(res.text, "html.parser") # this has all the data
element = str(input('Enter html element, p, a, etc.: ')) # pick element

posts = soup.find_all(element) # finding element if it exists
posts_count = 0 # start a visual # counter for results
for post in posts: # for loops are always clutch
    posts_count += 1 # number incrementor
    if element == 'a':
        if 'https://' in post.get('href') or 'http://' in post.get('href'): 
            print(str(posts_count) + '.',str.strip(post.get_text()),post.get('href')) #get target text and url only if http in them
            posts_count -= 1 #fix the counter since item is invalid link
        print(str(posts_count) + '.',str.strip(post.get_text())) # get_text() inside <element>, results not perfect
if posts_count == 0:
    print(f'Looks like there are no results for: {element}') # no results message
print(f'End of scraping {dc_url.upper()} for: {element}') # its all done now detected this post first

Leave a Comment