Python – DCScrape v1.2, updated again

UPDATE version 1.3 is under way and will be a refactor into OOP design plus improvements and additions. I may design a more advanced scraper for purchase, this can be customized for targeted data and specific data sets for a fee. I noticed a bug after running this script while using the app Pyto for inputs, this may just be the app as I did not notice it in visual studio code.

I have a page here you can scrape to test on with my permission. The new class search is very simple with no real formatting, lots of times a class will be used to style an entire section of content like <div class=”container”> Bunch of content</div> so you might get a huge block of data depending on the website. This will be improved later.

This simple web scraper is decently universal and may not work great on all sites, scrapers should be customized for best results. I believe this v1.2 in its procedural/functional style of code to be a good start.

Updates

  1. Endless simple searches using storage variable more efficiently.
  2. Type exit to end or when changing the URL to scrape.
  3. Class name search: c class_name
  4. All image links: img
  5. HTML searches example for all links: a

Upcoming

  1. Improve class and image functions
  2. Images download with urllib
  3. Refactor dcscrape into class with methods
  4. Storage with CRUD functionality
  5. More…
import requests
from bs4 import BeautifulSoup

# dc_scrape version 1.2
# Outputs stripped text, results may vary.

# 07-24-2022 Search for all img src links: img
# 07-24-2022 Search class names: c class_name 
# 06-23-2022 Now does multiple searches via 1 request.
# 06-23-2022 Outputs web element text: <p> <a> <li> etc.

# USAGE, only on sites you have permission on.
# When entering a HTML search do not use the < > brackets
# Links search: a
# All img links: img
# Class search: c class_name
# To end search's type: exit
# Restart or type exit if you change the URL 'dc_url' to scrape or it uses old data.

# Next update img and class improvements

# These 3 lines of code below run only once when script is started
dc_url = 'https://dreamcreator.io/scrape/' # URL TO SCRAPE
res = requests.get(f"{dc_url}") 
print(f"- Now scraping {dc_url.upper()}") 

# Main function
def dcscrape():
    element = str(input('- Enter html element to scrape, p, a, article, for img link: img, class use \'c class_name\' etc. type \'exit\' to end!: ')) # pick element
    soup = BeautifulSoup(res.text, "html.parser") 

    if element.startswith('c'):
        el = element.split()
        dclass = el[1]
        print(f'- Class detected... {dclass}')
    # Find all class names if specified or HTML elements by default
    posts = soup.find_all(class_ = dclass) if element.startswith('c') else soup.find_all(element)
    posts_count = 0 # start a visual # counter for results

    if element == 'exit':
        print('- EXIT command used, exiting...')
        exit()
    # Main loop for all results with N. prefix
    for post in posts:
        posts_count += 1 # number incrementor

        if element.startswith('c'):
            print(str(posts_count) + '.',str(post.get_text()))

        elif element == 'img':
            print(str(posts_count) + '.',str.strip(post.get_text()),post.get('src'))

        elif element == 'a':
            if bool(post.get('href')):
                if 'https://' in post.get('href') or 'http://' in post.get('href'): 
                    print(str(posts_count) + '.',str.strip(post.get_text()),post.get('href')) #get target text and url only if http in them
                else:
                    posts_count -= 1 #fix the counter since item is invalid link

        else:
            print(str(posts_count) + '.',str.strip(post.get_text())) # get_text() inside <element>, results not perfect e.g spaces
    if posts_count == 0:
        print(f'- Looks like there are no results for: {element}') # no results message
    dcscrape()
        
# Start everything
dcscrape()

What makes a web scraper more advanced? Can check the results versus some other data, grab many other forms of data like ‘user information’, last updated, site watcher for like news or weather, really the skies the limit. I use my very own to perform searches on targeted patterns, save the result time(s), do look ups, go even deeper. One of my recent functions was to check for users registered on very bad websites so I could block them on mine even if a user account is needed, yeap.

Leave a Comment