Scrape a Website for Changes with Python

Scrape a Website for Changes with Python

How do you keep track of recent news in your field?

I use a simple website listener script that scrapes websites for specific keywords, and if found, notifies me through email.

It’s really easy to implement, and you only require Python 3.6 and the libraries request (downloads the website), time (for a time delay between scrapes), smtplib (to send emails over SMTP) and BeautifulSoup (to parse the website files).

# Import requests (to download the webpage)
import requests

# Import Time (adds a delay between scrapes)
import time

# Import smtplib (to send an email)
import smtplib

# Import BeautifulSoup (to parse the website)
from bs4 import BeautifulSoup
# The script scrapes the Verge frontpage, and if it finds any reference to Tensorflow, sends me an email. If Tensorflow is not found, the script waits 60 seconds to scrape.

# while this is true (it is true by default),
while True:
    # set the url as the Verge,
    url = "http://theverge.com/"
    # set the headers like we are a browser,
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    # download the homepage
    response = requests.get(url, headers=headers)
    # parse the downloaded homepage and grab all text, then,
    soup = BeautifulSoup(response.text, "lxml")
    
    # if the number of times the word "Tensorflow" occurs on the page is less than 1,
    if str(soup).find("Tensorflow") == -1:
        # wait 60 seconds,
        time.sleep(60)
        # continue with the script,
        continue
        
    # but if the word "Tensorflow" occurs any other number of times,
    else:
        # create an email message with just a subject line,
        msg = 'Subject: Tensorflow has been mentioned on the Verge!'
        # set the 'from' address,
        fromaddr = 'YOUR_EMAIL_ADDRESS'
        # set the 'to' addresses,
        toaddrs  = ['AN_EMAIL_ADDRESS','A_SECOND_EMAIL_ADDRESS', 'A_THIRD_EMAIL_ADDRESS']
        
        # setup the email server,
        # server = smtplib.SMTP('smtp.gmail.com', 587)
        # server.starttls()
        # add my account login name and password,
        # server.login("YOUR_EMAIL_ADDRESS", "YOUR_PASSWORD")
        
        # Print the email's contents
        print('From: ' + fromaddr)
        print('To: ' + str(toaddrs))
        print('Message: ' + msg)
        
        # send the email
        # server.sendmail(fromaddr, toaddrs, msg)
        # disconnect from the server
        # server.quit()
        
        break

How easy was that?!


Geoffrey Momin is an Engineer and Technology Consultant. He is actively researching the application of blockchain, artificial intelligence and conversational interfaces to improve human capital and enterprise management.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Up Next:

Diagnosing Pneumonia in X-Rays with Machine Learning

Diagnosing Pneumonia in X-Rays with Machine Learning