Web Scraping With Python - A Step-By-Step Guide

Web Scraping With Python – A Step-By-Step Guide

Table of Contents

What is web scraping?

Web scraping is an automation technique for extracting data from websites. Web Scraping gaining popularity day by day as the increase in use of Machine Learning algorithms. Web scraping helps to extract large set of data from websites accessing through HTTP protocol. Unlike the long and mind-numbing process of manually getting data, Web scraping uses intelligence automation methods to get thousands or even millions of data sets in a smaller amount of time. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications. Web Scraping With Python is the preferred method for web scrapping.

Why is Python a popular programming language for Web Scraping?

Python is the most popular language for web scraping as it can handle most of the processes easily.

Python has many libraries for Web Scrapping.
Python use less code to write more functionalities.
Python is Opensource and most of the libraries are free to use and modify.
Python can handle large amount of data.
Python support many databases.
Python can access OS components.

What is the use case of Web Scraping?

Price Monitoring & Market Research: Web Scraping can be used by companies to scrap the product data for their products and competing products. Also can help price comparison services. High-quality web scraped data obtained in large volumes can be very helpful for companies in analyzing consumer trends.
News Monitoring: Web scraping news sites can provide detailed reports on the current news and trends.
Sentiment Analysis: Companies can use web scraping to collect data from social media websites such as Facebook and Twitter and generate report on general sentiment about their products and services.
Email Marketing: Companies can also use Web scraping for email marketing. They can collect Email ID’s from various sites using web scraping and then send bulk promotional and marketing Emails.

Popular Python Libraries Used for Web Scrapping

BeautifulSoup
Scrapy
Selenium
Requests
Urllib3
Lxml
MechanicalSoup

Practical Step-by-Step guide to scrape data from a website.

In this example, I am going to scrape data from “https://www.programmableweb.com” and store the extracted data in a CSV file.

I am going to use following tools and libraries:

Python 3.4
Requests
BeautifulSoup
CSV

There are many API’s and are categorized in different sectors, I am scrapping data for Transportation API. The API URL is https://www.programmableweb.com/category/transportation/api

The programmableweb.com data table for scrapping

From this site, we are going to grab the following information:

API Name
Description
Category
Followers

After collecting the information, we are going to store it in a CSV file.

Installing the required libraries.


pip install requests

pip install beautifulsoup4

Importing the required libraries and dependencies

from bs4 import BeautifulSoup
import requests
import csv

Defining a function to visit sub-page and scrap required data:

def getSubPage(link, headers):
    url="https://www.programmableweb.com"
    response=requests.get(url+link, headers=headers, allow_redirects=False)
    #Check the API response status code
    if response.status_code == 200:
        soup=BeautifulSoup(response.content,'html.parser')
        APITagsHtml = soup.find('div', attrs={'class':'tags'})
        APITagsH = APITagsHtml.find_all('a')
        APItags = ''
        for atg in APITagsH:
            APItags = APItags + atg.string + ", "
        
        APIDesc = soup.find('div', attrs={'class':'api_description tabs-header_description'}).text
        APISubData = [APItags.strip(", "), APIDesc]
        return APISubData
    else:
        return False

Python code to scrap data:

url="https://www.programmableweb.com/category/financial/api?pw_view_display_id=apis_all&page="
page_num = 0
APIData = []
file = open('apidata_financial.csv', 'w+', newline ='')
with file:
    # identifying the CSV header 
    header = ['API Name', 'API Description', 'Category', 'Sub-Category','Followers','Inner Page Status','Inner Page URL']
    #Initializing the CSV Object
    writer = csv.DictWriter(file, fieldnames = header)
    writer.writeheader()
    # Defining HTTP Request Header
    headers = {
                'Accept-Encoding': 'gzip, deflate, sdch',
                'Accept-Language': 'en-US,en;q=0.8',
                'Upgrade-Insecure-Requests': '1',
                'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                'Cache-Control': 'max-age=0',
                'Connection': 'keep-alive',
            }
    while True:
        #generate page url with page number to scrap data in pagination
        pageUrl = url+ str(page_num)
        #Sending Request to fetch the entire HTML page
        response=requests.get(pageUrl,headers=headers)
        htmlcontent=response.content
        #Parse HTML data
        soup=BeautifulSoup(htmlcontent,'html.parser')
        #Check if we landed the last page and the response is empty then break the loop
        if soup.find('div',attrs={'class':'view-empty'}):
            break
        else:
            page_num = page_num + 1
            table = soup.find('table', attrs={'class':'views-table'})
            table_body = table.find('tbody')
            rows = table_body.find_all('tr')
            for row in rows:
                APIName = row.find('td', attrs={'class':'views-field-title'}).text
                APIFollower = row.find('td', attrs={'class':'views-field-count'}).text
                APIDescription = row.find('td', attrs={'class':'views-field-field-api-description'}).text
                APICategory = row.find('td', attrs={'class':'views-field-field-article-primary-category'}).text
                APINameHtml = row.find('td', attrs={'class':'views-field-title'})
                APILink = APINameHtml.find('a').get('href')
                APIDataResponse = getSubPage(APILink, headers)
                if APIDataResponse:
                    writer.writerow({'API Name' : APIName, 'API Description': APIDataResponse[1], 'Category': APICategory, 'Sub-Category': APIDataResponse[0], 'Followers': APIFollower})
                else:
                    writer.writerow({'API Name' : APIName, 'API Description': APIDescription, 'Category': APICategory, 'Sub-Category': APICategory, 'Followers': APIFollower})

The Output

The final script

from bs4 import BeautifulSoup
import requests
import csv

def getSubPage(link, headers):
    url="https://www.programmableweb.com"
    response=requests.get(url+link, headers=headers, allow_redirects=False)
    if response.status_code == 200:
        soup=BeautifulSoup(response.content,'html.parser')
        APITagsHtml = soup.find('div', attrs={'class':'tags'})
        APITagsH = APITagsHtml.find_all('a')
        APItags = ''
        for atg in APITagsH:
            APItags = APItags + atg.string + ", "
        
        APIDesc = soup.find('div', attrs={'class':'api_description tabs-header_description'}).text
        APISubData = [APItags.strip(", "), APIDesc]
        return APISubData
    else:
        return False

url="https://www.programmableweb.com/category/financial/api?pw_view_display_id=apis_all&page="
page_num = 0
APIData = []
file = open('apidata_financial.csv', 'w+', newline ='')
with file:
    # identifying CSV header 
    header = ['API Name', 'API Description', 'Category', 'Sub-Category','Followers','Inner Page Status','Inner Page URL']
    writer = csv.DictWriter(file, fieldnames = header)
    writer.writeheader()
    headers = {
                'Accept-Encoding': 'gzip, deflate, sdch',
                'Accept-Language': 'en-US,en;q=0.8',
                'Upgrade-Insecure-Requests': '1',
                'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                'Cache-Control': 'max-age=0',
                'Connection': 'keep-alive',
            }
    while True:
        pageUrl = url+ str(page_num)
        print(pageUrl)
        response=requests.get(pageUrl,headers=headers)
        htmlcontent=response.content
        soup=BeautifulSoup(htmlcontent,'html.parser')
        if soup.find('div',attrs={'class':'view-empty'}):
            break
        else:
            page_num = page_num + 1
            table = soup.find('table', attrs={'class':'views-table'})
            table_body = table.find('tbody')
            rows = table_body.find_all('tr')
            for row in rows:
                APIName = row.find('td', attrs={'class':'views-field-title'}).text
                APIFollower = row.find('td', attrs={'class':'views-field-count'}).text
                APIDescription = row.find('td', attrs={'class':'views-field-field-api-description'}).text
                APICategory = row.find('td', attrs={'class':'views-field-field-article-primary-category'}).text
                APINameHtml = row.find('td', attrs={'class':'views-field-title'})
                APILink = APINameHtml.find('a').get('href')
                APIDataResponse = getSubPage(APILink, headers)
                if APIDataResponse:
                    APIData.append([APIName,APIDataResponse[1],APICategory,APIDataResponse[0],APIFollower, 200,APILink])
                    writer.writerow({'API Name' : APIName, 'API Description': APIDataResponse[1], 'Category': APICategory, 'Sub-Category': APIDataResponse[0], 'Followers': APIFollower})
                else:
                    APIData.append([APIName,APIDescription,APICategory,APICategory,APIFollower])
                    writer.writerow({'API Name' : APIName, 'API Description': APIDescription, 'Category': APICategory, 'Sub-Category': APICategory, 'Followers': APIFollower, 'Inner Page Status':301, 'Inner Page URL':APILink})

Conclusion:

In this Project, we used the most popular web-scraping package Beautiful Soup, which creates a parse tree that can be used to extract data from HTML on a website. From the : https://www.programmableweb.com/ site, we have scraped data such as., API Name, Category, Description, and Followers. Finally, write the data it to a CSV file, apidata.csv. Hope this will help you in building small web scrappers. If you want to build a scrapper and collect data for your business need, feel free to contact us.

Web Scraping With Python – A Step-By-Step Guide