sia.hackernoon.com

In this article, we will build a program that allows you to scrape or grab data from a website with a Python script. This method of gathering data is called web scraping.

Web scraping is all about programmatically using Python or any other programming language to download, clean, and use the data from a web page. Most websites don’t want you scraping their data, and to find out what is legal and permissible for scraping, websites have a dedicated page that shows details of the endpoints allowed.

Attach robots.txt to the end of any link to find out about the allowed endpoints. For example, let’s use https://news.ycombinator.com/robots.txt.

The result should look like this with this text file below:

The screenshot states what endpoints we are allowed and not allowed to scrape from the YCombinator website. A crawl delay means a pause when scraping data from the website with programs, thereby not overloading their servers and slowing down the website because of constant scraping.

In this exercise, we scrape the news content's home page, which we can do according to the user agent.

Getting Started

The Python web scraper requires two necessary modules for scraping the data:

Beautiful Soup
Requests

Beautiful Soup

Beautiful Soup is a Python library for extracting data from HTML files. It modifies the file using a parser, turns the data into a valuable document, and saves programmers hours of manual and repetitive work.

Requests

The requests HTTP library is for downloading HTML files using the link to the website with the

.get()

function.

Creating a Web Scraper

Now to the nitty-gritty of this project. Create a new directory, and in there, a file that will contain all the scripts for the web scraper program.

Copy and paste the following code:

# app.py

import requests

response = requests.get('https://news.ycombinator.com/news')
yc_web_page = response.text

print(yc_web_page)

The code above does the following:

Importing the
```
requests
```
module
Using the response variable, the requests attached to the
```
.get()
```
function download the HTML files from the link of the website provided
Reading the content of the web page with
```
.text
```

If you run this code with the command python

app.py

and it does not give you any output, it means the two imported modules need to be installed.

Run the following commands to install the modules.

pip3 install requests

pip install beautifulsoup4

The result of the source code should look like this:

Next, let’s update the

app.py

file with the rest of the code using beautiful soup:

# main.py

import requests
from bs4 import BeautifulSoup # add this

response = requests.get('https://news.ycombinator.com/news')

yc_web_page = response.text

# add this 
soup = BeautifulSoup(yc_web_page, 'html.parser')

article_tag = soup.find(name="a", class_='titlelink')
article_title = article_tag.get_text()

article_link = article_tag.get('href')
article_upvote = soup.find(name="span", class_="score").get_text()

result = {
  "title": article_title,
  "link": article_link,
  "point": article_upvote
}

print(result)

Follow the code snippet above by doing the following::

Import the BeautifulSoup function from module bs4
Next, use the variable soup to parse the document from the
```
yc_web_page
```
using the BeautifulSoup function and
```
html.parser
```
to get the HTML files

Before going over the rest of the code, let’s open our web browser with the link provided in

.get()

Next, right-click on the page, and click inspect to view the elements tab of the YCombinator news page.

Our web page should look like this:

With Beautiful Soup, we can target specific elements on the page with their class names:

By assigning the article_tag variable, every page element has a tag name using the
```
find()
```
function with the element's name, the a tag, and the
```
class_
```
with an underscore. This is done to prevent an overwrite of the class in the element on the web page

Now, we want to extract one of the link titles of the
```
article_tag
```
using the
```
.get_text()
```
function
Next, extract the link of the
```
article_tag
```
using the attribute
```
href
```
with the
```
.get()
```
function
The same applies to the
```
article_upvote
```
variable, where the tag name,
```
<span>
```
, and the class name are used to extract the points for each article link
Create a variable result that will display the extracted data as a dictionary with the key and value pair
Print out the final result

With the whole script written, our page should scrape the data from the news home page of YCombinator and look like this:

Conclusion

This article taught you how to use Python web scraper to extract data from a web page.

Also, the functionalities of using a web scraper are that it saves time and effort in producing large data sets faster rather than manually.

How to Build a Python Web Scraper: Scrape Data from any Website

Getting Started

Beautiful Soup

Requests

Creating a Web Scraper

Conclusion

Learn More