In this article, we will build a program that allows you to scrape or grab data from a website with a Python script. This method of gathering data is called web scraping.
Web scraping is all about programmatically using Python or any other programming language to download, clean, and use the data from a web page. Most websites don’t want you scraping their data, and to find out what is legal and permissible for scraping, websites have a dedicated page that shows details of the endpoints allowed.
Attach robots.txt to the end of any link to find out about the allowed endpoints. For example, let’s use https://news.ycombinator.com/robots.txt.
The result should look like this with this text file below:
The screenshot states what endpoints we are allowed and not allowed to scrape from the YCombinator website. A crawl delay means a pause when scraping data from the website with programs, thereby not overloading their servers and slowing down the website because of constant scraping.
In this exercise, we scrape the news content's home page, which we can do according to the user agent. 

Getting Started

The Python web scraper requires two necessary modules for scraping the data:

Beautiful Soup

Beautiful Soup is a Python library for extracting data from HTML files. It modifies the file using a parser, turns the data into a valuable document, and saves programmers hours of manual and repetitive work.

Requests

The requests HTTP library is for downloading HTML files using the link to the website with the
.get()
function.

Creating a Web Scraper

Now to the nitty-gritty of this project. Create a new directory, and in there, a file that will contain all the scripts for the web scraper program.
Copy and paste the following code:
# app.py

import requests

response = requests.get('https://news.ycombinator.com/news')
yc_web_page = response.text

print(yc_web_page)
The code above does the following:
If you run this code with the command python
app.py
and it does not give you any output, it means the two imported modules need to be installed.
Run the following commands to install the modules.
pip3 install requests

pip install beautifulsoup4
The result of the source code should look like this:
Next, let’s update the
app.py
file with the rest of the code using beautiful soup:
# main.py

import requests
from bs4 import BeautifulSoup # add this

response = requests.get('https://news.ycombinator.com/news')

yc_web_page = response.text

# add this 
soup = BeautifulSoup(yc_web_page, 'html.parser')

article_tag = soup.find(name="a", class_='titlelink')
article_title = article_tag.get_text()

article_link = article_tag.get('href')
article_upvote = soup.find(name="span", class_="score").get_text()

result = {
  "title": article_title,
  "link": article_link,
  "point": article_upvote
}

print(result)
Follow the code snippet above by doing the following::
Before going over the rest of the code, let’s open our web browser with the link provided in
.get()
Next, right-click on the page, and click inspect to view the elements tab of the YCombinator news page.
Our web page should look like this:
With Beautiful Soup, we can target specific elements on the page with their class names:
With the whole script written, our page should scrape the data from the news home page of YCombinator and look like this:

Conclusion

This article taught you how to use Python web scraper to extract data from a web page. 
Also, the functionalities of using a web scraper are that it saves time and effort in producing large data sets faster rather than manually.

Learn More