Large organizations have put effort into building an application and frown at developers extracting data from their websites. That is why they put a gateway in the form of user agents to let you know what is permitted.


In most sites, you can find these details in the robots.txt file attached to a live URL just like using this link below:

https://www.amazon.com/robots.txt

In this article, you’ll learn about using Bright Data’s Web Scraper integrated development environment (IDE) to scrape datasets at scale using its ready-made functions and coding templates.

Benefits of the Web Scraper IDE

These are some of the benefits of using Bright Data’s Web Scraper IDE:

What is Bright Data?

Bright Data is a proxy network that helps you turn websites into structured data. To get started with the platform, create an account.

Check out this resource to learn more about Bright Data.

https://www.youtube.com/watch?v=YzoLTalL6Uo&embedable=true

Working with Web Scraper IDE

On your account dashboard, click the Datasets and Web Scraper IDE icon, and afterward, select the Get started button to open the template window.

The new window pop-ups a dialog box where you can select pre-existing dataset options to work with or create one from scratch if you desire to do so.

Select the eBay discovery and PDP options, and the page should look something like this with the collector code.

Now scroll down the page, and under the input tab, pass in the name of a product you want to analyze and extract its data. Once done, click the Preview button to run the preview and start the extraction.

PS: You must also note that you can enter your scripts within the Interaction codesection.

Looking at the output result tab after running the preview, it formatted the result from the eBay website based on the following data classification as product_url, title, image, price of the product, and so on.

Saving the collector

To save the collector, click on the Finish editing button to open the configuration page as seen below:

Initiate the collector by API

Under the My Scrapers tab, let’s initiate this project and work with the scripts provided by clicking the Initiate by API button.

Creating authorization token

Authorization in programming grants access to users and identifies you as the account's rightful owner.

Click on the Account settings menu at the bottom left of the window to create an API token.

Upon adding the API token, you will receive a token for verification; enter the secret code.

Now that is done, copy your API token key, as you won’t be able to retrieve it unless you create a new one for use.

Return to the New collector page, and copy the scripts based on your operating system (OS) in your terminal. Make sure to replace API_TOKEN with the key you copied in the previous section after the word BEARER.

In your command line interface or terminal, the result of the API code should look something like this:

curl -H "Authorization: Bearer API_TOKEN" -H "Content-Type: application/json" -d '[{"keyword":"ralph lauren polo shirt","count":10,"location":"","condition":"New unused"}]' "https://api.brightdata.com/dca/trigger?collector=c_liopmjh61f3o3lz7dz&queue_next=1"

The request command makes the code active in the Result API section of the New collector dashboard page. Once again, please copy and paste the code into the CLI tool.

PS: Remember to put your API token key in place of the value API_TOKEN.

curl "https://api.brightdata.com/dca/dataset?id=j_liosdy1cdutdi7sod" -H "Authorization: Bearer API_TOKEN"

Run the script in the CLI, and the datasets in an object with status should read building and a message.

If the response continues to show, retry sending the request. When successful, you should see this result object.

Using Postman

Like the displayed object above, let’s use Postman to get the response for the Result API.

If you do not have Postman, download it here. Postman is an API platform for building, publishing, monitoring, testing, and documenting APIs. Check this resource article to learn more about Postman and its use.

Open the Postman app and input these values:

Creating a Node Server

Node is a JavaScript runtime environment that allows the execution of JavaScript code outside the web browser enabling developers to build server-side applications and command-line tools.


Let’s create a web server. One of the requirements for initializing your project in the terminal is using the package manager, npm, which is automatically present after installing Node.js on your local machine. Check it using this command:

node --version

It displays the current version of Node.

  1. Create a new directory. For this project, it is named datasets.
  2. Change its directory and initialize the project with the command:

cd datasets

npm init -y

The -y flag accepts the defaults that look like this:

package.json

{
  "name": "datasets",
  "version": "1.0.0",
  "description": "",
  "main": "index.mjs",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  ...
}

  1. Install the following packages:

npm install -D nodemon

Nodemon automatically updates and monitors for any changes in your source code and automatically restarts the server.

npm install csv-parse

The csv-parse package is a parser for converting CSV text input into an array or objects.

Now, update the script section in the package.json file to this:

{
  "name": "datasets",
  "version": "1.0.0",
  "description": "",
  "main": "index.mjs",
  "scripts": {
    "start": "node index.mjs",
    "start:dev": "nodemon index.mjs"
  },
  ...
}

  1. Next, create a new file in the root directory with the command:

touch index.mjs

To test this file, write a basic JavaScript script and run the server with the following command:

npm run start:dev

Social Media Data from Bright Data

Scraping large datasets requires lots of effort and work using technologies like Node or Python. The way to get around this is to use a platform like Bright Data to obtain the information you need to get your results as soon as possible.

Let’s get this dataset from Bright Data which will be the social media platform Instagram, with these steps:

  1. Sign up for a Bright Data account.
  2. Go to https://brightdata.com/cp/datasets/ or select the Dataset Marketplace on the Datasets & Web Scraper IDE.

  1. Open the Dataset Marketplace, and under Categories, select Instagram.com from the Social media dropdown.

  1. Click on View dataset and download the sample dataset in CSV format.

Make sure to save the dataset in the root directory of the Node web server.

Your folder structure should look something like this:

.
└── datasets
    ├── node_modules
    ├── instagram.csv
    ├── package-lock.json
    ├── package.json
    └── index.mjs

Reading CSV datasets in Node.js

For this section, Node will read the comma-separated values (CSV), which dataset is from Bright Data.

Update the index.mjs file with the code:

import { parse } from "csv-parse";
import { createReadStream } from "node:fs";

const instagramAccount = [];

const isInstagramAccount = (info) => {
  return (
    info["posts_count"] > 300 &&
    info["followers"] > 6000 &&
    info["biography"] !== "" &&
    info["posts"] !== ""
  );
};

createReadStream("instagram.csv")
  .pipe(
    parse({
      columns: true,
    })
  )
  .on("data", (data) => {
    if (isInstagramAccount(data)) {
      instagramAccount.push(data);
    }
  })
  .on("error", (err) => {
    console.log("error", err);
  })
  .on("end", () => {
    console.log(`${instagramAccount.length} accounts are live`);
    console.log("done");
  });

The code above does the following:

Running the scripts with the command npm run start:dev should display the result like this in the terminal:

643 accounts are live
done

Conclusion

Web scraping is an integral part of data extraction used in data science. The Web Scraper IDE by Bright Data does all the heavy lifting in the background, presenting only the relevant data for your use.

This article walked you through understanding how to use the Web Scraper IDE and how you can build a custom datasets script to query large datasets of companies without fear of getting blocked by the company’s bots designed to help protect the company’s data.

Resources