Webscraping Node + puppeteer

Webscraping Node + puppeteer
ibnu gunawan prayogo

Ibnu Gunawan P

Wed, 28 2021, 2:44:00 pm

Table Of Content

Webscraping with Node and Puppeteer

Webscraping is a method used to extract and retrieve data from specific web pages, such as images, text, or URLs on a website, especially when the website does not provide an API for data access. Puppeteer, a Node.js library, is essential for performing web scraping. In this article, I'll try my hand at scraping data from a particular website.

Determining the Data to Extract

For this example, we'll be extracting data from medium.com. To identify the data to extract, you can press F12 to open the browser's inspector tool and explore the HTML components or tags on the website that contain the data you need.

/assets/articles/medium.png

As an example, let's say I want to capture the trending topics on medium.com by copying a selector, as shown in the image above. Here's the selector I obtained:

Trying Out Puppeteer

Once you have the selector, the next step is to test web scraping by visiting https://try-puppeteer.appspot.com/.

/assets/articles/Screen_Shot_2021-07-28_at_17.20.23.png

Here, you can experiment with the selector you copied earlier using Puppeteer by modifying the script as shown below:

const browser = await puppeteer.launch();

const page = await browser.newPage();
await page.goto('https://medium.com');

const Log = await page.$$eval('#root > div > div.ar.do > div > div > div > div > div > div > div > div', item => item.map(data => data.innerText))

console.log(Log);
await page.screenshot({path: 'screenshot.png'});

await browser close();

Explanation

Puppeteer is a Node.js library that provides an API for controlling the Chromium browser. With Puppeteer, you can automate and control various browser actions.

  • Line 1 launches Puppeteer to open a browser.
  • Line 2 creates a new tab within the browser.
  • Line 3 directs the opened tab to the Medium website.
  • Line 4 creates a variable to store the data to be scraped. It involves finding HTML tags using the selector copied earlier, getting the desired data in an array, and looping through the HTML tags to return the data as text.
  • Line 5 prints the scraped data as text to the console.
  • Line 6 takes a screenshot of the medium.com page.
  • Line 7 closes the browser.

Scraping Results

/assets/articles/Screen_Shot_2021-07-28_at_17.25.33.png

From this, we've successfully obtained a list of trending topics on medium.com in text format.

So far, we've performed web scraping through the Puppeteer-provided website. However, if you'd like to try it directly in your code, you can do so by installing the Puppeteer package for Node.js.

Scraping with Node.js

To start scraping, you can create a new project using the following command:

npm init -y #[↵]

Installation

To install Puppeteer, simply run the following command in your terminal:

npm i puppeteer #[↵]
# or yarn add puppeteer [↵]

The Puppeteer installation will automatically download the latest version of Chromium, which has an approximate size of [170MB for Mac], [282MB for Linux], and [280MB for Windows]. However, if you want a lighter version, you can use:

npm i puppeteer-core #[↵]
# or yarn add puppeteer-core [↵]

Once all preparations are ready, you can proceed with writing the code.

Scraping Code

Create a file named index.js and enter the following code:

// Importing Puppeteer
const puppeteer = require("puppeteer");

// Continue with the code previously tested at [https://try-puppeteer.appspot.com](https://try-puppeteer.appspot.com/)
(async () => {
    const browser = await puppeteer.launch();

    const page = await browser.newPage();
    await page.goto('https://medium.com');
    
    const Log = await page.$$eval('#root > div > div.ar.do > div > div > div > div > div > div > div > div', item => item.map(data => data.innerText))
    
    console.log(Log);
    await page.screenshot({path: 'screenshot.png'});
    
    await browser.close();
})();

# Author's Message

How easy was that? Puppeteer not only allows you to retrieve data as shown above but also provides options for handling authenticated data, managing browser cookies, running the browser headlessly, and much more. You can see an example of a scraping application I created [here]. For complete documentation, check it out here.

That's all I can provide for now. I hope you find it useful. Thank you! 😁