Pyppeteer: the Puppeteer for Python Developers

24 February 2022 | 10 min read

The web acts like a giant, powerful database, with tons of data being generated every single day. With the rise of trends such as big data and data science, data has become more useful than ever, being used to train machine learning algorithms, generate insights, forecast the future, and many other purposes. Extracting this data manually, page by page, can be a very slow and time consuming process. The process of web scraping can be a helpful solution, programmatically extracting data from the web. Thanks to browser automation, which emulates human actions such as clicking and scrolling through a web system, users can simply and efficiently gather useful data without being hindered by a manual process.

There are a number of tools and libraries in Python for web scraping. Some of the most popular options include requests , BeautifulSoup , Scrapy , MechanicalSoup , lxml , and selenium . In this article, you’ll learn about another powerful alternative, Pyppeteer , and explore how to get started with it as a Python developer.

cover image

What is Pyppeteer ?

Pyppeteer is a Python wrapper for the JavaScript (Node) library, Puppeteer . It works similarly to Selenium, supporting both headless and non-headless mode, though Pyppeteer’s native support is limited to JavaScript and Chromium browsers.

Headless mode simply refers to running the web browser in the background without the graphical user interface (GUI). This is typically preferable for tasks like web automation, automated testing, and web scraping, as it significantly reduces the load time of the browser and the computation power required since all the work is being done in the background.

Why Pyppeteer?

While tools like requests and BeautifulSoup excel at extracting data from static sites, they struggle when it comes to dynamic or reactive sites that involve a lot of JS on the UI made with frameworks such as ReactJS, AngularJS, or VueJS. They simply weren’t made to deal with dynamically created content.

Pyppeteer, on the other hand, gives you the ability to control the entire browser and its elements instead of using HTTP libraries like requests to get the content of the page. This gives you much more flexibility in terms of what you can accomplish. Some of Pyppeteer’s specific use cases include:

  • Creating screenshots or PDFs of website pages
  • Automating keyboard input, form submission, UI testing, and more
  • Crawling a single-page application to produce pre-rendered content (i.e., server-side rendering)
  • Generating an automated testing environment to run tests within fully updated versions of Chrome and JavaScript

You can learn more about all that you can with it by visiting the official Pyppeteer documentation .

Implementing Pyppeteer

Now that you know a little more about Pyppeteer, let’s get started with the tutorial on how to implement it in Python.

Setting Up Your Virtual Environment

First, it’s best practice for you to create a separate development environment so that you don’t mess up any existing libraries:

# Install virtualenv (If you don't have one)

## installing | Window
pip install --upgrade virtualenv

## installing | Linux | Mac
pip3 install --upgrade virtualenv

# create virtual environment | Window | Linux | Mac
virtualenv pyp-env

# activating pyp-env | Window
pyp-env\Scripts\activate

# activating pyp-env | Linux | Mac
source pyp-env/bin/activate

Installation

Pyppeteer is the main dependency to be installed. Note that it requires you to be running Python 3.6+, which you can install directly using pip or from the Pyppeteer GitHub repository:

# Installing using pip | Window
C:\> python -m pip install pyppeteer

# Installing using pip | Linux | Mac
$ python3 -m pip install pyppeteer

# Installing from a GitHub repository | Window
C:\> python -m pip install -U git+https://github.com/miyakogi/pyppeteer.git@dev

# Installing from a GitHub repository | Linux | Mac
$ python3 -m pip install -U git+https://github.com/miyakogi/pyppeteer.git@dev

Note:

  1. Pyppeteer may delay for a while when you run your script for the first time because it needs some time to download the latest version of the Chromium browser. Alternatively, you can complete installation manually before running your scripts using the following command:
$ pyppeteer-install
  1. M1 Macs may have problems with running Pyppeteer in arm64 mode, and may need to run via Rosetta instead.

It’s also worth mentioning that Pyppeteer has async support by default, which means that it allows your script/application to handle the browser automation and scraping process asynchronously. This can be a performance booster when it comes to tasks involving HTTP calls.

Capturing Screenshots with Pyppeteer

Next, you’ll learn how to use Pyppeteer to capture screenshots from a website and save them as an image.

First, import your required libraries:

import asyncio
from pyppeteer import launch

Then create an async function to open a website and capture a screenshot:

import asyncio
from pyppeteer import launch

async def main():
    # launch chromium browser in the background
    browser = await launch()
    # open a new tab in the browser
    page = await browser.newPage()
    # add URL to a new page and then open it
    await page.goto("https://www.python.org/")
    # create a screenshot of the page and save it
    await page.screenshot({"path": "python.png"})
    # close the browser
    await browser.close()

print("Starting...")
asyncio.get_event_loop().run_until_complete(main())
print("Screenshot has been taken")

Finally, run your script (app.py):

# Window
C:\> python app.py
......
Starting...
Screenshot has been taken

# Linux | Mac
$ python3 app.py
......
Starting...
Screenshot has been taken

When you see “Screenshot has been taken,” you should be able to see a new image titled “python.png” in your current directory. It should look something like this:

Python website screenshot

This is a very basic example of taking a screenshot using Pyppeteer. However, as mentioned previously, Pyppeteer is also fit for use on more complex, dynamic websites. In the next section, you’ll explore a second example where you’ll learn how to build a simple web scraping script that extracts topics titles from an interactive site. This is where Pyppeteer shines, as this is almost impossible to accomplish with alternative tools like requests or BeautifulSoup.

Scraping Complex Page Content with Pyppeteer

Let’s say you’re given the task to scrape article ideas for a given list of topic names from Educative.io/edpresso . The content of the page renders interactively based on what you type the search box, as pictured here:

Educative io website

Examining this gif, you can quickly brainstorm the steps the scripts need to take to be able to effectively extract the interactive article ideas. Those steps might include:

  1. Locating the search box
  2. Writing a target topic on the search box
  3. Waiting for the topics to load
  4. Extracting all article titles about that topic
  5. Deleting the topic from the search box
  6. Repeating steps 2-5 until you iterate through all the necessary topics

Setting Up

Before you proceed to the implementation of the algorithm, remember that Pyppeteer, by default, launches a Chromium browser in headless mode. When it comes to building a script with a lot of unpredictability, it’s typically preferable to manually configure it to run in a non-headless mode as it tends to ease the burden of blind debugging.

Here’s how to configure Pyppeteer to run in non-headless mode:

# launch browser in non-headless mode
browser = await launch({"headless": False})

# It's also good choice to allow full screen
# To enable full screen on the launched browser
# Here how you do that
browser = await launch({"headless": False, "args": ["--start-maximized"]})

Now let’s get back to setting up the various aspects of the script.

The first lines of code for opening a website will be similar to those used in the first example in this article. Here, though, you’ll need to add a new line that will locate the search box using a CSS selector. Your code will look like this:

import asyncio
from typing import List
from pyppeteer import launch

async def get_article_titles(keywords: List[str]):
    # launch browser in headless mode
    browser = await launch({"headless": False, "args": ["--start-maximized"]})
    # create a new page
    page = await browser.newPage()
    # set page viewport to the largest size
    await page.setViewport({"width": 1600, "height": 900})
    # navigate to the page
    await page.goto("https://www.educative.io/edpresso")
    # locate the search box
    entry_box = await page.querySelector(
       "#__next > div.ed-grid > div.ed-grid-main > div > div.flex.flex-row.items-center.justify-around.bg-gray-50.dark\:bg-dark.lg\:py-0.lg\:px-6 > div > div.w-full.p-0.m-0.flex.flex-col.lg\:w-1\/2.lg\:py-0.lg\:px-5 > div.pt-6.px-4.pb-0.lg\:sticky.lg\:p-0 > div > div > div.w-full.dark\:bg-dark.h-12.flex-auto.text-sm.font-normal.rounded-sm.cursor-text.inline-flex.items-center.hover\:bg-alphas-black06.dark\:hover\:bg-gray-A900.border.border-solid.overflow-hidden.focus-within\:ring-1.border-gray-400.dark\:border-gray-900.focus-within\:border-primary.dark\:focus-within\:border-primary-light.focus-within\:ring-primary.dark\:focus-within\:ring-primary-light > input"
   )
2. Writing a Target Topic
# Type keyword in search box
await entry_box.type(keyword)
3. Waiting for the Topics to Load
# wait for search results to load
await page.waitFor(4000)
4. Extracting the Article Ideas
# extract the article titles
topics = await page.querySelectorAll("h2")
for topic in topics:
    title = await topic.getProperty("textContent")
    # print the article titles
    print(await title.jsonValue())
# clear the input box
for _ in range(len(keyword)):
    await page.keyboard.press("Backspace")
6. Repeating Steps 2-5 (Iterating over Topics)
for keyword in keywords:
    # type keyword in search box
    await entry_box.type(keyword)
    # wait for search results to load
    await page.waitFor(4000)
    # extract the article titles
    topics = await page.querySelectorAll("h2")

    # print the article titles
    for topic in topics:
        title = await topic.getProperty("textContent")
        print(await title.jsonValue())

    # clear the input box
    for _ in range(len(keyword)):
        await page.keyboard.press("Backspace")

Completing the Script

Now that you’ve built the various pieces of the algorithm, it’s time to put the whole script together. Your complete source code should look like this:

import asyncio
from typing import List
from pyppeteer import launch

async def get_article_titles(keywords: List[str]):
   # launch browser in headless mode
   browser = await launch({"headless": False, "args": ["--start-maximized"]})
   # create a new page
   page = await browser.newPage()
   # set page viewport to the largest size
   await page.setViewport({"width": 1600, "height": 900})
   # navigate to the page
   await page.goto("https://www.educative.io/edpresso")
   # locate the search box
   entry_box = await page.querySelector(
       "#__next > div.ed-grid > div.ed-grid-main > div > div.flex.flex-row.items-center.justify-around.bg-gray-50.dark\:bg-dark.lg\:py-0.lg\:px-6 > div > div.w-full.p-0.m-0.flex.flex-col.lg\:w-1\/2.lg\:py-0.lg\:px-5 > div.pt-6.px-4.pb-0.lg\:sticky.lg\:p-0 > div > div > div.w-full.dark\:bg-dark.h-12.flex-auto.text-sm.font-normal.rounded-sm.cursor-text.inline-flex.items-center.hover\:bg-alphas-black06.dark\:hover\:bg-gray-A900.border.border-solid.overflow-hidden.focus-within\:ring-1.border-gray-400.dark\:border-gray-900.focus-within\:border-primary.dark\:focus-within\:border-primary-light.focus-within\:ring-primary.dark\:focus-within\:ring-primary-light > input"
   )

   for keyword in keywords:
       print("====================== {} ======================".format(keyword))
       # type keyword in search box
       await entry_box.type(keyword)
       # wait for search results to load
       await page.waitFor(4000)
       # extract the article titles
       topics = await page.querySelectorAll("h2")
       for topic in topics:
           title = await topic.getProperty("textContent")
           # print the article titles
           print(await title.jsonValue())

       # clear the input box
       for _ in range(len(keyword)):
           await page.keyboard.press("Backspace")

print("Starting...")
asyncio.get_event_loop().run_until_complete(
   get_article_titles(["python", "opensource", "opencv"])
)
print("Finished extracting articles titles")

Running the Script

Once your script has been compiled, it’s time to see if it works. Launch the script as you would normally run a Python script, as shown here:

$ python3 app.py

Starting...
====================== python ======================
What is the contextlib module?
What is the difference between String find() and index() method?
Installing pip3 in Ubuntu
What is a private heap space?
......
====================== opensource ======================
Knative
How to use ASP.NET Core
What is apache Hadoop?
What is OpenJDK?
What is Azure Data Studio?
.....
====================== opencv ======================
What is OpenCV in Python?
Eye Blink detection using OpenCV, Python, and Dlib
How to capture a frame from real-time camera video using OpenCV
Finished extracting articles titles

When you run your script, it will automatically launch a Chromium browser and then open a new tab for the Educative.io page. Then it will go through all the steps highlighted above and print out the scraped article titles for each keyword. If you see results similar to the output above when you run your script, then congratulations—you made it!

Conclusion

In this article, you learned about web scraping and explored the abilities of Pyppeteer to build scripts to do anything from capturing simple website screenshots to web scraping dynamic, interactive web pages.These are just the basics, however. Now that you know the fundamentals, take some time to explore the rest at your own curiosity by visiting Pyppeteer’s official documentation .

image description
Kalebu Gwalugano

I’m Kalebu Gwalugano, a Mechatronics Engineer and Experienced Python Developer with expertise in building (Web|Mobile) Backends, DataScience solutions, IoT Architectures, Microservices, and DevOps.