You can download a file with Playwright by targeting the file download button on the page using any Locator
and clicking it. Alternatively, you can also extract the link from an anchor tag using the get_attribute
method and then download the file using requests
. This is better as sometimes the PDFs and other downloadable files will open natively in the browser instead of triggering a download on button click.
Here is some sample code that downloads a random paper from arXiv using Playwright and requests
:
import requests
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless = False)
page = browser.new_page()
page.goto('https://arxiv.org/abs/2301.05226')
# Get the href from the "download pdf" button
href = page.locator("a.download-pdf").get_attribute('href')
absolute_url = f"https://arxiv.org{href}"
# Download the file using requests
file = requests.get(absolute_url)
with open('output.pdf', 'wb') as f:
f.write(file.content)