How to scrape data from a website to Excel

14 October 2024 | 17 min read

Collecting data from websites and organizing it into a structured format like Excel can be super handy. Maybe you're building reports, doing research, or just want a neat spreadsheet with all the info you need. But copying and pasting manually? That's a time sink no one enjoys. In this guide, we'll discuss a few ways to scrape data from websites and save it directly into Excel.

Together we'll talk about methods for both non-techies and devs, using everything from built-in Excel tools to coding your own solutions with Python. By the end, you'll have a clear picture of which method fits your needs the best.

Here are the methods we are going to observe:

  • Manually copy-pasting for smaller tasks.
  • Using Excel VBA for some automation.
  • Excel Power Queries for built-in web extraction.
  • Leveraging Google Sheets functions for quick extraction.
  • Writing a Python script to scrape sites.
  • Pulling data using APIs for cleaner and structured information.

Pick what works for you and let's dive in!

Why would you want to scrape data into Excel?

Before jumping into the "how" let's talk about the "why". Why would anyone want to pull data from a website into Excel? Well, because Excel is a powerful tool for analyzing, organizing, and visualizing data. If you're dealing with any sort of structured information, it's much easier to work with it in a spreadsheet format.

Here are a few common reasons:

  1. Building reports and dashboards
    Imagine you need to monitor prices from multiple e-commerce sites, track stock performance, or keep an eye on some key market metrics. Scraping the data and pushing it into Excel means you can automate your reporting, create visual dashboards, and stay up-to-date without needing to copy and paste data manually.

  2. Data analysis and research
    For researchers, marketers, or anyone working with large datasets, having data in Excel makes it simple to run quick calculations, apply filters, and spot trends. It's a go-to for financial modeling, trend analysis, and basic data science.

  3. Data cleanup and transformation
    Excel has a lot of built-in features for cleaning and structuring data. If you scrape a messy table from a website, you can use Excel to tidy it up—remove duplicates, reformat columns, or even run simple formulas.

  4. Combining multiple data sources
    Sometimes you need to pull data from several websites and merge it into a single view. Excel's flexibility lets you bring together all this information, align it, and create consolidated reports.

  5. Easy export and sharing
    Once you've got your data in Excel, you can easily share it as a spreadsheet, convert it to CSV, or even export it to other tools. It's a universal format that most people are comfortable with.

Method 1: Manually copy-pasting data

Sometimes the simplest approach is all you need. If you're working with a small amount of data, manual copy-pasting can be the fastest way to get it into Excel. It's straightforward and doesn't require any special tools or setup.

When to use it:

  • Small data sets: If you only need to pull a few rows or a small table, it's quicker to just select and paste.
  • No repeated updates: Ideal when you don't expect the website data to change often.
  • Quick one-offs: If you just need to grab a snapshot of data without automation.

How to do it:

  1. Select the data: Open the website, and highlight the text or table you want to copy.
  2. Copy: Right-click and select Copy, or use the shortcut Ctrl + C.
  3. Open Excel: Go to the Excel sheet where you want to insert the data.
  4. Paste: Right-click in a cell and select Paste, or use Ctrl + V. Excel will usually recognize table formats and adjust the cells automatically.

Pros and cons:

Pros:

  • No setup, no coding.
  • Perfect for small, static data sets.
  • Great for capturing text or single tables quickly.

Cons:

  • Tedious for large or complex data.
  • Prone to human errors if you miss a cell or copy incorrectly.
  • Doesn't work well if the site's layout changes often.

When to move on:

If you find yourself repeating this process regularly or dealing with large tables, it's probably time to move to a more automated solution. Check out the next methods for a smoother workflow!

Method 2: Using Excel VBA (Visual Basic for Applications)

If you're looking to automate web scraping directly from within Excel, VBA can be a great way to go. VBA is a built-in programming language for Excel, and it lets you write custom scripts to automate repetitive tasks, like scraping a webpage and pulling data into your sheet.

Note: This method is specific to Microsoft Excel. LibreOffice and similar tools (like OpenOffice) do support macros, but the syntax and setup might be a bit different. So, if you're on one of those platforms, you'll have to adapt the script accordingly.

When to use it:

  • Automated web queries: If you want to run a script on demand and fetch updated data directly into your sheet.
  • Internal/personal use: Perfect for small projects or one-off tasks that don't require complex error handling.
  • Excel enthusiasts: If you're already familiar with VBA, it's a natural choice to extend your Excel capabilities.

How to set it up:

  1. Enable the developer tab:

    • Go to File > Options > Customize Ribbon and enable Developer.
  2. Open the VBA editor:

    • Hit Alt + F11 to open the VBA editor. Alternatively, switch to the Developer tab and click Visual Basic.
  3. Create a new module:

    • Right-click on VBAProject in the left panel and select InsertModule.
  4. Write the VBA script:
    Paste the following code snippet as a basic example that pulls data from a webpage.

Sub ScrapeWebContent()
    Dim httpRequest As Object
    Set httpRequest = CreateObject("MSXML2.XMLHTTP")
    
    Dim url As String
    url = "https://apastyle.apa.org/style-grammar-guidelines/tables-figures/sample-tables" ' Replace with your target URL

    ' Send HTTP GET request
    httpRequest.Open "GET", url, False
    httpRequest.send

    ' Check if the request was successful
    If httpRequest.Status = 200 Then
        ' Declare and create the htmlDoc object
        Dim htmlDoc As Object
        Set htmlDoc = CreateObject("htmlfile")
        
        htmlDoc.body.innerHTML = httpRequest.responseText

        ' Example: Extract all hyperlinks
        Dim links As Object
        Set links = htmlDoc.getElementsByTagName("a")

        Dim i As Integer
        For i = 0 To links.Length - 1
            ThisWorkbook.Sheets(1).Cells(i + 1, 1).Value = links(i).href
            ThisWorkbook.Sheets(1).Cells(i + 1, 2).Value = links(i).innerText
        Next i
    Else
        MsgBox "Failed to retrieve the webpage. Status: " & httpRequest.Status & " - " & httpRequest.statusText, vbCritical
    End If

    Set httpRequest = Nothing
End Sub

How the script works:

Using VBA in Excel to scrape data

  • The script uses MSXML2 to send HTTP requests.
  • It waits for the page to load completely.
  • Then it grabs the webpage's content, fetches all the links and pastes those into different cells.

Pros and cons:

Pros:

  • Can be run entirely within Excel, no external software needed.
  • Good for pulling small data snippets quickly.
  • Reusable for similar tasks.

Cons:

  • Can break easily if the webpage structure changes.
  • Not compatible with non-Microsoft office suites (e.g., LibreOffice, OpenOffice).

When to move on:

If you're dealing with large datasets, complex navigation, or need more flexibility, consider using a dedicated programming language like Python. But if you just need a quick solution inside Excel, VBA will do the trick.

Method 3: Using Excel Power Queries

Excel's Power Query is a solid tool for pulling in data from various sources, including web pages. It's built right into Excel, so you don't need to write any code, and it's much more flexible than simple copy-pasting. With Power Query, you can pull in tables from websites, transform the data as needed, and update the query anytime with just a few clicks.

When to use it:

  • Table data: Great for structured data like tables on HTML pages.
  • Automated updates: If the data on the webpage changes periodically, you can refresh your query to pull in the latest info without redoing the setup.
  • No coding required: If you want automation but aren't comfortable with VBA or scripting, Power Query is a nice middle ground.

How to set it up:

  1. Open Excel and go to the data tab:

    • Click on the Data tab in Excel and select From Web.
  2. Enter the URL:

    • Paste in the URL of the webpage you want to scrape and click OK.
  3. Select the table:

    • Excel will show a preview of the tables found on that page. Choose the one you want to import.
  4. Transform data if needed:

    • The Power Query Editor lets you clean and format the data before importing it into Excel. You can remove columns, apply filters, or modify the data type.
  5. Load data into Excel:

    • Click Close & Load to pull the data directly into your spreadsheet.

Using Excel Power Query to scrape a website

How it works:

Power Query connects to the web page, pulls in the HTML content, and identifies the tables within the page. It gives you an interface to select which table(s) to extract, and you can even preview and tweak the data before loading it into Excel. Once set up, you can refresh the query whenever you want updated data.

Pros and cons:

Pros:

  • No programming needed, very beginner-friendly.
  • Can handle most simple table extractions.
  • Built-in tools for data cleaning and transformation.
  • Easily refreshed to keep your data up-to-date.

Cons:

  • Limited to structured data (i.e., tables).
  • Doesn't work well with dynamic content like JavaScript-generated tables.
  • Some web pages may block the query, depending on permissions and site configurations.
  • LibreOffice and other similar tools don't have a built-in Power Query equivalent. If you're on those platforms, you'll need to rely on more manual methods or external tools.

When to move on:

If you find that Power Query is unable to extract the data (e.g., for more complex sites with dynamic elements), or you need more control over the scraping process, a script-based solution like Python might be more effective.

Method 4: Using Google Sheets functions

Google Sheets can pull data from a website directly into your spreadsheet using a few built-in functions like IMPORTHTML, IMPORTXML, and IMPORTDATA. This approach is a solid option for grabbing structured data like tables or lists, and once the data is in Google Sheets, you can easily convert it to Excel if needed. Best part? No coding required.

When to use it:

  • For simple table data: Ideal for quickly pulling in tables or lists without any complex setup.
  • Small data sets: Great for lightweight extraction when you don't need advanced transformations.
  • Public or accessible data: Works best when the data is not blocked by login pages or complex JavaScript.

How to set it up:

  1. Open a Google Sheet:

Go to Google Sheets and create a new spreadsheet.

  1. Use the IMPORTHTML function:

Use the following syntax:

=IMPORTHTML(url, query, index)

For example:

=IMPORTHTML("https://example.com/sample-table", "table", 1)

This pulls the first table from the given URL.

Using the IMPORTHTML function in Google Sheets

  1. Use the IMPORTXML Function:

Use the following syntax:

=IMPORTXML(url, xpath_query)

For example:

=IMPORTXML("https://example.com/sample-page", "//h1")

This pulls all <h1> headers from the page using an XPath query.

  1. Use the IMPORTDATA Function:

Syntax:

=IMPORTDATA(url)

Use this if the data is available as a downloadable .csv or .txt file. For example:

=IMPORTDATA("https://example.com/data.csv")

How it works:

Each function sends a request to the given URL and extracts data based on the specified parameters:

  • IMPORTHTML: Fetches HTML tables or lists (<table> or <ul>/<ol>).
  • IMPORTXML: Uses XPath queries to target specific elements within the HTML structure.
  • IMPORTDATA: Downloads and imports a .csv or .txt file directly into the sheet.

Pros and cons:

Pros:

  • No coding skills needed—just a few straightforward formulas.
  • Automatically updates when the source data changes.
  • Easily shareable or exportable to Excel with Google Sheets' built-in options.

Cons:

  • Limited to public data; won't work with login-protected or dynamically rendered content.
  • XPath queries can be difficult to set up if the page structure is complex.
  • May break if the site structure changes or if the data is behind a paywall.
  • This method is specific to Google Sheets. Once you've imported the data, you can download the sheet as an Excel file by going to File > Download > Microsoft Excel (.xlsx).

When to move on:

If the data you need is behind a login, requires pagination handling, or is dynamically generated with JavaScript, you'll need to look at more robust options like Python scripts.

Method 5: Scraping with Python

If you want full control over your web scraping and don't mind writing a bit of code, Python is your best bet. It's highly flexible, has plenty of libraries to make the job easier, and works well for both simple and complex projects.

For this section, we'll walk through a basic example using ScrapingBee's Python client to fetch data and BeautifulSoup to parse it. By the end, we'll save the extracted data into an Excel file using pandas. ScrapingBee handles a lot of the challenges you'd normally face with basic HTTP requests, such as bypassing captchas, using premium proxies, and handling JavaScript rendering.

When to use it:

  • Complex data extraction: If the website uses dynamic content, has nested tables, or you need to pull specific elements.
  • High flexibility: When other tools don't offer the level of control you need.
  • Automated scripts: For scenarios where you want to run the scraping on a schedule or from a server.

Setting up the environment

Before we jump into the code, there's one big challenge to consider: scraping without getting blocked. Websites often have measures in place to detect and block scrapers, which can quickly become a headache if you're just trying to pull some data. We've talked about these challenges in more detail in a previous article , but here's a simple way to bypass a lot of these issues: using ScrapingBee's premium proxies.

First, register on ScrapingBee for free and grab your API key. We offer 1,000 free credits to get started, with each request costing around 25 credits depending on your configuration. Once you've signed up and logged in, head over to your dashboard and copy your unique API token.

Now, let's update our script to use ScrapingBee. You won't need to deal with setting up custom proxies or handling captchas—ScrapingBee will handle it for you in the background.

Make sure you have Python installed and set up. Then install the required libraries:

pip install scrapingbee beautifulsoup4 pandas openpyxl

Writing the script

Let's say you want to scrape a sample website with some table data. Here's a simple script to get you started:

from scrapingbee import ScrapingBeeClient
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Set up ScrapingBee client with your API key
client = ScrapingBeeClient(api_key="YOUR_SCRAPINGBEE_API_KEY")

# Step 2: Fetch the webpage content using ScrapingBee
url = "https://example.com/sample-table"
response = client.get(
    url,
    params={
        'premium_proxy': True,  # Use premium proxies for better success rate
        'block_resources': True,  # Block images and CSS to speed up the request
        'country_code': 'us',  # Request from a specific country
    }
)

# Step 3: Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Step 4: Find the table and extract rows
table = soup.find('table')  # Assuming there is a <table> element in the HTML
rows = table.find_all('tr')

# Step 5: Extract the data into a list of dictionaries
data = []
for row in rows:
    cols = row.find_all('td')
    data.append([col.text.strip() for col in cols])

# Step 6: Create a DataFrame and save to Excel
df = pd.DataFrame(data, columns=["Column1", "Column2", "Column3"])  # Adjust column names as needed
df.to_excel("output.xlsx", index=False)
print("Data successfully scraped and saved to 'output.xlsx'.")

How the script works:

  • Set up ScrapingBee: Initialize the ScrapingBee client with your API key.
  • Fetch the web page: Use client.get() to send the request, while leveraging advanced parameters like premium proxies and resource blocking.
  • Parse with BeautifulSoup: BeautifulSoup helps us navigate the HTML structure and locate the desired table.
  • Extract data: We go through each row (<tr>) in the table and grab the text from each cell (<td>).
  • Create an Excel file: With pandas, we load the data into a DataFrame and export it to an Excel file.

Pros and cons:

Pros:

  • High flexibility and control over what data to extract.
  • Can handle more complex scenarios (nested elements, multiple pages, etc.).
  • ScrapingBee takes care of tough challenges like captchas and IP blocking.
  • This method is entirely script-based, so it doesn't depend on the office suite you're using. As long as you can run Python, you're good to go.

Cons:

  • Requires coding skills and some understanding of HTML.
  • Some advanced features (like premium proxies) consume more credits.
  • Needs a bit more setup compared to built-in Excel tools.

When to move on:

If you need to scrape a lot of data on a regular basis, or want to avoid handling dynamic content manually, consider more robust frameworks like Selenium or exploring additional options provided by ScrapingBee's advanced API features.

Method 6: Using APIs instead of scraping

Before diving into scraping, it's always worth checking if the website has an API (Application Programming Interface). APIs provide a direct way to access data in a structured format, which is far more reliable and efficient than scraping HTML pages. Using an API lets you avoid messy HTML parsing, dynamic content, and anti-scraping mechanisms.

When to use it:

  • Available API: If the site offers an API, it's almost always the best option. Check their documentation or developer page for details.
  • Cleaner and faster: APIs are built to deliver data in a streamlined way, so you get exactly what you need without dealing with the overhead of HTML.
  • Avoid legal issues: APIs are designed for data sharing, so you're less likely to encounter legal problems compared to scraping.

Finding an API

  1. Check the website's footer or documentation:
    Websites like Reddit, Twitter, and many news sites have public APIs. Look for a "Developers" or "API" section at the bottom of the page.

  2. Use browser DevTools:
    Open your browser's DevTools (F12 or Ctrl + Shift + I), go to the Network tab, and look for calls made to endpoints like /api/ as you navigate the site.

  3. Search the web:
    Try searching [website name] API or public API for [topic] to see if there's already an open source or public API available.

Writing a Python script to pull data from an API

Here's how to use ScrapingBee's Python client to pull data from an API and save it to Excel. Let's say we're using a mock API endpoint to grab user data:

  1. Install required libraries:
    Make sure you have scrapingbee and pandas installed:
pip install scrapingbee pandas openpyxl
  1. Write the script:
from scrapingbee import ScrapingBeeClient
import pandas as pd

# Step 1: Set up ScrapingBee client
client = ScrapingBeeClient(api_key="YOUR_SCRAPINGBEE_API_KEY")

# Step 2: Define the API endpoint
url = "https://jsonplaceholder.typicode.com/users"  # Replace with your target API

# Step 3: Fetch the data
response = client.get(url)

# Step 4: Check if the request was successful
if response.status_code == 200:
    data = response.json()  # Convert the response to a Python dictionary

    # Step 5: Create a DataFrame
    df = pd.DataFrame(data)

    # Step 6: Save the DataFrame to an Excel file
    df.to_excel("api_output.xlsx", index=False)
    print("Data successfully pulled from API and saved to 'api_output.xlsx'.")
else:
    print(f"Failed to retrieve data. HTTP Status code: {response.status_code}")

How the script works:

  • Set up ScrapingBee: Initialize the ScrapingBee client with your API key.
  • Define the API endpoint: Specify the URL of the API you want to pull data from. In this example, we use a mock API https://jsonplaceholder.typicode.com/users.
  • Fetch the data: Use the ScrapingBee client to send a GET request to the API.
  • Check the response: If the response code is 200 (success), extract the data using .json().
  • Convert to Excel: Load the data into a pandas DataFrame and save it as an Excel file.

Pros and cons:

Pros:

  • Reliable and structured data format (usually JSON).
  • Faster and cleaner compared to scraping.
  • Avoids dealing with messy HTML or anti-scraping techniques.

Cons:

  • Limited to what the API offers (might not provide all the data you want).
  • Some APIs have rate limits or require authentication keys.
  • Not every website has a public API available.

When to use each approach:

  • Use APIs when they're available. They save time, provide clean data, and won't break if the site's layout changes.
  • Use scraping if the data you need isn't exposed through an API or if the API is restricted (e.g., requiring paid access).

If you're scraping a website regularly, it's worth checking if you can get access to an API (even a private one) to simplify your workflow and ensure long-term stability.

Conclusion

Choosing the right method to pull website data into Excel depends on your needs and technical comfort level. If you're dealing with small amounts of data and need a quick solution, manual copy-pasting or using Google Sheets functions can get the job done. For those who want to automate things within Excel, VBA and Power Queries offer decent options without having to leave the spreadsheet environment.

But when you need more flexibility and control—like handling dynamic pages or pulling data from multiple sources—a scripting approach using Python or APIs is the way to go. Python scripts are highly versatile, allowing you to tackle complex scenarios, handle large datasets, and customize your extraction process with various libraries.

The key takeaway? Start with the simplest method that meets your requirements, and then move to more advanced tools if needed. In the end, what really matters is getting clean, structured data that you can analyze, report, or share seamlessly.

Happy scraping!

image description
Ilya Krukowski

Ilya is an IT tutor and author, web developer, and ex-Microsoft/Cisco specialist. His primary programming languages are Ruby, JavaScript, Python, and Elixir. He enjoys coding, teaching people and learning new things. In his free time he writes educational posts, participates in OpenSource projects, tweets, goes in for sports and plays music.