How to scrape all text from a website for LLM training

03 July 2024 | 14 min read

Artificial Intelligence (AI) is rapidly becoming a part of everyday life, and with it, the demand for training custom models has increased. Many people these days would like to train their very own... AI, not dragon, duh! One crucial step in training any language model (LLM) is gathering a significant amount of text data. In this article, I'll show you how to collect text data from all pages of a website using web scraping techniques. We'll build a custom Python script to automate this process, making it easy to gather the data you need for your model training.

The source code can be found on GitHub.

Why train your own model?

But the first question to answer is why would someone want to train their own model? Here are a few reasons:

  1. Customization: Pre-made models might not meet some specific needs. Training your own model allows for customization tailored to particular tasks, industries, or datasets.
  2. Privacy: For sensitive applications, using a custom model ensures that your data remains private and secure. Unfortunately, there have already been some data leaks related to AI usage, so you should be extra careful when feeding potentially secret data to the engine.
  3. Performance: Custom models can be optimized for better performance on specific types of data or queries, outperforming general-purpose models in those areas.

How long does It take to train a model?

Training an LLM can indeed be time-consuming and resource-intensive. The duration depends on several factors, including the size of the model, the amount of data, and the computational resources available. For example, training a small to medium-sized model might take up to several days or even weeks on high-performance GPUs. Larger and more advanced models, like GPT-3, can take months and require powerful distributed computing setups.

How much data is needed?

The amount of data required varies based on the complexity and size of the model. Here are some rough estimates:

  • For small models, a few million words (equivalent to 10-20 "Prisoner of Azkaban" books) might be sufficient.
  • Medium-sized models typically need tens of millions of words.
  • Large models require hundreds of millions to billions of words.

Tools for training a model

Training your own language model requires not only powerful hardware but also specialized software. Here are some of the tools and frameworks you might use:

  1. TensorFlow: An open-source machine learning framework developed by Google. TensorFlow is widely used for building and training neural networks.

  2. Vertex AI: A machine learning platform that lets you train and deploy ML models and AI applications, and customize large language models for use in your AI-powered applications.

  3. PyTorch: An open-source deep learning platform developed by Facebook's AI Research lab. PyTorch is known for its dynamic computation graph and ease of use, making it popular among researchers and developers.

  4. Hugging Face Transformers: A library that provides state-of-the-art pre-trained models and tools for natural language processing. It's built on top of TensorFlow and PyTorch, making it easy to use with either framework.

  5. CUDA: A parallel computing platform and application programming interface model created by NVIDIA. CUDA allows developers to use NVIDIA GPUs for general purpose processing, which is essential for training large models efficiently.

  6. TPUs (Tensor Processing Units): Custom-developed application-specific integrated circuits by Google specifically for accelerating machine learning workloads. TPUs can significantly speed up the training process for large models.

So, as you can see, there's a lot to pick from!

Preparing to write a custom script to load training data

Prerequisites

Okay, so now we understand why one might want to dig into model training in the first place. However, the next important question is: where to get the training data? One potential solution is to fetch this data from the website related to the task you're solving. That's why we will write a custom Python script to achieve this.

But before we dive into the code, let's briefly cover some prerequisites. I expect that you have:

  • A basic knowledge of the Python programming language.
  • Already installed Python 3 on your machine.
  • Installed your favorite code editor or IDE.
  • An internet connection to install necessary libraries.

This is basically it!

Setting up your environment

When you're ready to proceed, create a new folder for your Python project. Inside create two files:

  • find_links.py — inside we'll write a simple script to find all website URLs.
  • extract_data.py — this file will be the main star today as it'll contain a script to download website data.

Just follow the wizard to help you initialize the project. Next, open up pyproject.toml and add a few dependencies under tool.poetry.dependencies:

poetry init

Just follow the wizard to help you initialize the project. Next, open up pyproject.toml and add a few dependencies under tool.poetry.dependencies:

python = "^3.12" # adjust according to your Python version
beautifulsoup4 = "^4.12"
requests = "^2.31"
lxml = "^5.1"
scrapingbee = "^2.0"
pandas = "^2.2.2"

Finally, run:

poetry install --no-root

If you prefer to avoid Poetry, simply install dependencies with pip:

pip install beautifulsoup4 requests lxml scrapingbee pandas

Let me briefly cover the tools we're going to utilize today:

  • requests — library to make HTTP requests easily.
  • beautifulsoup4 — enables us to parse HTML content.
  • lxml — special binding for libxml2 and libxslt libraries required by BeautifulSoup.
  • pandas — data analysis and manipulation tool.
  • scrapingbee — ScrapingBee Python client will be used to easily hook up proxies to avoid getting blocked. It also has many other goodies including custom JS manipulation, page screenshotting, and so on.

Alright, at this point we're ready to go!

Finding all website pages

Now, before we can download some data there's yet another important task to solve: we need to understand what pages the website in question actually contains! In fact, it might present a challenge on its own, therefore, I have prepared a dedicated article for you that lists a few potential solutions and things to watch for.

Today we're going to employ a relatively solid and, at the same time, not so complex approach. Specifically, I suggest we scan the website's sitemap to extract links from there. Therefore, let's open the find_links.py file and import all the necessary dependencies.

import requests
from bs4 import BeautifulSoup as Soup
import os
import csv

We will save the URLs into a CSV file containing the actual link, last modification date, and priority (if you ever require this additional data). So, add a constant and a function call.

# Constants for the attributes to be extracted from the sitemap.
ATTRS = ["loc", "lastmod", "priority"]

# Example usage
parse_sitemap("https://yourwebsite/sitemap.xml")

Now, let's code the main function:

def parse_sitemap(url, csv_filename="urls.csv"):
    """Parse the sitemap at the given URL and append the data to a CSV file."""
    # Return False if the URL is not provided.
    if not url:
        return False

    # Attempt to get the content from the URL.
    response = requests.get(url)
    # Return False if the response status code is not 200 (OK).
    if response.status_code != 200:
        return False

    # Parse the XML content of the response.
    soup = Soup(response.content, "xml")

    # Recursively parse nested sitemaps.
    for sitemap in soup.find_all("sitemap"):
        loc = sitemap.find("loc").text
        parse_sitemap(loc, csv_filename)

    # Define the root directory for saving the CSV file.
    root = os.path.dirname(os.path.abspath(__file__))

    # Find all URL entries in the sitemap.
    urls = soup.find_all("url")

    rows = []
    for url in urls:
        row = []
        for attr in ATTRS:
            found_attr = url.find(attr)
            # Use "n/a" if the attribute is not found, otherwise get its text.
            row.append(found_attr.text if found_attr else "n/a")
        rows.append(row)

    # Check if the file already exists
    file_exists = os.path.isfile(os.path.join(root, csv_filename))

    # Append the data to the CSV file.
    with open(os.path.join(root, csv_filename), "a+", newline="") as csvfile:
        writer = csv.writer(csvfile)
        # Write the header only if the file doesn't exist
        if not file_exists:
            writer.writerow(ATTRS)
        writer.writerows(rows)

In fact, this code is not too complex:

  1. We fetch the given URL.
  2. Parse the response with Beautiful Soup.
  3. Try to find any nested sitemaps and recursively process those.
  4. Then in every sitemap find the url tag.
  5. Fetch the required attributes from every found URL.
  6. Save the data to the CSV file.

That's it! If you have issues processing a sitemap because your request is being blocked (website tries to protect itself from crawlers), you can take advantage of the ScrapingBee client as I'll show in the section below.

To run the script, use the following command:

poetry run python find_links.py

Fetching website data from every page

Alright, at this point you should have a urls.csv file with all website links ready to be crawled. So, open up extract_data.py file and let's get down to business.

Simple script to load website data

Let's code the first version of our script now. Import the necessary libraries:

import pandas as pd
import requests
from bs4 import BeautifulSoup
from scrapingbee import ScrapingBeeClient

Next, let's use Pandas to read our CSV file, get contents of the loc column (that contains our URLs), and convert it to a list. Also, prepare the variable to contain all the fetched data:

df = pd.read_csv('urls.csv')
urls = df['loc'].tolist()

all_texts = []

Then our job is to iterate over the URLs, make a request, and process every page data:

for url in urls:
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')
        text = soup.get_text(separator='\n', strip=True)
        all_texts.append(text)
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")

So, we make a request, process HTML with BeautifulSoup, and then simply use get_text() function to fetch the data in human-readable format (it will skip any scripts or styles).

Finally, let's save the data to a UTF8-encoded text file:

with open('extracted_texts.txt', 'w', encoding='utf-8') as file:
    for text in all_texts:
        file.write(text + '\n\n')

Here's the first version of the script:

import pandas as pd
import requests
from bs4 import BeautifulSoup

# Load URLs from CSV
df = pd.read_csv('urls.csv')
urls = df['loc'].tolist()

all_texts = []

for url in urls:
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')

        text = soup.get_text(separator='\n', strip=True)
        all_texts.append(text)
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")

# Save extracted texts to a file, ensuring newlines are correctly written
with open('extracted_texts.txt', 'w', encoding='utf-8') as file:
    for text in all_texts:
        file.write(text + '\n\n')

print("Text extraction completed successfully!")

In fact, it already works quite nicely. However, you might encounter a typical issue: your requests might be blocked as websites are usually not too fond of crawlers. Thus, let me explain how to alleviate this problem.

Using proxies to avoid getting blocked

As we've already discussed in one of the previous articles, scraping without getting blocked can be a complex task. You can find detailed information in that article, but here I'll show you a neat trick to take advantage of premium proxies easily — this will help you fly under the radar.

First of all, register on ScrapingBee for free. You'll receive 1000 credits as a gift (each HTTP request costs approximately 25 credits). After logging in, navigate to your dashboard and copy your API token:

You'll need it to send the requests.

Now, return to your Python script and set up a ScrapingBee client:

# your imports...

client = ScrapingBeeClient(api_key="YOUR_TOKEN")

# other code...

Then, find the line where you send a request:

response = requests.get(url)

Replace it with the following:

response = client.get(
    url,
    params={
        # Use premium proxies for tough sites
        'premium_proxy': True,  
        'country_code': 'gb',
        # Block images and CSS to speed up loading
        "block_resources": True, 
        'device': 'desktop',
    }
)

This is it! Now ScrapingBee will do all the heavy lifting for you, including the proxy setup, and you can concentrate on fetching the data. To learn more about other features, refer to the ScrapingBee Python client documentation.

Using multiple threads

Visiting one page after another in a sequential manner is not very optimal, especially if you have hundreds of pages to process. Therefore, let's enhance our script by introducing threads. This way we can process multiple pages in one go, greatly speeding up the overall process.

Let's revisit the imports:

import pandas as pd
import requests
from bs4 import BeautifulSoup
from scrapingbee import ScrapingBeeClient
from concurrent.futures import ThreadPoolExecutor, as_completed # <== add this

The initial setup is going to be the same:

# Load URLs from CSV
df = pd.read_csv("urls.csv")
urls = df["loc"].tolist()
client = ScrapingBeeClient(api_key="YOUR_TOKEN")

Next, code the main function:

def main():
    all_texts = []

    # Use ThreadPoolExecutor to handle threading
    with ThreadPoolExecutor(max_workers=5) as executor:
        # Submit all URL processing tasks to the executor
        futures = {executor.submit(extract_text_from_url, url): url for url in urls}

        # Collect results as they complete
        for future in as_completed(futures):
            result = future.result()
            all_texts.append(result)

    # Save extracted texts to a file, ensuring newlines are correctly written
    with open("extracted_texts.txt", "w", encoding="utf-8") as file:
        for text in all_texts:
            file.write(text + "\n\n")

    print("Text extraction completed successfully!")


if __name__ == "__main__":
    main()

Here we set up to five workers to do page processing concurrently. Then for every URL we execute the extract_text_from_url function (we'll write it in a moment). Then we collect results from the executors and send all the data into the text file as before.

The extract_text_from_url contains the code that we've already seen:

def extract_text_from_url(url):
    print(f"Processing {url}")
    """Fetch and extract text from a single URL."""
    try:
        response = client.get(
            url,
            params={
                'premium_proxy': True,  # Use premium proxies for tough sites
                'country_code': 'gb',
                "block_resources": True,  # Block images and CSS to speed up loading
                'device': 'desktop',
            }
        )

        response.raise_for_status()
        soup = BeautifulSoup(response.content, "html.parser")

        # Extract text, ensuring newlines are preserved
        text = soup.get_text(separator="\n", strip=True)
        return text
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return ""

Nice! Now the overall process will be performed much faster.

Implementing retries for additional robustness

When scraping a website, you might encounter annoying intermittent issues like temporary server errors or network problems. To handle these gracefully, we can implement retries to ensure that failed requests are, well, retried a few times before giving up.

There can be various approaches to solving this task, but I suggest we use the tenacity library to handle retries. Therefore, add it to your pyproject.toml file:

tenacity= "^8.4.1"

Or install the library with pip:

pip install tenacity

Next, import the library in your Python script:

from tenacity import retry, stop_after_attempt, wait_exponential

Now, the main trick is to extract your response fetching logic into a separate function and wrap it with a special decorator:

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def fetch_with_retry(url):
    """Fetch the URL with retries."""
    response = client.get(
        url,
        params={
            'premium_proxy': True,  # Use premium proxies for tough sites
            'country_code': 'gb',
            "block_resources": True,  # Block images and CSS to speed up loading
            'device': 'desktop',
        }
    )
    response.raise_for_status()
    return response

We use the @retry decorator from tenacity to wrap the fetch_with_retry function, specifying that it should stop after 3 attempts and use exponential backoff for waiting between retries.

Then update your extract_text_from_url() function:

def extract_text_from_url(url):
    print(f"Processing {url}")
    """Fetch and extract text from a single URL."""
    try:
        response = fetch_with_retry(url) # <=== modify this
        soup = BeautifulSoup(response.content, "html.parser")

        text = soup.get_text(separator="\n", strip=True)
        return text
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return ""

All other code is left intact. Here's the new version of the script:

import pandas as pd
import requests
from bs4 import BeautifulSoup
from scrapingbee import ScrapingBeeClient
from concurrent.futures import ThreadPoolExecutor, as_completed
from tenacity import retry, stop_after_attempt, wait_exponential

# Load URLs from CSV
df = pd.read_csv("urls.csv")
urls = df["loc"].tolist()
client = ScrapingBeeClient(api_key="YOUR_TOKEN")


@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def fetch_with_retry(url):
    """Fetch the URL with retries."""
    response = client.get(
        url,
        params={
            'premium_proxy': True,  # Use premium proxies for tough sites
            'country_code': 'gb',
            "block_resources": True,  # Block images and CSS to speed up loading
            'device': 'desktop',
        }
    )
    response.raise_for_status()
    return response


def extract_text_from_url(url):
    print(f"Processing {url}")
    """Fetch and extract text from a single URL."""
    try:
        response = fetch_with_retry(url)
        soup = BeautifulSoup(response.content, "html.parser")

        text = soup.get_text(separator="\n", strip=True)
        return text
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return ""


def main():
    all_texts = []

    # Use ThreadPoolExecutor to handle threading
    with ThreadPoolExecutor(max_workers=5) as executor:
        # Submit all URL processing tasks to the executor
        futures = {executor.submit(extract_text_from_url, url): url for url in urls}

        # Collect results as they complete
        for future in as_completed(futures):
            result = future.result()
            all_texts.append(result)

    # Save extracted texts to a file, ensuring newlines are correctly written
    with open("extracted_texts.txt", "w", encoding="utf-8") as file:
        for text in all_texts:
            file.write(text + "\n\n")

    print("Text extraction completed successfully!")


if __name__ == "__main__":
    main()

Great job, now we've made things more robust!

Conclusion

So, in this article, we've discussed how to scrape text from all pages of a website to collect data for training a language model. We covered everything from finding all the website pages using sitemaps, fetching and parsing HTML content, to using proxies and multithreading to make the process more efficient.

With the scraped data in hand, you can proceed to training your model. However, remember, that data quality is crucial for building a high-performing model so you might want to further clean your data.

Feel free to refer to the source code on GitHub for the final version of the scripts. You might also be interested in learning how to employ AI to effeciently scrape website data.

I thank you for staying with me, happy scraping and good luck with your model training journey!

image description
Ilya Krukowski

Ilya is an IT tutor and author, web developer, and ex-Microsoft/Cisco specialist. His primary programming languages are Ruby, JavaScript, Python, and Elixir. He enjoys coding, teaching people and learning new things. In his free time he writes educational posts, participates in OpenSource projects, tweets, goes in for sports and plays music.