Scrapegraph AI Tutorial; Scrape websites easily with LLaMA AI

Ilya Krukowski | 30 May 2024 | 19 min read

Table of contents

Artifical intelligence is everywhere in tech these days, and it's wild how it's become a go-to tool, for example, in stuff like web scraping. Let's dive into how Scrapegraph AI can totally simplify your scraping game. Just tell it what you need in simple English, and watch it work its magic.

I'm going to show you how to get Scrapegraph AI up and running, how to set up a language model, how to process JSON, scrape websites, use different AI models, and even turning your data into audio. Sounds like a lot, but it's easier than you think, and I'll walk you through it step by step.

Why use AI for scraping?

Web scraping is super useful for pulling data from websites but comes with its own headaches. Websites change all the time: they'll have new layouts, new designs, and that messes with traditional scraping scripts relying on things staying the same. This means developers often end up redoing their scripts to keep up, turning what should be (ideally) a one-time job into a never-ending chore.

That's where AI steps in and changes the game. Instead of making complicated rules to find and grab specific elements, you can let AI figure out the pages for you, kind of like how a human would. These AI models are built to get the gist of a site's content and structure on the fly. This flexibility means you spend less time fixing scripts and more time on what matters—working with the data you need.

This not only saves you a ton of time but also makes your scraping setups more robust and flexible, ready to handle whatever the web throws your way.

Prerequisites

Before we get started with setting up and using Scrapegraph AI, here are a few prerequisites I'm assuming you already have:

A basic understanding of Python – you don't need to be an expert, but you should be able to write some basic scripts.
Python 3 installed on your local computer – make sure you have an up-to-date version.
An operating system that's either Mac, Linux, or Windows 10+ – our instructions will focus on these systems.
Access to a terminal and a code editor – you'll need these tools to write your scripts and run them.

You can find the source code for this tutorial on GitHub at this link.

Installing Ollama

The first step to working with Scrapegraph AI is installing Ollama. Ollama simplifies the process of downloading, setting up, and running large language models, which serve as the core intelligence for the AI tools we'll be using.

Here's how to install Ollama:

Visit the official Ollama website.
Select the version that corresponds with your operating system.
Follow the installation guidelines provided on their website.

It's important to note that installing Ollama doesn't automatically include a language model. You'll need to select and download a language model that fits your specific needs separately.

But which model should you choose? Selecting the appropriate model depends on your specific needs, but here are some popular options:

LLaMA — ideal for general purposes, LLaMA is versatile and cost-effective. It excels in tasks involving understanding and generating text that sounds human-like.
Mistral — best suited for handling large datasets or integrating smoothly with existing data pipelines.
Phi — recommended for high-precision tasks, such as extracting detailed financial data.
Gemma — provides strong performance in multilingual contexts, making it suitable for scraping data from sources in various languages.
Gemini — excellent for real-time data processing, particularly useful for sites that update frequently.

You might need to experiment to find the best model for your needs. For some performance comparisons, refer to the benchmarks provided in the Scrapegraph repository.

If uncertain, starting with LLaMA is probably your best bet. But then comes the choice: should you opt for 8B or 70B? What does the 'B' signify? Let's clarify that.

Parameters in AI Models

The 'B' in model names like '8B' or '70B' stands for billion, indicating the number of parameters the model supports: 8 billion and 70 billion, respectively.

Parameters in an AI model are akin to the knobs and dials of the model; they are learned from the training data and represent the information the model uses to make decisions or predictions. The significance of these parameters includes:

Capacity to learn — more parameters typically allow a model to learn more complex patterns, akin to a more detailed map for better navigation.
Performance — a higher count of parameters often improves performance on complex tasks, as the model can make finer distinctions between types of input.
Computational requirements — more parameters mean higher demands on computational power, affecting processing, memory, and storage, especially if you intend to run these models locally.
Cost — operating larger models can be costlier, not just in terms of hardware but also operational costs like electricity and cooling, particularly with high data volumes.

For web scraping tasks involving large data volumes or complex content, a model with more parameters might be more effective, but it also demands robust hardware and potentially higher operational costs.

For instance, LLaMA 8B needs at least 8GB of RAM with a file size around 4.7GB, while LLaMA 70B, being significantly larger (40GB), requires over 32GB of RAM.

Ultimately, choosing a larger model isn't always better; it's about balancing your specific needs and available resources. More parameters mean handling more complexity but also require more from your infrastructure.

💡 Interested in training your own LLMs? Check out our guide on How to scrape all the text from a website for training AI models

Setting up a language model

For this demo, I recommend using LLaMA 8B unless your PC is high-performance with lots of disk space. Start by pulling the model with this command in your terminal. This command will download the LLaMA 8B model along with some other things we'll need later:

ollama pull llama3 && ollama pull nomic-embed-text

The download might take some time, depending on your internet speed. After it's done, you'll need to run this command to start using the model. It checks if everything is set up right:

ollama run llama3

If set up correctly, you'll see a simple interface where you can interact with the model. For example, let's ask the model to introduce itself. Here's what you might see:

>>> Introduce yourself

My name is LLaMA, and I'm a large language model...

If you get a response, it means everything is working, and LLaMA is ready to help us. When you're done testing, close the prompt by typing /bye.

Next, you'll use this command to start serving the model on your machine. This lets it work with Scrapegraph AI:

ollama serve

This command sets up a local Ollama instance at 127.0.0.1:11434, which Scrapegraph AI will use. If you find the port is busy, check if Ollama started on its own after installation. Keep this terminal window open and move to the next step.

Installing Playwright and Scrapegraph AI

The last step in our setup process is to install Scrapegraph AI. The simplest way to do this is using the Pip package manager. Just enter this command in your terminal to install the scrapegraphai package.

pip install scrapegraphai

If you like to keep your Python projects organized, consider setting up a new project with Poetry and adding scrapegraphai as a dependency in that environment.

Installing Scrapegraph AI also installs Playwright, a tool for automating browsers that enhances your web scraping abilities. We've covered Playwright more thoroughly in one of the previous blog posts. If this is your first time using Playwright, there's an extra step. Run this command to download some necessary components:

playwright install

With these tools installed, you're all set to start scraping!

Scraping JSON data

For our first exercise, we're going to scrape some JSON content using a bit of AI magic. I'll start by creating a local demo.json file within my Python project. You can use any JSON data for this task; if you like, you can download my sample from GitHub. Also here's the JSON code if you want to quickly copy and paste it:

[
  {
      "id": 1,
      "name": "Product A",
      "price": 25.99,
      "availability": true,
      "description": "A popular gadget with multiple features",
      "categories": ["electronics", "gadgets"],
      "manufacturer": {
          "name": "GadgetPro",
          "location": "USA"
      },
      "reviews": [
          {
              "user": "JohnDoe",
              "rating": 4,
              "comment": "Great value for the price!"
          },
          {
              "user": "JaneSmith",
              "rating": 5,
              "comment": "Absolutely love this product!"
          }
      ]
  },
  {
      "id": 2,
      "name": "Product B",
      "price": 45.50,
      "availability": false,
      "description": "High-quality appliance for home use",
      "categories": ["home", "appliance"],
      "manufacturer": {
          "name": "HomeGoodsInc",
          "location": "Canada"
      },
      "reviews": [
          {
              "user": "AliceBrown",
              "rating": 3,
              "comment": "Useful, but a bit overpriced."
          },
          {
              "user": "DaveWilson",
              "rating": 5,
              "comment": "A must-have in every household!"
          }
      ]
  },
  {
      "id": 3,
      "name": "Product C",
      "price": 15.75,
      "availability": true,
      "description": "Eco-friendly and sustainable",
      "categories": ["sustainability", "eco-friendly"],
      "manufacturer": {
          "name": "EcoProducts",
          "location": "Germany"
      },
      "reviews": [
          {
              "user": "ClaraKlein",
              "rating": 4,
              "comment": "Good for the environment and your wallet."
          },
          {
              "user": "OmarFarouk",
              "rating": 5,
              "comment": "Excellent product with great eco credentials!"
          }
      ]
  },
  {
      "id": 4,
      "name": "Product D",
      "price": 29.99,
      "availability": true,
      "description": "A durable item for outdoor activities",
      "categories": ["outdoors", "durable"],
      "manufacturer": {
          "name": "OutdoorGear",
          "location": "Sweden"
      },
      "reviews": [
          {
              "user": "BethJones",
              "rating": 4,
              "comment": "Sturdy and reliable."
          },
          {
              "user": "TomNorton",
              "rating": 5,
              "comment": "Perfect for any outdoor enthusiast!"
          }
      ]
  },
  {
      "id": 5,
      "name": "Product E",
      "price": 55.00,
      "availability": true,
      "description": "Luxury beauty product",
      "categories": ["beauty", "luxury"],
      "manufacturer": {
          "name": "LuxBeautyCo",
          "location": "France"
      },
      "reviews": [
          {
              "user": "SandraLee",
              "rating": 5,
              "comment": "Feels luxurious and well worth the price."
          },
          {
              "user": "RajPatel",
              "rating": 4,
              "comment": "Great quality, but a little pricey."
          }
      ]
  }
]

The sample JSON data includes detailed information on products, such as prices, manufacturers, and reviews. While this data isn't overly complex, processing it manually might still take some time, especially for a beginner developer. Luckily, we have our AI tools to help simplify the process.

# Import the JSONScraperGraph class from scrapegraphai.graphs
from scrapegraphai.graphs import JSONScraperGraph

# Configuration settings for the scraper graph
graph_config = {
    "llm": {
        "model": "ollama/llama3",  # Specifies the large language model to use
        "temperature": 0,          # Temperature controls randomness; 0 makes it deterministic
        "format": "json",          # Output format is set to JSON
        "base_url": "http://localhost:11434",  # Base URL where the Ollama server is running
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",  # Specifies the embedding model to use
        "temperature": 0,                    # Keeps the generation deterministic
        "base_url": "http://localhost:11434",  # Base URL for the embeddings model server
    },
    "verbose": True,  # Enables verbose output for more detailed log information
}

First, you'll need to import the necessary class from scrapegraphai.graphs. You'll then set up the configuration for your scraping graph. This configuration will include specifying your language model and setting up a few parameters like the model's temperature (to control randomness), the format of the output, and the base URL for the server running your model.

If you decide to use a different language model, be sure to update the model name in your configuration (however, don't change "ollama/nomic-embed-text").

Next, open your demo.json file to read its content. This step is straightforward—just make sure to handle the file correctly to avoid any encoding issues.

# Open the demo.json file and read its content into a variable
with open("demo.json", "r", encoding="utf-8") as file:
    text = file.read()

Now, create an instance of your scraper by setting up a prompt that asks for specific details from the JSON data, such as product names, average ratings, and prices. Run the scraper to process the JSON content. Depending on your system, this might take a few minutes.

# Create an instance of JSONScraperGraph
json_scraper_graph = JSONScraperGraph(
    prompt="List all the product names along with the average rating based on the reviews. Also include price for every product name.",
    source=text,
    config=graph_config,
)

# Execute the scraping and processing of JSON data
result = json_scraper_graph.run()

# Print the result to the console
print(result)

It might take a couple of minutes depending on your hardware, but once the script finishes running, it should output a Python dictionary containing the information you requested. Here's a peek at what the results might look like:

{
    "products": [
        {"name": "Product A", "average_rating": 4.5, "price": 25.99},
        {"name": "Product B", "average_rating": 4, "price": 45.5},
        {"name": "Product C", "average_rating": 4.5, "price": 15.75},
        {"name": "Product D", "average_rating": 4.5, "price": 29.99},
        {"name": "Product E", "average_rating": 4.5, "price": 55.0},
    ]
}

The output will show each product's name along with its average rating and price. This format is not only easy to read but also ready to use in further applications or analyses.

One of the coolest aspects of using AI for this kind of task is the ability to let users who might not be familiar with programming interact with the data. By providing simple, natural language prompts, users can effectively analyze data without needing to know how to code.

Remember, Scrapegraph isn't limited to JSON; it can also handle other data types like XML, CSV, and PDF, making it a versatile tool for various data extraction and processing tasks.

Scraping the web

Next, we'll explore how to scrape data from a website, using a real-world example. Let's say we want to gather recent news titles from Wired. For this task, we'll use the SmartScraperGraph tool from the Scrapegraph AI suite, keeping the main configuration the same as before.

You'll begin by importing the necessary classes and setting up the configuration for your scraper. This configuration will specify your AI model, control randomness with a temperature setting, define the output format, and set the base URL for your model's server.

# Import the SmartScraperGraph class from scrapegraphai.graphs
from scrapegraphai.graphs import SmartScraperGraph

# Configuration settings for the scraper graph
graph_config = {
    "llm": {
        "model": "ollama/llama3",  # Specifies the large language model to use
        "temperature": 0,          # Temperature controls randomness; 0 makes it deterministic
        "format": "json",          # Output format is set to JSON
        "base_url": "http://localhost:11434",  # Base URL where the Ollama server is running
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",  # Specifies the embedding model to use
        "temperature": 0,                    # Keeps the generation deterministic
        "base_url": "http://localhost:11434",  # Base URL for the embeddings model server
    },
    "verbose": True,  # Enables verbose output for more detailed log information
}

With the setup ready, you simply provide the URL from which you want to scrape—let's use Wired as an example—and the specific prompt for the data you want to extract. In our case, it could be something like, "List me all the titles." This prompt tells the AI exactly what information to look for and extract from the page.

# Create an instance of SmartScraperGraph with specific instructions
smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the titles",  # The AI prompt specifies what to extract
    source="https://www.wired.com/",  # URL of the website from which to scrape data
    config=graph_config,  # Uses predefined configuration settings
)

# Execute the scraping process
result = smart_scraper_graph.run()

# Print the results to the console
print(result)

After running your scrape, you will receive the results in a structured format, typically a Python dictionary. The output will list all the scraped titles under a key such as "titles". These might include various headlines from articles across the Wired website, such as guides on smart speakers, reviews of the latest tech gadgets, and features on cultural topics.

{
    "titles": [
        "Buying Guide: The Best Smart Speakers With Alexa, Google Assistant, and Siri",
        "Product Review: Insta360 Adds 8K Video to Its 360-Action Camera Hybrid",
        "Product Review: Samsung's Flagship QD-OLED Has Glorious, Reflection-Free Picture Quality",
        "WIRED Classics: Crowdsourcing",
        "Going Viral: TikTok and the Evolution of Digital Blackface",
        "We ❤️ Hardware: ‘You Must Believe You Can Repair It'",
        "Hip Hop's 50th: Hip Hop 2073: A Vision of the Future, 50 Years From Now",
        "Trending Stories: Science, Gear, Culture",
        "Videos: Autocomplete Interview, Tech Support",
    ]
}

Imagine telling someone two decades ago that you could automatically gather all this information with a few lines of code. It's a testament to how far technology has come, allowing us to easily access and organize data from the internet with little more than a simple command. This capability is not only fascinating but incredibly powerful for developers and businesses alike.

Dealing with proxies

Web scraping can often feel like navigating a minefield due to various technical and legal challenges. One of the critical challenges is avoiding being blocked by websites. This is where the use of proxies becomes invaluable. Our detailed guide on web scraping without getting blocked covers many tactics, with proxies being a cornerstone strategy.

When scraping websites, especially at scale, it's common for your IP address to get blocked if too many requests are sent from the same source. Proxies mask your real IP address, making your requests appear as if they're coming from different locations. This not only helps in avoiding IP bans but also in scraping geo-restricted data.

Scrapegraph AI simplifies the integration of proxies into your scraping projects. It has built-in support for automatic proxy rotation, leveraging the free-proxy library, which allows you to specify criteria such as anonymity, security level, country preferences, connection timeout, and how frequently to rotate proxies. This setup is great for casual use where simple HTTP proxies suffice.

graph_config = {
    "llm": { },
    "loader_kwargs": {
        "proxy" : {
            "server": "broker",
            "criteria": {
                "anonymous": True,
                "secure": True,
                "countryset": {"IT"},
                "timeout": 10.0,
                "max_shape": 3
            },
        },
    },
}

For those needing more robust and reliable proxy solutions, services like ScrapingBee offer premium proxies. These premium services provide a higher level of anonymity and reliability, essential for more aggressive scraping tasks or when scraping particularly sensitive or heavily fortified websites.

Setting up these proxies involves specifying your server details, username, and password in your configuration, ensuring all your requests are routed through the proxy server.

graph_config = {
    "llm": { },
    "loader_kwargs": {
        "proxy" : {
            "server": "http://your_proxy_server:port",
            "username": "your_username",
            "password": "your_password",
        },
    },
}

Using proxies is essential not just for circumventing anti-scraping measures but also for maintaining the operational integrity of your scraping activities. Without proxies, the risk of being blocked increases significantly, which can lead to incomplete data collection and potential legal issues, depending on the target's scraping policies.

In essence, integrating effective proxy management into your scraping strategy is not just an option; it's a necessity for ensuring access to data across the web without interruptions or penalties.

Scraping search results

Scraping search results efficiently can be a game changer, especially when you're looking for specific information like the best places to eat in a city you live in or plan to visit.

To start, you'll need to import the appropriate class for creating a search graph, which allows you to query and scrape data from various sources based on your input criteria. Your configuration settings will generally stay the same as previous setups, ensuring consistency across your scraping tasks.

# Import the SearchGraph class from scrapegraphai.graphs
from scrapegraphai.graphs import SearchGraph

# Configuration settings for the scraper graph
graph_config = {
  # Same config as before
}

For a practical example, let's search for the top pizzerias in Riga, the city I call home. By setting up a prompt in your search graph, you can direct the AI to focus on finding the best pizzerias. The AI will handle the query and scrape the necessary information from the web.

# Create an instance of SearchGraph for specific search query
search_graph = SearchGraph(
    prompt="List me the best pizzerias in Riga",  # The AI prompt specifies what to find
    config=graph_config,  # Uses predefined configuration settings
)

# Execute the search process using the SearchGraph instance
result = search_graph.run()

# Print the results to the console
print(result)

Once the search is executed, the process might take a few minutes depending on your computer's performance. But patience pays off! The results will typically return in a structured format, listing the pizzerias along with essential details like their names, ratings, and addresses.

{
    "best_pizzerias": [
        {
            "name": "Pizzaiolo",
            "rating": 4.5,
            "address": "Krišjāņa Valdemāra iela 13, Rīga",
        },
        {"name": "Pizza Marvella", "rating": 4.2, "address": "Skārņu iela 55, Rīga"},
        {"name": "La Biga", "rating": 4.1, "address": "Kronvalda iela 10, Rīga"},
        {"name": "Pizzeria Riga", "rating": 4.5, "address": "Riga, Latvia"},
    ]
}

With these results, not only do you get a quick guide to some of the best spots for pizza in the city, but you also gain insights into the effectiveness of AI-powered search scraping. After wrapping up this article, I know exactly where I'll be heading for a delicious slice!

Using OpenAI models for text to speech generation

Imagine being able to listen to a summary of a website as easily as playing a song. That's exactly what we can achieve with the power of OpenAI's GPT-3.5 and text-to-speech models. This technology is not just cool—potentially it's a game-changer, especially for people with visual impairments.

To tap into this capability, you'll first need to register on the OpenAI platform. Once registered, you can generate an API key. Depending on your usage, you may also need to purchase some API credits, which you can do under the Billing section. Typically, each request might cost around 5 cents.

Once you have your API key, you'll need to prepare your script:

# Import the SpeechGraph class from scrapegraphai.graphs
from scrapegraphai.graphs import SpeechGraph

graph_config = {
    "llm": {
        "api_key": "YOUR_OPENAI_API_KEY",
        "model": "gpt-3.5-turbo",
    },
    "tts_model": {
        "api_key": "YOUR_OPENAI_API_KEY",
        "model": "tts-1-hd",
        "voice": "fable",
    },
    "output_path": "audio_summary.mp3",
}

A few things to note here:

API key — insert your OpenAI API key to authenticate your requests.
Model selection — choose a model for generating text. For example, gpt-3.5-turbo is a solid choice for understanding and summarizing content.
Text-to-speech (TTS) model — select a TTS model like tts-1-hd to convert text into speech.
Voice selection — pick a voice from the available options to personalize how your audio sounds. You can listen to different voice samples in the OpenAI documentation to decide which one fits best.
Output — decide on the filename and format for the audio file that will be generated, such as audio_summary.mp3.

Now, you're ready to use these settings to analyze a website and generate an audio summary. Simply set up a prompt to instruct the AI what to focus on, like summarizing the main points of a website (for example, I'll use my own website):

speech_graph = SpeechGraph(
    prompt="Summarize information provided on the website.",
    source="https://bodrovis.tech",
    config=graph_config,
)

result = speech_graph.run()
print(result)

After running the process, you'll get a detailed summary in both text and audio format. This not only makes the content accessible but also adds a layer of convenience for those who prefer auditory learning.

The end result? A neatly summarized audio file of your chosen content. Truly, the possibilities with AI are as exciting as they are boundless.

Conclusion

We've shown how Scrapegraph AI simplifies web scraping, from handling JSON to turning text into speech. We started by setting up a language model and then explored using AI to pull and process data from various sources. Hopefully, you found this article useful!

This tech isn't just for developers—it helps businesses gain insights and improves accessibility for those with visual impairments. Indeed, the world of AI is always evolving, offering endless opportunities for innovation.

Thanks for joining this exploration. Here's to discovering new tools and methods in your future projects!

Ilya Krukowski

Ilya is an IT tutor and author, web developer, and ex-Microsoft/Cisco specialist. His primary programming languages are Ruby, JavaScript, Python, and Elixir. He enjoys coding, teaching people and learning new things. In his free time he writes educational posts, participates in OpenSource projects, tweets, goes in for sports and plays music.