How to Scrape TikTok in 2024: Scrape Profile Stats and Videos

15 July 2024 | 40 min read

Are you a data analyst thirsty for social media insights and trends? A Python developer looking for a practical social media scraping project? Maybe you're a social media manager tracking metrics or a content creator wanting to download and analyze your TikTok data? If any of these describe you, you're in the right place!

TikTok , the social media juggernaut, has taken the world by storm. TikTok's global success is reflected in its numbers:

  • Massive Download Volume: TikTok has been downloaded over 4.1 billion times .
  • Explosive Growth: In 2024, TikTok has over 1 billion monthly active users globally, surpassing many other social media platforms in engagement and content consumption.
  • High User Engagement: Users spend an average of 55.8 minutes daily on TikTok , browsing their personalized feeds and uploading millions of videos daily.
  • Visual Search: For Gen Z, TikTok is the new search engine. 40% of Gen Z prefer TikTok over Google for local searches.

Mind-blowing right? TikTok is really a treasure trove of data waiting to be explored. But why should we make use of this information?

Why Scrape TikTok?

From decoding viral trends to analyzing user behavior, scraping TikTok data is the new oil for various fields:

Audience Use-Case
Data Analysts - Discover emerging trends
- Understand user behavior
- Track the performance of specific content
- Gather engagement metrics for data-driven insights
Developers - Build robust scraping scripts
- Automate data collection processes
- Integrate scraping with data analysis libraries
Social Media Managers - Monitor follower growth
- Track video performance
- Analyze engagement metrics
- Analyze competitors
- Devise strategies to improve engagement and reach
Digital Marketers - Measure campaign effectiveness
- Understand target audience preferences
- Optimize marketing efforts
Tech Bloggers and Content Creators - Download and analyze videos
- Monitor competitor content
- Identify trending topics
- Generate ideas and improve content strategy
- Increase follower engagement
Researchers - Analyze the spread of viral content
- Study user interaction patterns
- Conduct sociological and psychological research
- Study social media trends and patterns
- Conduct studies on digital communication methods
Brands - Monitor brand presence
- Understand consumer sentiment
- Identify influencers for collaborations
Tech Giants - Analyze platform features
- Understand user engagement
- Inform product development and competitive strategy
Celebrities - Track public perception
- Engage with fans
- Manage online presence

Impressive! TikTok is really there for everyone! It's a goldmine to learn about what's popular, user behavior, and emerging trends.

What This Guide Will Cover

In this comprehensive guide, we’ll delve into the nitty-gritty of scraping TikTok using Python. Whether you're a scraping newbie or a seasoned pro, this guide will provide you with the tools and knowledge to effectively extract and analyze TikTok data.

Precisely, we'll walk you through:

  1. Ethical Scraping Etiquette: Keeping your scraping activities legal and respectful to TikTok's terms of service
  2. Setting Up Your Environment: Essential tools and technologies (libraries)
  3. Scraping Profile Stats: Extracting key metrics from TikTok profiles
  4. Downloading Videos and Their Stats: Advanced technique for downloading TikTok videos and their stats
  5. Outsmarting TikTok's Defenses: Handling dynamic content, bypassing anti-scraping measures, and proxy magic
  6. Data Analysis 101: Quick tips to make sense of your scraped TikTok data with efficient storage methods
  7. Scaling Your Scraper: Taking your project from hobby to heavy-duty

Pro Tip: New to web scraping? Don't sweat it! Check out our primer on What Is Web Scraping or dive into our Python Web Scraping: Full Tutorial With Examples (2024) . Trust me, a solid foundation will make your TikTok scraping journey much smoother!

By the time you finish this guide, you'll be scraping TikTok with Python like a pro, ready to unlock insights that could revolutionize your strategy or research. Buckle up – you're in for a treat! πŸš€πŸŽ΅


Article's cover image


Getting Started With TikTok Scraping

Step 1: Reviewing Ethical Considerations for Scraping TikTok

First and foremost, every website has rules ( robots.txt file) regarding web scraping, and TikTok is no exception. Before we start scraping, it's important to check TikTok's robots.txt file . This file tells us what parts of TikTok are okay to scrape and which are off-limits.

Think of it as a treasure map that shows us where to dig and where not to!

Viewing TikTok's robots txt file

As we can see, we can scrape general sections like /foryou, /discover, and /about, but we need to avoid areas such as /inapp, /auth, and specific directory paths. The robots.txt file is a guideline, telling web crawlers which parts of TikTok they can access and which are off-limits.

This means we are allowed to:

  • Scrape profile stats: Extract key metrics from TikTok profiles that are accessible through allowed paths
  • Download videos and their stats: Download TikTok videos and scrape their associated data, as long as the content resides within the permitted areas

We always want to ensure our scraping follows the website's terms of service and legal guidelines. It's all about being a good web scraper citizen.

TikTok's Ethical Scraping Checklist

Consideration Description
Terms of Service Compliance Review and follow TikTok’s terms of service
Respect Robots.txt Check and adhere to the robots.txt file directives
Scraping Activity Monitoring Scrape only necessary data and avoid excessive requests
Data Protection Compliance Ensure no PII (Personally Identifiable Information) is collected and comply with GDPR, CCPA, and other data protection regulations

Step 2: Setting Up Our Python Environment

To start scraping TikTok, we need to set up our environment. We'll need to install a few Python libraries and understand the site's web structure.

Installing Python3

First, let's make sure we have Python3 installed on our machine. If not, we can grab it from the official Python website and follow the installation instructions.

Python3's download page

Installing Python is easy as pie (see what I did there?)! πŸ₯§

Creating Our Virtual Environment

Now that Python is ready, we should create a virtual environment to keep things organized. This way, our TikTok scraping project won't mess with other projects on our machine.

Here's how to create one:

python -m venv tiktok-scraping-env
source tiktok-scraping-env/bin/activate  # On Windows use `tiktok-scraping-env\Scripts\activate`

Think of this virtual environment (tiktok-scraping-env) as a designated sandbox for our TikTok-scraping adventure!

Installing Required Libraries: Nodriver, Asyncio, BeautifulSoup

Python can only lead the battle for our TikTok scraping adventure, but it can't go into the battlefront alone. Therefore, to start scraping TikTok data, we'll need to use several Python libraries:

  • Asyncio : The great multitasking maestro that will allow us to run operations asynchronously. We won't have to wait for each request to finish before sending the next.
  • BeautifulSoup : Our data excavator, parsing HTML and extracting the juicy bits we need.
  • Nodriver : The stealthy ninja of our team, allowing us to interact with web pages just like a real browser would, but without the overhead of a full browser instance. It's like having a VIP pass that lets us slip past the velvet ropes unnoticed.

Pro Tip: As I've had my fair adventurous share of using Selenium for web scraping and extracting data from websites, its undetected_chromedriver bypasses a pretty number of anti-bot systems, including those from Cloudflare and Akamai . However, undetected_chromedriver can encounter limitations against advanced anti-bot technology like TikTok systems. This is where I summon Nodriver, its official successor.

Are you still feeling a bit overwhelmed by these tech terms? Don't sweat it! We've got your back. Here's your tech dictionary to check out and learn more, served with a side of fun:

Article Description
BeautifulSoup Tutorial: Scraping Web Pages With Python The soup kitchen of web scraping - learn how to extract data from web pages, turning you into a master data chef.
Web Scraping Tutorial Using Selenium & Python (+ Examples) Become a web-scraping maestro - it's like interacting with web pages, orchestrating your browser to perform scraping tasks.
How To Use Undetected_chromedriver (Plus Working Alternatives) Learn to be a scraping ninja - sneak past website defenses without breaking a sweat!
Scraping With Nodriver: Step By Step Tutorial With Examples Discover Nodriver - it's like scraping with an invisibility cloak!
How To Use Asyncio To Scrape Websites With Python Master the art of asynchronous scraping - it's like teaching your scraper to juggle multiple tasks!

Pro Tip: Don't feel pressured to become an expert in all these tools overnight. In my scraping journey, I've found that understanding the basics of each is a great start. You can always dive deeper as you need. Remember, even web scraping ninjas started with "Hello World"!

Now, let's continue our adventure and summon these trusty sidekicks using pip:

pip install asyncio nodriver beautifulsoup4

Here, we install these libraries with a single command.

Step 3: Checking Our Setup

To ensure our code environment is set up correctly and good to go, let's write a small script to open a browser window and navigate to TikTok's homepage:

import nodriver as uc
import asyncio

async def main():
    try:
        print("Starting browser...")
        browser = await uc.start(headless=False)
        
        print("Navigating to TikTok Homepage...")
        page = await browser.get("https://www.tiktok.com/")
        print("TikTok page loaded. If you see the TikTok homepage, the setup is successful.")

        print("Waiting for 10 seconds before closing...")
        await asyncio.sleep(10)
    except Exception as e:
        print(f"An error occurred: {e}")
    finally:
        print("Test finished.")

if __name__ == "__main__":
    uc.loop().run_until_complete(main())

Running this script should open a browser that navigates to TikTok's homepage, where it waits for 10 seconds before closing the homepage without issues:

A random screenshot of TikTok's homepage opened by the browser window

If that happens, our environment and setup are complete and ready!

Scraping TikTok Profile Stats

Alright, folks! It's time to start our TikTok scraping adventure. But before we dive in, let me let you in on a little secret - we're not just randomly picking a TikTok profile to scrape. Oh no, we're much more considerate than that!

For this scraping escapade, we'll be using the handle @fpl_insights . And don't worry, this isn't some random, unsuspecting victim of our data curiosity. This handle belongs to a good friend of mine who's fully aware of our little digital reconnaissance mission. He's given us the green light to use his profile as our scraping guinea pig.

So, we're all set for guilt-free scraping!

Understanding TikTok Profile Structure

Let's put on our detective hats and explore the structure of a TikTok profile page. Trust me, this bit of investigation will make our scraping adventure much smoother, as we need to target the right elements and extract the data we need.

Key Data Points to Scrape: Identifying HTML Elements

In our TikTok profile treasure hunt, we're after these juicy bits of information:

  • Username
  • Display name
  • Followers count
  • Following count
  • Likes count
  • Bio

Now, let's see where these elements usually hang out on the TikTok profile page:

Annotated TikTok profile page showing locations of username, display name, follower count, following count, like count, bio, and video count.

TikTok Profile Data Elements: Inspecting the Profile Page

To extract this information, we need to inspect the page and locate the correct HTML elements. TikTok uses specific data attributes that make our job easier.

Now, let's learn how to peek under TikTok's hood. Here's how:

  1. Open a TikTok profile page in the browser
  2. Right-click on the element to inspect
  3. Select 'Inspect' from the context menu
  4. Voila! The browser's Developer Tools will pop up, showing the HTML structure

Pro Tip: Using keyboard shortcuts can save tons of time. On most browsers, you can open Developer Tools with Ctrl+Shift+I (Windows/Linux) or Cmd+Option+I (Mac). It's like having a secret passageway into the website's structure!

Let's now go on a treasure hunt for our data points!

Extracting Username

To find the username:

  1. Right-click on the username
  2. Click 'Inspect' in the context menu:

Developer Tools window showing the HTML structure of the TikTok username element.

From our Developer Tools, we can see something like:

<h1 data-e2e="user-title" class="..."</h1>

The data-e2e="user-title" attribute marks the spot for the username!

Extracting Display Name

The display name is usually below the username:

Developer Tools window showing the HTML structure of the TikTok display name element.

The data-e2e="user-subtitle" is our treasure map to the display name!

Extracting Followers Count

Developer Tools window showing the HTML structure of the TikTok followers count element.

The data-e2e="followers-count" attribute is our key to unlocking the followers count!

Extracting Following Count

Developer Tools window showing the HTML structure of the TikTok following count element.

The data-e2e="following-count" attribute points us to the following count!

Extracting Likes Count

Developer Tools window showing the HTML structure of the TikTok likes count element.

The data-e2e="likes-count" attribute is our treasure chest containing the likes count!

Extracting Bio

Developer Tools window showing the HTML structure of the TikTok bio element.

The data-e2e="user-bio" attribute is our final piece of treasure, leading us to the bio description!

From our screenshots, here's a breakdown of the HTML attributes we're targeting:

Data HTML Attribute
Username <h1 data-e2e="user-title">
Display name <h2 data-e2e="user-subtitle">
Followers count <strong data-e2e="followers-count">
Following count <strong data-e2e="following-count">
Likes count <strong data-e2e="likes-count">
Bio <h2 data-e2e="user-bio">

Pro Tip: In my scraping adventures, I've found that TikTok, like many social media platforms, loves to play hide and seek with its HTML structure. They might change these data-e2e attributes in the future. If your scraper suddenly starts bringing home empty treasure chests, checking for changes in these attributes should be your first move. It's like they're constantly rearranging the furniture in their HTML house!

Now that we've got our treasure map (the HTML structure), we're ready to write some code that'll automatically do all this treasure hunting.

Isn't web scraping exciting? It's like being a digital archaeologist, unearthing data artifacts from the vast internet landscape.

Let's move on to the coding part and turn our manual exploration into automated magic!

Scraping TikTok Profile Stats: Step-by-Step Process

Let's break down our code step by step, shall we?

Step 1: Setting Up Our Imports

import asyncio
import nodriver as uc
from bs4 import BeautifulSoup

Here, we're importing our trusty sidekicks. asyncio for asynchronous operations, nodriver (alias uc) for stealthy browsing, and BeautifulSoup for parsing HTML.

Step 2: Defining Our Scraping Function

async def scrape_tiktok_profile(username):
    try:
        print(f"Initiating scrape for TikTok profile: @{username}")
        browser = await uc.start(headless=False)
        print("Browser started successfully")

        page = await browser.get(f"https://www.tiktok.com/@{username}")
        print("TikTok profile page loaded successfully")

        await asyncio.sleep(10)  # Wait for 10 seconds
        print("Waited for 10 seconds to allow content to load")

This function is the heart of our operation. It starts a browser (visible, not headless), navigates to the TikTok profile, and waits for the content to load. The await asyncio.sleep() is like giving TikTok a moment to catch its breath before we start poking around.

Pro Tip: From my experience, I've learned that patience is key. That's why we have that 10-second wait after loading the page. It gives JavaScript time to work magic and load all the dynamic content. Without this wait, we might end up with incomplete data, which is about as useful as a chocolate teapot!

Step 3: Extracting the HTML

        html_content = await page.evaluate('document.documentElement.outerHTML')
        print(f"HTML content retrieved (length: {len(html_content)} characters)")

        soup = BeautifulSoup(html_content, 'html.parser')
        print("HTML content parsed with BeautifulSoup")

Here, we're grabbing all the HTML (html_content = ...) from the page and feeding it to BeautifulSoup. It's like taking a snapshot of the webpage and handing it to our data excavator soup = BeautifulSoup(...).

Step 4: Mining for Gold (Data)

        profile_info = {
                    'username': soup.select_one('h1[data-e2e="user-title"]').text.strip() if soup.select_one('h1[data-e2e="user-title"]') else None,
                    'display_name': soup.select_one('h2[data-e2e="user-subtitle"]').text.strip() if soup.select_one('h2[data-e2e="user-subtitle"]') else None,
                    'follower_count': soup.select_one('strong[data-e2e="followers-count"]').text.strip() if soup.select_one('strong[data-e2e="followers-count"]') else None,
                    'following_count': soup.select_one('strong[data-e2e="following-count"]').text.strip() if soup.select_one('strong[data-e2e="following-count"]') else None,
                    'like_count': soup.select_one('strong[data-e2e="likes-count"]').text.strip() if soup.select_one('strong[data-e2e="likes-count"]') else None,
                    'bio': soup.select_one('h2[data-e2e="user-bio"]').text.strip() if soup.select_one('h2[data-e2e="user-bio"]') else None
                }
                print("Profile information extracted successfully")
                return profile_info

This is where the magic happens! We're using BeautifulSoup's select_one method to strip our elements off the page's content. It's like we're on a treasure hunt, using the data-e2e attributes we fetched earlier as our map to find the profile information and stats.

Step 5: Error Handling and Cleanup

    except Exception as e:
            print(f"An error occurred while scraping: {str(e)}")
            return None
    finally:        
        if 'browser' in locals():
            browser.stop()
        print("Browser closed")

Even the best-laid plans can go awry sometimes, so we've got this try-except block to catch any hiccups. And no matter what happens, we make sure to finally close our browser – we're responsible digital citizens, after all!

Step 6: Calling the Main Function

async def main():
    username = "fpl_insights"
    profile_info = await scrape_tiktok_profile(username)

    if profile_info:
        print("\nProfile Information:")
        for key, value in profile_info.items():
            print(f"{key.replace('_', ' ').title()}: {value}")
    else:
        print("Failed to scrape profile information.")

if __name__ == "__main__":
    uc.loop().run_until_complete(main())

This is our main function, which orchestrates the whole operation. Here, we declare our username (fpl_insights, in this case). Then, we call our scraping function (scrape_tiktok_profile) and print(...) out the results (if successful). It's like the director of our little data heist movie!

Step 7: Putting It All Together

Now that we've explored each piece of our TikTok scraping puzzle, it's time to assemble our digital Voltron!

Here's the complete code that brings all our steps together:

import asyncio
import nodriver as uc
from bs4 import BeautifulSoup

async def scrape_tiktok_profile(username):
    try:
        print(f"Initiating scrape for TikTok profile: @{username}")
        browser = await uc.start(headless=False)
        print("Browser started successfully")

        page = await browser.get(f"https://www.tiktok.com/@{username}")
        print("TikTok profile page loaded successfully")

        await asyncio.sleep(10)  # Wait for 10 seconds
        print("Waited for 10 seconds to allow content to load")

        html_content = await page.evaluate('document.documentElement.outerHTML')
        print(f"HTML content retrieved (length: {len(html_content)} characters)")

        soup = BeautifulSoup(html_content, 'html.parser')
        print("HTML content parsed with BeautifulSoup")

        profile_info = {
            'username': soup.select_one('h1[data-e2e="user-title"]').text.strip() if soup.select_one('h1[data-e2e="user-title"]') else None,
            'display_name': soup.select_one('h2[data-e2e="user-subtitle"]').text.strip() if soup.select_one('h2[data-e2e="user-subtitle"]') else None,
            'follower_count': soup.select_one('strong[data-e2e="followers-count"]').text.strip() if soup.select_one('strong[data-e2e="followers-count"]') else None,
            'following_count': soup.select_one('strong[data-e2e="following-count"]').text.strip() if soup.select_one('strong[data-e2e="following-count"]') else None,
            'like_count': soup.select_one('strong[data-e2e="likes-count"]').text.strip() if soup.select_one('strong[data-e2e="likes-count"]') else None,
            'bio': soup.select_one('h2[data-e2e="user-bio"]').text.strip() if soup.select_one('h2[data-e2e="user-bio"]') else None
        }
        print("Profile information extracted successfully")
        return profile_info

    except Exception as e:
        print(f"An error occurred while scraping: {str(e)}")
        return None

    finally:
        if 'browser' in locals():
            browser.stop()
        print("Browser closed")

async def main():
    username = "fpl_insights"
    profile_info = await scrape_tiktok_profile(username)

    if profile_info:
        print("\nProfile Information:")
        for key, value in profile_info.items():
            print(f"{key.replace('_', ' ').title()}: {value}")
    else:
        print("Failed to scrape profile information.")

if __name__ == "__main__":
    uc.loop().run_until_complete(main())

Pro Tip: From my coding experience, I've found that it's crucial to add plenty of print statements throughout your code. They're like breadcrumbs in a digital forest, helping you track the progress of your scraper and quickly identify where things might go wrong. Trust me, your future self will thank you when you're debugging at 2 AM!

When we run our script, it'll work its magic and produce output similar to this:

Terminal output showing the scraped TikTok profile information including username, display name, follower count, following count, like count, and bio for the @fpl_insights profile

And there we have it, folks! We've just walked through a TikTok profile scraper that's stealthy, efficient, and, most importantly, consensual. Remember, always get permission before scraping someone's data – it's not just good manners; it's often a legal requirement!

Storing Our Scraped TikTok Profile Data

Now that we've successfully scraped our TikTok profile data, it's time to talk about storing this digital gold. Remember, data is only as valuable as it is accessible and analyzable.

Let's explore some options to make storing our adventure harvests a breeze!

Option 1: Storing Data in JSON

JSON (JavaScript Object Notation) is like the Swiss Army knife of data formats - versatile, readable, and widely supported. It's a lightweight data interchange format that's easy for humans to read and write and for machines to parse and generate.

Here's how we can continue our code and save our TikTok treasure in JSON :

import json

# Add JSON storage to the existing main function
async def main():
    username = "fpl_insights"
    profile_info = await scrape_tiktok_profile(username)

    if profile_info:
        print("\nProfile Information:")
        for key, value in profile_info.items():
            print(f"{key.replace('_', ' ').title()}: {value}")

        # Save to a JSON file
        with open(f"{username}_profile.json", 'w', encoding='utf-8') as f:
            json.dump(profile_info, f, ensure_ascii=False, indent=4)
        print(f"Profile information saved to {username}_profile.json")
    else:
        print("Failed to scrape profile information.")

if __name__ == "__main__":
    uc.loop().run_until_complete(main())

This will create a file {username}_profile.json with our scraped data in a nicely formatted JSON structure that's as easy to read as your favorite book.

Option 2: Storing Data in CSV

CSV (Comma-Separated Values) is the darling of spreadsheet enthusiasts. It's another perfectly popular format, especially useful when scraping multiple profiles and analyzing the data in spreadsheet software like Microsoft Excel or Google Sheets .

Here's how to save our TikTok data in CSV :

import csv

# Add CSV storage to the existing main function
async def main():
    username = "fpl_insights"
    profile_info = await scrape_tiktok_profile(username)

    if profile_info:
        print("\nProfile Information:")
        for key, value in profile_info.items():
            print(f"{key.replace('_', ' ').title()}: {value}")
        
        # Save to a CSV file
        csv_filename = f"{username}_profile.csv"
        with open(csv_filename, 'w', newline='', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=profile_info.keys())
            writer.writeheader()
            writer.writerow(profile_info)
        print(f"Profile information saved to {csv_filename}")
    else:
        print("Failed to scrape profile information.")

if __name__ == "__main__":
    uc.loop().run_until_complete(main())

This will create a file {username}_profile.csv with our scraped data in CSV format that'll make any data analyst's heart skip a beat.

Option 3: Storing Data in SQLite Database

For the data hoarders among us (I see you, and I salute you), storing data in a SQLite database is like having our own personal Fort Knox of information. SQLite is a lightweight, serverless database engine perfect for small to medium-sized projects.

Let's now store our hard-earned profile stats data using SQLite :

import sqlite3

# Add SQLite storage to the existing main function
async def main():
    username = "fpl_insights"
    profile_info = await scrape_tiktok_profile(username)

    if profile_info:
        print("\nProfile Information:")
        for key, value in profile_info.items():
            print(f"{key.replace('_', ' ').title()}: {value}")
        
        # Save to SQLite database
        db_name = f"{username}_tiktok_profile_stat.db"
        conn = sqlite3.connect(db_name)
        cursor = conn.cursor()
        
        # Create table if it doesn't exist
        cursor.execute('''
        CREATE TABLE IF NOT EXISTS tiktok_profiles
        (username TEXT PRIMARY KEY, display_name TEXT, follower_count INTEGER, 
        following_count INTEGER, like_count INTEGER, bio TEXT)
        ''')
        
        # Insert data
        cursor.execute('''
        INSERT OR REPLACE INTO tiktok_profiles
        (username, display_name, follower_count, following_count, like_count, bio)
        VALUES (:username, :display_name, :follower_count, :following_count, :like_count, :bio)
        ''', profile_info)
        
        conn.commit()
        conn.close()
        print(f"Profile information saved to SQLite database: {db_name}")
    else:
        print("Failed to scrape profile information.")

if __name__ == "__main__":
    uc.loop().run_until_complete(main())

This will create an SQLite database file {username}_tiktok_profile_stat.db and store our scraped data in a table tiktok_profiles, ready for all our querying needs.

Pro Tip: In my years of data wrangling, I've learned that there's no one-size-fits-all solution for data storage. JSON is great for maintaining the data's original structure, CSV is perfect for quick spreadsheet analysis, and SQLite allows for more complex querying. My advice? Use them all! Each has its strengths, and having your data in multiple formats means you're always prepared, no matter what analysis adventure comes your way.

Scraping TikTok Profile: Downloading Videos and Their Stats

Alright, TikTok explorers! We've conquered the profile structure, and now it's time to dive into the real treasure trove: the videos themselves. Buckle up, because we're about to embark on a data-mining adventure that would make even Indiana Jones jealous!

Understanding TikTok Videos Structure

Before we start our code-powered excavation, let's take a moment to understand the landscape of TikTok videos. It's like studying a map before venturing into uncharted territory – trust me, it'll save us from falling into any data pitfalls!

Key Video Data Points to Scrape

In our TikTok video expedition, we're after these golden nuggets of information for each video:

  • Video URL
  • Video Description
  • Music Title
  • Posted Date
  • Tags
  • Views count
  • Likes count
  • Comments count
  • Shares count
  • Bookmarks count

Pro Tip: I can tell you that understanding the structure of your target is half the battle won. It's like having X-ray vision for websites!

Let's take a closer look at where these elements typically reside on a TikTok video page:

Annotated TikTok profile video page highlighting the locations of video URL, views count, likes count, comments count, shares count, and bookmarks count.

Annotated TikTok profile video page highlighting the locations of video description and tags, posted date, and music title

TikTok Video Data Elements: Inspecting the Video Page

Now, let's put on our digital archaeologist hats and start digging into the HTML structure. Here's how we'll unearth these precious data elements:

  1. Open a TikTok video page in a browser
  2. Right-click on the element to inspect
  3. Select 'Inspect' from the context menu
  4. VoilΓ ! The browser's Developer Tools will appear, revealing the HTML structure

Let's go treasure-hunting for our video data points!

Extracting Views Count

The views count is typically displayed prominently on the profile video page:

Developer Tools window highlighting the HTML element for TikTok video view count.

Extracting Video Description and Hashtags

The video description and hashtags are typically displayed together:

Developer Tools window highlighting the HTML element for TikTok video video description and tags.

As we can see, the video description is within a <span> element with a unique class css-j2a19r-SpanText. The hashtags are separated but all share the same attribute data-e2e="search-common-link".

Extracting Music Title

Developer Tools window highlighting the HTML element for TikTok video music title.

Extracting Posted Date

Developer Tools window highlighting the HTML element for TikTok video posted date.

We can see that the date is isolated in the last <span> element within a parent element that has the attribute data-e2e="browser-nickname". Therefore, we'll use the last-child selector to target that date's section - ['span[data-e2e="browser-nickname"] span:last-child'].

Extracting Likes, Comments, Shares, and Bookmarks Counts

These engagement metrics are usually grouped together:

Developer Tools window highlighting the HTML elements for TikTok video likes, comments, shares, and bookmarks counts..

From our digital excavation, here's a summary of the CSS selectors we're targeting:

Data Selectors
Video URL <meta property="og:url">
Video Description ['span.css-j2a19r-SpanText']
Music Title ['.css-pvx3oa-DivMusicText']
Posted Date ['span[data-e2e="browser-nickname"] span:last-child']
Tags [data-e2e="search-common-link"]
Views Count [data-e2e="video-views"]
Likes Count [data-e2e="like-count"]
Comments Count [data-e2e="comment-count"]
Shares Count [data-e2e="share-count"]
Bookmarks Count [data-e2e="undefined-count"]

Pro Tip: From my experience, I've noticed that TikTok, like a mischievous genie, occasionally changes its HTML structure. If your scraper suddenly starts returning empty data, checking for updates in these selectors should be your first wish!

Now that we've mapped out our digital dig site, it's time to craft the enchanted tools (code) that will automate our treasure hunt.

Crafting Our TikTok Video Scraper: Step-by-Step Process

Let's roll up our sleeves and start coding our TikTok video scraper. Brace up, fellow data enthusiasts, because we're about to embark on a really exciting journey!

Step 1: Installing Playwright

First things first, we need to get our digital workbench ready. This time, we're calling in the big guns - Playwright !

Why Playwright?

Playwright is like the Swiss Army knife of web automation. It offers several advantages over Selenium or Nodriver:

  • Better handling of dynamic content: Playwright excels at interacting with JavaScript-heavy sites like TikTok, where content loads dynamically (content that might change and flicker like a firefly).
  • Cross-browser support: It allows us to use Chromium , Firefox , or WebKit with the same code.
  • Powerful selectors: Playwright's built-in selectors are more robust, making it easier to target elements even when the page structure changes.
  • Network interception: We can even decide to modify or mock network requests , which is helpful for bypassing some anti-bot measures.
  • Auto-wait functionality: Playwright automatically waits for elements to be ready before interacting with them, reducing the need for explicit waits.

It's like upgrading from a rowboat to a modern sailboat - we're still navigating the same waters, but with much better tools to handle the currents and winds of web scraping!

Let's now use pip to send an invite to Playwright for our party:

pip install playwright
playwright install

Here, we install playwright and the necessary dependencies.

Hungry to know more about Playwright? Check out our separate guide on Scraping The Web With Playwright .

Step 2: Installing the yt-dlp Library

We also need to equip ourselves with a powerful tool for video downloading: the yt-dlp library. This library is essential for downloading the video content from TikTok (if we decide to).

Let's now extend our party invitation to yt-dlp with pip:

pip install yt-dlp

This will download and install the yt-dlp library, preparing our Python environment for the video downloading capabilities we'll be implementing in our scraper.

Step 3: Setting Up Our Imports

Let's start our code by importing the necessary libraries. It's like preparing our adventurer's backpack before the journey.

from playwright.async_api import async_playwright
import asyncio, random, json, logging, time, os, yt_dlp

Let's take a minute to understand our library imports:

Library Purpose
playwright.async_api Provides tools for browser automation and web scraping
asyncio Enables asynchronous programming for efficient I/O operations
random Generates random numbers for sleep intervals
json Handles JSON data for storing and parsing scraped information
logging Sets up logging for tracking our scraper's progress
time Provides time-related functions for adding delays
os Interacts with the operating system for file operations
yt_dlp A fork of youtube-dl for downloading videos

Pro Tip: From my scraping experience, I can tell you that combining these libraries is like assembling a crack team of specialists for our TikTok expedition. Each one brings a unique skill to the table, making our TikTok data expedition smooth and efficient!

Step 4: Configuring Our Scraper

We begin our TikTok video scraping journey by setting up a robust Python logging system. This is crucial for monitoring our scraper's progress and troubleshooting any issues that may arise:

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Set to True to download videos
DOWNLOAD_VIDEOS = False

Here, we're setting up logging to monitor our scraper's progress. Think of it as establishing a mission control center for our data expedition. We configure logging to output informative messages about each step of the scraping process, allowing us to track our progress in real-time and maintain a record of our scraper's activities.

The DOWNLOAD_VIDEOS toggle is our secret weapon. Set it to True, and our scraper will not only gather data but also download all the videos in the TikTok profile to a directory. It's like choosing between taking notes at a museum or sneaking out with the exhibits!

Step 5: Crafting Our Stealth Techniques

From my experience in web scraping, patience is a virtue. We don't want to overwhelm TikTok's servers (or look suspiciously bot-like), so let's create a function to add random pauses and page scrolls to act more human-like and avoid detection:

async def random_sleep(min_seconds, max_seconds):
    await asyncio.sleep(random.uniform(min_seconds, max_seconds))

async def scroll_page(page):
    await page.evaluate("""
        window.scrollBy(0, window.innerHeight);
    """)
    await random_sleep(1, 2)

These functions will help our scraper mimic human behavior. It's like teaching a robot to do the "scroll and pause" dance!

Step 6: Dealing With CAPTCHAs

TikTok's bouncers are always on the lookout for bots:

A random screenshot of TikTok's pageload captcha for human verification

CAPTCHAs are like the final boss in a video game. We can't avoid them, but we can prepare for them. Therefore, we need to devise a way to handle them:

async def handle_captcha(page):
    try:
        # Check for the presence of the CAPTCHA dialog
        captcha_dialog = page.locator('div[role="dialog"]')
        is_captcha_present = await captcha_dialog.count() > 0 and await captcha_dialog.is_visible()
        
        if is_captcha_present:
            logging.info("CAPTCHA detected. Please solve it manually.")
            # Wait for the CAPTCHA to be solved
            await page.wait_for_selector('div[role="dialog"]', state='detached', timeout=300000)  # 5 minutes timeout
            logging.info("CAPTCHA solved. Resuming script...")
            await asyncio.sleep(0.5)  # Short delay after CAPTCHA is solved
    except Exception as e:
        logging.error(f"Error while handling CAPTCHA: {str(e)}")

When the script detects a CAPTCHA (is_captcha_present), this function pauses our scraper and allows us to prove we're human by manually solving the random captcha. It's like having a personal assistant who taps you on the shoulder when they need your expertise!

After we solve the captcha ("Hey TikTok, look, I can identify traffic lights too!"), the script continues automatically with minimal or no recurrence of CAPTCHA (thanks to Playwright's ability to maintain a persistent browser session and mimic human-like behavior πŸ˜‰).

Pro Tip: While our current approach involves manually solving the CAPTCHA, there are more advanced strategies to level up your scraping game:

  • Automated CAPTCHA Solving: For those who want to take their scraping to the next level, our guide on How To Bypass ReCAPTCHA & HCaptcha When Web Scraping is a must-read. It's like having a digital locksmith in your toolkit!
  • Scraping APIs: For a truly hands-off approach, consider using a scraping API like ScrapingBee . It's the equivalent of hiring a professional team to handle CAPTCHAs and other anti-bot measures for you, providing relief and efficiency.

Step 7: Navigating to the Video Page

Next, we need to visit each video's page:

async def extract_video_info(page, video_url, views):
    await page.goto(video_url, wait_until="networkidle")
    await random_sleep(2, 4)
    await handle_captcha(page)

This step sets the stage for data extraction like a digital archaeologist carefully approaching an ancient artifact.

Step 8: Defining Our Data Extraction Function

Now, let's craft a versatile text extraction function to extract the text content:

    video_info = await page.evaluate("""
            () => {
                const getTextContent = (selectors) => {
                    for (let selector of selectors) {
                        const elements = document.querySelectorAll(selector);
                        for (let element of elements) {
                            const text = element.textContent.trim();
                            if (text) return text;
                        }
                    }
                    return 'N/A';
                };
                // Function will be used in the next step
            }
        """)

Here, we're handling the dynamic nature of TikTok's web content by attempting to extract data using multiple selectors. This approach ensures we don't miss any valuable information due to slight changes in TikTok's HTML structure.

Our function patiently tries each selector, moving on to the next if one fails, until it successfully retrieves the desired content.

Step 9: Extracting Video Metrics

Let's put our extraction function to work:

    video_info = await page.evaluate("""
            () => {
                // ... (getTextContent function from previous step)

            const getTags = () => {
                const tagElements = document.querySelectorAll('a[data-e2e="search-common-link"]');
                return Array.from(tagElements).map(el => el.textContent.trim());
            };

            return {
                likes: getTextContent(['[data-e2e="like-count"]', '[data-e2e="browse-like-count"]']),
                comments: getTextContent(['[data-e2e="comment-count"]', '[data-e2e="browse-comment-count"]']),
                shares: getTextContent(['[data-e2e="share-count"]']),
                bookmarks: getTextContent(['[data-e2e="undefined-count"]']),
                description: getTextContent(['span.css-j2a19r-SpanText']),
                musicTitle: getTextContent(['.css-pvx3oa-DivMusicText']),
                date: getTextContent(['span[data-e2e="browser-nickname"] span:last-child']),
                tags: getTags()
            };
        }
    """)

    video_info['views'] = views
    video_info['url'] = video_url

    logging.info(f"Extracted info for {video_url}: {video_info}")
    return video_info

Here, we're like data miners, extracting precious metrics from the TikTok bedrock. Remember those juicy attributes (likes, comments, shares, bookmarks, description, musicTitle, date and tags) we collected earlier from our page inspection? We put them to use here.

Step 10: Preparing for Video Download

To download the TikTok profile videos, we also need to set up a downloader function using the yt-dlp library we installed earlier:

def download_tiktok_video(video_url, save_path):
    print(f"Preparing to download video: {video_url}")
    ydl_opts = {
        'outtmpl': os.path.join(save_path, '%(id)s.%(ext)s'),
        'format': 'best',
    }

This step is like packing our gear before a big expedition. We're getting ready to capture TikTok videos in their natural habitat!

Step 11: Downloading the Video

Now, let's actually download the video file:

    try:
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            info = ydl.extract_info(video_url, download=True)
            filename = ydl.prepare_filename(info)
            logging.info(f"Video successfully downloaded: {filename}")
            return filename
    except Exception as e:
        logging.error(f"Error downloading video: {str(e)}")
        return None

This is where the magic happens. Using the yt_dlp library, we're bottling TikTok lightning, saving videos for offline analysis.

Note: Downloading videos is like capturing lightning in a bottle. It's exciting, but for other profiles, we need to be careful! Make sure you have permission and respect copyright laws. Remember, we don't want our data science to turn into a legal drama!

Step 12: Initializing Our Scraping Session

We can now set up our scraping environment:

# Defining an asynchronous function that will handle the scraping process for a given username
async def scrape_tiktok_profile(username, videos=[]):
    print(f"Starting scrape for user: {username}")

    # Creating a context manager for Playwright, ensuring proper setup and teardown of resources
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False) # Launches a visible Chrome browser instance

        # Putting on our disguise to blend in with the TikTok crowd
        context = await browser.new_context(
            viewport={'width': 1280, 'height': 720}, # Adjusting our binoculars for the perfect view of TikTok's landscape
            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', # Crafting the perfect backstory for our undercover agent
        )
        
        page = await context.new_page() # Opens a new page in the browser

We're setting up our digital command center, ready to explore the TikTok wilderness. Each line of code is like preparing a different piece of equipment for our data expedition. From launching our browser (our trusty steed) to setting our disguise ( user_agent ), we're gearing up for a successful journey into the heart of TikTok's data-rich terrain.

Step 13: Navigating to the TikTok Profile

Time to visit the TikTok profile we're interested in with the await page.goto method:

    url = f"https://www.tiktok.com/@{username}"
        print(f"Navigating to profile: {url}")
        await page.goto(url, wait_until="networkidle")
        await handle_captcha(page)

This is like arriving at our destination. We're at the doorstep of TikTok treasures! We're also calling our previous handle_captcha function here as well.

Step 14: Processing Each Video

Now, let's process each video we find:

    for i, video in enumerate(videos):
            if 'likes' not in video:
                print(f"Processing video {i+1}/{len(videos)}")
                video_info = await extract_video_info(page, video['url'], video['views'])
                videos[i].update(video_info)
                
                if DOWNLOAD_VIDEOS:
                    save_path = os.path.join(os.getcwd(), username)
                    filename = download_tiktok_video(video['url'], save_path)
                    if filename:
                        videos[i]['local_filename'] = filename

This is the heart of our operation. We're examining each TikTok video like a jeweler appraising precious gems.

We're navigating to the profile, scrolling through videos, and extracting information using our extract_video_info function. If we set DOWNLOAD_VIDEOS to True, we're running the downloader function (download_tiktok_video) here as well.

Step 15: Saving Our Progress

The entire process can take a while. Let's ensure we don't lose our hard-earned data if anything interrupts the process. So, let's save our progress every 10 videos into a file in JSON format:

                # Save progress every 10 videos
                if (i + 1) % 10 == 0:
                    with open(f"{username}_progress.json", "w") as f:
                        json.dump(videos, f, indent=2)
                    print(f"Progress saved. Processed {i+1}/{len(videos)} videos.")
                
                await random_sleep(3, 5)

        await browser.close()
        return videos

Regularly saving our progress is like creating save points in a video game. If any interruption happens at any time, we don't want to lose our digital loot that has taken so much time!

Step 16: Running Our Scraper

Last but not least, let's fire up our scraper with a main function that brings everything together:

async def main():
    username = "fpl_insights"
    
    # Attempt to load previously saved progress
    try:
        with open(f"{username}_progress.json", "r") as f:
            videos = json.load(f)
        logging.info(f"Loaded progress. {len([v for v in videos if 'likes' in v])}/{len(videos)} videos already processed.")
    except FileNotFoundError:
        # If no previous progress is found, start with an empty list
        videos = []

    # Create a directory for video downloads if DOWNLOAD_VIDEOS is True
    if DOWNLOAD_VIDEOS:
        os.makedirs(username, exist_ok=True)
        os.chdir(username)

    # Start the scraping process
    videos = await scrape_tiktok_profile(username, videos)

    # Log the total number of scraped videos
    logging.info(f"\nTotal videos scraped: {len(videos)}")

    # Save the scraped data to a JSON file
    with open(f"{username}_playwright_video_stats.json", "w") as f:
        json.dump(videos, f, indent=2)
    logging.info(f"Data saved to {username}_playwright_video_stats.json")

if __name__ == "__main__":
    # Run the main function asynchronously
    asyncio.run(main())

First, we try to check if there’s a previously saved progress file for the given username (in case of an interruption). If the file exists, we load the existing data to see how many videos have been processed. If such a file doesn't exist, we start afresh with an empty list (videos = []).

Next, if we set DOWNLOAD_VIDEOS to True, we create a directory for video downloads and navigate into it. Then, we kick off the scraping process by calling scrape_tiktok_profile with the username and the list of videos, collecting all the juicy data.

After that, we log the total number of scraped videos and save all the data into a JSON file named after the username, ensuring our hard-earned data is safely stored. Finally, we run the main function asynchronously, orchestrating the entire scraping operation.

Step 17: Putting It All Together

Now, let's assemble all our code snippets into one magnificent TikTok scraping machine:

from playwright.async_api import async_playwright
import asyncio, random, json, logging, time, os, yt_dlp

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

DOWNLOAD_VIDEOS = True

async def random_sleep(min_seconds, max_seconds):
    await asyncio.sleep(random.uniform(min_seconds, max_seconds))

async def scroll_page(page):
    await page.evaluate("""
        window.scrollBy(0, window.innerHeight);
    """)
    await random_sleep(1, 2)

async def handle_captcha(page):
    try:
        # Check for the presence of the CAPTCHA dialog
        captcha_dialog = page.locator('div[role="dialog"]')
        is_captcha_present = await captcha_dialog.count() > 0 and await captcha_dialog.is_visible()
        
        if is_captcha_present:
            logging.info("CAPTCHA detected. Please solve it manually.")
            # Wait for the CAPTCHA to be solved
            await page.wait_for_selector('div[role="dialog"]', state='detached', timeout=300000)  # 5 minutes timeout
            logging.info("CAPTCHA solved. Resuming script...")
            await asyncio.sleep(0.5)  # Short delay after CAPTCHA is solved
    except Exception as e:
        logging.error(f"Error while handling CAPTCHA: {str(e)}")

async def hover_and_get_views(page, video_element):
    await video_element.hover()
    await random_sleep(0.5, 1)
    views = await video_element.evaluate("""
        (element) => {
            const viewElement = element.querySelector('strong[data-e2e="video-views"]');
            return viewElement ? viewElement.textContent.trim() : 'N/A';
        }
    """)
    return views

async def extract_video_info(page, video_url, views):
    await page.goto(video_url, wait_until="networkidle")
    await random_sleep(2, 4)
    await handle_captcha(page)

    video_info = await page.evaluate("""
        () => {
            const getTextContent = (selectors) => {
                for (let selector of selectors) {
                    const elements = document.querySelectorAll(selector);
                    for (let element of elements) {
                        const text = element.textContent.trim();
                        if (text) return text;
                    }
                }
                return 'N/A';
            };

            const getTags = () => {
                const tagElements = document.querySelectorAll('a[data-e2e="search-common-link"]');
                return Array.from(tagElements).map(el => el.textContent.trim());
            };

            return {
                likes: getTextContent(['[data-e2e="like-count"]', '[data-e2e="browse-like-count"]']),
                comments: getTextContent(['[data-e2e="comment-count"]', '[data-e2e="browse-comment-count"]']),
                shares: getTextContent(['[data-e2e="share-count"]']),
                bookmarks: getTextContent(['[data-e2e="undefined-count"]']),
                description: getTextContent(['span.css-j2a19r-SpanText']),
                musicTitle: getTextContent(['.css-pvx3oa-DivMusicText']),
                date: getTextContent(['span[data-e2e="browser-nickname"] span:last-child']),
                tags: getTags()
            };
        }
    """)

    video_info['views'] = views
    video_info['url'] = video_url

    logging.info(f"Extracted info for {video_url}: {video_info}")
    return video_info

def download_tiktok_video(video_url, save_path):
    ydl_opts = {
        'outtmpl': os.path.join(save_path, '%(id)s.%(ext)s'),
        'format': 'best',
    }

    try:
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            info = ydl.extract_info(video_url, download=True)
            filename = ydl.prepare_filename(info)
            logging.info(f"Video successfully downloaded: {filename}")
            return filename
    except Exception as e:
        logging.error(f"Error downloading video: {str(e)}")
        return None

async def scrape_tiktok_profile(username, videos=[]):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        context = await browser.new_context(
            viewport={'width': 1280, 'height': 720},
            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        )
        
        page = await context.new_page()

        url = f"https://www.tiktok.com/@{username}"
        await page.goto(url, wait_until="networkidle")
        await handle_captcha(page)

        if not videos:
            videos = []
            last_video_count = 0
            no_new_videos_count = 0
            start_time = time.time()
            timeout = 300  # 5 minutes timeout

            while True:
                await scroll_page(page)
                await handle_captcha(page)

                video_elements = await page.query_selector_all('div[data-e2e="user-post-item"]')
                
                for element in video_elements:
                    video_url = await element.evaluate('(el) => el.querySelector("a").href')
                    if any(video['url'] == video_url for video in videos):
                        continue

                    views = await hover_and_get_views(page, element)
                    videos.append({'url': video_url, 'views': views})

                logging.info(f"Found {len(videos)} unique videos so far")

                if len(videos) == last_video_count:
                    no_new_videos_count += 1
                else:
                    no_new_videos_count = 0

                last_video_count = len(videos)

                if no_new_videos_count >= 3 or time.time() - start_time > timeout:
                    break

        logging.info(f"Found a total of {len(videos)} videos")

        for i, video in enumerate(videos):
            if 'likes' not in video:
                video_info = await extract_video_info(page, video['url'], video['views'])
                videos[i].update(video_info)
                logging.info(f"Processed video {i+1}/{len(videos)}: {video['url']}")
                
                if DOWNLOAD_VIDEOS:
                    save_path = os.path.join(os.getcwd(), username)
                    filename = download_tiktok_video(video['url'], save_path)
                    if filename:
                        videos[i]['local_filename'] = filename
                
                # Save progress every 10 videos
                if (i + 1) % 10 == 0:
                    with open(f"{username}_progress.json", "w") as f:
                        json.dump(videos, f, indent=2)
                    logging.info(f"Progress saved. Processed {i+1}/{len(videos)} videos.")
                
                await random_sleep(3, 5)

        await browser.close()
        return videos

async def main():
    username = "fpl_insights"
    
    # Try to load progress
    try:
        with open(f"{username}_progress.json", "r") as f:
            videos = json.load(f)
        logging.info(f"Loaded progress. {len([v for v in videos if 'likes' in v])}/{len(videos)} videos already processed.")
    except FileNotFoundError:
        videos = []

    if DOWNLOAD_VIDEOS:
        os.makedirs(username, exist_ok=True)
        os.chdir(username)

    videos = await scrape_tiktok_profile(username, videos)

    logging.info(f"\nTotal videos scraped: {len(videos)}")

    # Save as JSON
    with open(f"{username}_playwright_video_stats.json", "w") as f:
        json.dump(videos, f, indent=2)
    print(f"Data saved to {username}_playwright_video_stats.json")

    logging.info(f"Data saved to {username}_playwright_video_stats.json")

if __name__ == "__main__":
    asyncio.run(main())

In my years of code wrangling, I've found that keeping your entire script in one file makes it easier to manage and share. It's like having a Swiss Army knife instead of a scattered toolbox!

Step 18: Running Our TikTok Scraper

We're all set! Let's fire up our scraper and watch the TikTok data roll in.

First, with video downloads enabled:

Terminal screenshot showing the output of the TikTok scraper script with video downloads enabled, displaying progress bars for each video download and extracted metadata.

If we choose to download videos, the process might take a while, especially for a profile with tons of videos. But remember, it's like slow-cooking data - the longer it simmers, the richer the insights you'll get!

Now, let's run it again with video downloads turned off:

Terminal screenshot showing the output of the TikTok scraper sript with video downloads turned off, displaying only extracted metadata and faster progress through the profile's videos.

What a journey! Our TikTok scraper is now exploring the digital wilderness, bringing back valuable data insights. It's like we've built an automatic data vacuum cleaner for TikTok videos!

Exporting TikTok Video Data to CSV

By default, we're already storing our scraped data in JSON format, which is great for maintaining data structure and compatibility with many data processing tools. However, let's add an option to export our TikTok treasure trove to a CSV file.

Let's continue our code and slightly modify our main() function to include this feature:

import csv

async def main():
    # ... (previous code remains the same)

    videos = await scrape_tiktok_profile(username, videos)

    print(f"\nTotal videos scraped: {len(videos)}")

    # Save as CSV
    csv_filename = f"{username}_playwright_video_stats.csv"
    with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
        # Define all fields including
        fieldnames = ['url', 'views', 'likes', 'comments', 'shares', 'bookmarks', 
                      'description', 'musicTitle', 'date', 'tags']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for video in videos:
            # Handle 'tags' separately as it's a list
            video_data = {k: video.get(k, 'N/A') for k in fieldnames if k != 'tags'}
            video_data['tags'] = ','.join(video.get('tags', []))
            writer.writerow(video_data)
    print(f"Data saved to {csv_filename}")

if __name__ == "__main__":
    asyncio.run(main())

This addition allows us to export our data to a neat CSV file {username}_playwright_video_stats.csv, perfect for spreadsheet analysis or data visualization tools .

Storing TikTok Video Data in SQLite

For the database enthusiasts among us, let's add an option to store our TikTok video data in a SQLite database.

To do this, we'll only modify our main() function as well:

import sqlite3

async def main():
    # ... (previous code remains the same)

    videos = await scrape_tiktok_profile(username, videos)

    print(f"\nTotal videos scraped: {len(videos)}")
    
    # Save to SQLite
    
    # Define our file output name and initiate a sqlite3 connection
    db_filename = f"{username}_tiktok_data.db"
    conn = sqlite3.connect(db_filename)
    cursor = conn.cursor()
    
    # Create table with all fields
    cursor.execute('''CREATE TABLE IF NOT EXISTS videos
                      (url TEXT PRIMARY KEY, views TEXT, likes TEXT,
                       comments TEXT, shares TEXT, bookmarks TEXT,
                       description TEXT, musicTitle TEXT, date TEXT, tags TEXT)''')
    
    # Insert or update data for each video
    for video in videos:
        cursor.execute('''INSERT OR REPLACE INTO videos
                          (url, views, likes, comments, shares, bookmarks,
                           description, musicTitle, date, tags)
                          VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''',
                       (video['url'], video['views'], video['likes'],
                        video['comments'], video['shares'], video['bookmarks'],
                        video.get('description', 'N/A'), 
                        video.get('musicTitle', 'N/A'),
                        video.get('date', 'N/A'),
                        ','.join(video.get('tags', []))))
    
    # Commit changes and close connection
    conn.commit()
    conn.close()
    print(f"Data saved to SQLite database: {db_filename}")

if __name__ == "__main__":
    asyncio.run(main())

This SQLite integration turns our TikTok data into a queryable database, perfect for complex data analysis and app integration.

Pro Tip: From my data hoarding escapades, I can say that SQLite is like a secret weapon for data scientists. It's lightweight, requires no separate server, and can handle surprisingly large datasets. It's perfect for when you want to level up from simple file storage without the complexity of a full-fledged database server!

And there we have it, intrepid data explorers! We've built a comprehensive TikTok scraping tool that not only extracts valuable data but also offers flexible storage options, whether we prefer the simplicity of CSV, the ubiquity of JSON, or the power of SQLite.

Supercharge Your Scraping With ScrapingBee

Hey there, data adventurers! πŸ•΅οΈβ€β™‚οΈ Remember those pesky CAPTCHAs we encountered while starting our scraping? Well, what if I told you there's a magical tool that can make them disappear faster than a viral TikTok dance?

Enter ScrapingBee - the superhero sidekick every web scraper dreams of!

Why ScrapingBee is Your New Best Friend

  • CAPTCHA? What CAPTCHA? πŸ•΅οΈβ€β™‚οΈ ScrapingBee API deftly navigates around CAPTCHAs , allowing you to focus on the data gold while it handles the heavy lifting of CAPTCHA management.
  • Proxy Power-Up: Imagine having a whole fleet of ships (IP addresses) at your command. That's what proxies are! Instead of just one ship sending out requests, you can spread the load across many, just like a digital shapeshifter, which makes it impossible for TikTok or other advanced anti-bot systems to pin down.
  • JavaScript Jedi: ScrapingBee handles JavaScript execution like a pro, perfect for dynamic content like TikTok's.
  • API Simplicity: Simple API calls replace complex browser automation. It's like trading your rusty screwdriver for a laser cutter!

Ready to Become a Scraping Wizard?

Don't let advanced anti-bot measures keep you from your data dreams. With ScrapingBee, you're not just scraping - you're performing data sorcery!

Start Your Free ScrapingBee Trial Now!

No credit card required, and you get 1000 free API calls. That's enough to scrape more TikTok videos than you can watch in a day!

Signing up on ScrapingBee website for free

After signing up, you can proceed to copy your API key from the dashboard.

Copying the API Key from ScrapingBee dashboard

Let's See ScrapingBee in Action!

Enough talk, let's see some code!

First, we've to install the ScrapingBee Python SDK :

pip install scrapingbee

Now, let's see how to get started with using ScrapingBee API to scrape a TikTok profile with the elegance of a ballet dancer and the stealth of a ninja:

from scrapingbee import ScrapingBeeClient
import json

# Your ScrapingBee API key - your ticket to scraping paradise!
client = ScrapingBeeClient(api_key='YOUR_API_KEY')

# The TikTok profile URL - let's scrape @fpl_insights again
url = "https://www.tiktok.com/@fpl_insights"

# Our extraction rules - telling ScrapingBee what treasure to find
extract_rules = {
    "videos": {
        "selector": "div[data-e2e='user-post-item']",
        "type": "list",
        "output": {
            "url": {
                "selector": "a",
                "output": "@href"
            },
            "views": {
                "selector": "strong[data-e2e='video-views']",
                "output": "text"
            }
        }
    },
    "profile_info": {
        "selector": "h2[data-e2e='user-subtitle']",
        "output": "text"
    }
}

# The magic happens here!
response = client.get(
    url,
    params={
        'extract_rules': extract_rules,
        'render_js': True,
        'premium_proxy': True
    }
)

if response.ok:
    data = response.json()
    print(json.dumps(data, indent=2))
else:
    print(f"Oops! Something went wrong: {response.content}")

This extract_rules dictionary is like giving ScrapingBee a treasure map. It's saying:

  1. Find all the video items (div[data-e2e='user-post-item'])
  2. For each video, grab the URL and views count
  3. Oh, and while you're at it, snag the display name from the profile info too

The response = client.get(...) part is where the magic happens. ScrapingBee takes our treasure map, navigates TikTok's defenses, and brings back the gold - all with one simple API call.

Why You'll Fall in Love With ScrapingBee

  • Focus on what matters: No more wrestling with CAPTCHAs or proxy configurations. ScrapingBee handles the grunt work while you focus on the data magic.
  • Scale like a boss: Whether you're scraping 10 videos or 10,000 for stats, ScrapingBee's got your back. It's like having an army of tiny web-scraping minions at your command!
  • Stay on the right side of the law: ScrapingBee helps you respect rate limits and terms of service. It's like having a digital lawyer on your team!
  • Save time, save money: Time is money, and ScrapingBee saves you both. It's an investment that pays for itself faster than you can say "viral TikTok"!

So, why wait? You don't need a credit card to start.

Start your free ScrapingBee trial now and turn those websites into data goldmines! Plus, your first 1,000 requests are on us!

Conclusion

Wow, what a journey it's been! From setting up our TikTok scraping environment to unleashing the power of ScrapingBee, we've certainly covered a lot of ground.

Let's recap our epic journey:

  1. We set up our Python environment and learned about the challenges of scraping TikTok.
  2. We crafted powerful scrapers using Nodriver and Playwright to navigate TikTok's dynamic content (profile stats and video downloads).
  3. We tackled CAPTCHAs and other anti-bot measures head-on.
  4. We explored various storage methods to store our scraped data, from JSON and CSV files to a database like SQLite.
  5. Finally, we discovered the magic of ScrapingBee, which simplifies our scraping process and helps us avoid blocks.

Remember, with great scraping power comes great responsibility. Always scrape ethically, respect website terms of service, and use the data wisely. This not only ensures your professional integrity but also contributes to a healthy data ecosystem.

Whether you're analyzing TikTok trends, conducting research, or satisfying your data curiosity, you now have the tools to dive deep into the TikTok data ocean.

So, what are you waiting for? Grab your digital scuba gear (your Python environment), take a deep breath, and dive into the fascinating world of TikTok data.

Further Reading

Ready to level up your scraping skills? Check out more of our resources:

Topic Description
How to Scrape YouTube Learn how to scrape another social media giant
Python Web Scraping: Full Tutorial With Examples (2024) Master the art of web scraping with Python
API vs. Web Scraping: What’s the Difference? Understand when to use each approach
Avoiding Detection Stay under the radar while scraping
Scraping With Nodriver: Step By Step Tutorial With Examples Deep dive into using Nodriver for scraping
How To Scrape Data From Twitter.Com Apply your skills to another platform

The scraping journey doesn't end here, folks. Keep learning, keep adapting, and most importantly, keep asking questions. The world of web scraping is ever-evolving, and the best scrapers are those who never stop exploring!

Happy scraping, and may your datasets be ever bountiful! πŸŠβ€β™‚οΈπŸ“ŠπŸŽ‰

image description
Ismail Ajagbe

A DevOps Enthusiast, Technical Writer, and Content Strategist who's not just about the bits and bytes but also the stories they tell. I don't just write; I communicate, educate, and engage, turning the 'What's that?' into 'Ah, I get it!' moments.