Are you a data analyst thirsty for social media insights and trends? A Python developer looking for a practical social media scraping project? Maybe you're a social media manager tracking metrics or a content creator wanting to download and analyze your TikTok data? If any of these describe you, you're in the right place!
TikTok , the social media juggernaut, has taken the world by storm. TikTok's global success is reflected in its numbers:
- Massive Download Volume: TikTok has been downloaded over 4.1 billion times .
- Explosive Growth: In 2024, TikTok has over 1 billion monthly active users globally, surpassing many other social media platforms in engagement and content consumption.
- High User Engagement: Users spend an average of 55.8 minutes daily on TikTok , browsing their personalized feeds and uploading millions of videos daily.
- Visual Search: For Gen Z, TikTok is the new search engine. 40% of Gen Z prefer TikTok over Google for local searches.
Mind-blowing right? TikTok is really a treasure trove of data waiting to be explored. But why should we make use of this information?
Why Scrape TikTok?
From decoding viral trends to analyzing user behavior, scraping TikTok data is the new oil for various fields:
Audience | Use-Case |
---|---|
Data Analysts | - Discover emerging trends - Understand user behavior - Track the performance of specific content - Gather engagement metrics for data-driven insights |
Developers | - Build robust scraping scripts - Automate data collection processes - Integrate scraping with data analysis libraries |
Social Media Managers | - Monitor follower growth - Track video performance - Analyze engagement metrics - Analyze competitors - Devise strategies to improve engagement and reach |
Digital Marketers | - Measure campaign effectiveness - Understand target audience preferences - Optimize marketing efforts |
Tech Bloggers and Content Creators | - Download and analyze videos - Monitor competitor content - Identify trending topics - Generate ideas and improve content strategy - Increase follower engagement |
Researchers | - Analyze the spread of viral content - Study user interaction patterns - Conduct sociological and psychological research - Study social media trends and patterns - Conduct studies on digital communication methods |
Brands | - Monitor brand presence - Understand consumer sentiment - Identify influencers for collaborations |
Tech Giants | - Analyze platform features - Understand user engagement - Inform product development and competitive strategy |
Celebrities | - Track public perception - Engage with fans - Manage online presence |
Impressive! TikTok is really there for everyone! It's a goldmine to learn about what's popular, user behavior, and emerging trends.
What This Guide Will Cover
In this comprehensive guide, weβll delve into the nitty-gritty of scraping TikTok using Python. Whether you're a scraping newbie or a seasoned pro, this guide will provide you with the tools and knowledge to effectively extract and analyze TikTok data.
Precisely, we'll walk you through:
- Ethical Scraping Etiquette: Keeping your scraping activities legal and respectful to TikTok's terms of service
- Setting Up Your Environment: Essential tools and technologies (libraries)
- Scraping Profile Stats: Extracting key metrics from TikTok profiles
- Downloading Videos and Their Stats: Advanced technique for downloading TikTok videos and their stats
- Outsmarting TikTok's Defenses: Handling dynamic content, bypassing anti-scraping measures, and proxy magic
- Data Analysis 101: Quick tips to make sense of your scraped TikTok data with efficient storage methods
- Scaling Your Scraper: Taking your project from hobby to heavy-duty
Pro Tip: New to web scraping? Don't sweat it! Check out our primer on What Is Web Scraping or dive into our Python Web Scraping: Full Tutorial With Examples (2024) . Trust me, a solid foundation will make your TikTok scraping journey much smoother!
By the time you finish this guide, you'll be scraping TikTok with Python like a pro, ready to unlock insights that could revolutionize your strategy or research. Buckle up β you're in for a treat! ππ΅
Getting Started With TikTok Scraping
Step 1: Reviewing Ethical Considerations for Scraping TikTok
First and foremost, every website has rules ( robots.txt file) regarding web scraping, and TikTok is no exception. Before we start scraping, it's important to check TikTok's robots.txt file . This file tells us what parts of TikTok are okay to scrape and which are off-limits.
Think of it as a treasure map that shows us where to dig and where not to!
As we can see, we can scrape general sections like /foryou
, /discover
, and /about
, but we need to avoid areas such as /inapp
, /auth
, and specific directory paths. The robots.txt
file is a guideline, telling web crawlers which parts of TikTok they can access and which are off-limits.
This means we are allowed to:
- Scrape profile stats: Extract key metrics from TikTok profiles that are accessible through allowed paths
- Download videos and their stats: Download TikTok videos and scrape their associated data, as long as the content resides within the permitted areas
We always want to ensure our scraping follows the website's terms of service and legal guidelines. It's all about being a good web scraper citizen.
TikTok's Ethical Scraping Checklist
Consideration | Description |
---|---|
Terms of Service Compliance | Review and follow TikTokβs terms of service |
Respect Robots.txt | Check and adhere to the robots.txt file directives |
Scraping Activity Monitoring | Scrape only necessary data and avoid excessive requests |
Data Protection Compliance | Ensure no PII (Personally Identifiable Information) is collected and comply with GDPR, CCPA, and other data protection regulations |
Step 2: Setting Up Our Python Environment
To start scraping TikTok, we need to set up our environment. We'll need to install a few Python libraries and understand the site's web structure.
Installing Python3
First, let's make sure we have Python3 installed on our machine. If not, we can grab it from the official Python website and follow the installation instructions.
Installing Python is easy as pie (see what I did there?)! π₯§
Creating Our Virtual Environment
Now that Python is ready, we should create a virtual environment to keep things organized. This way, our TikTok scraping project won't mess with other projects on our machine.
Here's how to create one:
python -m venv tiktok-scraping-env
source tiktok-scraping-env/bin/activate # On Windows use `tiktok-scraping-env\Scripts\activate`
Think of this virtual environment (tiktok-scraping-env
) as a designated sandbox for our TikTok-scraping adventure!
Installing Required Libraries: Nodriver, Asyncio, BeautifulSoup
Python can only lead the battle for our TikTok scraping adventure, but it can't go into the battlefront alone. Therefore, to start scraping TikTok data, we'll need to use several Python libraries:
- Asyncio : The great multitasking maestro that will allow us to run operations asynchronously. We won't have to wait for each request to finish before sending the next.
- BeautifulSoup : Our data excavator, parsing HTML and extracting the juicy bits we need.
- Nodriver : The stealthy ninja of our team, allowing us to interact with web pages just like a real browser would, but without the overhead of a full browser instance. It's like having a VIP pass that lets us slip past the velvet ropes unnoticed.
Pro Tip: As I've had my fair adventurous share of using Selenium for web scraping and extracting data from websites, its undetected_chromedriver bypasses a pretty number of anti-bot systems, including those from Cloudflare and Akamai . However,
undetected_chromedriver
can encounter limitations against advanced anti-bot technology like TikTok systems. This is where I summon Nodriver, its official successor.
Are you still feeling a bit overwhelmed by these tech terms? Don't sweat it! We've got your back. Here's your tech dictionary to check out and learn more, served with a side of fun:
Article | Description |
---|---|
BeautifulSoup Tutorial: Scraping Web Pages With Python | The soup kitchen of web scraping - learn how to extract data from web pages, turning you into a master data chef. |
Web Scraping Tutorial Using Selenium & Python (+ Examples) | Become a web-scraping maestro - it's like interacting with web pages, orchestrating your browser to perform scraping tasks. |
How To Use Undetected_chromedriver (Plus Working Alternatives) | Learn to be a scraping ninja - sneak past website defenses without breaking a sweat! |
Scraping With Nodriver: Step By Step Tutorial With Examples | Discover Nodriver - it's like scraping with an invisibility cloak! |
How To Use Asyncio To Scrape Websites With Python | Master the art of asynchronous scraping - it's like teaching your scraper to juggle multiple tasks! |
Pro Tip: Don't feel pressured to become an expert in all these tools overnight. In my scraping journey, I've found that understanding the basics of each is a great start. You can always dive deeper as you need. Remember, even web scraping ninjas started with "Hello World"!
Now, let's continue our adventure and summon these trusty sidekicks using pip
:
pip install asyncio nodriver beautifulsoup4
Here, we install these libraries with a single command.
Step 3: Checking Our Setup
To ensure our code environment is set up correctly and good to go, let's write a small script to open a browser window and navigate to TikTok's homepage:
import nodriver as uc
import asyncio
async def main():
try:
print("Starting browser...")
browser = await uc.start(headless=False)
print("Navigating to TikTok Homepage...")
page = await browser.get("https://www.tiktok.com/")
print("TikTok page loaded. If you see the TikTok homepage, the setup is successful.")
print("Waiting for 10 seconds before closing...")
await asyncio.sleep(10)
except Exception as e:
print(f"An error occurred: {e}")
finally:
print("Test finished.")
if __name__ == "__main__":
uc.loop().run_until_complete(main())
Running this script should open a browser that navigates to TikTok's homepage, where it waits for 10 seconds before closing the homepage without issues:
If that happens, our environment and setup are complete and ready!
Scraping TikTok Profile Stats
Alright, folks! It's time to start our TikTok scraping adventure. But before we dive in, let me let you in on a little secret - we're not just randomly picking a TikTok profile to scrape. Oh no, we're much more considerate than that!
For this scraping escapade, we'll be using the handle @fpl_insights . And don't worry, this isn't some random, unsuspecting victim of our data curiosity. This handle belongs to a good friend of mine who's fully aware of our little digital reconnaissance mission. He's given us the green light to use his profile as our scraping guinea pig.
So, we're all set for guilt-free scraping!
Understanding TikTok Profile Structure
Let's put on our detective hats and explore the structure of a TikTok profile page. Trust me, this bit of investigation will make our scraping adventure much smoother, as we need to target the right elements and extract the data we need.
Key Data Points to Scrape: Identifying HTML Elements
In our TikTok profile treasure hunt, we're after these juicy bits of information:
- Username
- Display name
- Followers count
- Following count
- Likes count
- Bio
Now, let's see where these elements usually hang out on the TikTok profile page:
TikTok Profile Data Elements: Inspecting the Profile Page
To extract this information, we need to inspect the page and locate the correct HTML elements. TikTok uses specific data attributes that make our job easier.
Now, let's learn how to peek under TikTok's hood. Here's how:
- Open a TikTok profile page in the browser
- Right-click on the element to inspect
- Select 'Inspect' from the context menu
- Voila! The browser's Developer Tools will pop up, showing the HTML structure
Pro Tip: Using keyboard shortcuts can save tons of time. On most browsers, you can open Developer Tools with
Ctrl+Shift+I
(Windows/Linux) orCmd+Option+I
(Mac). It's like having a secret passageway into the website's structure!
Let's now go on a treasure hunt for our data points!
Extracting Username
To find the username:
- Right-click on the username
- Click 'Inspect' in the context menu:
From our Developer Tools, we can see something like:
<h1 data-e2e="user-title" class="..."</h1>
The data-e2e="user-title"
attribute marks the spot for the username!
Extracting Display Name
The display name is usually below the username:
The data-e2e="user-subtitle"
is our treasure map to the display name!
Extracting Followers Count
The data-e2e="followers-count"
attribute is our key to unlocking the followers count!
Extracting Following Count
The data-e2e="following-count"
attribute points us to the following count!
Extracting Likes Count
The data-e2e="likes-count"
attribute is our treasure chest containing the likes count!
Extracting Bio
The data-e2e="user-bio"
attribute is our final piece of treasure, leading us to the bio description!
From our screenshots, here's a breakdown of the HTML attributes we're targeting:
Data | HTML Attribute |
---|---|
Username | <h1 data-e2e="user-title"> |
Display name | <h2 data-e2e="user-subtitle"> |
Followers count | <strong data-e2e="followers-count"> |
Following count | <strong data-e2e="following-count"> |
Likes count | <strong data-e2e="likes-count"> |
Bio | <h2 data-e2e="user-bio"> |
Pro Tip: In my scraping adventures, I've found that TikTok, like many social media platforms, loves to play hide and seek with its HTML structure. They might change these
data-e2e
attributes in the future. If your scraper suddenly starts bringing home empty treasure chests, checking for changes in these attributes should be your first move. It's like they're constantly rearranging the furniture in their HTML house!
Now that we've got our treasure map (the HTML structure), we're ready to write some code that'll automatically do all this treasure hunting.
Isn't web scraping exciting? It's like being a digital archaeologist, unearthing data artifacts from the vast internet landscape.
Let's move on to the coding part and turn our manual exploration into automated magic!
Scraping TikTok Profile Stats: Step-by-Step Process
Let's break down our code step by step, shall we?
Step 1: Setting Up Our Imports
import asyncio
import nodriver as uc
from bs4 import BeautifulSoup
Here, we're importing our trusty sidekicks. asyncio
for asynchronous operations, nodriver
(alias uc
) for stealthy browsing, and BeautifulSoup
for parsing HTML.
Step 2: Defining Our Scraping Function
async def scrape_tiktok_profile(username):
try:
print(f"Initiating scrape for TikTok profile: @{username}")
browser = await uc.start(headless=False)
print("Browser started successfully")
page = await browser.get(f"https://www.tiktok.com/@{username}")
print("TikTok profile page loaded successfully")
await asyncio.sleep(10) # Wait for 10 seconds
print("Waited for 10 seconds to allow content to load")
This function is the heart of our operation. It starts a browser
(visible, not headless
), navigates to the TikTok profile, and waits for the content to load. The
await asyncio.sleep()
is like giving TikTok a moment to catch its breath before we start poking around.
Pro Tip: From my experience, I've learned that patience is key. That's why we have that 10-second wait after loading the page. It gives JavaScript time to work magic and load all the dynamic content. Without this wait, we might end up with incomplete data, which is about as useful as a chocolate teapot!
Step 3: Extracting the HTML
html_content = await page.evaluate('document.documentElement.outerHTML')
print(f"HTML content retrieved (length: {len(html_content)} characters)")
soup = BeautifulSoup(html_content, 'html.parser')
print("HTML content parsed with BeautifulSoup")
Here, we're grabbing all the HTML (html_content = ...
) from the page and feeding it to BeautifulSoup
. It's like taking a snapshot of the webpage and handing it to our data excavator soup = BeautifulSoup(...)
.
Step 4: Mining for Gold (Data)
profile_info = {
'username': soup.select_one('h1[data-e2e="user-title"]').text.strip() if soup.select_one('h1[data-e2e="user-title"]') else None,
'display_name': soup.select_one('h2[data-e2e="user-subtitle"]').text.strip() if soup.select_one('h2[data-e2e="user-subtitle"]') else None,
'follower_count': soup.select_one('strong[data-e2e="followers-count"]').text.strip() if soup.select_one('strong[data-e2e="followers-count"]') else None,
'following_count': soup.select_one('strong[data-e2e="following-count"]').text.strip() if soup.select_one('strong[data-e2e="following-count"]') else None,
'like_count': soup.select_one('strong[data-e2e="likes-count"]').text.strip() if soup.select_one('strong[data-e2e="likes-count"]') else None,
'bio': soup.select_one('h2[data-e2e="user-bio"]').text.strip() if soup.select_one('h2[data-e2e="user-bio"]') else None
}
print("Profile information extracted successfully")
return profile_info
This is where the magic happens! We're using BeautifulSoup's
select_one
method to strip
our elements off the page's content. It's like we're on a treasure hunt, using the data-e2e
attributes we fetched earlier as our map to find the profile information and stats.
Step 5: Error Handling and Cleanup
except Exception as e:
print(f"An error occurred while scraping: {str(e)}")
return None
finally:
if 'browser' in locals():
browser.stop()
print("Browser closed")
Even the best-laid plans can go awry sometimes, so we've got this try-except
block to catch any hiccups. And no matter what happens, we make sure to finally
close our browser β we're responsible digital citizens, after all!
Step 6: Calling the Main Function
async def main():
username = "fpl_insights"
profile_info = await scrape_tiktok_profile(username)
if profile_info:
print("\nProfile Information:")
for key, value in profile_info.items():
print(f"{key.replace('_', ' ').title()}: {value}")
else:
print("Failed to scrape profile information.")
if __name__ == "__main__":
uc.loop().run_until_complete(main())
This is our main function, which orchestrates the whole operation. Here, we declare our username
(fpl_insights, in this case). Then, we call our scraping function (scrape_tiktok_profile
) and print(...)
out the results (if successful). It's like the director of our little data heist movie!
Step 7: Putting It All Together
Now that we've explored each piece of our TikTok scraping puzzle, it's time to assemble our digital Voltron!
Here's the complete code that brings all our steps together:
import asyncio
import nodriver as uc
from bs4 import BeautifulSoup
async def scrape_tiktok_profile(username):
try:
print(f"Initiating scrape for TikTok profile: @{username}")
browser = await uc.start(headless=False)
print("Browser started successfully")
page = await browser.get(f"https://www.tiktok.com/@{username}")
print("TikTok profile page loaded successfully")
await asyncio.sleep(10) # Wait for 10 seconds
print("Waited for 10 seconds to allow content to load")
html_content = await page.evaluate('document.documentElement.outerHTML')
print(f"HTML content retrieved (length: {len(html_content)} characters)")
soup = BeautifulSoup(html_content, 'html.parser')
print("HTML content parsed with BeautifulSoup")
profile_info = {
'username': soup.select_one('h1[data-e2e="user-title"]').text.strip() if soup.select_one('h1[data-e2e="user-title"]') else None,
'display_name': soup.select_one('h2[data-e2e="user-subtitle"]').text.strip() if soup.select_one('h2[data-e2e="user-subtitle"]') else None,
'follower_count': soup.select_one('strong[data-e2e="followers-count"]').text.strip() if soup.select_one('strong[data-e2e="followers-count"]') else None,
'following_count': soup.select_one('strong[data-e2e="following-count"]').text.strip() if soup.select_one('strong[data-e2e="following-count"]') else None,
'like_count': soup.select_one('strong[data-e2e="likes-count"]').text.strip() if soup.select_one('strong[data-e2e="likes-count"]') else None,
'bio': soup.select_one('h2[data-e2e="user-bio"]').text.strip() if soup.select_one('h2[data-e2e="user-bio"]') else None
}
print("Profile information extracted successfully")
return profile_info
except Exception as e:
print(f"An error occurred while scraping: {str(e)}")
return None
finally:
if 'browser' in locals():
browser.stop()
print("Browser closed")
async def main():
username = "fpl_insights"
profile_info = await scrape_tiktok_profile(username)
if profile_info:
print("\nProfile Information:")
for key, value in profile_info.items():
print(f"{key.replace('_', ' ').title()}: {value}")
else:
print("Failed to scrape profile information.")
if __name__ == "__main__":
uc.loop().run_until_complete(main())
Pro Tip: From my coding experience, I've found that it's crucial to add plenty of
When we run our script, it'll work its magic and produce output similar to this:
And there we have it, folks! We've just walked through a TikTok profile scraper that's stealthy, efficient, and, most importantly, consensual. Remember, always get permission before scraping someone's data β it's not just good manners; it's often a legal requirement!
Storing Our Scraped TikTok Profile Data
Now that we've successfully scraped our TikTok profile data, it's time to talk about storing this digital gold. Remember, data is only as valuable as it is accessible and analyzable.
Let's explore some options to make storing our adventure harvests a breeze!
Option 1: Storing Data in JSON
JSON (JavaScript Object Notation) is like the Swiss Army knife of data formats - versatile, readable, and widely supported. It's a lightweight data interchange format that's easy for humans to read and write and for machines to parse and generate.
Here's how we can continue our code and save our TikTok treasure in JSON :
import json
# Add JSON storage to the existing main function
async def main():
username = "fpl_insights"
profile_info = await scrape_tiktok_profile(username)
if profile_info:
print("\nProfile Information:")
for key, value in profile_info.items():
print(f"{key.replace('_', ' ').title()}: {value}")
# Save to a JSON file
with open(f"{username}_profile.json", 'w', encoding='utf-8') as f:
json.dump(profile_info, f, ensure_ascii=False, indent=4)
print(f"Profile information saved to {username}_profile.json")
else:
print("Failed to scrape profile information.")
if __name__ == "__main__":
uc.loop().run_until_complete(main())
This will create a file {username}_profile.json
with our scraped data in a nicely formatted JSON structure that's as easy to read as your favorite book.
Option 2: Storing Data in CSV
CSV (Comma-Separated Values) is the darling of spreadsheet enthusiasts. It's another perfectly popular format, especially useful when scraping multiple profiles and analyzing the data in spreadsheet software like Microsoft Excel or Google Sheets .
Here's how to save our TikTok data in CSV :
import csv
# Add CSV storage to the existing main function
async def main():
username = "fpl_insights"
profile_info = await scrape_tiktok_profile(username)
if profile_info:
print("\nProfile Information:")
for key, value in profile_info.items():
print(f"{key.replace('_', ' ').title()}: {value}")
# Save to a CSV file
csv_filename = f"{username}_profile.csv"
with open(csv_filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=profile_info.keys())
writer.writeheader()
writer.writerow(profile_info)
print(f"Profile information saved to {csv_filename}")
else:
print("Failed to scrape profile information.")
if __name__ == "__main__":
uc.loop().run_until_complete(main())
This will create a file {username}_profile.csv
with our scraped data in CSV format that'll make any data analyst's heart skip a beat.
Option 3: Storing Data in SQLite Database
For the data hoarders among us (I see you, and I salute you), storing data in a SQLite database is like having our own personal Fort Knox of information. SQLite is a lightweight, serverless database engine perfect for small to medium-sized projects.
Let's now store our hard-earned profile stats data using SQLite :
import sqlite3
# Add SQLite storage to the existing main function
async def main():
username = "fpl_insights"
profile_info = await scrape_tiktok_profile(username)
if profile_info:
print("\nProfile Information:")
for key, value in profile_info.items():
print(f"{key.replace('_', ' ').title()}: {value}")
# Save to SQLite database
db_name = f"{username}_tiktok_profile_stat.db"
conn = sqlite3.connect(db_name)
cursor = conn.cursor()
# Create table if it doesn't exist
cursor.execute('''
CREATE TABLE IF NOT EXISTS tiktok_profiles
(username TEXT PRIMARY KEY, display_name TEXT, follower_count INTEGER,
following_count INTEGER, like_count INTEGER, bio TEXT)
''')
# Insert data
cursor.execute('''
INSERT OR REPLACE INTO tiktok_profiles
(username, display_name, follower_count, following_count, like_count, bio)
VALUES (:username, :display_name, :follower_count, :following_count, :like_count, :bio)
''', profile_info)
conn.commit()
conn.close()
print(f"Profile information saved to SQLite database: {db_name}")
else:
print("Failed to scrape profile information.")
if __name__ == "__main__":
uc.loop().run_until_complete(main())
This will create an SQLite database file {username}_tiktok_profile_stat.db
and store our scraped data in a table tiktok_profiles
, ready for all our querying needs.
Pro Tip: In my years of data wrangling, I've learned that there's no one-size-fits-all solution for data storage. JSON is great for maintaining the data's original structure, CSV is perfect for quick spreadsheet analysis, and SQLite allows for more complex querying. My advice? Use them all! Each has its strengths, and having your data in multiple formats means you're always prepared, no matter what analysis adventure comes your way.
Scraping TikTok Profile: Downloading Videos and Their Stats
Alright, TikTok explorers! We've conquered the profile structure, and now it's time to dive into the real treasure trove: the videos themselves. Buckle up, because we're about to embark on a data-mining adventure that would make even Indiana Jones jealous!
Understanding TikTok Videos Structure
Before we start our code-powered excavation, let's take a moment to understand the landscape of TikTok videos. It's like studying a map before venturing into uncharted territory β trust me, it'll save us from falling into any data pitfalls!
Key Video Data Points to Scrape
In our TikTok video expedition, we're after these golden nuggets of information for each video:
- Video URL
- Video Description
- Music Title
- Posted Date
- Tags
- Views count
- Likes count
- Comments count
- Shares count
- Bookmarks count
Pro Tip: I can tell you that understanding the structure of your target is half the battle won. It's like having X-ray vision for websites!
Let's take a closer look at where these elements typically reside on a TikTok video page:
TikTok Video Data Elements: Inspecting the Video Page
Now, let's put on our digital archaeologist hats and start digging into the HTML structure. Here's how we'll unearth these precious data elements:
- Open a TikTok video page in a browser
- Right-click on the element to inspect
- Select 'Inspect' from the context menu
- VoilΓ ! The browser's Developer Tools will appear, revealing the HTML structure
Let's go treasure-hunting for our video data points!
Extracting Views Count
The views count is typically displayed prominently on the profile video page:
Extracting Video Description and Hashtags
The video description and hashtags are typically displayed together:
As we can see, the video description is within a <span>
element with a unique class css-j2a19r-SpanText
. The hashtags are separated but all share the same attribute data-e2e="search-common-link"
.
Extracting Music Title
Extracting Posted Date
We can see that the date is isolated in the last <span>
element within a parent element that has the attribute data-e2e="browser-nickname"
. Therefore, we'll use the
last-child
selector to target that date's section - ['span[data-e2e="browser-nickname"] span:last-child']
.
Extracting Likes, Comments, Shares, and Bookmarks Counts
These engagement metrics are usually grouped together:
From our digital excavation, here's a summary of the CSS selectors we're targeting:
Data | Selectors |
---|---|
Video URL | <meta property="og:url"> |
Video Description | ['span.css-j2a19r-SpanText'] |
Music Title | ['.css-pvx3oa-DivMusicText'] |
Posted Date | ['span[data-e2e="browser-nickname"] span:last-child'] |
Tags | [data-e2e="search-common-link"] |
Views Count | [data-e2e="video-views"] |
Likes Count | [data-e2e="like-count"] |
Comments Count | [data-e2e="comment-count"] |
Shares Count | [data-e2e="share-count"] |
Bookmarks Count | [data-e2e="undefined-count"] |
Pro Tip: From my experience, I've noticed that TikTok, like a mischievous genie, occasionally changes its HTML structure. If your scraper suddenly starts returning empty data, checking for updates in these selectors should be your first wish!
Now that we've mapped out our digital dig site, it's time to craft the enchanted tools (code) that will automate our treasure hunt.
Crafting Our TikTok Video Scraper: Step-by-Step Process
Let's roll up our sleeves and start coding our TikTok video scraper. Brace up, fellow data enthusiasts, because we're about to embark on a really exciting journey!
Step 1: Installing Playwright
First things first, we need to get our digital workbench ready. This time, we're calling in the big guns - Playwright !
Why Playwright?
Playwright is like the Swiss Army knife of web automation. It offers several advantages over Selenium or Nodriver:
- Better handling of dynamic content: Playwright excels at interacting with JavaScript-heavy sites like TikTok, where content loads dynamically (content that might change and flicker like a firefly).
- Cross-browser support: It allows us to use Chromium , Firefox , or WebKit with the same code.
- Powerful selectors: Playwright's built-in selectors are more robust, making it easier to target elements even when the page structure changes.
- Network interception: We can even decide to modify or mock network requests , which is helpful for bypassing some anti-bot measures.
- Auto-wait functionality: Playwright automatically waits for elements to be ready before interacting with them, reducing the need for explicit waits.
It's like upgrading from a rowboat to a modern sailboat - we're still navigating the same waters, but with much better tools to handle the currents and winds of web scraping!
Let's now use pip
to send an invite to
Playwright
for our party:
pip install playwright
playwright install
Here, we install playwright
and the necessary dependencies.
Hungry to know more about Playwright? Check out our separate guide on Scraping The Web With Playwright .
Step 2: Installing the yt-dlp Library
We also need to equip ourselves with a powerful tool for video downloading: the yt-dlp library. This library is essential for downloading the video content from TikTok (if we decide to).
Let's now extend our party invitation to
yt-dlp
with pip
:
pip install yt-dlp
This will download and install the yt-dlp
library, preparing our Python environment for the video downloading capabilities we'll be implementing in our scraper.
Step 3: Setting Up Our Imports
Let's start our code by importing the necessary libraries. It's like preparing our adventurer's backpack before the journey.
from playwright.async_api import async_playwright
import asyncio, random, json, logging, time, os, yt_dlp
Let's take a minute to understand our library imports:
Library | Purpose |
---|---|
playwright.async_api |
Provides tools for browser automation and web scraping |
asyncio |
Enables asynchronous programming for efficient I/O operations |
random |
Generates random numbers for sleep intervals |
json |
Handles JSON data for storing and parsing scraped information |
logging |
Sets up logging for tracking our scraper's progress |
time |
Provides time-related functions for adding delays |
os |
Interacts with the operating system for file operations |
yt_dlp |
A fork of youtube-dl for downloading videos |
Pro Tip: From my scraping experience, I can tell you that combining these libraries is like assembling a crack team of specialists for our TikTok expedition. Each one brings a unique skill to the table, making our TikTok data expedition smooth and efficient!
Step 4: Configuring Our Scraper
We begin our TikTok video scraping journey by setting up a robust Python logging system. This is crucial for monitoring our scraper's progress and troubleshooting any issues that may arise:
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
# Set to True to download videos
DOWNLOAD_VIDEOS = False
Here, we're setting up logging to monitor our scraper's progress. Think of it as establishing a mission control center for our data expedition. We configure logging to output informative messages about each step of the scraping process, allowing us to track our progress in real-time and maintain a record of our scraper's activities.
The DOWNLOAD_VIDEOS
toggle is our secret weapon. Set it to True
, and our scraper will not only gather data but also download all the videos in the TikTok profile to a directory. It's like choosing between taking notes at a museum or sneaking out with the exhibits!
Step 5: Crafting Our Stealth Techniques
From my experience in web scraping, patience is a virtue. We don't want to overwhelm TikTok's servers (or look suspiciously bot-like), so let's create a function to add random pauses and page scrolls to act more human-like and avoid detection:
async def random_sleep(min_seconds, max_seconds):
await asyncio.sleep(random.uniform(min_seconds, max_seconds))
async def scroll_page(page):
await page.evaluate("""
window.scrollBy(0, window.innerHeight);
""")
await random_sleep(1, 2)
These functions will help our scraper mimic human behavior. It's like teaching a robot to do the "scroll and pause" dance!
Step 6: Dealing With CAPTCHAs
TikTok's bouncers are always on the lookout for bots:
CAPTCHAs are like the final boss in a video game. We can't avoid them, but we can prepare for them. Therefore, we need to devise a way to handle them:
async def handle_captcha(page):
try:
# Check for the presence of the CAPTCHA dialog
captcha_dialog = page.locator('div[role="dialog"]')
is_captcha_present = await captcha_dialog.count() > 0 and await captcha_dialog.is_visible()
if is_captcha_present:
logging.info("CAPTCHA detected. Please solve it manually.")
# Wait for the CAPTCHA to be solved
await page.wait_for_selector('div[role="dialog"]', state='detached', timeout=300000) # 5 minutes timeout
logging.info("CAPTCHA solved. Resuming script...")
await asyncio.sleep(0.5) # Short delay after CAPTCHA is solved
except Exception as e:
logging.error(f"Error while handling CAPTCHA: {str(e)}")
When the script detects a CAPTCHA (is_captcha_present
), this function pauses our scraper and allows us to prove we're human by manually solving the random captcha. It's like having a personal assistant who taps you on the shoulder when they need your expertise!
After we solve the captcha ("Hey TikTok, look, I can identify traffic lights too!"), the script continues automatically with minimal or no recurrence of CAPTCHA (thanks to Playwright's ability to maintain a persistent browser session and mimic human-like behavior π).
Pro Tip: While our current approach involves manually solving the CAPTCHA, there are more advanced strategies to level up your scraping game:
- Automated CAPTCHA Solving: For those who want to take their scraping to the next level, our guide on How To Bypass ReCAPTCHA & HCaptcha When Web Scraping is a must-read. It's like having a digital locksmith in your toolkit!
- Scraping APIs: For a truly hands-off approach, consider using a scraping API like ScrapingBee . It's the equivalent of hiring a professional team to handle CAPTCHAs and other anti-bot measures for you, providing relief and efficiency.
Step 7: Navigating to the Video Page
Next, we need to visit each video's page:
async def extract_video_info(page, video_url, views):
await page.goto(video_url, wait_until="networkidle")
await random_sleep(2, 4)
await handle_captcha(page)
This step sets the stage for data extraction like a digital archaeologist carefully approaching an ancient artifact.
Step 8: Defining Our Data Extraction Function
Now, let's craft a versatile text extraction function to extract the text content:
video_info = await page.evaluate("""
() => {
const getTextContent = (selectors) => {
for (let selector of selectors) {
const elements = document.querySelectorAll(selector);
for (let element of elements) {
const text = element.textContent.trim();
if (text) return text;
}
}
return 'N/A';
};
// Function will be used in the next step
}
""")
Here, we're handling the dynamic nature of TikTok's web content by attempting to extract data using multiple selectors
. This approach ensures we don't miss any valuable information due to slight changes in TikTok's HTML structure.
Our function patiently tries each selector
, moving on to the next if one fails, until it successfully retrieves the desired content.
Step 9: Extracting Video Metrics
Let's put our extraction function to work:
video_info = await page.evaluate("""
() => {
// ... (getTextContent function from previous step)
const getTags = () => {
const tagElements = document.querySelectorAll('a[data-e2e="search-common-link"]');
return Array.from(tagElements).map(el => el.textContent.trim());
};
return {
likes: getTextContent(['[data-e2e="like-count"]', '[data-e2e="browse-like-count"]']),
comments: getTextContent(['[data-e2e="comment-count"]', '[data-e2e="browse-comment-count"]']),
shares: getTextContent(['[data-e2e="share-count"]']),
bookmarks: getTextContent(['[data-e2e="undefined-count"]']),
description: getTextContent(['span.css-j2a19r-SpanText']),
musicTitle: getTextContent(['.css-pvx3oa-DivMusicText']),
date: getTextContent(['span[data-e2e="browser-nickname"] span:last-child']),
tags: getTags()
};
}
""")
video_info['views'] = views
video_info['url'] = video_url
logging.info(f"Extracted info for {video_url}: {video_info}")
return video_info
Here, we're like data miners, extracting precious metrics from the TikTok bedrock. Remember those juicy attributes (likes
, comments
, shares
, bookmarks
, description
, musicTitle
, date
and tags
) we collected earlier from our page inspection? We put them to use here.
Step 10: Preparing for Video Download
To download the TikTok profile videos, we also need to set up a downloader function using the yt-dlp
library we installed earlier:
def download_tiktok_video(video_url, save_path):
print(f"Preparing to download video: {video_url}")
ydl_opts = {
'outtmpl': os.path.join(save_path, '%(id)s.%(ext)s'),
'format': 'best',
}
This step is like packing our gear before a big expedition. We're getting ready to capture TikTok videos in their natural habitat!
Step 11: Downloading the Video
Now, let's actually download the video file:
try:
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(video_url, download=True)
filename = ydl.prepare_filename(info)
logging.info(f"Video successfully downloaded: {filename}")
return filename
except Exception as e:
logging.error(f"Error downloading video: {str(e)}")
return None
This is where the magic happens. Using the yt_dlp
library, we're bottling TikTok lightning, saving videos for offline analysis.
Note: Downloading videos is like capturing lightning in a bottle. It's exciting, but for other profiles, we need to be careful! Make sure you have permission and respect copyright laws. Remember, we don't want our data science to turn into a legal drama!
Step 12: Initializing Our Scraping Session
We can now set up our scraping environment:
# Defining an asynchronous function that will handle the scraping process for a given username
async def scrape_tiktok_profile(username, videos=[]):
print(f"Starting scrape for user: {username}")
# Creating a context manager for Playwright, ensuring proper setup and teardown of resources
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False) # Launches a visible Chrome browser instance
# Putting on our disguise to blend in with the TikTok crowd
context = await browser.new_context(
viewport={'width': 1280, 'height': 720}, # Adjusting our binoculars for the perfect view of TikTok's landscape
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', # Crafting the perfect backstory for our undercover agent
)
page = await context.new_page() # Opens a new page in the browser
We're setting up our digital command center, ready to explore the TikTok wilderness. Each line of code is like preparing a different piece of equipment for our data expedition. From launching our browser (our trusty steed) to setting our disguise ( user_agent ), we're gearing up for a successful journey into the heart of TikTok's data-rich terrain.
Step 13: Navigating to the TikTok Profile
Time to visit the TikTok profile we're interested in with the await page.goto method:
url = f"https://www.tiktok.com/@{username}"
print(f"Navigating to profile: {url}")
await page.goto(url, wait_until="networkidle")
await handle_captcha(page)
This is like arriving at our destination. We're at the doorstep of TikTok treasures! We're also calling our previous handle_captcha
function here as well.
Step 14: Processing Each Video
Now, let's process each video we find:
for i, video in enumerate(videos):
if 'likes' not in video:
print(f"Processing video {i+1}/{len(videos)}")
video_info = await extract_video_info(page, video['url'], video['views'])
videos[i].update(video_info)
if DOWNLOAD_VIDEOS:
save_path = os.path.join(os.getcwd(), username)
filename = download_tiktok_video(video['url'], save_path)
if filename:
videos[i]['local_filename'] = filename
This is the heart of our operation. We're examining each TikTok video like a jeweler appraising precious gems.
We're navigating to the profile, scrolling through videos, and extracting information using our extract_video_info
function. If we set DOWNLOAD_VIDEOS
to True
, we're running the downloader function (download_tiktok_video
) here as well.
Step 15: Saving Our Progress
The entire process can take a while. Let's ensure we don't lose our hard-earned data if anything interrupts the process. So, let's save our progress every 10 videos into a file in JSON format:
# Save progress every 10 videos
if (i + 1) % 10 == 0:
with open(f"{username}_progress.json", "w") as f:
json.dump(videos, f, indent=2)
print(f"Progress saved. Processed {i+1}/{len(videos)} videos.")
await random_sleep(3, 5)
await browser.close()
return videos
Regularly saving our progress is like creating save points in a video game. If any interruption happens at any time, we don't want to lose our digital loot that has taken so much time!
Step 16: Running Our Scraper
Last but not least, let's fire up our scraper with a main
function that brings everything together:
async def main():
username = "fpl_insights"
# Attempt to load previously saved progress
try:
with open(f"{username}_progress.json", "r") as f:
videos = json.load(f)
logging.info(f"Loaded progress. {len([v for v in videos if 'likes' in v])}/{len(videos)} videos already processed.")
except FileNotFoundError:
# If no previous progress is found, start with an empty list
videos = []
# Create a directory for video downloads if DOWNLOAD_VIDEOS is True
if DOWNLOAD_VIDEOS:
os.makedirs(username, exist_ok=True)
os.chdir(username)
# Start the scraping process
videos = await scrape_tiktok_profile(username, videos)
# Log the total number of scraped videos
logging.info(f"\nTotal videos scraped: {len(videos)}")
# Save the scraped data to a JSON file
with open(f"{username}_playwright_video_stats.json", "w") as f:
json.dump(videos, f, indent=2)
logging.info(f"Data saved to {username}_playwright_video_stats.json")
if __name__ == "__main__":
# Run the main function asynchronously
asyncio.run(main())
First, we try
to check if thereβs a previously saved progress file for the given username (in case of an interruption). If the file exists, we load the existing data to see how many videos have been processed. If such a file doesn't exist, we start afresh with an empty list (videos = []
).
Next, if we set DOWNLOAD_VIDEOS
to True
, we create a directory for video downloads and navigate into it. Then, we kick off the scraping process by calling scrape_tiktok_profile
with the username
and the list of videos, collecting all the juicy data.
After that, we log the total number of scraped videos and save all the data into a JSON file named after the username, ensuring our hard-earned data is safely stored. Finally, we run the main
function asynchronously, orchestrating the entire scraping operation.
Step 17: Putting It All Together
Now, let's assemble all our code snippets into one magnificent TikTok scraping machine:
from playwright.async_api import async_playwright
import asyncio, random, json, logging, time, os, yt_dlp
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
DOWNLOAD_VIDEOS = True
async def random_sleep(min_seconds, max_seconds):
await asyncio.sleep(random.uniform(min_seconds, max_seconds))
async def scroll_page(page):
await page.evaluate("""
window.scrollBy(0, window.innerHeight);
""")
await random_sleep(1, 2)
async def handle_captcha(page):
try:
# Check for the presence of the CAPTCHA dialog
captcha_dialog = page.locator('div[role="dialog"]')
is_captcha_present = await captcha_dialog.count() > 0 and await captcha_dialog.is_visible()
if is_captcha_present:
logging.info("CAPTCHA detected. Please solve it manually.")
# Wait for the CAPTCHA to be solved
await page.wait_for_selector('div[role="dialog"]', state='detached', timeout=300000) # 5 minutes timeout
logging.info("CAPTCHA solved. Resuming script...")
await asyncio.sleep(0.5) # Short delay after CAPTCHA is solved
except Exception as e:
logging.error(f"Error while handling CAPTCHA: {str(e)}")
async def hover_and_get_views(page, video_element):
await video_element.hover()
await random_sleep(0.5, 1)
views = await video_element.evaluate("""
(element) => {
const viewElement = element.querySelector('strong[data-e2e="video-views"]');
return viewElement ? viewElement.textContent.trim() : 'N/A';
}
""")
return views
async def extract_video_info(page, video_url, views):
await page.goto(video_url, wait_until="networkidle")
await random_sleep(2, 4)
await handle_captcha(page)
video_info = await page.evaluate("""
() => {
const getTextContent = (selectors) => {
for (let selector of selectors) {
const elements = document.querySelectorAll(selector);
for (let element of elements) {
const text = element.textContent.trim();
if (text) return text;
}
}
return 'N/A';
};
const getTags = () => {
const tagElements = document.querySelectorAll('a[data-e2e="search-common-link"]');
return Array.from(tagElements).map(el => el.textContent.trim());
};
return {
likes: getTextContent(['[data-e2e="like-count"]', '[data-e2e="browse-like-count"]']),
comments: getTextContent(['[data-e2e="comment-count"]', '[data-e2e="browse-comment-count"]']),
shares: getTextContent(['[data-e2e="share-count"]']),
bookmarks: getTextContent(['[data-e2e="undefined-count"]']),
description: getTextContent(['span.css-j2a19r-SpanText']),
musicTitle: getTextContent(['.css-pvx3oa-DivMusicText']),
date: getTextContent(['span[data-e2e="browser-nickname"] span:last-child']),
tags: getTags()
};
}
""")
video_info['views'] = views
video_info['url'] = video_url
logging.info(f"Extracted info for {video_url}: {video_info}")
return video_info
def download_tiktok_video(video_url, save_path):
ydl_opts = {
'outtmpl': os.path.join(save_path, '%(id)s.%(ext)s'),
'format': 'best',
}
try:
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(video_url, download=True)
filename = ydl.prepare_filename(info)
logging.info(f"Video successfully downloaded: {filename}")
return filename
except Exception as e:
logging.error(f"Error downloading video: {str(e)}")
return None
async def scrape_tiktok_profile(username, videos=[]):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context(
viewport={'width': 1280, 'height': 720},
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
)
page = await context.new_page()
url = f"https://www.tiktok.com/@{username}"
await page.goto(url, wait_until="networkidle")
await handle_captcha(page)
if not videos:
videos = []
last_video_count = 0
no_new_videos_count = 0
start_time = time.time()
timeout = 300 # 5 minutes timeout
while True:
await scroll_page(page)
await handle_captcha(page)
video_elements = await page.query_selector_all('div[data-e2e="user-post-item"]')
for element in video_elements:
video_url = await element.evaluate('(el) => el.querySelector("a").href')
if any(video['url'] == video_url for video in videos):
continue
views = await hover_and_get_views(page, element)
videos.append({'url': video_url, 'views': views})
logging.info(f"Found {len(videos)} unique videos so far")
if len(videos) == last_video_count:
no_new_videos_count += 1
else:
no_new_videos_count = 0
last_video_count = len(videos)
if no_new_videos_count >= 3 or time.time() - start_time > timeout:
break
logging.info(f"Found a total of {len(videos)} videos")
for i, video in enumerate(videos):
if 'likes' not in video:
video_info = await extract_video_info(page, video['url'], video['views'])
videos[i].update(video_info)
logging.info(f"Processed video {i+1}/{len(videos)}: {video['url']}")
if DOWNLOAD_VIDEOS:
save_path = os.path.join(os.getcwd(), username)
filename = download_tiktok_video(video['url'], save_path)
if filename:
videos[i]['local_filename'] = filename
# Save progress every 10 videos
if (i + 1) % 10 == 0:
with open(f"{username}_progress.json", "w") as f:
json.dump(videos, f, indent=2)
logging.info(f"Progress saved. Processed {i+1}/{len(videos)} videos.")
await random_sleep(3, 5)
await browser.close()
return videos
async def main():
username = "fpl_insights"
# Try to load progress
try:
with open(f"{username}_progress.json", "r") as f:
videos = json.load(f)
logging.info(f"Loaded progress. {len([v for v in videos if 'likes' in v])}/{len(videos)} videos already processed.")
except FileNotFoundError:
videos = []
if DOWNLOAD_VIDEOS:
os.makedirs(username, exist_ok=True)
os.chdir(username)
videos = await scrape_tiktok_profile(username, videos)
logging.info(f"\nTotal videos scraped: {len(videos)}")
# Save as JSON
with open(f"{username}_playwright_video_stats.json", "w") as f:
json.dump(videos, f, indent=2)
print(f"Data saved to {username}_playwright_video_stats.json")
logging.info(f"Data saved to {username}_playwright_video_stats.json")
if __name__ == "__main__":
asyncio.run(main())
In my years of code wrangling, I've found that keeping your entire script in one file makes it easier to manage and share. It's like having a Swiss Army knife instead of a scattered toolbox!
Step 18: Running Our TikTok Scraper
We're all set! Let's fire up our scraper and watch the TikTok data roll in.
First, with video downloads enabled:
If we choose to download videos, the process might take a while, especially for a profile with tons of videos. But remember, it's like slow-cooking data - the longer it simmers, the richer the insights you'll get!
Now, let's run it again with video downloads turned off:
What a journey! Our TikTok scraper is now exploring the digital wilderness, bringing back valuable data insights. It's like we've built an automatic data vacuum cleaner for TikTok videos!
Exporting TikTok Video Data to CSV
By default, we're already storing our scraped data in JSON format, which is great for maintaining data structure and compatibility with many data processing tools. However, let's add an option to export our TikTok treasure trove to a CSV file.
Let's continue our code and slightly modify our main()
function to include this feature:
import csv
async def main():
# ... (previous code remains the same)
videos = await scrape_tiktok_profile(username, videos)
print(f"\nTotal videos scraped: {len(videos)}")
# Save as CSV
csv_filename = f"{username}_playwright_video_stats.csv"
with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
# Define all fields including
fieldnames = ['url', 'views', 'likes', 'comments', 'shares', 'bookmarks',
'description', 'musicTitle', 'date', 'tags']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for video in videos:
# Handle 'tags' separately as it's a list
video_data = {k: video.get(k, 'N/A') for k in fieldnames if k != 'tags'}
video_data['tags'] = ','.join(video.get('tags', []))
writer.writerow(video_data)
print(f"Data saved to {csv_filename}")
if __name__ == "__main__":
asyncio.run(main())
This addition allows us to export our data to a neat CSV file {username}_playwright_video_stats.csv
, perfect for spreadsheet analysis or
data visualization tools
.
Storing TikTok Video Data in SQLite
For the database enthusiasts among us, let's add an option to store our TikTok video data in a SQLite database.
To do this, we'll only modify our main()
function as well:
import sqlite3
async def main():
# ... (previous code remains the same)
videos = await scrape_tiktok_profile(username, videos)
print(f"\nTotal videos scraped: {len(videos)}")
# Save to SQLite
# Define our file output name and initiate a sqlite3 connection
db_filename = f"{username}_tiktok_data.db"
conn = sqlite3.connect(db_filename)
cursor = conn.cursor()
# Create table with all fields
cursor.execute('''CREATE TABLE IF NOT EXISTS videos
(url TEXT PRIMARY KEY, views TEXT, likes TEXT,
comments TEXT, shares TEXT, bookmarks TEXT,
description TEXT, musicTitle TEXT, date TEXT, tags TEXT)''')
# Insert or update data for each video
for video in videos:
cursor.execute('''INSERT OR REPLACE INTO videos
(url, views, likes, comments, shares, bookmarks,
description, musicTitle, date, tags)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''',
(video['url'], video['views'], video['likes'],
video['comments'], video['shares'], video['bookmarks'],
video.get('description', 'N/A'),
video.get('musicTitle', 'N/A'),
video.get('date', 'N/A'),
','.join(video.get('tags', []))))
# Commit changes and close connection
conn.commit()
conn.close()
print(f"Data saved to SQLite database: {db_filename}")
if __name__ == "__main__":
asyncio.run(main())
This SQLite integration turns our TikTok data into a queryable database, perfect for complex data analysis and app integration.
Pro Tip: From my data hoarding escapades, I can say that SQLite is like a secret weapon for data scientists. It's lightweight, requires no separate server, and can handle surprisingly large datasets. It's perfect for when you want to level up from simple file storage without the complexity of a full-fledged database server!
And there we have it, intrepid data explorers! We've built a comprehensive TikTok scraping tool that not only extracts valuable data but also offers flexible storage options, whether we prefer the simplicity of CSV, the ubiquity of JSON, or the power of SQLite.
Supercharge Your Scraping With ScrapingBee
Hey there, data adventurers! π΅οΈββοΈ Remember those pesky CAPTCHAs we encountered while starting our scraping? Well, what if I told you there's a magical tool that can make them disappear faster than a viral TikTok dance?
Enter ScrapingBee - the superhero sidekick every web scraper dreams of!
Why ScrapingBee is Your New Best Friend
- CAPTCHA? What CAPTCHA? π΅οΈββοΈ ScrapingBee API deftly navigates around CAPTCHAs , allowing you to focus on the data gold while it handles the heavy lifting of CAPTCHA management.
- Proxy Power-Up: Imagine having a whole fleet of ships (IP addresses) at your command. That's what proxies are! Instead of just one ship sending out requests, you can spread the load across many, just like a digital shapeshifter, which makes it impossible for TikTok or other advanced anti-bot systems to pin down.
- JavaScript Jedi: ScrapingBee handles JavaScript execution like a pro, perfect for dynamic content like TikTok's.
- API Simplicity: Simple API calls replace complex browser automation. It's like trading your rusty screwdriver for a laser cutter!
Ready to Become a Scraping Wizard?
Don't let advanced anti-bot measures keep you from your data dreams. With ScrapingBee, you're not just scraping - you're performing data sorcery!
Start Your Free ScrapingBee Trial Now!
No credit card required, and you get 1000 free API calls. That's enough to scrape more TikTok videos than you can watch in a day!
After signing up, you can proceed to copy your API key from the dashboard.
Let's See ScrapingBee in Action!
Enough talk, let's see some code!
First, we've to install the ScrapingBee Python SDK :
pip install scrapingbee
Now, let's see how to get started with using ScrapingBee API to scrape a TikTok profile with the elegance of a ballet dancer and the stealth of a ninja:
from scrapingbee import ScrapingBeeClient
import json
# Your ScrapingBee API key - your ticket to scraping paradise!
client = ScrapingBeeClient(api_key='YOUR_API_KEY')
# The TikTok profile URL - let's scrape @fpl_insights again
url = "https://www.tiktok.com/@fpl_insights"
# Our extraction rules - telling ScrapingBee what treasure to find
extract_rules = {
"videos": {
"selector": "div[data-e2e='user-post-item']",
"type": "list",
"output": {
"url": {
"selector": "a",
"output": "@href"
},
"views": {
"selector": "strong[data-e2e='video-views']",
"output": "text"
}
}
},
"profile_info": {
"selector": "h2[data-e2e='user-subtitle']",
"output": "text"
}
}
# The magic happens here!
response = client.get(
url,
params={
'extract_rules': extract_rules,
'render_js': True,
'premium_proxy': True
}
)
if response.ok:
data = response.json()
print(json.dumps(data, indent=2))
else:
print(f"Oops! Something went wrong: {response.content}")
This extract_rules
dictionary is like giving ScrapingBee a treasure map. It's saying:
- Find all the video items
(div[data-e2e='user-post-item'])
- For each video, grab the
URL
andviews
count - Oh, and while you're at it, snag the display name from the profile info too
The response = client.get(...)
part is where the magic happens. ScrapingBee takes our treasure map, navigates TikTok's defenses, and brings back the gold - all with one simple API call.
Why You'll Fall in Love With ScrapingBee
- Focus on what matters: No more wrestling with CAPTCHAs or proxy configurations. ScrapingBee handles the grunt work while you focus on the data magic.
- Scale like a boss: Whether you're scraping 10 videos or 10,000 for stats, ScrapingBee's got your back. It's like having an army of tiny web-scraping minions at your command!
- Stay on the right side of the law: ScrapingBee helps you respect rate limits and terms of service. It's like having a digital lawyer on your team!
- Save time, save money: Time is money, and ScrapingBee saves you both. It's an investment that pays for itself faster than you can say "viral TikTok"!
So, why wait? You don't need a credit card to start.
Start your free ScrapingBee trial now and turn those websites into data goldmines! Plus, your first 1,000 requests are on us!
Conclusion
Wow, what a journey it's been! From setting up our TikTok scraping environment to unleashing the power of ScrapingBee, we've certainly covered a lot of ground.
Let's recap our epic journey:
- We set up our Python environment and learned about the challenges of scraping TikTok.
- We crafted powerful scrapers using Nodriver and Playwright to navigate TikTok's dynamic content (profile stats and video downloads).
- We tackled CAPTCHAs and other anti-bot measures head-on.
- We explored various storage methods to store our scraped data, from JSON and CSV files to a database like SQLite.
- Finally, we discovered the magic of ScrapingBee, which simplifies our scraping process and helps us avoid blocks.
Remember, with great scraping power comes great responsibility. Always scrape ethically, respect website terms of service, and use the data wisely. This not only ensures your professional integrity but also contributes to a healthy data ecosystem.
Whether you're analyzing TikTok trends, conducting research, or satisfying your data curiosity, you now have the tools to dive deep into the TikTok data ocean.
So, what are you waiting for? Grab your digital scuba gear (your Python environment), take a deep breath, and dive into the fascinating world of TikTok data.
Further Reading
Ready to level up your scraping skills? Check out more of our resources:
Topic | Description |
---|---|
How to Scrape YouTube | Learn how to scrape another social media giant |
Python Web Scraping: Full Tutorial With Examples (2024) | Master the art of web scraping with Python |
API vs. Web Scraping: Whatβs the Difference? | Understand when to use each approach |
Avoiding Detection | Stay under the radar while scraping |
Scraping With Nodriver: Step By Step Tutorial With Examples | Deep dive into using Nodriver for scraping |
How To Scrape Data From Twitter.Com | Apply your skills to another platform |
The scraping journey doesn't end here, folks. Keep learning, keep adapting, and most importantly, keep asking questions. The world of web scraping is ever-evolving, and the best scrapers are those who never stop exploring!
Happy scraping, and may your datasets be ever bountiful! πββοΈππ