Ever wondered what people across the United States are talking about online? Reddit, often dubbed "the front page of the internet," offers a treasure trove of conversations, and each state has its own dedicated subreddit reflecting local interests. But what exactly are these state-based communities discussing the most?
In total, we looked at 50,947 threads from the different states of the USA. We used the “year” filter and the “top” sort on Reddit. We first made a word cloud consisting of the commonly occurring words in the thread topics. Based on this preliminary analysis, we made 8 categories, including an “others” category which we excluded from visualizations. We asked gpt-4o-mini to go over each topic and classify them into one of those. The 8 categories we used are as follows:
- News and Politics
- Classifieds
- Queries
- Opinions and Rants
- Nature and Culture
- Sports
- Humor
- Others
We present our findings below followed by the technical steps we followed for the analysis.
Preliminary Results
Word Cloud of All Analyzed Thread Titles
A word cloud is a figure that represents word frequencies in an image, by sizing the words in proportion to their frequency of occurrence. I generated a word cloud with the words in the titles of all the threads we analyzed. Since the focus was on the subjects being discussed, I considered only the words that were nouns (common nouns and proper nouns). The overall word cloud is below:
Popular Category by State
After the gpt-4o-mini analysis, we had a table with what percentage of threads under each state fell into each category. You can find that table on the linked google sheet . Except for Guam and the Virgin Islands, the topics in all the states (or territories) of the USA were dominated by one of the two categories: “news and politics” and “nature and culture”. The split between these two categories is shown in the map below.
From the choropleth map, we see that the top categories are mostly geographically continuous. The shift in the most popular topics from 'news and politics' to 'nature and culture' in states near the Rocky Mountains in the west and the Appalachian Mountains in the east suggests that the mountains and national parks in those areas are significant topics of discussion.
Category-wise Analysis
While the states are dominated by two categories “news and politics” and “nature and culture”, this could be partly due to the nature of the Reddit platform itself and the kind of discussion it affords. To see differences across states, we looked at another perspective. We took each category and looked at which states discuss it the most.
News and Politics
Top 5 States:
state | news and politics |
---|---|
California | 83.38% |
Illinois | 75.83% |
Texas | 75.36% |
Ohio | 73.3% |
Virginia | 72.29% |
From the visual above we first see the distribution of this category is huge, ranging from a little below 20% to a little over 80%. The pattern is again mostly geographically contiguous. The colors also correlate with the population density as per the 2020 Census . Regions with higher population density seem to discuss news and politics more, while regions with lower populations are more concerned with other things.
An Example Thread:
Word Cloud:
Nature and Culture
Top 5 States:
state | nature and culture |
---|---|
Washington | 74.11% |
Arizona | 73.96% |
Colorado | 59.19% |
Montana | 54.75% |
Alaska | 47.2% |
These figures are almost the inverse of "news and politics." Regions with lower population density, often shaped by natural features like mountains or cold climates (as in Alaska), tend to focus their discussions on these environmental factors.
Example Thread:
Word Cloud:
Opinions and Rants
state | opinions and rants |
---|---|
Puerto Rico | 19.43 |
Iowa | 17.91 |
Guam | 17.73 |
District of Columbia | 14.82 |
New Jersey | 14.27 |
Three out of five states in the top 5 are small states with high population density, suggesting that subreddits for regions with higher population density are more likely to be sharing opinions and ranting online. Another very interesting point to note is that states that are in the upper third of this category are mostly located in the east, while most states in the west fall into the lower part of the "opinions and rants" table, maybe it's because of the better weather on the West coast.
Example Thread:
Word Cloud:
Queries
Top 5 States:
state | queries |
---|---|
Virgin Islands | 77.59% |
Guam | 35.15% |
Nevada | 24.14% |
South Dakota | 17.38% |
Georgia | 15.21% |
The queries category is topped by the Virgin Islands and followed by Guam, both of which are not shown on the map. It is possible that queries are posted on Reddit when the state's area is small. With larger states, these queries might go on individual city or county subreddits instead of the state subreddit. Apart from the size effect, queries could also suggest a lot of incoming tourists to the state, as tourists tend to have questions before heading to a destination.
Example Thread:
Word Cloud:
Classifieds
state | classifieds |
---|---|
North Dakota | 7.61 |
South Dakota | 6.39 |
Delaware | 6.33 |
Kentucky | 6.29 |
Guam | 6.19 |
The maps and the table toppers for classifieds kind of posts on Reddit are also somewhat similar to the query distribution, showing that smaller states with lower population density tend to use the state subreddit more for hyperlocal utility. With larger states, this kind of discussion may not happen at the state-level subreddits. It looks like most of them are selling dogs or advertising/seeking jobs.
Example Thread:
Word Cloud:
Sports
state | sports |
---|---|
Nevada | 3.11 |
Hawaii | 2.87 |
Montana | 2.82 |
North Dakota | 2.79 |
Nebraska | 2.75 |
The states that discuss sports the most tend to be mostly in the north. Most states on the southern border discuss sports the least. Hawaii stands out in this list, at the second spot on the table.
Example Thread:
Word Cloud:
Humor
state | humor |
---|---|
Puerto Rico | 22.4 |
New Jersey | 15.19 |
Minnesota | 13.1 |
Alaska | 11.39 |
Maine | 8.79 |
With humor, we notice a very skewed distribution with 4 or 5 states topping the list and the others in the lower half, most of them close to zero. Most states on the top here are at the border of the country, or disconnected from the mainland. With Puerto Rico topping the list, it is worth asking if language could play a role here, with more than 90% of Puerto Rico speaking Spanish as the first language, unlike the rest of the country which mostly speaks English.
Example Thread:
Translation: You are Puerto Rican and have turned 40. Choose your personality.
Word Cloud:
Technical Steps Involved
In this section, I describe the technical steps involved in gathering and analyzing the data. If you are a non-technical reader interested only in the insights and trends, feel free to skip this section. Python was the programming language used for all the steps, with the following additional libraries:
- PRAW: Python Reddit API wrapper
- dataset: For working with SQLite databases
- pydantic: For defining data models to interact with OpenAI APIs
- openai: For interacting with the gpt-4o-mini model.
- spacy: Natural Language Processing
- wordcloud: For generating the word clouds
- pandas: For working with tabular data
- geopandas: For merging tabular data with maps
- matplotlib and geoplot : For plotting choropleth maps
Getting Data from the Subreddits
I started with a CSV file containing all the US states and their corresponding subreddits. For each subreddit, I used the Reddit API to get the ‘top’ threads using the ‘year’ filter:
import csv
from datetime import datetime
import pprint
import time
import dataset
from dotenv import dotenv_values
import praw
config = dotenv_values(".env")
username = config["REDDIT_USERNAME"]
reddit = praw.Reddit(
client_id=config["REDDIT_CLIENT_ID"],
client_secret=config["REDDIT_CLIENT_SECRET"],
password=config["REDDIT_PASSWORD"],
username=username,
user_agent=f"linux:script.s-hub:v0.0.1 (by /u/{username})",
)
db = dataset.connect("sqlite:///data.db")
# check that the reddit login worked
print(f"Logged Into Reddit As: {str(reddit.user.me())}")
states = []
with open("us_state_subs.csv") as f:
for country in csv.DictReader(f):
states.append(country)
for state in states:
if not state["Subreddit"]:
continue
sub_name = state["Subreddit"].split("/r/")[-1].strip("/")
subreddit = reddit.subreddit(sub_name)
thread_ids = set()
for thread in subreddit.top(time_filter="year", limit=None):
thread_ids.add(thread.id)
data = vars(thread)
row = dict(
thread_id = thread.id,
title = thread.title,
score = thread.score,
subreddit = sub_name,
created = datetime.utcfromtimestamp(thread.created_utc).isoformat(),
comment_limit = data.get("comment_limit"),
permalink = data.get("permalink"),
selftext = data.get("selftext"),
link_flair = data.get("link_flair_text")
)
db["threads"].insert(row)
Get credentials for the Reddit API here by creating an app.
The above code retrieved 50,947 threads and stored them in data.db
.
The CSV of subreddits looks like this:
Categorizing the Threads with gpt-4o-mini:
Being a new model, gpt-4o-mini gives us the ability to get structured output via a technique known as tool calling. This allowed me to define a list of 8 string values and get only one of those as the output, instead of a natural language response. This model is also much cheaper than ChatGPT, comparable in pricing to the Mistral 7B model we used in some of our previous blogs. I used the following prompt for classifying each thread using it’s topic and flair:
Categorize the following reddit thread title into an appropriate category. You may use a flair as hint, if it is provided. The title is:
The flair is:
Code:
from enum import Enum
from pydantic import BaseModel
import openai
from openai import OpenAI
import dataset
db = dataset.connect("sqlite:///data.db")
API_KEY = "<OPENAI_API_KEY_HERE>"
client = OpenAI(api_key=API_KEY)
class ThreadCategory(str, Enum):
news_and_politics = "news and politics"
classifieds = "classifieds"
queries = "queries"
opinions_and_rants = "opinions and rants"
nature_and_culture = "nature and culture"
sports = "sports"
humor = "humor"
others = "others"
class Result(BaseModel):
category: ThreadCategory
def get_gpt_response(thread_title, flair):
prompt = "Categorize the following reddit thread title into an appropriate category. You may use a flair as hint, if it is provided."
prompt += "The title is: " + thread_title
if flair:
prompt += ". The flair is: " + flair
completion = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages = [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": prompt
}
],
tools = [
openai.pydantic_function_tool(Result)
],
)
return completion.choices[0].message.tool_calls[0].function.parsed_arguments.category
threads = list(db["threads"])
for row in threads:
if row.get("gpt_response"):
print("SKIPPING")
continue
resp_val = "others"
if row["title"]!="...":
response = get_gpt_response(row["title"], row.get("link_flair"))
resp_val = str(response.value)
db["threads"].update({"id": row["id"], "gpt_response": resp_val}, ["id"])
I stored the responses in a column named gpt_response
, and wrote the code to skip the rows that contain a non-null value for this field. This ensures that if the program is interrupted and restarted for any reason, it starts right where it left off, skipping the already processed rows.
Generating the Category-Wise Percentage Table
After the GPT analysis, the first step was to obtain the table showing the percentage of threads falling under each topic category for each state. this was done using the below code:
import dataset
import pandas as pd
db = dataset.connect("sqlite:///data.db")
states_list = pd.read_csv("us_state_subs.csv", dtype=str, na_filter=False)
# create a dict to lookup state names by subreddit
state_names = {}
for idx, s in states_list.iterrows():
if s["Subreddit"]:
key = s["Subreddit"].split("/r/")[-1].strip("/").lower()
state_names[key] = s["State"]
state_wise_counters = {}
for row in db["threads"]:
key = row["subreddit"].lower()
state_name = state_names[key]
if state_name not in state_wise_counters:
state_wise_counters[state_name] = {"state": state_name, "total_threads_analyzed": 0}
category = row["gpt_response"]
if category not in state_wise_counters[state_name]:
state_wise_counters[state_name][category] = 0
state_wise_counters[state_name][category] += 1
state_wise_counters[state_name]["total_threads_analyzed"] += 1
data = []
for state_name in state_wise_counters:
state = state_wise_counters[state_name]
highest_val = 0
highest_category = ""
for key in state:
if key not in ["state", "total_threads_analyzed"]:
if state[key]>highest_val:
highest_val = state[key]
highest_category = key
state[key] = round(100*state[key]/state["total_threads_analyzed"], 2)
state["top_category"] = highest_category
data.append(state)
df = pd.DataFrame.from_records(data, columns=[
"state", "total_threads_analyzed",
"news and politics", "opinions and rants", "nature and culture",
"queries", "humor", "classifieds", "sports", "others",
"top_category",
]).fillna(0)
df.to_csv("topic_percents.csv", index=False)
The above code outputs the data you see in the linked Google sheet and was used for further analysis.
Generating Word Clouds
I generated word clouds for 1) all the threads together and 2) for each topic category. I used the same function to do this, only varying the set of threads used. I used Spacy to identify the parts of speech in each title and consider only nouns (common and proper) for the analysis.
import dataset
import os
import spacy
from wordcloud import WordCloud
os.makedirs("results/", exist_ok=True)
db = dataset.connect("sqlite:///data.db")
nlp = spacy.load('en_core_web_sm')
def lemmatize(text):
words = nlp(text)
filtered_words = filter(lambda w: w.pos_ in ["NOUN", "PROPN"], words)
return [word.lemma_.lower() for word in filtered_words]
def generate_wordcloud(category):
if not category:
return
if category == "all":
rows = db["threads"]
else:
rows = db["threads"].find(gpt_response=category)
word_freq_dict = {}
for row in list(rows):
for word in lemmatize(row["title"]):
if word not in word_freq_dict:
word_freq_dict[word] = 0
word_freq_dict[word] += 1
wc = WordCloud(width=640, height=360).generate_from_frequencies(word_freq_dict)
wc.to_file(f"results/wordcloud-{category}.png" )
generate_wordcloud("all")
categories = ["news and politics", "classifieds", "queries", "opinions and rants", "nature and culture", "sports", "humor"]
for c in categories:
generate_wordcloud(c)
Generating the Choropleth Maps
I used geopandas, geoplot, and matplotlib to render the choropleth maps. The base map was a GeoJSON map with the US state borders, with Alaska, Hawaii, and Puerto Rico resized and moved for compactness of representation.
Base Setup:
import mapclassify as mc
import pandas as pd
import geopandas as gpd
import os
from matplotlib.colors import ListedColormap, BoundaryNorm, LinearSegmentedColormap
from matplotlib.ticker import FuncFormatter
import matplotlib.image as image
import matplotlib.pyplot as plt
import geoplot.crs as gcrs
import geoplot as gplt
plt.rcParams["font.size"] = 14
os.makedirs("results/", exist_ok=True)
df = pd.read_csv("topic_percents.csv", index_col=False)
map_df = gpd.read_file("states_map.geojson")
merged_df = map_df.merge(
df,
left_on=["NAME"],
right_on=["state"],
)
Plotting Top Categories by State:
# plotting a column needs it to have numerical values
plot_vals = {"news and politics":0, "nature and culture": 1, "queries": 2}
merged_df["top_category_plot_val"] = merged_df["top_category"].apply(lambda x: plot_vals[x])
scheme = mc.UserDefined(merged_df["top_category_plot_val"], bins=[0,1])
cmap = ListedColormap(["#c12f51", "#2fc14c"])
ax = gplt.choropleth(
merged_df,
extent=(-124.733174, 24.544701, -66.949895, 52.384358),
projection=gcrs.PlateCarree(),
hue='top_category_plot_val', # column to be used for color coding
scheme=scheme,
cmap=cmap,
linewidth=0,
edgecolor='black',
figsize=(16, 9),
legend=True,
legend_kwargs=dict(
loc='lower center',
bbox_to_anchor = (0.5, -0.30)
),
legend_labels=["News and Politics", "Nature and Culture"],
)
# adding state codes on the plot
# with adjustments for neatness
for _i, r in merged_df.iterrows():
coords = r.geometry.centroid.coords[0]
xy = [coords[0]+96, coords[1]]
if r["SHORT_CODE"]=="CT":
xy[0] -= 0.2
xy[1] -= 0.5
if r["SHORT_CODE"]=="RI":
xy[0] += 0.5
xy[1] -= 0.5
if r["SHORT_CODE"]=="DE":
xy[0] += 0.5
xy[1] -= 0.5
if r["SHORT_CODE"]=="MD":
xy[1] += 0.5
if r["SHORT_CODE"]=="DC":
xy[1] -= 0.5
ax.annotate(r["SHORT_CODE"], xy=xy, fontsize=8, ha="center", va="center")
fig = ax.get_figure()
logo = image.imread("logo.png")
logo_ax = fig.add_axes([0.04, 0.78, 0.15, 0.15], anchor="NW", zorder=1)
logo_ax.imshow(logo)
logo_ax.axis("off")
plt.title(f"Most talked about topic in each State's Subreddit", ha="right", x=6.1, y=-0.3, loc="right", fontsize=18)
plt.savefig(f'results/map-top-categories.png')
Generating Maps and Tables for Each Category
def dump_results_for_category(category):
# the top 5 table
df2 = df[["state", category]].sort_values(by=category, ascending=False)
cutoff_value = df2[category].iloc[4]
df2 = df2[df2[category] >= cutoff_value]
df2.to_markdown(f'results/top-{category}.md', index=False)
# choropleth map
bounds = [df[category].min(), df[category].max()]
colors = ["#ccffec", "#003d27"]
cmap = LinearSegmentedColormap.from_list('custom_colormap', list(zip(bounds, colors)))
legend_format = lambda val, pos: str(val)+"%"
ax = gplt.choropleth(
merged_df,
extent=(-124.733174, 24.544701, -66.949895, 52.384358),
projection=gcrs.PlateCarree(),
hue=category, # column to be used for color coding
linewidth=0,
edgecolor="black",
figsize=(16, 9),
legend=True,
legend_kwargs=dict(
spacing="proportional",
location="bottom",
format=FuncFormatter(legend_format),
aspect=40,
shrink=0.75,
),
)
# adding two letter state codes
# with minor adjustments for neatness
for _i, r in merged_df.iterrows():
coords = r.geometry.centroid.coords[0]
xy = [coords[0]+96, coords[1]]
if r["SHORT_CODE"]=="CT":
xy[0] -= 0.2
xy[1] -= 0.2
if r["SHORT_CODE"]=="RI":
xy[0] += 0.5
xy[1] -= 0.5
if r["SHORT_CODE"]=="DE":
xy[0] += 0.5
xy[1] -= 0.5
if r["SHORT_CODE"]=="MD":
xy[1] += 0.3
if r["SHORT_CODE"]=="DC":
xy[1] -= 0.5
ax.annotate(r["SHORT_CODE"], xy=xy, fontsize=10, ha="center", va="center")
fig = ax.get_figure()
logo = image.imread("logo.png")
logo_ax = fig.add_axes([0.04, 0.78, 0.15, 0.15], anchor="NW", zorder=1)
logo_ax.imshow(logo)
logo_ax.axis("off")
plt.title(f"States that talk about {category} the most", ha="right", x=6.1, y=0, loc="right", fontsize=18)
plt.savefig(f'results/map-{category}.png')
categories = list(filter(lambda x: x not in ["state", "total_threads_analyzed", "top_category", "others"], df))
for category in categories:
dump_results_for_category(category)
The above code outputs the table with the top 5 states and the choropleth map for each category.
Conclusion
In this blog, we discussed our findings by categorizing the threads discussed on the subreddits of US states into one of 8 topics. We see that the distribution of these topics across the states could be influenced by factors such as state size, natural features, language, or even geographical position. Overall, most states predominantly discuss news and politics, or nature and culture over other topic categories. This could be an attribute of the platform itself.
This analysis was much easier with the availability of the tool calling feature and the ability to get structured output from the gpt-4o-mini model.