Topic Analysis of US State Subreddits Using gpt-4o-mini

21 October 2024 | 14 min read

Ever wondered what people across the United States are talking about online? Reddit, often dubbed "the front page of the internet," offers a treasure trove of conversations, and each state has its own dedicated subreddit reflecting local interests. But what exactly are these state-based communities discussing the most?

In total, we looked at 50,947 threads from the different states of the USA. We used the “year” filter and the “top” sort on Reddit. We first made a word cloud consisting of the commonly occurring words in the thread topics. Based on this preliminary analysis, we made 8 categories, including an “others” category which we excluded from visualizations. We asked gpt-4o-mini to go over each topic and classify them into one of those. The 8 categories we used are as follows:

  • News and Politics
  • Classifieds
  • Queries
  • Opinions and Rants
  • Nature and Culture
  • Sports
  • Humor
  • Others

We present our findings below followed by the technical steps we followed for the analysis.

Preliminary Results

Word Cloud of All Analyzed Thread Titles

A word cloud is a figure that represents word frequencies in an image, by sizing the words in proportion to their frequency of occurrence. I generated a word cloud with the words in the titles of all the threads we analyzed. Since the focus was on the subjects being discussed, I considered only the words that were nouns (common nouns and proper nouns). The overall word cloud is below:

wordcloud-all.png

After the gpt-4o-mini analysis, we had a table with what percentage of threads under each state fell into each category. You can find that table on the linked google sheet . Except for Guam and the Virgin Islands, the topics in all the states (or territories) of the USA were dominated by one of the two categories: “news and politics” and “nature and culture”. The split between these two categories is shown in the map below.

map-top-categories.png

From the choropleth map, we see that the top categories are mostly geographically continuous. The shift in the most popular topics from 'news and politics' to 'nature and culture' in states near the Rocky Mountains in the west and the Appalachian Mountains in the east suggests that the mountains and national parks in those areas are significant topics of discussion.

Category-wise Analysis

While the states are dominated by two categories “news and politics” and “nature and culture”, this could be partly due to the nature of the Reddit platform itself and the kind of discussion it affords. To see differences across states, we looked at another perspective. We took each category and looked at which states discuss it the most.

News and Politics

map-news and politics.png

Top 5 States:

statenews and politics
California83.38%
Illinois75.83%
Texas75.36%
Ohio73.3%
Virginia72.29%

From the visual above we first see the distribution of this category is huge, ranging from a little below 20% to a little over 80%. The pattern is again mostly geographically contiguous. The colors also correlate with the population density as per the 2020 Census . Regions with higher population density seem to discuss news and politics more, while regions with lower populations are more concerned with other things.

An Example Thread:

image.png Link to Thread

Word Cloud:

wordcloud-news and politics.png

Nature and Culture

map-nature and culture.png

Top 5 States:

statenature and culture
Washington74.11%
Arizona73.96%
Colorado59.19%
Montana54.75%
Alaska47.2%

These figures are almost the inverse of "news and politics." Regions with lower population density, often shaped by natural features like mountains or cold climates (as in Alaska), tend to focus their discussions on these environmental factors.

Example Thread:

image1.png Link to Thread

Word Cloud:

wordcloud-nature and culture.png

Opinions and Rants

map-opinions and rants.png

stateopinions and rants
Puerto Rico19.43
Iowa17.91
Guam17.73
District of Columbia14.82
New Jersey14.27

Three out of five states in the top 5 are small states with high population density, suggesting that subreddits for regions with higher population density are more likely to be sharing opinions and ranting online. Another very interesting point to note is that states that are in the upper third of this category are mostly located in the east, while most states in the west fall into the lower part of the "opinions and rants" table, maybe it's because of the better weather on the West coast.

Example Thread:

image2.png Link to Thread

Word Cloud:

wordcloud-opinions and rants.png

Queries

map-queries.png

Top 5 States:

statequeries
Virgin Islands77.59%
Guam35.15%
Nevada24.14%
South Dakota17.38%
Georgia15.21%

The queries category is topped by the Virgin Islands and followed by Guam, both of which are not shown on the map. It is possible that queries are posted on Reddit when the state's area is small. With larger states, these queries might go on individual city or county subreddits instead of the state subreddit. Apart from the size effect, queries could also suggest a lot of incoming tourists to the state, as tourists tend to have questions before heading to a destination.

Example Thread:

image3.png Link to Thread

Word Cloud:

wordcloud-queries.png

Classifieds

map-classifieds.png

stateclassifieds
North Dakota7.61
South Dakota6.39
Delaware6.33
Kentucky6.29
Guam6.19

The maps and the table toppers for classifieds kind of posts on Reddit are also somewhat similar to the query distribution, showing that smaller states with lower population density tend to use the state subreddit more for hyperlocal utility. With larger states, this kind of discussion may not happen at the state-level subreddits. It looks like most of them are selling dogs or advertising/seeking jobs.

Example Thread:

image4.png Link to Thread

Word Cloud:

wordcloud-classifieds.png

Sports

map-sports.png

statesports
Nevada3.11
Hawaii2.87
Montana2.82
North Dakota2.79
Nebraska2.75

The states that discuss sports the most tend to be mostly in the north. Most states on the southern border discuss sports the least. Hawaii stands out in this list, at the second spot on the table.

Example Thread:

image5.png Link to Thread

Word Cloud:

wordcloud-sports.png

Humor

map-humor.png

statehumor
Puerto Rico22.4
New Jersey15.19
Minnesota13.1
Alaska11.39
Maine8.79

With humor, we notice a very skewed distribution with 4 or 5 states topping the list and the others in the lower half, most of them close to zero. Most states on the top here are at the border of the country, or disconnected from the mainland. With Puerto Rico topping the list, it is worth asking if language could play a role here, with more than 90% of Puerto Rico speaking Spanish as the first language, unlike the rest of the country which mostly speaks English.

Example Thread:

image6.png Link to Thread

Translation: You are Puerto Rican and have turned 40. Choose your personality.

Word Cloud:

wordcloud-humor.png

Technical Steps Involved

In this section, I describe the technical steps involved in gathering and analyzing the data. If you are a non-technical reader interested only in the insights and trends, feel free to skip this section. Python was the programming language used for all the steps, with the following additional libraries:

  • PRAW: Python Reddit API wrapper
  • dataset: For working with SQLite databases
  • pydantic: For defining data models to interact with OpenAI APIs
  • openai: For interacting with the gpt-4o-mini model.
  • spacy: Natural Language Processing
  • wordcloud: For generating the word clouds
  • pandas: For working with tabular data
  • geopandas: For merging tabular data with maps
  • matplotlib and geoplot : For plotting choropleth maps

Getting Data from the Subreddits

I started with a CSV file containing all the US states and their corresponding subreddits. For each subreddit, I used the Reddit API to get the ‘top’ threads using the ‘year’ filter:

import csv
from datetime import datetime
import pprint
import time

import dataset
from dotenv import dotenv_values
import praw

config = dotenv_values(".env")
username = config["REDDIT_USERNAME"]
reddit = praw.Reddit(
    client_id=config["REDDIT_CLIENT_ID"],
    client_secret=config["REDDIT_CLIENT_SECRET"],
    password=config["REDDIT_PASSWORD"],
    username=username,
    user_agent=f"linux:script.s-hub:v0.0.1 (by /u/{username})",
)

db = dataset.connect("sqlite:///data.db")

# check that the reddit login worked
print(f"Logged Into Reddit As: {str(reddit.user.me())}")

states = []
with open("us_state_subs.csv") as f:
    for country in csv.DictReader(f):
        states.append(country)

for state in states:
    if not state["Subreddit"]:
        continue

    sub_name = state["Subreddit"].split("/r/")[-1].strip("/")
    subreddit = reddit.subreddit(sub_name)

    thread_ids = set()
    for thread in subreddit.top(time_filter="year", limit=None):
        thread_ids.add(thread.id)
        data = vars(thread)
        row = dict(
            thread_id = thread.id,
            title = thread.title,
            score = thread.score,
            subreddit = sub_name,
            created = datetime.utcfromtimestamp(thread.created_utc).isoformat(),
            comment_limit = data.get("comment_limit"),
            permalink = data.get("permalink"),
            selftext = data.get("selftext"),
            link_flair = data.get("link_flair_text")
        )
        db["threads"].insert(row)

Get credentials for the Reddit API here by creating an app.

The above code retrieved 50,947 threads and stored them in data.db.

The CSV of subreddits looks like this:

StateSubreddit
Alabamar/Alabama
Alaskar/alaska
Arizonar/arizona
Arkansasr/Arkansas
Californiar/California
Colorador/Colorado
Connecticutr/Connecticut
Delawarer/Delaware
District of Columbiar/washingtondc
Floridar/florida
Georgiar/Georgia
Guamr/guam
Hawaiir/Hawaii
Idahor/Idaho
Illinoisr/illinois
Indianar/Indiana
Iowar/Iowa
Kansasr/kansas
Kentuckyr/Kentucky
Louisianar/Louisiana
Mainer/Maine
Marylandr/maryland
Massachusettsr/massachusetts
Michiganr/Michigan
Minnesotar/minnesota
Mississippir/mississippi
Missourir/missouri
Montanar/Montana
Nebraskar/Nebraska
Nevadar/Nevada
New Hampshirer/newhampshire
New Jerseyr/newjersey
New Mexicor/NewMexico
New Yorkr/newyork
North Carolinar/NorthCarolina
North Dakotar/northdakota
Ohior/Ohio
Oklahomar/oklahoma
Oregonr/oregon
Pennsylvaniar/Pennsylvania
Puerto Ricor/PuertoRico
Rhode Islandr/RhodeIsland
South Carolinar/southcarolina
South Dakotar/SouthDakota
Tennesseer/Tennessee
Texasr/texas
Utahr/Utah
Vermontr/vermont
Virgin Islandsr/virginislands
Virginiar/Virginia
Washingtonr/Washington
West Virginiar/WestVirginia
Wisconsinr/wisconsin
Wyomingr/wyoming

Categorizing the Threads with gpt-4o-mini:

Being a new model, gpt-4o-mini gives us the ability to get structured output via a technique known as tool calling. This allowed me to define a list of 8 string values and get only one of those as the output, instead of a natural language response. This model is also much cheaper than ChatGPT, comparable in pricing to the Mistral 7B model we used in some of our previous blogs. I used the following prompt for classifying each thread using it’s topic and flair:

Categorize the following reddit thread title into an appropriate category. You may use a flair as hint, if it is provided. The title is: The flair is:

Code:

from enum import Enum

from pydantic import BaseModel
import openai
from openai import OpenAI

import dataset
db = dataset.connect("sqlite:///data.db")

API_KEY = "<OPENAI_API_KEY_HERE>"
client = OpenAI(api_key=API_KEY)

class ThreadCategory(str, Enum):
    news_and_politics = "news and politics"
    classifieds = "classifieds"
    queries = "queries"
    opinions_and_rants = "opinions and rants"
    nature_and_culture = "nature and culture"
    sports = "sports"
    humor = "humor"
    others = "others"

class Result(BaseModel):
    category: ThreadCategory
    
def get_gpt_response(thread_title, flair):
    prompt = "Categorize the following reddit thread title into an appropriate category. You may use a flair as hint, if it is provided."
    prompt += "The title is: " + thread_title
    if flair:
        prompt += ". The flair is: " + flair

    completion = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages = [
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        tools = [
            openai.pydantic_function_tool(Result)
        ],
    )
    return completion.choices[0].message.tool_calls[0].function.parsed_arguments.category

threads = list(db["threads"])
for row in threads:
    if row.get("gpt_response"):
        print("SKIPPING")
        continue

    resp_val = "others"
    if row["title"]!="...":
        response = get_gpt_response(row["title"], row.get("link_flair"))
        resp_val = str(response.value)

    db["threads"].update({"id": row["id"], "gpt_response": resp_val}, ["id"])

I stored the responses in a column named gpt_response, and wrote the code to skip the rows that contain a non-null value for this field. This ensures that if the program is interrupted and restarted for any reason, it starts right where it left off, skipping the already processed rows.

Generating the Category-Wise Percentage Table

After the GPT analysis, the first step was to obtain the table showing the percentage of threads falling under each topic category for each state. this was done using the below code:

import dataset
import pandas as pd

db = dataset.connect("sqlite:///data.db")
states_list = pd.read_csv("us_state_subs.csv", dtype=str, na_filter=False)

# create a dict to lookup state names by subreddit
state_names = {}
for idx, s in states_list.iterrows():
    if s["Subreddit"]:
        key = s["Subreddit"].split("/r/")[-1].strip("/").lower()
        state_names[key] = s["State"]

state_wise_counters = {} 
for row in db["threads"]:
    key = row["subreddit"].lower()
    state_name = state_names[key]
    if state_name not in state_wise_counters:
        state_wise_counters[state_name] = {"state": state_name, "total_threads_analyzed": 0}

    category = row["gpt_response"]
    if category not in state_wise_counters[state_name]:
        state_wise_counters[state_name][category] = 0
    state_wise_counters[state_name][category] += 1
    state_wise_counters[state_name]["total_threads_analyzed"] += 1

data = []
for state_name in state_wise_counters:
    state = state_wise_counters[state_name]
    
    highest_val = 0
    highest_category = ""
    for key in state:
        if key not in ["state", "total_threads_analyzed"]:
            if state[key]>highest_val:
                highest_val = state[key]
                highest_category = key
            state[key] = round(100*state[key]/state["total_threads_analyzed"], 2)
    state["top_category"] = highest_category
    
    data.append(state)

df = pd.DataFrame.from_records(data, columns=[
    "state", "total_threads_analyzed",
    "news and politics", "opinions and rants", "nature and culture",
    "queries", "humor", "classifieds", "sports", "others",
    "top_category",
]).fillna(0)
df.to_csv("topic_percents.csv", index=False)

The above code outputs the data you see in the linked Google sheet and was used for further analysis.

Generating Word Clouds

I generated word clouds for 1) all the threads together and 2) for each topic category. I used the same function to do this, only varying the set of threads used. I used Spacy to identify the parts of speech in each title and consider only nouns (common and proper) for the analysis.

import dataset
import os
import spacy
from wordcloud import WordCloud

os.makedirs("results/", exist_ok=True)
db = dataset.connect("sqlite:///data.db")

nlp = spacy.load('en_core_web_sm')
def lemmatize(text):
    words = nlp(text)
    filtered_words = filter(lambda w: w.pos_ in ["NOUN", "PROPN"], words)
    return [word.lemma_.lower() for word in filtered_words]

def generate_wordcloud(category):
    if not category:
        return
    if category == "all":
        rows = db["threads"]
    else:
        rows = db["threads"].find(gpt_response=category)

    word_freq_dict = {}
    for row in list(rows):
        for word in lemmatize(row["title"]):
            if word not in word_freq_dict:
                word_freq_dict[word] = 0
            word_freq_dict[word] += 1
    wc = WordCloud(width=640, height=360).generate_from_frequencies(word_freq_dict)
    wc.to_file(f"results/wordcloud-{category}.png" )

generate_wordcloud("all")
categories = ["news and politics", "classifieds", "queries", "opinions and rants", "nature and culture", "sports", "humor"]
for c in categories:
    generate_wordcloud(c)

Generating the Choropleth Maps

I used geopandas, geoplot, and matplotlib to render the choropleth maps. The base map was a GeoJSON map with the US state borders, with Alaska, Hawaii, and Puerto Rico resized and moved for compactness of representation.

Base Setup:

import mapclassify as mc
import pandas as pd
import geopandas as gpd
import os

from matplotlib.colors import ListedColormap, BoundaryNorm, LinearSegmentedColormap
from matplotlib.ticker import FuncFormatter
import matplotlib.image as image
import matplotlib.pyplot as plt
import geoplot.crs as gcrs
import geoplot as gplt

plt.rcParams["font.size"] = 14

os.makedirs("results/", exist_ok=True)
df = pd.read_csv("topic_percents.csv", index_col=False)
map_df = gpd.read_file("states_map.geojson")
merged_df = map_df.merge(
    df,
    left_on=["NAME"],
    right_on=["state"],
)

Plotting Top Categories by State:

# plotting a column needs it to have numerical values
plot_vals = {"news and politics":0, "nature and culture": 1, "queries": 2}
merged_df["top_category_plot_val"] = merged_df["top_category"].apply(lambda x: plot_vals[x])

scheme = mc.UserDefined(merged_df["top_category_plot_val"], bins=[0,1])
cmap = ListedColormap(["#c12f51", "#2fc14c"])

ax = gplt.choropleth(
    merged_df,
    extent=(-124.733174, 24.544701, -66.949895, 52.384358),
    projection=gcrs.PlateCarree(),
    hue='top_category_plot_val', # column to be used for color coding
    scheme=scheme,
    cmap=cmap,
    linewidth=0,
    edgecolor='black',
    figsize=(16, 9),
    legend=True,
    legend_kwargs=dict(
        loc='lower center',
        bbox_to_anchor = (0.5, -0.30)
    ),
    legend_labels=["News and Politics", "Nature and Culture"],
)

# adding state codes on the plot
# with adjustments for neatness
for _i, r in merged_df.iterrows():
    coords = r.geometry.centroid.coords[0]
    xy = [coords[0]+96, coords[1]]
    if r["SHORT_CODE"]=="CT":
        xy[0] -= 0.2
        xy[1] -= 0.5 
    if r["SHORT_CODE"]=="RI":
        xy[0] += 0.5
        xy[1] -= 0.5
    if r["SHORT_CODE"]=="DE":
        xy[0] += 0.5
        xy[1] -= 0.5
    if r["SHORT_CODE"]=="MD":
        xy[1] += 0.5
    if r["SHORT_CODE"]=="DC":
        xy[1] -= 0.5
    ax.annotate(r["SHORT_CODE"], xy=xy, fontsize=8, ha="center", va="center")

fig = ax.get_figure()
logo = image.imread("logo.png")
logo_ax = fig.add_axes([0.04, 0.78, 0.15, 0.15], anchor="NW", zorder=1)
logo_ax.imshow(logo)
logo_ax.axis("off")

plt.title(f"Most talked about topic in each State's Subreddit",  ha="right", x=6.1, y=-0.3, loc="right", fontsize=18)
plt.savefig(f'results/map-top-categories.png')

Generating Maps and Tables for Each Category

def dump_results_for_category(category):
    # the top 5 table
    df2 = df[["state", category]].sort_values(by=category, ascending=False)
    cutoff_value = df2[category].iloc[4]
    df2 = df2[df2[category] >= cutoff_value]
    df2.to_markdown(f'results/top-{category}.md', index=False)
    
    # choropleth map
    bounds = [df[category].min(), df[category].max()]
    colors = ["#ccffec", "#003d27"]
    cmap = LinearSegmentedColormap.from_list('custom_colormap', list(zip(bounds, colors)))

    legend_format = lambda val, pos: str(val)+"%"

    ax = gplt.choropleth(
        merged_df,
        extent=(-124.733174, 24.544701, -66.949895, 52.384358),
        projection=gcrs.PlateCarree(),
        hue=category, # column to be used for color coding
        linewidth=0,
        edgecolor="black",
        figsize=(16, 9),
        legend=True,
        legend_kwargs=dict(
            spacing="proportional",
            location="bottom",
            format=FuncFormatter(legend_format),
            aspect=40,
            shrink=0.75,
        ),
    )

    # adding two letter state codes
    # with minor adjustments for neatness
    for _i, r in merged_df.iterrows():
        coords = r.geometry.centroid.coords[0]
        xy = [coords[0]+96, coords[1]]
        if r["SHORT_CODE"]=="CT":
            xy[0] -= 0.2
            xy[1] -= 0.2
        if r["SHORT_CODE"]=="RI":
            xy[0] += 0.5
            xy[1] -= 0.5
        if r["SHORT_CODE"]=="DE":
            xy[0] += 0.5
            xy[1] -= 0.5
        if r["SHORT_CODE"]=="MD":
            xy[1] += 0.3
        if r["SHORT_CODE"]=="DC":
            xy[1] -= 0.5
        ax.annotate(r["SHORT_CODE"], xy=xy, fontsize=10, ha="center", va="center")    

    fig = ax.get_figure()
    logo = image.imread("logo.png")
    logo_ax = fig.add_axes([0.04, 0.78, 0.15, 0.15], anchor="NW", zorder=1)
    logo_ax.imshow(logo)
    logo_ax.axis("off")

    plt.title(f"States that talk about {category} the most", ha="right", x=6.1, y=0, loc="right", fontsize=18)
    plt.savefig(f'results/map-{category}.png')

categories = list(filter(lambda x: x not in ["state", "total_threads_analyzed", "top_category", "others"], df))
for category in categories:
    dump_results_for_category(category)

The above code outputs the table with the top 5 states and the choropleth map for each category.

Conclusion

In this blog, we discussed our findings by categorizing the threads discussed on the subreddits of US states into one of 8 topics. We see that the distribution of these topics across the states could be influenced by factors such as state size, natural features, language, or even geographical position. Overall, most states predominantly discuss news and politics, or nature and culture over other topic categories. This could be an attribute of the platform itself.

This analysis was much easier with the availability of the tool calling feature and the ability to get structured output from the gpt-4o-mini model.

image description
Karthik Devan

I work freelance on full-stack development of apps and websites, and I'm also trying to work on a SaaS product. When I'm not working, I like to travel, play board games, hike and climb rocks.