Study of Amazon’s Best Selling & Most Read Book Charts Since 2017

24 June 2024 | 31 min read

Amazon is most well known as an online shopping website, and among the tech folks for Amazon Web Services. However, it was initially started as an online bookstore. They are also well known for the Kindle eBook and the Audiobook experiences they offer.

The extensive offerings in the literature space have given Amazon so much data about reading patterns on a global scale. They present this data by publishing 4 charts every week. These 4 charts are the most read and the most sold books in fiction and non-fiction categories in the USA.

A Screenshot of Amazon Book Charts Page

We obtained the books on the charts for each week, starting on May 14, 2017. For each week's charts, we could obtain the top 20 books, along with their name, author, cover, etc., and also some other data fields such as the chart position and the number of weeks the book has spent on the chart. We then aggregated the data over the 371 weeks we analyzed, for each of the 4 charts to obtain some interesting insights.

I'll discuss the insights first, after which I will cover the technical details of how the task was accomplished. If you are a non-technical reader, please feel free to skip the latter part. Before we delve into the insights, let's first look at what exactly these charts represent. The verbatim definitions of these charts from Amazon's chart pages are below:

Amazon's Most Sold charts rank books according to the number of copies sold and pre-ordered through Amazon.com, Audible.com, Amazon Books stores, and books read through digital subscription programs (once a customer has read a certain percentage – roughly the length of a free reading sample). Bulk buys are counted as a single purchase.

Amazon's Most Read charts rank titles by the average number of daily Kindle readers and Audible listeners each week. Categories not ranked on Most Read charts include dictionaries, encyclopedias, religious texts, daily devotionals, and calendars.

The Fiction Charts

I collected the data on the charts for the 371 weeks since Amazon started publishing. That is a big time span to manually look at, and I had to make some aggregate metrics to make sense of the data. One of the first and quite obvious things I calculated was which books had been on the chart for the longest period. In other words, these are the longest charting books.

Longest Charting Books

For both the most-read and the most-sold charts under fiction, I counted how many weeks each book had spent on the chart. Then I ranked the books by this metric and the top 20 books are in the tables below. Longest Charting Most Read Books:

Book NameWeeks
Harry Potter and the Deathly Hallows371
Harry Potter and the Sorcerer's Stone371
Harry Potter and the Goblet of Fire371
Harry Potter and the Half-Blood Prince370
Harry Potter and the Prisoner of Azkaban365
Harry Potter and the Chamber of Secrets346
Harry Potter and the Order of the Phoenix331
Where the Crawdads Sing161
A Game of Thrones90
Dune88
The Handmaid's Tale84
Oathbringer80
Lessons in Chemistry77
Beneath a Scarlet Sky71
Demon Copperhead64
Project Hail Mary63
A Court of Mist and Fury60
American Dirt60
Little Fires Everywhere59
The Silent Patient59
Ready Player One59

Longest Charting Most Sold Books:

Book NameWeeks
Harry Potter and the Sorcerer's Stone206
Where the Crawdads Sing197
It Ends with Us114
Verity99
The Seven Husbands of Evelyn Hugo97
The Silent Patient90
The Housemaid83
Lessons in Chemistry80
Little Fires Everywhere74
A Court of Thorns and Roses74
The Handmaid's Tale69
The Midnight Library66
Things We Never Got Over65
Beneath a Scarlet Sky64
Reminders of Him59
Fourth Wing59
Before We Were Yours58
The Last Thing He Told Me58
Haunting Adeline55
It Starts with Us51

The Harry Potter series dominates the most-read charts, with the 7 books of the series taking the top 7 spots leaving the rest behind by a huge margin. The top 3 have been on this chart every week the chart was published. The book trailing the Harry Potter series has spent less than half the time on the charts compared to its predecessor.

However, we see some diversity in the most-sold charts. The top spot in the most-sold section is still held by the first book in the Harry Potter series - Harry Potter and the Sorcerer's Stone, but none of the other books from the series are in this table. One could assume that very few readers who bought the first book made it through to the end to buy the subsequent books in the series.

Another possibility is that they switched to eBooks/Audiobooks afterward. Could this mean the audience prefers eBooks/Audiobooks when it comes to lengthy series or bulkier books? Another data point that supports this hypothesis is "A Game of Thrones".

This occurs in the most-read section but not in the most-sold section. It is also worth mentioning the books that occur in both tables: "Where the Crawdads Sing", "The Handmaid's Tale", "Lessons in Chemistry", and "Beneath a Scarlet Sky".

Stint Analysis

In the previous bit, we looked at the total time the books have spent on the charts. But, this period may or may not be continuous. When we see that Dune has spent 88 weeks on the chart, it does not mean that Dune entered the chart one week, stayed there for 88 weeks, and left to never return again. It has entered and exited the charts several times. In the stint analysis section, I tried to see how books enter, stay in, and exit the charts.

Here, a stint would mean a continuous number of weeks for which a book was on the chart. Now, let's see the longest stints in most-read and most-sold fiction charts.

Longest Stints in the Most Read Charts:

Book NameStart DateWeeks
Harry Potter and the Goblet of Fire2017-05-14371
Harry Potter and the Deathly Hallows2017-05-14371
Harry Potter and the Sorcerer's Stone2017-05-14371
Harry Potter and the Half-Blood Prince2017-12-03342
Harry Potter and the Prisoner of Azkaban2018-01-07337
Harry Potter and the Order of the Phoenix2017-05-14331
Harry Potter and the Chamber of Secrets2018-01-14320
Where the Crawdads Sing2018-09-23112
Lessons in Chemistry2022-08-1477
Demon Copperhead2022-11-2764
Project Hail Mary2021-05-0959
The Covenant of Water2023-05-2157
Fourth Wing2023-06-0455
A Court of Thorns and Roses2023-06-2552
A Court of Mist and Fury2023-06-2552
Before We Were Yours2017-08-0650
Rhythm of War2020-11-2250
A Game of Thrones2018-10-0749
Dune2021-07-0449
The Vanishing Half2020-06-2847

Longest Stints in the Most Sold Charts:

Book NameStart DateWeeks
Where the Crawdads Sing2018-09-09116
It Ends with Us2021-06-2091
Lessons in Chemistry2022-07-2479
The Seven Husbands of Evelyn Hugo2021-06-2772
Verity2021-12-2671
Reminders of Him2022-01-0959
Fourth Wing2023-05-0759
Harry Potter and the Sorcerer's Stone2018-05-0652
Iron Flame2023-07-2348
The Housemaid2023-01-0147
The Midnight Library2020-12-1346
Where the Crawdads Sing2022-03-2043
The Last Thing He Told Me2021-05-0943
It Starts with Us2022-07-2440
The Silent Patient2019-12-0839
The Silent Patient2019-02-1038
American Dirt2020-01-2637
Things We Never Got Over2022-01-2337
Haunting Adeline2023-03-1237
A Court of Thorns and Roses2023-10-1536

In the most-read longest stints table, some books are from the corresponding longest charting table, while some books are newly appear on this table. These books are probably the ones that achieved momentary popularity and slowly faded away.

To look at the opposite of this effect, I looked at the books with the most number of stints, i.e., the books enter and exit the charts now and then. We could somewhat say these are the books that stand the test of time but cannot beat the longest charters and move to the top. The books with the most number of stints on both charts are below.

Most Stints in the Most Read Charts:

Book NameStints
A Game of Thrones10
Oathbringer9
Dune8
The Silent Patient6
Haunting Adeline6
The Housemaid6
The Handmaid's Tale6
The Way of Kings5
Harry Potter and the Chamber of Secrets5
It Ends with Us5
Where the Crawdads Sing5
When We Believed in Mermaids5
House of Earth and Blood4
American Dirt4
The Last Thing He Told Me4
The Fellowship of the Ring4
A Court of Silver Flames3
Never Lie3
Where the Forest Meets the Stars3
Thin Air3
It3
A Clash of Kings3
Ready Player One3
A Court of Thorns and Roses3
The Storyteller's Secret3
Fire & Blood3
House of Sky and Breath3
Anxious People3
A Court of Wings and Ruin3
Beneath a Scarlet Sky3
Things We Never Got Over3
The Letter3
Harry Potter and the Prisoner of Azkaban3
The Perfect Marriage3

Most Stints in the Most Sold Charts:

Book NameStints
Harry Potter and the Sorcerer's Stone27
If Animals Kissed Good Night11
Beneath a Scarlet Sky11
The Handmaid's Tale10
The Silent Patient10
Then She Was Gone10
The Housemaid's Secret9
Oh, the Places You'll Go8
When We Believed in Mermaids8
Eleanor Oliphant Is Completely Fine8
Harry Potter and the Chamber of Secrets8
Little Fires Everywhere7
The Tattooist of Auschwitz7
Demon Copperhead7
Remarkably Bright Creatures7
Project Hail Mary7
The Housemaid7
19847
The Hobbit7
A Court of Thorns and Roses6
The Hate U Give6
Haunting Adeline6
Dune6
It's Not Easy Being a Bunny6
Before We Were Yours6
The Boy, the Mole, the Fox and the Horse6
Where the Crawdads Sing6
The Invisible Life of Addie LaRue6
Mad Honey6

In these tables, apart from the books also featuring in the longest charting tables, we see some classics such as "1984" and the books that have been (or will be) adapted as movies such as "The Hobbit" and "Eleanor Oliphant is Completely Fine".

Being adapted as a movie is arguably a good measure of a fiction book's popularity though most readers swear by reading the book over watching the movie.

Animated Timelapse

To visually sum up our discussion above, let's look at animated time-lapses of the longest charting books in the two charts for the 371 weeks we analyzed. The growing bars represent the number of weeks each book has spent on the chart as time progresses.

Timelapse of Most Read Fiction Chart Timelapse of Most Sold Fiction Chart

As time progresses we see multiple books entering, staying in, and leaving the table, with some books steadily growing their bars. One observation that caught my eye was how Dune's bar entered the most-read table a few weeks after the first Dune movie was released in late 2021.

It grows for a while and then mostly stays static, and grows again after the second movie was released in early 2024.

This shows a correlation between movie releases and book readership. Having your book adapted into a movie may be good for your book after all. A literature geek with a keener eye than I have can surely make more such observations from the animation.

The Non-Fiction Charts

I performed the same set of analyses on the non-fiction most-read and most-sold charts as I did with the fiction charts. Here too, let's look at the longest charting books, and stints and finish with animated time-lapses.

Longest Charting Books

The top 20 longest charting books for most-read and most-sold non-fiction are in the tables below.

Longest Charting Most Read Books:

Book NameWeeks
Sapiens371
The Subtle Art of Not Giving a F*ck339
12 Rules for Life329
Can't Hurt Me289
Atomic Habits287
The 7 Habits of Highly Effective People237
How to Win Friends and Influence People231
If You Tell198
The Daily Stoic177
Born a Crime174
Greenlights165
Educated164
The Body Keeps the Score155
The 48 Laws of Power153
Becoming146
Never Split the Difference144
The Power of Now123
You Are a Badass100
Unfu*k Yourself98
A Promised Land95

Longest Charting Most Sold Books:

Book NameWeeks
The Subtle Art of Not Giving a F*ck292
Atomic Habits282
Can't Hurt Me224
The Four Agreements184
The Body Keeps the Score184
12 Rules for Life169
Rich Dad Poor Dad164
The 48 Laws of Power155
How to Win Friends and Influence People146
Educated120
Greenlights120
The 5 Love Languages116
Becoming116
Unfu*k Yourself101
The 7 Habits of Highly Effective People88
You Are a Badass79
If You Tell76
Girl, Wash Your Face75
Sapiens69
Killers of the Flower Moon69
Untamed69

From the two tables, the first observation I can make is that most books occurring in the most-read table also occur in the most-sold table. Contrary to the fiction charts, the numbers are somewhat evenly decreasing rather than a single book or series taking the lead by a large margin.

Most of the books in these two tables fall in the self-help category, with notable exceptions of Sapiens, Born a Crime, and Killers of the Flower Moon. Sapiens has stayed on the most-read charts for all the 371 weeks we looked at, but only barely made it to the most-sold table.

This adds as an example to the hypothesis that I described for fiction: the bulkier books tend to appear in the most-read charts over the most-sold charts.

Stint Analysis

In this bit, let's look at which non-fiction books have spent longest continuous periods on the chart, and which books have entered and exited the charts most number of times. Firstly, the longest stints on the most-read and most-sold charts are below. Longest Stints in the Most Read Chart:

Book NameStart DateWeeks
Sapiens2017-05-14371
Can't Hurt Me2018-12-09289
Atomic Habits2018-12-30286
The 7 Habits of Highly Effective People2017-05-14189
12 Rules for Life2020-06-21188
The Subtle Art of Not Giving a F*ck2020-12-20183
Educated2018-03-04162
Greenlights2020-10-25158
The 48 Laws of Power2021-08-22148
How to Win Friends and Influence People2017-05-14148
The Subtle Art of Not Giving a F*ck2017-05-14148
Becoming2018-11-18146
The Body Keeps the Score2021-06-13133
The Daily Stoic2021-12-19131
Born a Crime2017-05-14129
12 Rules for Life2018-01-28112
The Power of Habit2017-05-1487
A Promised Land2021-01-1086
Untamed2020-03-1583
If You Tell2019-11-0379
Girl, Wash Your Face2018-04-1579

Longest Stints in the Most Sold Chart:

Book NameStart DateWeeks
Atomic Habits2020-06-21209
The Body Keeps the Score2021-01-17150
The Subtle Art of Not Giving a F*ck2017-05-14128
Educated2018-02-25119
Becoming2018-10-1496
Greenlights2020-10-2580
The 48 Laws of Power2022-12-2578
Girl, Wash Your Face2018-04-0172
Outlive2023-03-2665
12 Rules for Life2018-01-2164
Untamed2020-03-0857
Can't Hurt Me2018-12-0252
The 48 Laws of Power2021-12-2650
Rich Dad Poor Dad2021-12-2648
The Subtle Art of Not Giving a F*ck2021-12-1246
Atlas of the Heart2021-12-0541
Killers of the Flower Moon2023-05-2140
The Four Agreements2021-08-0140
Girl, Stop Apologizing2019-02-2440
The Subtle Art of Not Giving a F*ck2022-12-2539
Unfu*k Yourself2018-12-3039

On these tables, we see that the pattern is very similar to longest charting tables. Could this mean non-fiction books appear on the charts, stay for however long they can, and then leave the chart almost permanently? Let's look at the most-stints tables to get an idea about this.

Most Stints in the Most Read chart:

Book NameStints
The Power of Now21
Never Split the Difference19
Think and Grow Rich14
If You Tell11
Extreme Ownership10
How to Win Friends and Influence People9
The Daily Stoic8
The Rise and Fall of the Third Reich7
First 100 Words7
Killers of the Flower Moon7
Unfu*k Yourself6
You Are a Badass6
Rich Dad Poor Dad6
Alexander Hamilton5
12 Rules for Life4
Maybe You Should Talk to Someone4
Elon Musk4
The Mountain Is You4
Never Finished4
Born a Crime4
The 48 Laws of Power4
The Subtle Art of Not Giving a F*ck4
I'm Glad My Mom Died4
The 7 Habits of Highly Effective People4

Most Stints in the Most Sold Chart:

Book NameStints
How to Win Friends and Influence People28
The 5 Love Languages28
Sapiens24
12 Rules for Life22
The Subtle Art of Not Giving a F*ck22
Can't Hurt Me22
Rich Dad Poor Dad22
The 7 Habits of Highly Effective People20
The Four Agreements20
Born a Crime19
You Are a Badass14
Killers of the Flower Moon14
Never Split the Difference13
The Psychology of Money13
The 48 Laws of Power10
Greenlights10
Think and Grow Rich10
The Mountain Is You10
Unfu*k Yourself9
On Tyranny8
If You Tell8
Atomic Habits8

We see that the patterns in the most-stints tables for non-fiction are not very different from their fiction counterparts. However, a good number of books appearing here do not appear in the longest charting or longest stints table. This suggests that non-fiction books too have their set of steady sailers that don't rise to the top but stand the test of time.

Animated Timelapse

To summarize our discussion above and see some patterns visually, let's look at animated time-lapses for the most-read and most-sold charts for non-fiction books. These are similar to the ones we saw for fiction books.

Timelapse of Most Read Non Fiction Chart Timelapse of Most Sold Non Fiction Chart

Debut Patterns

We analyzed two key parameters associated with the debut of books on these charts. One is the position at which a book debuts and the other is the time between publication and debuting on the charts. I am presenting the debut patterns for fiction and non-fiction charts together as there was not much variation among them.

Debut Position Distributions

Debut Position - Most Read Fiction Debut Position - Most Read Non Fiction Debut Position - Most Sold Fiction Debut Position - Most Sold Non Fiction

In all the charts, we see that a book is most likely to debut in the second half of the chart. However, only in the most-read charts, if a book makes it to the first half, there is a good chance it will be in the top two.

Time to Debut

I also looked at the time a book takes to debut after it has been published. Essentially, this is the difference (in weeks) between its debut date and its publication date, as seen on Amazon's product page. There were some instances of the publication date occurring after the debut date, for which we do not have a satisfactory explanation. Let's see the cumulative plot for the weeks to debut below.

Time to Debut Most Read Fiction vs. Non Fiction Time to Debut Most Sold Fiction vs. Non Fiction

When we compare fiction with non-fiction for times to debut, we see that most fiction books tend to debut slightly quicker than non-fiction books under both most-read and most-sold. Overall, there is no significant variation in this pattern among the 4 charts. We see that around half the books that have been on these charts debut within one week of publication. Around 3/4th of the books have debuted within 3 years of publication. So, most of the 4 charts are filled with newer books.

A Pinch of Salt

Fans of the books featured on these tables and visuals have enough reasons to be excited. However, we cannot see any of this data as an absolute measure of any book's popularity. While the books featured here are surely popular, we cannot say that a book not featured here is not popular. The data from Amazon's charts can be skewed from the overall popularity by many factors.

Firstly, not everyone buys/rents all their books from Amazon. For instance, I prefer the charm of a cozy bookstore for my book buying, while using Amazon for just the obscure or new ones I cannot find in bookstores. Some titles I find in almost every bookstore, like "The Alchemist" by Paulo Coelho are barely seen on Amazon charts.

The size could also play a role; The Alchemist fits in the side pocket of my shorts while Harry Potter does not. The former is the kind of book I'd buy at a train station or a bus stop to read on the way, and probably even sell to a used bookshop at my destination. The latter is something I'd have delivered to my home to read over a few weeks.

Other factors that could skew this chart are the age of books and second-hand sales. The older books go around well in the second-hand market offline, while newer books are easier to find online.

These days, it is easier for an author/publisher to launch a book online, and take all the effort to reach offline stores only if the book performs well. This could also lead to a behavior where readers buy only first-hand books online and second-hand copies of older books offline.

Technical Steps Involved

Python was the language of choice for this analysis. Briefly, the process involved getting the data from the Amazon website, scraping the charts data, individual books data, and putting together the visuals.

Packages Used

  • requests : For sending HTTP requests, receiving, and parsing responses.
  • beautifulsoup4 : For scraping data from HTML responses.
  • dataset : For working with SQLite databases. I used SQLite databases to store the scraped data.
  • scrapingbee : For scraping pages with bot detection, rate limiting, etc.
  • numpy : For numerical analyses, such as frequency distributions.
  • pandas : For analyzing data in tabular format.
  • matplotlib : For plotting line charts, bar charts, and animated visuals.

Scraping the Charts Data

The chart URLs have the pattern: https://www.amazon.com/charts/{week_start_date}/most{read_or_sold}/{non}fiction. The week start dates range from 14th May 2017 (the earliest available date) to 16th June 2024 (the latest available when we drafted this). This is a total of 371 weeks. In addition, the next two parts of the URLs can be mostread or mostsold and fiction or nonfiction respectively, giving us 4 charts for each week. In total, I had to scrape 1,484 chart pages (371 x 4 = 1484). Since the charts have 20 positions, I would have 29,680 chart entries in total (1484 x 20 = 29680). Let's look at the chart pages' scraping code below.

from datetime import datetime, timedelta

from bs4 import BeautifulSoup
import dataset
import requests

def _parse_weeks(text):
    '''For parsing weeks in text to an integer value
    '''
    text = text.upper().split(" WEEK")[0]
    return 1 if text=="FIRST" else int(text)

# connect to SQLite DB
db = dataset.connect("sqlite:///data.db")
# Creates a new one if data.db does not exist

# iterate over all sundays starting from 2017-05-14
date = datetime(2017, 5, 14)
end_date = datetime(2024, 6, 17)

while date<=end_date:
    date_str = date.isoformat().split("T")[0]

    # 4 possible chart urls for each date
    chart_names = [
        "mostread/fiction",
        "mostread/nonfiction",
        "mostsold/fiction",
        "mostsold/nonfiction",
    ]

    # get chart entries for each URL
    for chart_name in chart_names:
        url = f'https://www.amazon.com/charts/{date_str}/{chart_name}'
        print(url)

        # check if URL was scraped before
        last_key = chart_name + "-" + date_str + "-20"
        exists = db["chart_entries"].find_one(key=last_key)
        if exists:
            # if URL was already scraped, books were added
            # so, skip this URL
            print("SKIPPING: ALREADY PROCESSED")
            continue

        # if URL wasn't scraped before, go ahead
        # send HTTP request to URL
        r = requests.get(url)
        # convert response into a Soup
        soup = BeautifulSoup(r.text, features="html5lib")

        # each chart entry is within a div of a particular class
        # use CSS selectors to select these divs
        cards = soup.select("div.kc-horizontal-rank-card")
        assert len(cards)==20

        for card in cards:
            # run loop over each div containing a chart entry
            chart_position = int(card.select_one("div.kc-rank-card-rank").text.strip())

            # construct a unique key to avoid duplicate entries into db
            key = chart_name + "-" + date_str + "-" + str(chart_position)

            exists = db["chart_entries"].find_one(key=key)
            if exists:
                print("SKIPPING:", key, "- ALREADY EXISTS")
                continue

            # scrape necessary fields using CSS selectors
            author_el = card.select_one("div.kc-rank-card-author")
            chart_entry = dict(
                key=key,
                chart_name=chart_name,
                date=date_str,
                chart_position=chart_position,
                book_name=card.select_one("div.kc-rank-card-title").text.strip(),
                author_name=author_el.get("title") if author_el else "N/A",
                # use previously defined function to get weeks as int
                weeks_on_chart=_parse_weeks(card.select_one("div.kc-wol,div.kc-wol-first").text),
                link=card.select_one("a.kc-cover-link").get("href"),
                cover_image=card.select_one("a.kc-cover-link img").get("src"),
            )

            # make sure no fields are blank
            for dict_key in chart_entry:
                assert chart_entry[dict_key]

            # add the chart entry to db
            db["chart_entries"].insert(chart_entry)

    # repeat for subsequent sunday
    date = date + timedelta(days=7)

The above code is written in such a way that if it crashes for some reason, it can be restarted right where it was interrupted. Or, if we want to refresh our analysis after a few weeks, we can get data for just the newer weeks. If the scraper starts from scratch, we would have duplicate entries in the database from multiple runs. Once this was done, I ensured there were exactly 29,680 chart entries in the database.

Scraping Book Metadata

While the chart pages had the title, cover image, and URL for each book, I had to scrape the book's product page to obtain additional data about the book such as publication date, number of pages, category, etc.

Though there were 29680 chart entries, I did not have to scrape that many product pages, as most books repeat on the chart. The simplest algorithm was to iterate over the chart entries and scrape a product page only if it hadn't been scraped before.

This approach yielded 3,497 unique books on the chart.

Since Amazon is more rigorous with blocking automated access to the product pages, I used ScrapingBee API to overcome this limitation. Let's look at the code below.

from contextlib import contextmanager
from datetime import datetime
import html
import signal
import time
import traceback

from bs4 import BeautifulSoup
import dataset
import requests
from scrapingbee import ScrapingBeeClient

# connect to SQLite DB
db = dataset.connect("sqlite:///data.db")

# initialize ScrapingBeeClient using the ScrapingBee Python Package
spb_client = ScrapingBeeClient(api_key="{{REDACTED}}")

# for parsing date strings
MONTHS = {
    "january": 1, "february": 2, "march": 3, "april": 4,
    "may": 5, "june": 6, "july": 7, "august": 8,
    "september": 9, "october": 10, "november": 11, "december": 12,
}
def _parse_date(raw_date):
    chunks = raw_date.lower().replace(",","").split(" ")
    month = int(MONTHS[chunks[0]])
    day = int(chunks[1])
    year = int(chunks[2])
    return datetime(year, month, day).isoformat().split("T")[0]

# for setting a timeout for requests
class TimeoutException(Exception): pass

@contextmanager
def _time_limit(seconds):
    def signal_handler(sigint, frame):
        raise TimeoutException
    signal.signal(signal.SIGALRM, signal_handler)
    signal.alarm(seconds)
    try:
        yield
    finally:
        signal.alarm(0)

def run():
    counter = 0
    for entry in db["chart_entries"]:
        # each book has an unique id on Amazon, present in the URL
        book_id = entry["link"].split("/")[2]
        base_url = "https://amazon.com"
        url = base_url + entry["link"]

        # printing for monitoring
        counter += 1
        print(counter, entry["book_name"], url)

        # check if book was already scraped
        exists = db["books"].find_one(book_id=book_id)
        if exists:
            # skip if scraped
            print("SKIPPING: ALREADY PROCESSED")
            continue

        # get the HTML in the page through ScrapingBee
        # throws an Exception if request does not complete in 60s
        with _time_limit(60):
            r = spb_client.get(url, params={"timeout": "10000", "return_page_source": "true"})

        # write HTML response to file, to inspect if required
        with open("last.html", "w") as f:
            f.write(str(r.content))
            f.close()

        # create dict to store scraped data
        book_data = dict(
            book_id=book_id,
            name=entry["book_name"],
            author=entry["author_name"],
            url=url,
        )

        soup = BeautifulSoup(r.content, features="html5lib")
        if "/dogsofamazon" in str(r.content):
            # this is 404 page
            # mark as missing and move to next book
            book_data["missing"] = True
            print("BOOK PAGE MISSING, SKIPPING")
            # add book as missing to DB
            db["books"].insert(book_data)
            continue

        # scrape required fields
        categories = list(map(
            lambda s: s.text.strip(),
            soup.select("#wayfinding-breadcrumbs_feature_div a.a-link-normal"),
        ))
        book_data["categories"] = "|".join(categories)

        details = {}

        # most metadata is available as key value pairs
        # the format is slightly different for regular and audible links
        detail_soups = soup.select("#detailBullets_feature_div ul:first-child li .a-list-item")
        if len(detail_soups)>0:
            for item in detail_soups:
                spans = item.select("span")
                key = spans[0].text.replace("\u200F", "").replace("\u200E", "").replace("\n","")
                key = html.unescape(key).split(":")[0].strip()
                key = key.lower().replace(" ", "_").replace("-","_")
                value = html.unescape(spans[1].text).strip()
                if "_date" in key:
                    value = _parse_date(value)
                book_data["details_" + key] = value
        else:
            detail_soups = soup.select("#audibleProductDetails tr, #prodDetails tr")
            assert len(detail_soups)>0
            for item in detail_soups:
                key = item.select_one("th").text.lower()
                key = key.strip().replace(" ","_").replace("-","_").replace(".","_")
                value = item.select_one("td").text.strip()
                if "_date" in key:
                    value = _parse_date(value)
                book_data["details_" + key] = value

        # add scraped data to DB
        db["books"].insert(book_data)

    # indicate a successful run over all entries without errors
    return "FINISHED"

for i in range(400):
    # allow for 400 restarts in case of known exceptions
    # if run completes without exceptions, break the loop, exit the program
    try:
        if run()=="FINISHED":
            break
    # most exceptions are below are network or Amazon related
    # they are fixed by simply retrying
    except AssertionError:
        print("ASSERTION ERROR, RESTARTING")
        time.sleep(20)
        continue
    except TimeoutException:
        print("TIMED OUT, RESTARTING")
        time.sleep(60)
        continue
    except requests.exceptions.ChunkedEncodingError:
        print("CHUNKED ENCONDING ERROR, RESTARTING")
        time.sleep(2)
        continue
    except Exception:
        # any Exception other than the above indicates a bug in the code
        # bug is to be fixed and program is to be restarted
        print(traceback.format_exc())
        break

In the above code, there were a lot of possible network errors that could be fixed by simply restarting the program. I had written the code in such a way that I could leave the machine aside for a while without babysitting it and restarting it manually when something went wrong.

This involved restarting the program in case of network errors, or a particular request taking too long to complete.

Once the above code was finished, the data was ready for analysis and visualization.

Basic Setup for Analysis and Visualization

There were some common datasets and packages that I used for all the visualizations. I prepared them as shown in the code below.

from datetime import datetime, timedelta

import dataset
import matplotlib.animation as animation
import matplotlib.image as image
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import numpy as np
import pandas as pd

# formatting settings for matplotlib
plt.rcParams["font.family"] = "Noto Sans"
plt.rcParams["font.size"] = 16

# common function for adding our logo to plots
logo = image.imread("logo.png")
def _add_logo(fig):
    logo_ax = fig.add_axes([0.02, 0.82, 0.13, 0.13], anchor="NE", zorder=1)
    logo_ax.imshow(logo)
    logo_ax.axis("off")

# connect to db
db = dataset.connect("sqlite:///data.db")

# get data for the 4 charts from DB
mostread_fiction = list(db["chart_entries"].find(chart_name="mostread/fiction"))
mostsold_fiction = list(db["chart_entries"].find(chart_name="mostsold/fiction"))
mostread_nonfiction = list(db["chart_entries"].find(chart_name="mostread/nonfiction"))
mostsold_nonfiction = list(db["chart_entries"].find(chart_name="mostsold/nonfiction"))

# read all necessary book metadata into a dict
book_metadata = {}
date_keys = ["details_publication_date", "details_audible_com_release_date", "details_release_date"]
for book in db["books"]:
    book_id = book["book_id"]
    book_metadata[book_id] = book
    for key in date_keys:
        if book[key]:
            book_metadata[book_id][key] = datetime(*list(map(int, book[key].split("-"))))

With one list for each of the 4 charts, it was possible to have one function for each visualization and run the lists through it one by one. This is called the DRY (Do not Repeat Yourself) principle.

Longest Charting Table

To make the longest charting table, I iterated over all the chart entries, while keeping count of how many times each book occurs. I then used Pandas to identify the top 20 most occurring books.

def table_longest_charting(entries, filename):
    books_dict = {}
    for entry in entries:
        book_name = entry["book_name"]
        if book_name not in books_dict:
            # start counter if it does not exist
            books_dict[book_name] = {"weeks": 0}
        # increment the counter
        books_dict[book_name]["weeks"] += 1

    index = [] # book names
    weeks = [] # number of weeks spent on charts
    for book_name in books_dict:
        index.append(book_name)
        weeks.append(books_dict[book_name]["weeks"])

    df = pd.DataFrame({"Weeks": weeks}, index=index)
    df = df.sort_values(by="Weeks", ascending=False)

    # identify the value for 20th position
    # filter out everything below it
    # this makes sure all books tied at 20th position are included
    cutoff_value = df["Weeks"].iloc[19]
    df = df[df["Weeks"] >= cutoff_value]
    df.to_markdown(filename, index=True)

# run above function on all 4 datasets
table_longest_charting(mostread_fiction, "longest-charting-most-read-fiction.md")
table_longest_charting(mostsold_fiction, "longest-charting-most-sold-fiction.md")
table_longest_charting(mostread_nonfiction, "longest-charting-most-read-nonfiction.md")
table_longest_charting(mostsold_nonfiction, "longest-charting-most-sold-nonfiction.md")

Stint Analysis

Stint analysis involved tracking the books as they enter, stay on and leave the chart over successive weeks. This section had two tables to be output: longest stints and most number of stints. The code for this is below.

def stints(entries, chart_name):
    file_suffix = chart_name.lower().replace(" ", "-")

    # to keep track of books by week
    weekwise_books = {}
    for entry in entries:
        if entry["date"] not in weekwise_books:
            weekwise_books[entry["date"]] = set()
        book_name = entry["book_name"]
        weekwise_books[entry["date"]].add(book_name)

    # the dates to iterate over
    dates = sorted(weekwise_books.keys())

    # to keep track of books that entered and left
    stints = []

    # to keep track of books that are currently staying on the charts
    open_stints = {}

    last_date = "2024-05-05"
    for date in dates:
        week_book_names = weekwise_books[date]

        # for books on the chart in this week
        for book_name in week_book_names:
            # check if this book entered the charts this week
            # if yes, initialize stint
            if book_name not in open_stints:
                open_stints[book_name] = {
                    "book_name": book_name,
                    "start_date": date,
                    "weeks": 0,
                }

            # increment weeks in stint
            open_stints[book_name]["weeks"] += 1

        # for books that were staying on the charts
        open_stint_book_names = list(open_stints.keys())
        for book_name in open_stint_book_names:
            # check if a book exited this week
            if book_name not in week_book_names or date==last_date:
                # record the stint weeks
                stints.append(open_stints[book_name])
                # end the stint
                del open_stints[book_name]

    # make a dataframe with the stints
    data = {"Book Name": [], "Start Date": [], "Weeks": []}
    index = []
    for obj in stints:
        data["Book Name"].append(obj["book_name"])
        data["Start Date"].append(obj["start_date"])
        data["Weeks"].append(obj["weeks"])
        index.append(obj["book_name"])

    df = pd.DataFrame(data, index)
    df = df.sort_values(by="Weeks", ascending=False)

    # write the longest stints
    cutoff_value = df["Weeks"].iloc[19]
    top_df = df[df["Weeks"] >= cutoff_value]
    top_df.to_markdown("longest-stints-" + file_suffix + ".md", index=False)

    # make a dataframe with number of stints per book
    stint_count_dict = {x: index.count(x) for x in set(index)}
    counts_index = list(stint_count_dict.keys())
    counts_data = {"Book Name": [], "Stints": []}
    for book_name in counts_index:
        counts_data["Book Name"].append(book_name)
        counts_data["Stints"].append(stint_count_dict[book_name])

    stint_count_df = pd.DataFrame(counts_data, counts_index)
    stint_count_df = stint_count_df.sort_values(by="Stints", ascending=False)

    # write books with most number of stints
    cutoff_value = stint_count_df["Stints"].iloc[19]
    top_df = stint_count_df[stint_count_df["Stints"] >= cutoff_value]
    top_df.to_markdown("most-stints-"+file_suffix+".md", index=False)

# run above function on the 4 datasets
stints(mostread_fiction, "Most Read Fiction")
stints(mostread_nonfiction, "Most Read Non Fiction")
stints(mostsold_fiction, "Most Sold Fiction")
stints(mostsold_nonfiction, "Most Sold Non Fiction")

The above code produces 8 tables in total, 2 for each dataset.

Debut Pattern Analysis

The debut pattern analysis involved a filtered list of chart entries, where a book shows up on the chart for the first time. Based on this, I performed analyses such as the position of debut and time to debut.

def debut_trends(entries):
    # common for both plots

    # to keep track over iterations
    debuted_book_names = set()
    debut_positions = []
    times_to_debut = []

    for entry in entries:
        book_name = entry["book_name"]
        book_id = entry["link"].split("/")[2]
        if book_name not in debuted_book_names:
            book = book_metadata[book_id]
            if book["details_publication_date"]:
                chart_date = datetime(*list(map(int, entry["date"].split("-"))))
                diff = chart_date - book["details_publication_date"]
                times_to_debut.append(int(diff.days/7))
            debuted_book_names.add(book_name)
            debut_positions.append(entry["chart_position"])
    return debut_positions, times_to_debut

def plot_debut_position(debut_positions, chart_name):
    file_suffix = chart_name.lower().replace(" ", "-")

    # bin the positions
    positions = list(range(1,21))
    positionwise_debut_percentages = list(map(
        lambda val: debut_positions.count(val)/len(debut_positions),
        positions,
    ))

    # plot bar chart
    fig = plt.figure(figsize=(16, 9))
    bar_plot = plt.bar(
        positions, positionwise_debut_percentages,
        color="#ffc91f",
    )
    plt.xlabel("Debut Position on Chart")
    plt.ylabel("No. of Books")
    plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
    plt.xticks(positions)
    plt.bar_label(bar_plot, fmt=lambda x: f'{round(x*100, 1)}%')
    plt.title(
        "Amazon Book Charts\nDebut Position Distribution - " + chart_name,
    )
    _add_logo(fig)
    plt.savefig("debut-position-distribution-" + file_suffix + ".png")
    plt.clf()

def plot_times_to_debut(times_to_debut_a, times_to_debut_b, chart_name, legend):
    file_suffix = chart_name.lower().replace(" ", "-").replace(".", "")

    # Compute histogram values first
    histogram_a, _bins = np.histogram(times_to_debut_a, bins=range(0, max(times_to_debut_a), 1))
    cumulative_histogram_a = np.cumsum(histogram_a)
    normalized_cumulative_histogram_a = cumulative_histogram_a/cumulative_histogram_a[-1]

    histogram_b, _bins = np.histogram(times_to_debut_b, bins=range(0, max(times_to_debut_b), 1))
    cumulative_histogram_b = np.cumsum(histogram_b)
    normalized_cumulative_histogram_b = cumulative_histogram_b/cumulative_histogram_b[-1]

    # plot computed histogram as line chart
    # avoided plt.hist due to limited formatting options
    fig = plt.figure(figsize=(16, 9))
    plt.plot(normalized_cumulative_histogram_a[:500], color = "#ffc91f", linewidth=2)
    plt.plot(normalized_cumulative_histogram_b[:500], color = "#1fffcb", linewidth=2)
    plt.ylim(0,1)
    plt.xlabel("Weeks After Publication")
    plt.ylabel("% of Books Debuted")
    plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
    plt.title("Amazon Book Charts\nCumulative Time To Debut - " + chart_name)
    plt.legend(legend)

    _add_logo(fig)
    plt.savefig(f"time-to-debut-{file_suffix}.png")
    plt.clf()


# Most Read Debut Trends
debut_positions_1, times_to_debut_1  = debut_trends(mostread_fiction)
debut_positions_2, times_to_debut_2 = debut_trends(mostread_nonfiction)
plot_debut_position(debut_positions_1, "Most Read Fiction")
plot_debut_position(debut_positions_2, "Most Read Non Fiction")
plot_times_to_debut(
    times_to_debut_1,
    times_to_debut_2,
    "Most Read Fiction vs. Non Fiction",
    ["Fiction", "Non Fiction"],
)

# Most Sold Debut Trends
debut_positions_3, times_to_debut_3 = debut_trends(mostsold_fiction)
debut_positions_4, times_to_debut_4 = debut_trends(mostsold_nonfiction)
plot_debut_position(debut_positions_3, "Most Sold Fiction")
plot_debut_position(debut_positions_4, "Most Sold Non Fiction")
plot_times_to_debut(
    times_to_debut_3,
    times_to_debut_4,
    "Most Sold Fiction vs. Non Fiction",
    ["Fiction", "Non Fiction"]
)

The above code produced 6 visuals in total.

Animated Bar Charts

I made animated bar charts to visualize the tables as weeks passed by, with bars indicating how long each book had been on the charts. For this, I used the animation capabilities of Matplotlib. The code is below.


# used one of 6 colors for each book
# color of a book is fixed through the timelapse
# helps to keep track of a book as it moves

def animated_bar_chart(entries, chart_name):
    file_suffix = chart_name.lower().replace(" ", "-")

    # dates to iterate over
    dates = list(set([x["date"] for x in entries]))
    dates.sort()

    fig = plt.figure(figsize=(12, 9))
    colorset = ["#ffcb1f", "#53ff1f", "#1fffcb", "#1f53ff", "#cb1fff", "#ff1f53"]
    charted_books = {}

    # returns a plot for a given date
    def animate(i):
        print(i+1, end="\r"*4)

        # clear previous plot
        plt.clf()

        # get entries for given date
        date = dates[i]
        date_obj = datetime(*map(int, date.split("-")))
        date_entries = list(filter(lambda entry: entry["date"]==date, entries))
        date_entries.sort(key=lambda entry: entry["chart_position"])

        # get book names and number of weeks on chart
        for entry in date_entries:
            # if its a new entrant, start counter and allot a color
            if entry["book_name"] not in charted_books:
                color = colorset[len(charted_books)%len(colorset)]
                charted_books[entry["book_name"]] = {"color": color, "weeks": 0}
            # increment counter for new entrants & previously charted books
            charted_books[entry["book_name"]]["weeks"] += 1

        # construct a data frame to get top 20
        names = []
        weeks_on_chart = []
        colors = []
        for book_name in charted_books:
            names.append(book_name)
            weeks_on_chart.append(charted_books[book_name]["weeks"])
            colors.append(charted_books[book_name]["color"])

        df = pd.DataFrame({"Name": names, "Weeks": weeks_on_chart, "Color": colors})
        df = df.sort_values(by=["Weeks", "Name"], ascending=False).head(20)

        # draw the bar plot for this frame
        bar_plot = plt.barh(
            df["Name"][::-1],
            df["Weeks"][::-1],
            color=df["Color"][::-1],
            height=0.3,
        )

        # customize the plot
        ax = plt.gca()
        ax.get_xaxis().set_visible(False)
        ax.spines["top"].set_visible(False)
        ax.spines["right"].set_visible(False)
        ax.spines["bottom"].set_visible(False)
        ax.spines["left"].set_visible(False)

        plt.tick_params(left=False)
        plt.bar_label(bar_plot, padding=10)

        date_str = date_obj.strftime("%-d %b %Y")
        plt.title(
            "\n".join([
                f"Number of Weeks on Amazon's {chart_name} Chart",
                "Date: " + date_str
            ]),
            x=1.05, y=1.02, ha="right",
        )
        plt.subplots_adjust(left=0.6, top=0.87)
        _add_logo(fig)

    # generate plot for all dates and put it together
    ani = animation.FuncAnimation(fig, animate, frames=len(dates))
    # save as GIF
    ani.save(f"animated-{file_suffix}.gif", writer=animation.PillowWriter(fps=3))

animated_bar_chart(mostread_fiction, "Most Read Fiction")
animated_bar_chart(mostread_nonfiction, "Most Read Non Fiction")
animated_bar_chart(mostsold_fiction, "Most Sold Fiction")
animated_bar_chart(mostsold_nonfiction, "Most Sold Non Fiction")

Conclusion

We looked at some interesting tables and visuals from the Amazon book charts over a period of 7 years. Under each of the two categories of fiction and non-fiction, we saw 2 charts for most-sold and most-read books.

We saw the longest charting books, looked at how books debut, stay on, and exit the charts, and also at animated bar charts to visually summarize everything.

Some of these patterns changed between the two categories, for example, the distribution of books in the longest charting table. However, in all the charts, we saw mostly newer books debuting. This suggests Amazon has been a great platform for debuting books in recent times.

Apart from the newer books effect, we could also see that TV and film have a huge impact on charts. The Harry Potter series, which dominates the most-read fiction charts has also done very well as a movie series.

A Game of Thrones and Dune featuring in the most stints (most-read) suggest that fans pick up these books every time a new TV show season or a new movie is released in the franchise.

To look into the older books that have stood the test of time and made it into these charts, we looked at the tables for most stints on the charts.

The book topping the Most Stints table for Most Read Non-Fiction, The Power of Now, was published way back in 2004 and has been revisited by Amazon readers multiple times since then.

Think and Grow Rich, which was first published more than 80 years ago, appears only in this table.

image description
Karthik Devan

I work freelance on full-stack development of apps and websites, and I'm also trying to work on a SaaS product. When I'm not working, I like to travel, play board games, hike and climb rocks.