Home
ProjectsBlogArt

How: Books from 99% Invisible

One of my favorite podcasts is 99% Invisible. I love listening to it, and I’ve noticed that many of its episodes revolve around an author’s new book. I originally wanted to automate collecting all these books and display them in a nice way on a website, but getting all the book images turned out to be more trouble than I wanted to take on. In the end, I have them in 3 Goodreads lists.

List of books mentioned in 99% Invisible

My Thought Process

Before scraping the 99% Invisible website, I checked their robots.txt file to make sure scraping wasn’t disallowed, and it wasn’t. While exploring the site, I found an episode sitemap that conveniently listed all their episode pages.

Each episode page mentions books in several different ways. Sometimes a title appears in an <a> tag followed by an <em> tag, sometimes just inside an <em> tag, or occasionally an <em> tag is followed by an <a> tag. My first version of the scraper handled these patterns well and captured most of the book titles, but as I looked at more episodes, I realized the formatting wasn’t consistent. Many titles were mentioned in plain text within paragraphs, with no special HTML tags at all.

To handle those unstructured cases, I created a small algorithm that searched for keywords such as “book,” “books,” “author,” and “authors.” Whenever one of those words appeared, I grabbed a short chunk of text before and after it and sent that snippet to Gemini to extract the most likely book title.

A large part of the project involved cleaning and reformatting the scraped data so I could compare the titles against the Google Books API to verify them and fetch the authors. Many of the initial results were false positives, but after filtering I ended up with around 270 verified books.

Final Thoughts

I did this project in June 2025. The project helped me realize I was more excited for data collection versus creating the final website to display the results. I felt very clever when I thought of the algorithm to get the book titles and while I know it's probably missing some edge cases, I'm happy that my "first" coding project is finished. I've been holding onto this data for months and it feels nice to be done.

What I Used

Link to Google Sheet that contains entire list of books with episode number: sheet

Happy reading :)