How to scrape a million pages?

Scraping a million pages made me realize that a million pages is not that much, but it still requires some design decisions. In this conceptual post, I present the architecture of my scraper-crawler whose purpose is to discover pages and fetch and process them.

This is useful when you want to extract information of many pages at once. For instance, you want to extract all metadata of tracks out of a music provider or to download the wikipedia corpus. A related but different problem is when you revisit identical targets on a schedule, for instance if you want to follow the updates of plane ticket prices.

System objective

Web pages and their hyperlinks form a directed graph in which pages are nodes and links are edges.

The goal for our scraper-crawler is to walk the graph and harvest desired data.

High-level system architecture

Simply put:

Scraper = parser + crawler + fetcher (+ storage)

Our scraper starts at an origin page and works recursively, discovering new links and following the relevant ones.

The parser turns an html page into structured fields that ought to be extracted. It then emits discovered URLs that give potential future directions for harvesting.

The crawler scores and queues URLs with priorities: it schedules the nodes of the frontier to visit.

The fetcher commits the move by executing the HTTP request and feeding the documents to the parser.

We need to store the raw pages (html cache), the structured database of parsed recored and the crawl queue.

Let’s examine more closely these three components.

Parser

Used to be the most tedious task, but it is now alleviated by LLMs. Its role is to turn a single html page into one or several database rows.

The input content is organized into div tags with class and id attributes that should correspond to certain semantics. Standard choice is using BeautifulSoup with the methods .select_one() and .select() to process CSS hierarchy paths ( .playlist > a.track_title) and .get() to fetch html attributes.

You need to program the filtering and extraction logic on the content syntax, using regex. I feel like regex used to be fun to write as long as you didn’t have too many of them, but now I don’t see the point of writing regex yourself since you can kindly ask any LLM to do it for you. I still believe it’s a good mental exercise and the underlying theory of finite automata is quite interesting.

Parsing failures should be logged.

Crawler

The crawler organizes newly discovered links into a priority queue. This prioritization is very important, it reflects the logic of the goal of the scraping.

Indeed, our scraper should follow paths that lead to resources that we want to harvest, and many links should be ignored or be taken only to land an area of interest (thus skipping the exploration of many of their children).

The crawler writes to a priority queue that should likely be stored on disk and not in memory such that it is easy to stop and resume the job. It should keep the set of visited node hashes to prevent re-fetching.

Fetcher

The fetcher is in charge of the requests, ie it is the part that communicates with the servers. To do so, it gets the top priority item of the crawler and execute the http request. It is in charge of managing the potential errors, especially the 4xx errors that indicate a client-side problem. Implementing an exponential-backoff algorithm with jitter (ubiquitous in networking) might help resolve some 429 “Too many requests” errors.

An important task here is to log errors so you’re able to react.

The base fetching frequency should be controlled (rate limit). 1 second was enough for my task.

Storage

The choice of storage is not very important as long as you don’t aim for concurrent writes. I kept things minimal and used sqlite, writing to different tables in a single .db file. I regularly copied it locally and inspected the database with some sql queries of interest.

Further development advice

I rented an aws ec2 with 1GB of memory and 128GB of disk to have he scraper running 24/7. Tunneling on VSCode does not work but fortunately I am fluent enough in vim to fix some minor bugs.

I wrapped the scraper in a small CLI that would make it easy to stop and resume the job.

Rotating proxies might be needed after 10-100k requests. You can find or buy proxies online. Other methods that can hide your identity exist.

Throttling requests remain important or you might get a 429 error and get ip banned.

Iteration is important because as you start filling the database you might notice some parsing errors or a crawl schedule that does not seem right. If raw html pages are cached then filling a new db schema is easy.

To scale, you could split components into async workers.

I suspect there are many errors I have not encountered because the website I was scraping was very simple.

Conclusion

I presented the conceptual architecture of a system I just developed. I just wanted to quickly write-up about it before moving on to more interesting parts of this project.