Running multiply crawlers in a single process.
When crawling many web pages it’s important for an application to get an advantage of APM. Scrapy is a Python asynchronous crawling framework, that with small changes is perfectly suites this need.
Scrapy has a Crawler component that includes request scheduler as well as visited urls queue, together with all the configuration parameters related to how the crawling process should be performed. Thus if you want to crawl multiple domains simultaneously the best isolation level would be to create a separate Crawler for each domain. This will allow you to have a custom configuration context per domain.
The conventional use of scrapy is its Crawler-per-process model. However if you need to crawl 2000 domains simultaneously on one machine, it would be very inefficient having 2000 processes on there. So our purpose is to run multiple Crawlers within just one scrapy process.
Gregory Begelman describes in his post how we solved this problem, and how we managed to run 300 concurrent domains from only one process that took just one processor core and about only 3G of RAM.
Bloom as a duplicate filter for visited urls.
The default duplicate filter, that is used in scrapy for filtering visited urls, uses a list of url fingerprints – basically sha1 hashes in length of 40 characters that is 77 bytes long in Python 2.7. Lets say you have to scrape a site with 2M of pages, then your duplicates filter list might grow up to 2M * 77b = 154Mb per one Crawler. In order to be able to scrape 300 of such domains simultaneously, you will need 300 * 154Mb = 42G of memory. Fortunately there is another way – Bloom Filter. It’s a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. When we define it to store 2M of items with probability of false positive matches equal to 0.00001% we are getting the duplicate filter that is good enough for only 11Mb of memory. This enables us to crawl 300 domains using only 3G of memory.
Below is the Bloom filter for scrapy implemented with pybloom
from pybloom import BloomFilter
from scrapy.utils.job import job_dir
from scrapy.dupefilter import BaseDupeFilter
"""Request Fingerprint duplicates filter"""
def __init__(self, path=None):
self.file = None
self.fingerprints = BloomFilter(2000000, 0.00001)
def from_settings(cls, settings):
def request_seen(self, request):
fp = request.url
if fp in self.fingerprints:
def close(self, reason):
self.fingerprints = None
To enable it in your project add to the projects settings.py the following line:
# Somewhere in settings.py
DUPEFILTER_CLASS = "project_name.bloom_filter.BLOOMDupeFilter"