Recently I have made a brief exploration of local job market and have found a simple fact – Python is greatly misunderstood and as a result – extremely underestimated.
Having an experience of being employed by different software developing companies, I have heard many times from technical people that Python is slow and doesn’t have a real threading mechanism and due to the lack of static types it’s error-prone; that some of the “basic” OOP stuff is missing there and this is more like a toy language with no real use in production environment… So I felt a deep need to write an introduction to Python for potential users, employers and to all who may be interested. It might be of great use for business owners and start-up runners as well as for technical people.
1. “Python is a programming language that allows you to work more quickly and integrate your systems more effectively. You can learn to use Python and see almost immediate gains in productivity and lower maintenance costs.” – From my personal experience the development on Python is about 3 times faster then on Java.
2. “Python runs on Windows, Linux/Unix, Mac OS X, and has been ported to the Java and .NET virtual machines.” – and I would add also all of them.
3. “Python is free to use, even for commercial products, because of its OSI-approved open source license.”
The world is changing. Reactive Manifesto.
First of all I would like you to introduce a reactive application manifesto.
It has been recently created to name the changes that are remarkably taking place in nowadays requirements to a network application. Reactive application manifesto lists the rules that application has to comply with, namely – scalability, responsiveness and resilience. It clearly says that scalability and responsiveness are all about asynchronous programming models (APM).
A year ago I’ve written a post about asynchronous programming with python, where I explained the advantages of APM over threading model. There I claimed that Python is actually excels APM in respect to simplicity of code writing. Having one of the largest official repository of open source code (36195 packages as of time of writing the current post), Python can offer many different libraries that make it dead-easy to write an APM based code. While the most advanced of them is Twisted – event driven networking engine.
Gregory Begelman describes here how we built APM crawlers system on top of scrapy that is being able to scrape simultaneously 2000 domains on one machine with 8 cores and 32G of RAM. In this post I elaborate it even further.
Scalability and Resilience.
Today the major players on the market such as Google, Facebook, Yahoo, Youtube and many others use Python as part of their core infrastructure.
Thereby, Python is perfect for reactive application model that is described above. Considering its simplicity in use and its open source nature with a mature community and huge repository, I conclude that Python is an optimal, best and the most efficient choice for today’s network application programming needs and challenges.
Running multiply crawlers in a single process.
When crawling many web pages it’s important for an application to get an advantage of APM. Scrapy is a Python asynchronous crawling framework, that with small changes is perfectly suites this need.
Scrapy has a Crawler component that includes request scheduler as well as visited urls queue, together with all the configuration parameters related to how the crawling process should be performed. Thus if you want to crawl multiple domains simultaneously the best isolation level would be to create a separate Crawler for each domain. This will allow you to have a custom configuration context per domain.
The conventional use of scrapy is its Crawler-per-process model. However if you need to crawl 2000 domains simultaneously on one machine, it would be very inefficient having 2000 processes on there. So our purpose is to run multiple Crawlers within just one scrapy process.
Gregory Begelman describes in his post how we solved this problem, and how we managed to run 300 concurrent domains from only one process that took just one processor core and about only 3G of RAM.
Bloom as a duplicate filter for visited urls.
The default duplicate filter, that is used in scrapy for filtering visited urls, uses a list of url fingerprints – basically sha1 hashes in length of 40 characters that is 77 bytes long in Python 2.7. Lets say you have to scrape a site with 2M of pages, then your duplicates filter list might grow up to 2M * 77b = 154Mb per one Crawler. In order to be able to scrape 300 of such domains simultaneously, you will need 300 * 154Mb = 42G of memory. Fortunately there is another way – Bloom Filter. It’s a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. When we define it to store 2M of items with probability of false positive matches equal to 0.00001% we are getting the duplicate filter that is good enough for only 11Mb of memory. This enables us to crawl 300 domains using only 3G of memory.
Below is the Bloom filter for scrapy implemented with pybloom
from pybloom import BloomFilter
from scrapy.utils.job import job_dir
from scrapy.dupefilter import BaseDupeFilter
"""Request Fingerprint duplicates filter"""
def __init__(self, path=None):
self.file = None
self.fingerprints = BloomFilter(2000000, 0.00001)
def from_settings(cls, settings):
def request_seen(self, request):
fp = request.url
if fp in self.fingerprints:
def close(self, reason):
self.fingerprints = None
To enable it in your project add to the projects settings.py the following line:
# Somewhere in settings.py
DUPEFILTER_CLASS = "project_name.bloom_filter.BLOOMDupeFilter"