About websockets of 90’s and creativity within..

October 1, 2014

Eventually I was involved in a talk about different ways of establishing a real time connection between browser and server. One of my colleagues just said: “I would use a GIF for that”, and I was like “whaat?”, and then he showed me this library: gifsockets. This is an implementation of Graphical Interchange Format protocol, known as GIF. It’s supposed to represent animation in browser, that consists of set of frames that are consistently changing each other. When downloading GIF animation, browser is doing this frame by frame and is holding the connection till receiving a special character from the server. Potentially, when there is no such special character sent, the browser might hold the connection opened forever. It’s also possible to convert plain text to a special format (RGB frame) and send it via a regular GIF protocol to a browser.

Back in 90’s, when there was no websockets, people were using this feature for establishing a real time communication between browser and server. Although this is clearly a big piece of hack – when something is used in a way it’s not supposed to – I think this is an amazing example of creativity in programming..

Extracting textual time-based content from blog page using LCA techique on a DOM tree.

January 20, 2014

Today there is increasing interest in scraping the latest data from internet. Especially textual data. There is a lot of content providing sites, such as blogs, news, forums, etc. This content is time-based (periodically updated during the time). Extracting time-based content from millions of sites is not a trivial task. The main difficulty here is that we don’t know beforehand what is the format of the HTML page that we are going to scrape.

In this post I will describe the method for extracting the time-based textual content from the blog pages in the automatic manner.

Let us look on how posts are organized in the vast majority of blogs:

There are posts that consist of repeating DOM elements such as divs, spans and so on. The common feature in them is that every post contains time stamp. The time stamp might be in the beginning of the post or in the end of it, but almost always it’s a separate DOM element.

So it might be not too hard to detect the dates, using some sort of date parsing library, such as dateutil.

The dates detection process would go like this:

Now when we have the dates, we assume that each of them is the part of the post that we need to collect, but we don’t know where does the post starts and where does it end?

Fortunately there is an efficient method to identify the posts boundaries by finding the Lowest Common Ancestor (LCA) for given dates elements in a DOM tree. Then having the dates LCA’s we can be pretty sure that its children provide the required entry points to the posts itself. Below is the picture that visualizes the idea:

Screenshot from 2014-01-19 18:31:38

Another, more complex case such as multiple post threads in the page:

Screenshot from 2014-01-19 17:57:45

All these cases are supported by the following code:

Where graph is the pattern.graph.Graph instance representing the HTML DOM tree. We are using here shortest_path method to find the shortest paths between the dates elements. The intersection points of these shortest paths will provide the LCA  candidates.

Then we might want to iterate through LCA’s children and extract text from them. This may be accomplished with something similar to the following actions:

And, Yuppee!! We have got the posts!

This was a high level overview of the method. Obviously there is more optimization needed to make use of it in the production environment. Hopefully this will give some insights for those who’s interested with data mining and especially in time-based content extraction.

Thanks to my sister Natalia Vishnevsky for editing this and for the moral support.

Python as an optimal solution for today’s network application programming challenges.

November 3, 2013

Recently I have made a brief exploration of local job market and have found a simple fact – Python is greatly misunderstood and as a result – extremely underestimated.

Having an experience of being employed by different software developing companies, I have heard many times from technical people that Python is slow and doesn’t have a real threading mechanism and due to the lack of static types it’s error-prone; that some of the “basic” OOP stuff is missing there and this is more like a toy language with no real use in production environment… So I felt a deep need to write an introduction to Python for potential users, employers and to all who may be interested. It might be of great use for business owners and start-up runners as well as for technical people.

Introduction.

1. “Python is a programming language that allows you to work more quickly and integrate your systems more effectively. You can learn to use Python and see almost immediate gains in productivity and lower maintenance costs.” – From my personal experience the development on Python is about 3 times faster then on Java.

2. “Python runs on Windows, Linux/Unix, Mac OS X, and has been ported to the Java and .NET virtual machines.” – and I would add also all of them.

3. “Python is free to use, even for commercial products, because of its OSI-approved open source license.”

The world is changing. Reactive Manifesto.

First of all I would like you to introduce a reactive application manifesto.

It has been recently created to name the changes that are remarkably  taking place in nowadays requirements to a network application. Reactive application manifesto lists the rules that application has to comply with, namely – scalability, responsiveness and resilience. It clearly says that scalability and responsiveness are all about asynchronous programming models (APM).

Responsiveness.

A year ago I’ve written a post about asynchronous programming with python, where I explained the advantages of APM over threading model. There I claimed that Python is actually excels APM in respect to simplicity of code writing. Having one of the largest official repository of open source code (36195 packages as of time of writing the current post), Python can offer many different libraries that make it dead-easy to write an APM based code. While the most advanced of them is Twisted – event driven networking engine.

Gregory Begelman describes here how we built APM crawlers system on top of scrapy that is being able to scrape simultaneously 2000 domains on one machine with 8 cores and 32G of RAM. In this post I elaborate it even further.

Scalability and Resilience.

Today the major players on the market such as Google, Facebook, Yahoo, Youtube and many others use Python as part of their core infrastructure.

Such well-known and commonly used computing systems as  Amazone Web ServicesGoogle AppEngineAppache Hadoop and Nokia Disco have been integrated with Python as well.

Thereby, Python is perfect for reactive application model that is described above. Considering its simplicity in use and its open source nature with a mature community and huge repository, I conclude that Python is an optimal, best and the most efficient choice for today’s network application programming needs and challenges.

Tips on optimizing scrapy for a high performance

November 2, 2013

Running multiply crawlers in a single process.

When crawling many web pages it’s important for an application to get an advantage of APM. Scrapy is a Python asynchronous crawling framework, that with small changes is perfectly suites this need.

Scrapy has a Crawler component that includes request scheduler as well as visited urls queue, together with all the configuration parameters related to how the crawling process should be performed. Thus if you want to crawl multiple domains simultaneously the best isolation level would be to create a separate Crawler for each domain. This will allow you to have a custom configuration context per domain.

The conventional use of scrapy is its Crawler-per-process model. However if you need to crawl 2000 domains simultaneously  on one machine, it would be very inefficient  having 2000 processes on there. So our purpose is to run multiple Crawlers within just one scrapy process.

Gregory Begelman describes in his post how we solved this problem, and how we managed to run 300 concurrent domains from only one process that took just one processor core and about only 3G of RAM.

Bloom as a duplicate filter for visited urls.

The default duplicate filter, that is used in scrapy for filtering visited urls, uses a list of url fingerprints – basically sha1 hashes in length of 40 characters that is 77 bytes long in Python 2.7. Lets say you have to scrape a site with 2M of pages, then your duplicates filter list might grow up to 2M * 77b = 154Mb per one Crawler. In order to be able to scrape 300 of such domains simultaneously, you will need 300 * 154Mb = 42G of memory. Fortunately there is another way – Bloom Filter. It’s a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. When we define it to store 2M of items with probability of false positive matches equal to 0.00001% we are getting the duplicate filter that is good enough for only 11Mb of memory. This enables us to crawl 300 domains using only 3G of memory.

Below is the Bloom filter for scrapy implemented with pybloom

To enable it in your project add to the projects settings.py the following line: