Reinventing Quick Sort by improving Stupid Sort.

March 14, 2018

Stupid sort is the sort that just randomly shuffles the array until it eventually becomes sorted.

In spite of the fact that stupid sort is not so popular in production environment, it may be mentioned as an opening question at programming interview of any level. The discussion about optimizing this kind of search may be very interesting and useful for better understanding some sorting techniques.

In a nutshell stupid sort may be written like this:

The average running time of this algorithm is terribly long. For each permutation (out of n!) we need to verify the order of all the elements in the array (n elements) which gives us O (n*n!). On my Hexa-core Xiaomi Redmi 3 phone this process is very slow even for n=10 ending up with more than 36 million operations. In the worst case, the running time of the algorithm is unbounded, for the same reason that a tossed coin might turn up heads any number of times in a row.

The best case occurs if the list as given is already sorted; in this case the expected time complexity is O(n)

For any collection of fixed size, the expected running time of the algorithm is finite for much the same reason that the infinite monkey theorem holds: there is non-zero probability of getting  right permutation, so given an unbounded number of tries it will almost surely eventually be chosen.

Let us demonstrate the way we can increase the average running time by adding a couple of simple rules. In our example shuffle each time swaps 2 random elements. What if we will not swap those pairs that already placed in a right order (left element smaller than right)? This will immediately save us a lot of swaps and also significantly decrease the running time.

Let us write down this “directional shuffle”:

Now, after adding these new rules, my phone was able to sort 100 elements in less than a second.

Although the running time has decreased, it still increases very fast, when the number of elements in the array adds up. For example, to sort 1000 elements takes about 100 seconds. We increased the array size by x10 and got the x1000 slow down. This allows us to estimate our algorithm by O(n^3). It turns out that if we only could split our 1000 elements array to 10 sub-arrays by 100 elements and then sort each other separately, this would take us less than 10*(1 sec) = 10 seconds (instead of 1000). The only thing we should figure out is how to split the array in such way that the merge back would be trivial. Apparently we can change our shuffling function a little bit so that after shuffling the value of elements in the left part of the array will be smaller than those of the elements in the right part. Shuffling written in this way may be called many times on the same part of the array until it eventually becomes sorted (stupid-sort technique). Then, after all parts of the array are sorted, the entire array becomes sorted (trivially).

Let us modify our “shuffle” to swap elements with respect to some randomly chosen element of the array (pivot):

The result of one run of this function is that original array was divided into 2 parts: left and right, separated by pivot. Each element on the left is smaller than pivot, and each element on the right is greater or equal than the pivot. Please notice that the function has O (n) time complexity and works in-place.

Now our stupid_sort can be written as follows:

We can actually achieve the same functionality, using a recursion:

Pay attention, that is_sorted(arr) call replaced with if l >= r: since once the size of partitions is 2 the partitions become sorted. Then the original array which consists of sorted partitions becomes also sorted.

Here is the full code resuming all the above:

Written in such way, shuffling function has a special name – partitioning and the algorithm called Quick Sort. Its complexity is bound by number of

partitions, multiplied by complexity of shuffling each partition which is log(n)
* n

Thus, we have just demonstrated how by means of gradually improving the
most basic and simple function, our stupid sort has been transformed into
one of the most efficient sorting algorithms known today.

Lua as a Python’s secret weapon.

May 28, 2015

There are Lies, Damn Lies, and Benchmarks

Pure Python

Hello again, today I would like to share some remarkable benchmark findings that demonstrate the way Python may be boosted by embedding Lua into Python code. The process has started with simple task my friend asked me to fulfil in order to compare Python to other languages. So, below is the benchmark we are going to test:

fig 1.0

We have created 2 arrays of 5M random integers each and then we have compared them. If elements with corresponding indexes are different (merely they are) we would make “sum and update” action, which gives us the following result:

Pure Python init 7.01951003075
Pure Python sum 0.525348901749

Thus, Python requires about 7 seconds to initialize the arrays and another 0.52 seconds to perform “sum and update” action.
Are we capable to get more out of Python? Yes, for sure, we are!  Let’s see below:

NumPy

Apparently NumPy is capable to sum up two arrays in very efficient way. It is also superior in array initialization.

Consider the following code:

The output:

Numpy init 0.211035966873
Numpy sum 0.0101628303528

We can see that NumPY requires just 0.2 seconds for arrays initialization versus 7 seconds of Python, which is 35x faster of pure Python time. Same thing is with the “sum and update” action. NumPy used 0.01 second versus 0.5, which is 50x time. Pretty impressive numbers!

PyPy

Running the pure Python code (fig 1.0) with PyPy gives us the same results as NumPy:

Regular init 0.203986883163
Regular sum 0.0113749504089

C

Now let’s see what we are able to get out of C. My friend claimed he could write this function with C and it would take him no more than 5 minutes. Well, he actually did it within 5 minutes and then spent another hour trying to figure out why the function was not working. Probably, every experienced C programmer would immediately spot the problem. Namely, in C you cannot just declare the array of size 5000000 because it is too big to be placed in app stack. What you should do, is to place it in a heap by malloc function.. While my friend was straggling with C code, I wanted to figure out how to speed up Python code without writing custom module in C. And I have found this simple solution:

Lua

Lua is an amazing language! It’s available as open source, it’s easy to read and understand, which makes it easy to learn. You can find a detailed and authoritative introduction to all aspects of Lua programming written by Lua’s chief architect Roberto Ierusalimschy here. In spite being simple, it is, nevertheless, a very powerful tool and a solid piece of programming art. To say the least, Lua is one of the fastest dynamic languages today. So please do not underestimate it. An important feature for us is that Lua has a very small memory footprint – less than 200K for the whole interpreter. This allows us to seamlessly embed it into any environment in order to facilitate and speed up the process.

Here is our benchmark written with Lua but yet wrapped with Python:

Output:

LUA init: 0.080163
LUA sum: 0.007911

The result is amazing! We can see that Lua is faster than PyPy and faster than NumPy by 1.5-3x, and faster than pure Python by 50-100x!

  1. Python   1x
  2. NumPy  50x
  3. PyPy     50x
  4. Lua       100x
  5. C          100x

As you can see, it was easy to embed Lua in Python and the results are outstanding! Thus, we can say that Lua is definitely of a great help when you need to speed up some critical parts of your Python code. Many commercial products today use Lua. Just to name a few: Adobe Photoshop Lightroom, World of Warcraft, FarCry, SimCity 4 and others.

Hope you enjoyed this post and it will be useful to you.

Extracting textual time-based content from blog page using LCA techique on a DOM tree.

January 20, 2014

Today there is increasing interest in scraping the latest data from internet. Especially textual data. There is a lot of content providing sites, such as blogs, news, forums, etc. This content is time-based (periodically updated during the time). Extracting time-based content from millions of sites is not a trivial task. The main difficulty here is that we don’t know beforehand what is the format of the HTML page that we are going to scrape.

In this post I will describe the method for extracting the time-based textual content from the blog pages in the automatic manner.

Let us look on how posts are organized in the vast majority of blogs:

There are posts that consist of repeating DOM elements such as divs, spans and so on. The common feature in them is that every post contains time stamp. The time stamp might be in the beginning of the post or in the end of it, but almost always it’s a separate DOM element.

So it might be not too hard to detect the dates, using some sort of date parsing library, such as dateutil.

The dates detection process would go like this:

Now when we have the dates, we assume that each of them is the part of the post that we need to collect, but we don’t know where does the post starts and where does it end?

Fortunately there is an efficient method to identify the posts boundaries by finding the Lowest Common Ancestor (LCA) for given dates elements in a DOM tree. Then having the dates LCA’s we can be pretty sure that its children provide the required entry points to the posts itself. Below is the picture that visualizes the idea:

Screenshot from 2014-01-19 18:31:38

Another, more complex case such as multiple post threads in the page:

Screenshot from 2014-01-19 17:57:45

All these cases are supported by the following code:

Where graph is the pattern.graph.Graph instance representing the HTML DOM tree. We are using here shortest_path method to find the shortest paths between the dates elements. The intersection points of these shortest paths will provide the LCA  candidates.

Then we might want to iterate through LCA’s children and extract text from them. This may be accomplished with something similar to the following actions:

And, Yuppee!! We have got the posts!

This was a high level overview of the method. Obviously there is more optimization needed to make use of it in the production environment. Hopefully this will give some insights for those who’s interested with data mining and especially in time-based content extraction.

Thanks to my sister Natalia Vishnevsky for editing this and for the moral support.

Python as an optimal solution for today’s network application programming challenges.

November 3, 2013

Recently I have made a brief exploration of local job market and have found a simple fact – Python is greatly misunderstood and as a result – extremely underestimated.

Having an experience of being employed by different software developing companies, I have heard many times from technical people that Python is slow and doesn’t have a real threading mechanism and due to the lack of static types it’s error-prone; that some of the “basic” OOP stuff is missing there and this is more like a toy language with no real use in production environment… So I felt a deep need to write an introduction to Python for potential users, employers and to all who may be interested. It might be of great use for business owners and start-up runners as well as for technical people.

Introduction.

1. “Python is a programming language that allows you to work more quickly and integrate your systems more effectively. You can learn to use Python and see almost immediate gains in productivity and lower maintenance costs.” – From my personal experience the development on Python is about 3 times faster then on Java.

2. “Python runs on Windows, Linux/Unix, Mac OS X, and has been ported to the Java and .NET virtual machines.” – and I would add also all of them.

3. “Python is free to use, even for commercial products, because of its OSI-approved open source license.”

The world is changing. Reactive Manifesto.

First of all I would like you to introduce a reactive application manifesto.

It has been recently created to name the changes that are remarkably  taking place in nowadays requirements to a network application. Reactive application manifesto lists the rules that application has to comply with, namely – scalability, responsiveness and resilience. It clearly says that scalability and responsiveness are all about asynchronous programming models (APM).

Responsiveness.

A year ago I’ve written a post about asynchronous programming with python, where I explained the advantages of APM over threading model. There I claimed that Python is actually excels APM in respect to simplicity of code writing. Having one of the largest official repository of open source code (36195 packages as of time of writing the current post), Python can offer many different libraries that make it dead-easy to write an APM based code. While the most advanced of them is Twisted – event driven networking engine.

Gregory Begelman describes here how we built APM crawlers system on top of scrapy that is being able to scrape simultaneously 2000 domains on one machine with 8 cores and 32G of RAM. In this post I elaborate it even further.

Scalability and Resilience.

Today the major players on the market such as Google, Facebook, Yahoo, Youtube and many others use Python as part of their core infrastructure.

Such well-known and commonly used computing systems as  Amazone Web ServicesGoogle AppEngineAppache Hadoop and Nokia Disco have been integrated with Python as well.

Thereby, Python is perfect for reactive application model that is described above. Considering its simplicity in use and its open source nature with a mature community and huge repository, I conclude that Python is an optimal, best and the most efficient choice for today’s network application programming needs and challenges.

Tips on optimizing scrapy for a high performance

November 2, 2013

Running multiply crawlers in a single process.

When crawling many web pages it’s important for an application to get an advantage of APM. Scrapy is a Python asynchronous crawling framework, that with small changes is perfectly suites this need.

Scrapy has a Crawler component that includes request scheduler as well as visited urls queue, together with all the configuration parameters related to how the crawling process should be performed. Thus if you want to crawl multiple domains simultaneously the best isolation level would be to create a separate Crawler for each domain. This will allow you to have a custom configuration context per domain.

The conventional use of scrapy is its Crawler-per-process model. However if you need to crawl 2000 domains simultaneously  on one machine, it would be very inefficient  having 2000 processes on there. So our purpose is to run multiple Crawlers within just one scrapy process.

Gregory Begelman describes in his post how we solved this problem, and how we managed to run 300 concurrent domains from only one process that took just one processor core and about only 3G of RAM.

Bloom as a duplicate filter for visited urls.

The default duplicate filter, that is used in scrapy for filtering visited urls, uses a list of url fingerprints – basically sha1 hashes in length of 40 characters that is 77 bytes long in Python 2.7. Lets say you have to scrape a site with 2M of pages, then your duplicates filter list might grow up to 2M * 77b = 154Mb per one Crawler. In order to be able to scrape 300 of such domains simultaneously, you will need 300 * 154Mb = 42G of memory. Fortunately there is another way – Bloom Filter. It’s a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. When we define it to store 2M of items with probability of false positive matches equal to 0.00001% we are getting the duplicate filter that is good enough for only 11Mb of memory. This enables us to crawl 300 domains using only 3G of memory.

Below is the Bloom filter for scrapy implemented with pybloom

To enable it in your project add to the projects settings.py the following line: