Alexey Vishnevsky

There is always one more bug to fix

March 14, 2018

Stupid sort is the sort that just randomly shuffles the array until it eventually becomes sorted.

In spite of the fact that stupid sort is not so popular in production environment, it may be mentioned as an opening question at programming interview of any level. The discussion about optimizing this kind of search may be very interesting and useful for better understanding some sorting techniques.

In a nutshell stupid sort may be written like this:

The average running time of this algorithm is terribly long. For each permutation (out of n!) we need to verify the order of all the elements in the array (n elements) which gives us O (n*n!). On my Hexa-core Xiaomi Redmi 3 phone this process is very slow even for n=10 ending up with more than 36 million operations. In the worst case, the running time of the algorithm is unbounded, for the same reason that a tossed coin might turn up heads any number of times in a row.

The best case occurs if the list as given is already sorted; in this case the expected time complexity is O(n)

For any collection of fixed size, the expected running time of the algorithm is finite for much the same reason that the infinite monkey theorem holds: there is non-zero probability of getting right permutation, so given an unbounded number of tries it will almost surely eventually be chosen.

Let us demonstrate the way we can increase the average running time by adding a couple of simple rules. In our example `shuffle`

each time swaps 2 random elements. What if we will not swap those pairs that already placed in a right order (left element smaller than right)? This will immediately save us a lot of swaps and also significantly decrease the running time.

Let us write down this “directional shuffle”:

1 2 3 4 5 6 7 8 9 10 |
def shuffle(arr, l, r): rint = random.randint while l < r: m = rint(l, r) if arr[l] > arr[m]: arr[m], arr[l] = arr[l], arr[m] l += 1 return arr |

Now, after adding these new rules, my phone was able to sort 100 elements in less than a second.

Although the running time has decreased, it still increases very fast, when the number of elements in the array adds up. For example, to sort 1000 elements takes about 100 seconds. We increased the array size by x10 and got the x1000 slow down. This allows us to estimate our algorithm by O(n^3). It turns out that if we only could split our 1000 elements array to 10 sub-arrays by 100 elements and then sort each other separately, this would take us less than 10*(1 sec) = 10 seconds (instead of 1000). The only thing we should figure out is how to split the array in such way that the merge back would be trivial. Apparently we can change our shuffling function a little bit so that after shuffling the value of elements in the left part of the array will be smaller than those of the elements in the right part. Shuffling written in this way may be called many times on the same part of the array until it eventually becomes sorted (stupid-sort technique). Then, after all parts of the array are sorted, the entire array becomes sorted (trivially).

Let us modify our “shuffle” to swap elements with respect to some randomly chosen element of the array (pivot):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
def shuffle(arr, l, r): m = l # chose pivot randomly and place as last element p = random.randint(l, r) arr[p], arr[r] = arr[r], arr[p] # swap elements arr[l], arr[m] according to pivot arr[r] while l < r: if arr[l] < arr[r]: arr[l], arr[m] = arr[m], arr[l] m += 1 l += 1 # place pivot between 2 partitions arr[m], arr[r] = arr[r], arr[m] return m |

The result of one run of this function is that original array was divided into 2 parts: left and right, separated by pivot. Each element on the left is smaller than pivot, and each element on the right is greater or equal than the pivot. Please notice that the function has O (n) time complexity and works in-place.

Now our stupid_sort can be written as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 |
def stupid_sort(arr): r = len(arr) - 1 while not is_sorted(arr): p = shuffle(arr, 0, r) p1 = shuffle(arr, 0, p) p2 = shuffle(arr, p, r) p3 = shuffle(arr, 0, p1) p4 = shuffle(arr, p1, p) p5 = shuffle(arr, p, p2) p6 = shuffle(arr, p2, r) # etc.. |

We can actually achieve the same functionality, using a recursion:

1 2 3 4 5 6 7 8 |
def stupid_sort(arr, l, r): if l >= r: return p = partition(arr, l, r) stupid_sort(arr, l, p) stupid_sort(arr, p + 1, r) |

Pay attention, that is_sorted(arr) call replaced with if l >= r: since once the size of partitions is 2 the partitions become sorted. Then the original array which consists of sorted partitions becomes also sorted.

Here is the full code resuming all the above:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
import time import random def shuffle(arr, l, r): m = l p = random.randint(l, r) arr[p], arr[r] = arr[r], arr[p] while l < r: if arr[l] < arr[r]: arr[l], arr[m] = arr[m], arr[l] m += 1 l += 1 arr[m], arr[r] = arr[r], arr[m] return m def stupid_sort(arr, l, r): if l >= r: return p = shuffle(arr, l, r) stupid_sort(arr, l, p) stupid_sort(arr, p + 1, r) n = 100000 arr = [random.randint(1, n) for i in range(n)] st = time.time() stupid_sort(arr, 0, len(arr) - 1) print time.time() - st print arr |

Written in such way, shuffling function has a special name – partitioning and the algorithm called Quick Sort. Its complexity is bound by number of

partitions, multiplied by complexity of shuffling each partition which is `log(n)`

* n

Thus, we have just demonstrated how by means of gradually improving the

most basic and simple function, our stupid sort has been transformed into

one of the most efficient sorting algorithms known today.

January 20, 2014

Today there is increasing interest in scraping the latest data from internet. Especially textual data. There is a lot of content providing sites, such as blogs, news, forums, etc. This content is time-based (periodically updated during the time). Extracting time-based content from millions of sites is not a trivial task. The main difficulty here is that we don’t know beforehand what is the format of the HTML page that we are going to scrape.

In this post I will describe the method for extracting the time-based textual content from the blog pages in the automatic manner.

Let us look on how posts are organized in the vast majority of blogs:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
<div class="post-container"> <div class="post-1"> <p>...</p> <abbr class="published entry-date">January 11, 2014</abbr> </div> <div class="post-2"> <p>...</p> <abbr class="published entry-date">January 10, 2014</abbr> </div> <div class="post-3"> <p>...</p> <abbr class="published entry-date">January 9, 2014</abbr> </div> </div> |

There are posts that consist of repeating DOM elements such as divs, spans and so on. The common feature in them is that every post contains time stamp. The time stamp might be in the beginning of the post or in the end of it, but almost always it’s a separate DOM element.

So it might be not too hard to detect the dates, using some sort of date parsing library, such as dateutil.

The dates detection process would go like this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
import datetime from lxml import etree from lxml.html import HTMLParser from dateutil.parser import parse def is_valid_date(text): """Based on dateutil should return True if text is valid date format, and False othervise""" ... def main(): response = urllib.urlopen("http://alexeyvishnevsky.com/") element_tree = etree.parse(response, HTMLParser()) date_candidates = [] for element in element_tree.getiterator(): text = element.xpath("text()") if not len(text): continue if is_valid_date(texti[0]): date_candidates.append(element) print [d.xpath("text()")[0] for d in date_candidates] main() |

Now when we have the dates, we assume that each of them is the part of the post that we need to collect, but we don’t know where does the post starts and where does it end?

Fortunately there is an efficient method to identify the posts boundaries by finding the Lowest Common Ancestor (LCA) for given dates elements in a DOM tree. Then having the dates LCA’s we can be pretty sure that its children provide the required entry points to the posts itself. Below is the picture that visualizes the idea:

Another, more complex case such as multiple post threads in the page:

All these cases are supported by the following code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
def find_lca_candidates(graph, date_candidates): import itertools lca_local_candidates = None lca_global_candidates = set() comparing_set_for_local_date_candidates = itertools.combinations( date_candidates, 2) for item in comparing_set_for_local_date_candidates: path = graph.shortest_path(item[0].xpath, item[1].xpath) if not lca_local_candidates: lca_local_candidates = set(path) else: lca_local_candidates = lca_local_candidates.intersection( set(path)) if len(lca_local_candidates) == 1: item = lca_local_candidates.pop() lca_global_candidates.add(item) lca_local_candidates = None return lca_global_candidates |

Where graph is the pattern.graph.Graph instance representing the HTML DOM tree. We are using here shortest_path method to find the shortest paths between the dates elements. The intersection points of these shortest paths will provide the LCA candidates.

Then we might want to iterate through LCA’s children and extract text from them. This may be accomplished with something similar to the following actions:

1 2 3 4 5 6 7 8 |
def collect_posts_from_page(): posts = [] for lca_candidate in lca_global_candidates: for child in lca_candidate: posts.append(extract_text_from_etree(child)) return posts |

And, Yuppee!! We have got the posts!

This was a high level overview of the method. Obviously there is more optimization needed to make use of it in the production environment. Hopefully this will give some insights for those who’s interested with data mining and especially in time-based content extraction.