Today there is increasing interest in scraping the latest data from internet. Especially textual data. There is a lot of content providing sites, such as blogs, news, forums, etc. This content is time-based (periodically updated during the time). Extracting time-based content from millions of sites is not a trivial task. The main difficulty here is that we don’t know beforehand what is the format of the HTML page that we are going to scrape.

In this post I will describe the method for extracting the time-based textual content from the blog pages in the automatic manner.

Let us look on how posts are organized in the vast majority of blogs:

There are posts that consist of repeating DOM elements such as divs, spans and so on. The common feature in them is that every post contains time stamp. The time stamp might be in the beginning of the post or in the end of it, but almost always it’s a separate DOM element.

So it might be not too hard to detect the dates, using some sort of date parsing library, such as dateutil.

The dates detection process would go like this:

Now when we have the dates, we assume that each of them is the part of the post that we need to collect, but we don’t know where does the post starts and where does it end?

Fortunately there is an efficient method to identify the posts boundaries by finding the Lowest Common Ancestor (LCA) for given dates elements in a DOM tree. Then having the dates LCA’s we can be pretty sure that its children provide the required entry points to the posts itself. Below is the picture that visualizes the idea:

Screenshot from 2014-01-19 18:31:38

Another, more complex case such as multiple post threads in the page:

Screenshot from 2014-01-19 17:57:45

All these cases are supported by the following code:

Where graph is the pattern.graph.Graph instance representing the HTML DOM tree. We are using here shortest_path method to find the shortest paths between the dates elements. The intersection points of these shortest paths will provide the LCA  candidates.

Then we might want to iterate through LCA’s children and extract text from them. This may be accomplished with something similar to the following actions:

And, Yuppee!! We have got the posts!

This was a high level overview of the method. Obviously there is more optimization needed to make use of it in the production environment. Hopefully this will give some insights for those who’s interested with data mining and especially in time-based content extraction.

Thanks to my sister Natalia Vishnevsky for editing this and for the moral support.