Alexey Vishnevsky

About websockets of 90’s and creativity within..

Eventually I was involved in a talk about different ways of establishing a real time connection between browser and server. One of my colleagues just said: “I would use a GIF for that”, and I was like “whaat?”, and then he showed me this library: gifsockets. This is an implementation of Graphical Interchange Format protocol, known as GIF. It’s supposed to represent animation in browser, that consists of set of.. Read More

Extracting textual time-based content from blog page using LCA techique on a DOM tree.

Today there is increasing interest in scraping the latest data from internet. Especially textual data. There is a lot of content providing sites, such as blogs, news, forums, etc. This content is time-based (periodically updated during the time). Extracting time-based content from millions of sites is not a trivial task. The main difficulty here is that we don’t know beforehand what is the format of the HTML page that we.. Read More

Python as an optimal solution for today’s network application programming challenges.

Recently I have made a brief exploration of local job market and have found a simple fact – Python is greatly misunderstood and as a result – extremely underestimated. Having an experience of being employed by different software developing companies, I have heard many times from technical people that Python is slow and doesn’t have a real threading mechanism and due to the lack of static types it’s error-prone; that.. Read More

Tips on optimizing scrapy for a high performance

Running multiply crawlers in a single process. When crawling many web pages it’s important for an application to get an advantage of APM. Scrapy is a Python asynchronous crawling framework, that with small changes is perfectly suites this need. Scrapy has a Crawler component that includes request scheduler as well as visited urls queue, together with all the configuration parameters related to how the crawling process should be performed. Thus.. Read More