The last time Hackerfall tried to access this page, it returned a not found error. A cached version of the page is below, or clickhereto continue anyway

How to Mine the Web - Digital Bacon

UPDATE: I forgot to mention that PiCloud has been bought by Dropbox (Congrats!) at the end of 2013. Fortunately MultyVacwill rise from its ashes. How well MultyVac supports web mining has yet to be seen.

Recently, I've done a number of projects utilizing web mining techniques. Also known as web scraping, web mining is a set of methods to extract useful information from websites, be it content, connections, or others. With the stagnation in semantic web and the often-lacking API endpoints, web mining has become a popular way to retrieve and consume publicly available data in a scalable fashion.

A web scraper typically has 3 components: HTTP Client, Content Parsing, and Data Consumption. As I have been using mostly Python for my recent work, thus many of my recommended libraries will be in Python.

First Step

The first step to mine web for information is search for an API. Its surprising how trigger-happy developers may get with web scraping. If an API already exists for the needed data, dont scrape it! Web mining often requires much work to start and maintain, as any HTML changes may throw off the script. The best way to scrap the web is not do it.

HTTP Client

The HTTP client is responsible for chatting communicating with the internet. It needs to be able to make basic HTTP requests (e.g. GET and POST) and output the response. A good example is the GNU wget, retrieving the raw response at the URI of interests. Extra points go to libraries and tools that can understand and format the response. My favorite is requests, which saves developers time on common routines, including status code parsing and header extraction.

Often, this client may also disguised its requests as if from a popular browser, in order to better mirror the traditional user experience. Alternatively, a pool of IP addresses may be used. through cloud computing services such as AWS and PiCloud, in order to distribute the load. If the data you need is populated through an AJAX call, use tools such as Fiddler to examine the particular HTTP request that populated the data of interest. If the same response can be replicated through manipulating the GET arguments (i.e. the stuff following ? in the URI), you can make less requests to get the same information. Depending on the website, AJAX responses are often JSON objects, making it easier to parse and traverse.

Content Parsing

Once the content is retrieved, we need to traverse the data structure and identify the data of interest. If the response is in JSON format, there are many standard libraries for parsing it into a memory object. If the response is HTML, or even a javascript snippet, however, the task becomes a bigger challenge. Broadly speaking, the two solutions: Structural traversal and pattern matching.

We can traverse the XML tree (of which HTML is a type) with tools that supports XPATH expressions, or if you are using Python, use BeautifulSoup. By identifying the specific order or unique trait, XML parsers can quickly traverse to the node containing the target data. This method is sensitive to UI updates, as the node orders and attributes may change. If the website doesn't update often, this may be a good and easy-to-follow technique.

An alternative is to treat the whole response as a raw text file and apply text pattern techniques. Yes, I am talking about Regular Expressions (regex). Sometimes there are nothing uniquely identifying the node of interests; yet, the raw data itself is unique. For example, a price tag may appear in a random row of a table, but is always prefixed by the dollar sign ($). In situations such as this, regex is a powerful technique to quickly extract the needed data, while ignoring the structure of the HTML. Regex is also surprisingly effective when the target data is embedded within the javascript tags, instead of the HTML.

While these are two very different techniques, they are not mutually exclusive. I have often traversed to a set of relevant HTML nodes using BeautifulSoup, then use regex to extract the actual data of interest.

Data Consumption

An important step many developers forget, or not emphasize enough, is actually consuming the data. While it is the last step of importing data from the web, it should be the one of the first to be considered when designing the solution. How is the data ultimately used? What is the nature of data and how to make it easier to manipulate later? We need to think about these questions before we set out to build a web mining solution. If you are dealing to text data, you may want to apply text analytics techniques afterwards, requiring you to clean up the text and strip any HTML tags or encoded characters. If you are importing data tables, you may need to properly cast each row into useful data types (e.g. Int32, Boolean, String, etc). Ask any database engineers, and they can give you much thorough list of the dangers of improperly processed data.

Finally, it is important have an attitude of gratitude when mining information from the web. Often, the information we seek are not the primary purpose of the target websites, thus not guaranteed the quality, support, or even access. If we can avoid overstepping our boundary with the traffic load and usage, web community as a whole may be more inclined to share their data.

Continue reading on