|Web Data is Big Data|
|Saturday, 19 May 2012 17:58|
In the world of Big Data, there’s a lot of talk about unstructured data — after all, “variety” is one of the three Vs. Often these discussions dwell on log file data, sensor output or media content. But what about data on the Web itself — not data from Web APIs, but data on Web pages that were designed more for eyeballing than machine-driven query and storage? How can this data be read, especially at scale? Recently, I had a chat with the CTO and Founder of Kapow Software, Stefan Andreasen, who showed me how the company’s Katalyst product tames data-rich Web sites not designed for machine-readability.
Scraping the Web
Code that performs data extraction through this sort of string manipulation is sometimes said to be performing Web “scraping.” This term that pays homage to “screen scraping,” a similar, though much older, technique used to extract data from mainframe terminal screen text. Web scraping has significant relevance to Big Data. Even in cases where the bulk of a Big Data set comes from flat files or databases, augmenting that with up-to-date- reference data from the Web can be very attractive, if not outright required.
Unlocking Important Data
Similarly, there’s lots of commercial data available online that may not be neatly packaged in code-friendly formats either. Consider airline and hotel frequent flyer/loyalty program promotions. You can log into your account and read about them, but just try getting a list of all such promotions that may apply to a specific property or geographic area, and keeping the list up-to-date. If you’re an industry analyst wanting to perform ad hoc analytical queries across such offers, you may be really stuck.
Such an approach is neither reliable, nor scalable. Writing the code is expensive and updating it is too. What is really needed for this kind of work is a scripting engine which determines the URLs it needs to visit, the data it needs to extract and the processing it must subsequently perform on the data. What’s more, allowing the data desired for extraction, and the delimiters around it, to be identified visually, would allow for far faster authoring and updating than would manual inspection of HTML markup.
An engine like this has really been needed for years, but the rise of Big Data has increased the urgency. Because this data is no longer needed just for simple and quick updates. In the era of Big Data, we need to collect lots of this data and analyze it.
Making it Real
That’s great for public Web sites that you wish to extract data from, but it’s also good for adding an API to your own internal Web applications without having to write any code. In effect, Katalyst builds data services around existing Web sites and Web applications, does so without required coding, and makes any breaking layout changes in those products minimally disruptive.
Maybe the nicest thing about Katalyst is that it’s designed with data extraction and analysis in mind, and it provides a manageability layer atop all of its data integration processes, making it perfect for Big Data applications where repeatability, manageability, maintainability and scalability are all essential.
Web Data is BI, and Big Data
Joomla Templates and Joomla Extensions by ZooTemplate.Com