Blogger: Marcus Collins
My colleague Lyn Robson recently blogged about the emerging database meta models. Lyn mentioned the term Internet Scale in the context of data processing. This term is used to cover conditions where the traditional relational model is often not the correct choice and Map Reduce maybe more appropriate.
Organizations should consider a wider definition of internet scale as they ask when Map Reduce should be considered. Internet scale should include:
Data volumes – huge data volumes is still a key criteria for internet scale. The first examples of these were web clickstream logs and these are still an important data source. These data volumes are seen in other domains for example security logs for intrusion detection, credit card transactions for fraud detection etc.
- Complex Processing – for all its power SQL lacks the ability to perform complex data processing for example statistical analysis.
- Semi-consistent data – the Map Reduce programming model is similar to the relational full table scan. As such semi-consistent input data is important to such processing. I’ll explore this more in an upcoming paper on Cloud Databases.
The traditional database vendors have not missed how important Map Reduce is becoming. In a recently published document "Data Warehouses: Navigating the Maze of Technical Options" Burton Group Senior Analyst Marcus Collins looks at the differing database technologies that database architects have to choose as they design their data warehouse.
One of these is Map Reduce which is a programming model first developed by Google in 2004 for the processing of large data sets. Whilst not a database technology per se, the framework is included in this overview because of the coverage this framework has received and the work currently underway to integrate database access into the framework. Map Reduce consists of 2 functions map and reduce and a framework for running a large number of instances of these programs on commodity hardware. The map function reads a set of records from an input file processes these records and outputs a set of intermediate records. These output records taken the generic form of (key, data). As part of the map function a split function distributed the intermediate records across many buckets using a hash function. The reduce function then processes the intermediate records. Both the map and reduce functions can be written in any programming language.
As an example of Map Reduce – the requirement is to count the number of words that occur within a document or set of documents. The map function splits each document into words and outputs each word together with the digit “1”. The output records are therefore of the form (word, 1). The Map Reduce framework groups all the records with the same key (i.e., word) and feeds them into the reduce function. The reduce function sums the input values and outputs the word and the total number of occurrences in the document(s).
Both the map and reduce functions can be parallelized thus providing the scalability required when the input data is internet scale. This example shows the similarity between a parallel full table scan and the map function and it is this synergy that is being exploited by database vendors.
The primary benefit of the Map Reduce framework is the flexibility it allows in the type of data that can be access and the processing within the map and reduce functions. The map function can read from standard flat files (e.g., web clickstream logs), web crawls and database tables. There is no limit on the programming language or type of processing that the map function can perform. For example the map function could be written in Perl of Python and perform unstructured text analysis or statistical analysis. This flexibility should be compare with the limited processing supported by the SQL language even when we include the increased sophistication provided by the major database vendors (e.g., stored procedures or functions).
There are a number of drawbacks with Map Reduce. The records processed and written by Map Reduce have no schema – taking the generic form (key, data). Schemas provide documentation of the record structure, provide independence from schema changes through the use of views and require that data must adhere to the schema definition thereby guarding against corrupt data. Map Reduce’s lack of schema support provides none of these capabilities. Map Reduce does not make use of any indexing schema. Rather the map function uses brute force and massive parallelism to provide the necessary processing throughput.
The Burton Group document "Data Warehouses: Navigating the Maze of Technical Options" covers a number of other database technologies and I’ll explore these in a series of follow on blogs.