In: Computer Science
Explain the overlapping use cases between data warehousing and distributed computing technologies such as Hadoop
As customers, analysts, and journalists explore Hadoop and MapReduce, the most frequent questions are: “When should I use Hadoop, and when should I put the data into a data warehouse?” The answers are best explained with an example of a recently deployed big data source: Smart Meters Smart meters are deployed in homes worldwide to help consumers and utility companies manage the use of water, electricity, and gas better. Historically, meter readers would walk from house to house recording meter read outs and reporting them to the utility company for billing purposes. Because of the labor costs, many utilities switched from monthly readings to quarterly. This delayed revenue and made it impossible to analyze residential usage in any detail.
Consider a fictional company called CostCutter Utilities that serves 10 million households. Once a quarter, they gathered 10 million readings to produce utility bills. With government regulation and the price of oil skyrocketing, CostCutter started deploying smart meters so they could get hourly readings of electricity usage. They now collect 21.6 billion sensor readings per quarter from the smart meters. Analysis of the meter data over months and years can be correlated with energy saving campaigns, weather patterns, and local events, providing savings insights both for consumers and CostCutter Utilities. When consumers are offered a billing plan that has cheaper electricity from 8 p.m. to 5 a.m., they demand five minute intervals in their smart meter reports so they can identify high-use activity in their homes. At five minute intervals, the smart meters are collecting more than 100 billion meter readings every 90 days, and CostCutter Utilities now has a big data problem. Their data volume exceeds their ability to process it with existing software and hardware. So CostCutter Utilities turns to Hadoop to handle the incoming meter readings.
Hadoop now plays a key role in capturing, transforming, and publishing data. Using tools such as Apache Pig, advanced transformations can be applied in Hadoop with little manual programming effort; and since Hadoop is a low cost storage repository, data can be held for months or even years. Since Hadoop has been used to clean and transform the data, it is loaded directly into the data warehouse and MDMS systems. Marketing is developing additional offers for consumers to save money by using analysis of the trends by household, neighborhood, time of day, and local events. And now the in-home display unit can give consumers detailed knowledge of their usage.
Note that Hadoop is not an Extract-Transform-Load (ETL) tool. It is a platform that supports running ETL processes in parallel. The data integration vendors do not compete with Hadoop; rather, Hadoop is another channel for use of their data transformation modules.
As corporations start using larger amounts of data, migrating it over the network for transformation or analysis becomes unrealistic. Moving terabytes from one system to another daily can bring the wrath of the network administrator down on a programmer. It makes more sense to push the processing to the data. Moving all the big data to one storage area network (SAN) or ETL server becomes infeasible with big data volumes. Even if you can move the data, processing it is slow, limited to SAN bandwidth, and often fails to meet batch processing windows. With Hadoop, raw data is loaded directly to low cost commodity servers one time, and only the higher value refined results are passed to other systems. ETL processing runs in parallel across the entire cluster resulting in much faster operations than can be achieved pulling data from a SAN into a collection of ETL servers. Using Hadoop, data does not get loaded into a SAN just to then get pulled out of the SAN across the network multiple times for each transformation.
It should be no surprise that many Hadoop systems sit side by side with data warehouses. These systems serve different purposes and complement one another. For example:
A major brokerage firm uses Hadoop to preprocess raw click streams generated by customers using their website. Processing these click streams provides valuable insight into customer preferences which are passed to a data warehouse. The data warehouse then couples these customer preferences with marketing campaigns and recommendation engines to offer investment suggestions and analysis to consumers. There are other approaches to investigative analytics on clickstream data using analytic platforms. See “MapReduce and the Data Scientist” for more details.
An eCommerce service uses Hadoop for machine learning to detect fraudulent supplier websites. The fraudulent sites exhibit patterns that Hadoop uses to produce a predictive model. The model is copied into the data warehouse where it is used to find sales activity that matches the pattern. Once found, that supplier is investigated and potentially discontinued.
Complex Hadoop jobs can use the data warehouse as a data source, simultaneously leveraging the massively parallel capabilities of two systems. Any MapReduce program can issue SQL statements to the data warehouse. In one context, a MapReduce program is “just another program,” and the data warehouse is “just another database.” Now imagine 100 MapReduce programs concurrently accessing 100 data warehouse nodes in parallel. Both raw processing and the data warehouse scale to meet any big data challenge. Inevitably, visionary companies will take this step to achieve competitive advantages.