Question

In: Computer Science

Explain the overlapping use cases between data warehousing and distributed computing technologies such as Hadoop

Explain the overlapping use cases between data warehousing and distributed computing technologies such as Hadoop

Solutions

Expert Solution

As customers, analysts, and journalists explore Hadoop and MapReduce, the most frequent questions are: “When should I use Hadoop, and when should I put the data into a data warehouse?” The answers are best explained with an example of a recently deployed big data source: Smart Meters Smart meters are deployed in homes worldwide to help consumers and utility companies manage the use of water, electricity, and gas better. Historically, meter readers would walk from house to house recording meter read outs and reporting them to the utility company for billing purposes. Because of the labor costs, many utilities switched from monthly readings to quarterly. This delayed revenue and made it impossible to analyze residential usage in any detail.

Consider a fictional company called CostCutter Utilities that serves 10 million households. Once a quarter, they gathered 10 million readings to produce utility bills. With government regulation and the price of oil skyrocketing, CostCutter started deploying smart meters so they could get hourly readings of electricity usage. They now collect 21.6 billion sensor readings per quarter from the smart meters. Analysis of the meter data over months and years can be correlated with energy saving campaigns, weather patterns, and local events, providing savings insights both for consumers and CostCutter Utilities. When consumers are offered a billing plan that has cheaper electricity from 8 p.m. to 5 a.m., they demand five minute intervals in their smart meter reports so they can identify high-use activity in their homes. At five minute intervals, the smart meters are collecting more than 100 billion meter readings every 90 days, and CostCutter Utilities now has a big data problem. Their data volume exceeds their ability to process it with existing software and hardware. So CostCutter Utilities turns to Hadoop to handle the incoming meter readings.

Hadoop now plays a key role in capturing, transforming, and publishing data. Using tools such as Apache Pig, advanced transformations can be applied in Hadoop with little manual programming effort; and since Hadoop is a low cost storage repository, data can be held for months or even years. Since Hadoop has been used to clean and transform the data, it is loaded directly into the data warehouse and MDMS systems. Marketing is developing additional offers for consumers to save money by using analysis of the trends by household, neighborhood, time of day, and local events. And now the in-home display unit can give consumers detailed knowledge of their usage.

Note that Hadoop is not an Extract-Transform-Load (ETL) tool. It is a platform that supports running ETL processes in parallel. The data integration vendors do not compete with Hadoop; rather, Hadoop is another channel for use of their data transformation modules.

As corporations start using larger amounts of data, migrating it over the network for transformation or analysis becomes unrealistic. Moving terabytes from one system to another daily can bring the wrath of the network administrator down on a programmer. It makes more sense to push the processing to the data. Moving all the big data to one storage area network (SAN) or ETL server becomes infeasible with big data volumes. Even if you can move the data, processing it is slow, limited to SAN bandwidth, and often fails to meet batch processing windows. With Hadoop, raw data is loaded directly to low cost commodity servers one time, and only the higher value refined results are passed to other systems. ETL processing runs in parallel across the entire cluster resulting in much faster operations than can be achieved pulling data from a SAN into a collection of ETL servers. Using Hadoop, data does not get loaded into a SAN just to then get pulled out of the SAN across the network multiple times for each transformation.

It should be no surprise that many Hadoop systems sit side by side with data warehouses. These systems serve different purposes and complement one another. For example:

A major brokerage firm uses Hadoop to preprocess raw click streams generated by customers using their website. Processing these click streams provides valuable insight into customer preferences which are passed to a data warehouse. The data warehouse then couples these customer preferences with marketing campaigns and recommendation engines to offer investment suggestions and analysis to consumers. There are other approaches to investigative analytics on clickstream data using analytic platforms. See “MapReduce and the Data Scientist” for more details.

An eCommerce service uses Hadoop for machine learning to detect fraudulent supplier websites. The fraudulent sites exhibit patterns that Hadoop uses to produce a predictive model. The model is copied into the data warehouse where it is used to find sales activity that matches the pattern. Once found, that supplier is investigated and potentially discontinued.

Complex Hadoop jobs can use the data warehouse as a data source, simultaneously leveraging the massively parallel capabilities of two systems. Any MapReduce program can issue SQL statements to the data warehouse. In one context, a MapReduce program is “just another program,” and the data warehouse is “just another database.” Now imagine 100 MapReduce programs concurrently accessing 100 data warehouse nodes in parallel. Both raw processing and the data warehouse scale to meet any big data challenge. Inevitably, visionary companies will take this step to achieve competitive advantages.


Related Solutions

Explain in details Hadoop architecture. Give an example of Hadoop MapReduce example using real data
Explain in details Hadoop architecture. Give an example of Hadoop MapReduce example using real data
we list a multiple hot technologies related to distributed systems/computing. Students need to write a paragraph...
we list a multiple hot technologies related to distributed systems/computing. Students need to write a paragraph on ALL the below topics   What is Client Server model? Compare and contrast centralized and distributed computing.   Explain Servers, Clients, Thread, Code Migration, Software agents for the same. What is a process? Explain the various states of a process through state transition diagram. Explain the layered protocols. Compare and contrast the OSI and TCP/IP model. What is spontaneous networking? Compare and contrast some of...
Topic: Data Models Explain what Hadoop is in detail, and what are its basic components?
Topic: Data Models Explain what Hadoop is in detail, and what are its basic components?
Identify which cloud computing service model is appropriate to use in each of the following cases:...
Identify which cloud computing service model is appropriate to use in each of the following cases: A company needs to build a web portal for clients all over the world to access web pages. Google allows customers to do computing on their system. A customer does not need to have a powerful computer, just some means to access the Internet. A small business does not want to purchase software, such as visio drawing tools and database.
1. Compare and contrast the main differences between data and data warehousing. (4 marks) Question 1b....
1. Compare and contrast the main differences between data and data warehousing. Question 1b. Discuss the two major techniques for Data Mining. Question 2a Define Data Warehousing (DW)? Question 2b List and discuss two characteristics of Data Warehousing Question 3a State two reasons why data visualisation is important? Question 3b. Suggest two ways to build an Interactive Dashboard to the Air transport logistics Director Question 5. List two benefits of using a dashboard?
Computing Equivalent Units of Production Using the data given for Cases 1–3 below, and assuming the...
Computing Equivalent Units of Production Using the data given for Cases 1–3 below, and assuming the use of the average cost method, compute the separate equivalent units of production—one for materials and one for labor and overhead—under each of the following assumptions (labor and factory overhead are applied evenly during the process in each assumption): Assumptions: All materials go into production at the beginning of the process. All materials go into production at the end of the process. (Note that...
Explain in detail the role of Distributed ledger technologies in the following types of innovation: -...
Explain in detail the role of Distributed ledger technologies in the following types of innovation: - Process innovation - Product innovation - Exploitation of new markets - Organizational innovation
Explain the difference between a data warehouse and a data mart. Use the analogy of a...
Explain the difference between a data warehouse and a data mart. Use the analogy of a supply chain.
Can you explain the difference between overlapping and nonoverlapping in a moore sequence recognizer and mealy sequence recognizer?
  Can you explain the difference between overlapping and nonoverlapping in a moore sequence recognizer and mealy sequence recognizer?  
Are distributed ledger technologies general purpose technology, social technology or both? Explain in detail why and...
Are distributed ledger technologies general purpose technology, social technology or both? Explain in detail why and why it is important for long-run economic growth
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT