Explain the overlapping use cases between data warehousing and distributed computing technologies such as Hadoop

Expert Solution

As customers, analysts, and journalists explore Hadoop and MapReduce, the most frequent questions are: “When should I use Hadoop, and when should I put the data into a data warehouse?” The answers are best explained with an example of a recently deployed big data source: Smart Meters Smart meters are deployed in homes worldwide to help consumers and utility companies manage the use of water, electricity, and gas better. Historically, meter readers would walk from house to house recording meter read outs and reporting them to the utility company for billing purposes. Because of the labor costs, many utilities switched from monthly readings to quarterly. This delayed revenue and made it impossible to analyze residential usage in any detail.

Consider a fictional company called CostCutter Utilities that serves 10 million households. Once a quarter, they gathered 10 million readings to produce utility bills. With government regulation and the price of oil skyrocketing, CostCutter started deploying smart meters so they could get hourly readings of electricity usage. They now collect 21.6 billion sensor readings per quarter from the smart meters. Analysis of the meter data over months and years can be correlated with energy saving campaigns, weather patterns, and local events, providing savings insights both for consumers and CostCutter Utilities. When consumers are offered a billing plan that has cheaper electricity from 8 p.m. to 5 a.m., they demand five minute intervals in their smart meter reports so they can identify high-use activity in their homes. At five minute intervals, the smart meters are collecting more than 100 billion meter readings every 90 days, and CostCutter Utilities now has a big data problem. Their data volume exceeds their ability to process it with existing software and hardware. So CostCutter Utilities turns to Hadoop to handle the incoming meter readings.

Hadoop now plays a key role in capturing, transforming, and publishing data. Using tools such as Apache Pig, advanced transformations can be applied in Hadoop with little manual programming effort; and since Hadoop is a low cost storage repository, data can be held for months or even years. Since Hadoop has been used to clean and transform the data, it is loaded directly into the data warehouse and MDMS systems. Marketing is developing additional offers for consumers to save money by using analysis of the trends by household, neighborhood, time of day, and local events. And now the in-home display unit can give consumers detailed knowledge of their usage.

Note that Hadoop is not an Extract-Transform-Load (ETL) tool. It is a platform that supports running ETL processes in parallel. The data integration vendors do not compete with Hadoop; rather, Hadoop is another channel for use of their data transformation modules.

As corporations start using larger amounts of data, migrating it over the network for transformation or analysis becomes unrealistic. Moving terabytes from one system to another daily can bring the wrath of the network administrator down on a programmer. It makes more sense to push the processing to the data. Moving all the big data to one storage area network (SAN) or ETL server becomes infeasible with big data volumes. Even if you can move the data, processing it is slow, limited to SAN bandwidth, and often fails to meet batch processing windows. With Hadoop, raw data is loaded directly to low cost commodity servers one time, and only the higher value refined results are passed to other systems. ETL processing runs in parallel across the entire cluster resulting in much faster operations than can be achieved pulling data from a SAN into a collection of ETL servers. Using Hadoop, data does not get loaded into a SAN just to then get pulled out of the SAN across the network multiple times for each transformation.

It should be no surprise that many Hadoop systems sit side by side with data warehouses. These systems serve different purposes and complement one another. For example:

A major brokerage firm uses Hadoop to preprocess raw click streams generated by customers using their website. Processing these click streams provides valuable insight into customer preferences which are passed to a data warehouse. The data warehouse then couples these customer preferences with marketing campaigns and recommendation engines to offer investment suggestions and analysis to consumers. There are other approaches to investigative analytics on clickstream data using analytic platforms. See “MapReduce and the Data Scientist” for more details.

An eCommerce service uses Hadoop for machine learning to detect fraudulent supplier websites. The fraudulent sites exhibit patterns that Hadoop uses to produce a predictive model. The model is copied into the data warehouse where it is used to find sales activity that matches the pattern. Once found, that supplier is investigated and potentially discontinued.

Complex Hadoop jobs can use the data warehouse as a data source, simultaneously leveraging the massively parallel capabilities of two systems. Any MapReduce program can issue SQL statements to the data warehouse. In one context, a MapReduce program is “just another program,” and the data warehouse is “just another database.” Now imagine 100 MapReduce programs concurrently accessing 100 data warehouse nodes in parallel. Both raw processing and the data warehouse scale to meet any big data challenge. Inevitably, visionary companies will take this step to achieve competitive advantages.

venereology answered 3 months ago

Topic: Data Models Explain what Hadoop is in detail, and what are its basic components?

1. Compare and contrast the main differences between data and data warehousing. (4 marks) Question 1b....

1. Compare and contrast the main differences between data and data warehousing. Question 1b. Discuss the two major techniques for Data Mining. Question 2a Define Data Warehousing (DW)? Question 2b List and discuss two characteristics of Data Warehousing Question 3a State two reasons why data visualisation is important? Question 3b. Suggest two ways to build an Interactive Dashboard to the Air transport logistics Director Question 5. List two benefits of using a dashboard?

Computing Equivalent Units of Production Using the data given for Cases 1–3 below, and assuming the...

Computing Equivalent Units of Production Using the data given for Cases 1–3 below, and assuming the use of the average cost method, compute the separate equivalent units of production—one for materials and one for labor and overhead—under each of the following assumptions (labor and factory overhead are applied evenly during the process in each assumption): Assumptions: All materials go into production at the beginning of the process. All materials go into production at the end of the process. (Note that...

Explain in detail the role of Distributed ledger technologies in the following types of innovation: -...

Explain in detail the role of Distributed ledger technologies in the following types of innovation: - Process innovation - Product innovation - Exploitation of new markets - Organizational innovation

Explain the difference between a data warehouse and a data mart. Use the analogy of a...

Explain the difference between a data warehouse and a data mart. Use the analogy of a supply chain.

Can you explain the difference between overlapping and nonoverlapping in a moore sequence recognizer and mealy sequence recognizer?

Are distributed ledger technologies general purpose technology, social technology or both? Explain in detail why and...

Are distributed ledger technologies general purpose technology, social technology or both? Explain in detail why and why it is important for long-run economic growth

1) Briefly explain the following basic techniques and technologies: a. High-performance computing (HPC) system b. High-throughput...

1) Briefly explain the following basic techniques and technologies: a. High-performance computing (HPC) system b. High-throughput computing (HTC) system c. Peer-to-peer (P2P) network d. Computer cluster versus computational grid e. Service-oriented architecture (SOA) f. Virtual machine versus virtual infrastructure 2) Explain differences between multicore CPUs and GPUs in terms of architecture and usage

Computing ROPI for Multiple Years Use the following data to compute ROPI for Intel Corporation for...

Computing ROPI for Multiple Years Use the following data to compute ROPI for Intel Corporation for 2016 through the terminal period. Assume that the company's weighted average cost of capital is 9%. Actual Horizon Period Terminal $ millions 2015 2016 2017 2018 2019 Period Sales $55,466 $61,555 $67,699 $73,771 $78,962 $80,511 NOPAT 11,343 13,573 14,248 15,524 15,825 15,331 NOA 51,488 57,157 62,873 68,521 72,674 74,101 Round answers to the nearest whole number. Forecast Horizon Terminal ($ millions) 2016 2017 2018...

In your own words explain the difference between two types of cloud computing: Platform as a...

In your own words explain the difference between two types of cloud computing: Platform as a service (PaaS) and .Infrastructure as a service (IaaS). Give at least 2 examples of PaaS and 1 example of IaaS AWS solutions implemented by Zillow. Describe main features and benefits of each .

Question