In: Computer Science
Business Case
Your company needs to conduct an analysis of a large dataset that is 10 times bigger than the size of a hard drive of your most powerful computer. This dataset contains data about ATM transactions. The security team has provided a smaller dataset of suspicious transactions. Your task is to identify transactions similar to suspicious ones.
Directions
explain which parallel computing techniques would be appropriate for this case and why. Explain which component of Hadoop ecosystem may be used in this case, and why.
Example of parallel computing could be (distributed calculations vs. MapReduce) or others
key principles of Apache Hadoop environment (Hive, Spark, Pig, etc.) or others
Parallel computing technique that would be appropriate is : Machine learning and data mining.
Data mining results discovering meaningful patterns, so the data turns into information. Information or patterns that are potentially useful are not merely information but a huge source of knowledge. Before the huge amount of data was hidden, but now revealed and converted into useful knowledge.
And using either supervised or unsupervised learning, one can easily make model for suspicious transactions. In supervised learning, one has to provide a training set before, so that model can be trained and if at some point or the other, if the suspicious activities are matches then the activity can be reported while unsupervised learning is preferred to detecting anomaly, which means if some activity takes places that is not normal that it will be reported.
MapReduce, a component of Hadoop ecosystem can be used in this case because of following reasons:
It is the core component ad it facilitates the processing logic. Using MapReduce, we can write applications that will process large dataset, using parallel and distributed algorithms. MapReduce provides a crucial feature of parallel processing which can be used for Big Data Analysis. MapReduce has 2 main Functions:
Map : Map work is to convert one dataset into another and individual elements are broken down in keys or tuples.
Reduce: The input here is map function. It main job is to provide a aggregated and summarized result.