In: Operations Management
Hadoopis one of the feasible and affordable solutions for big data analytics. Its success is on the numerous add-on products in its 4 functional areas. Describe what the following add-onproducts are and how it can help in big data analytics with around 50 to 100 words each.
1) Pig
2) Spark
3) Storm
4) Atlas
5) Flume
6) Solr
7) HBase
8) Oozie
Hadoop-Pig
Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java. Pig's simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL. Apache Pig's enables people to focuses more on analysing bulk data sets and to spend less time writing Map-Reduce programs. Similar to Pig's, who eat anything the pig's programming language is design to work upon any kind of data. Thats why the name Pig.
Apache Spark.
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. It is an open source cluster computing framework written in Scala, Jawa, Python and R. It is general purpose distributed computer engine used for processing and analysing a large amount of data. Spark is still maturing and lags some important enterprise-grade featurses.
Apache Storm
Apache Storm is a free and open source distributed realtime computation system. Apache Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Apache Storm is simple, can be used with any programming language, and is a lot of fun to use. We currently use storm as our Twitter realtime processing pipeline. Who uses Apache Storm? Fullcontact Inc (u.s), Lookout, Inc (u.s).
Atlas Hadoop
Atlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem. Is a data governence tool which facilitates gathering, processing and maintaining meta data. Unlike spreadsheet and wikidocs, it has functioning components which can monitor your data processes, data stores, files and updates in a meta repository. Most popular data governance tools are IBM data governance, Talent, Collibra.
Apache Flume
Apache Flume is a system used for moving massive quantities of streaming data into HDFS. Collecting log data present in log files from web servers and aggregating it in HDFS for analysis, is one common example use case of Flume. It is a distributed, reliable and available software for efficiently collecting, aggregating and moving large amount of log data. It has a simple and flexible architecture based on streaming data floews.
Solr Hadoop
Apache Solr - On Hadoop. Advertisements. It is an open source search plateform build ipon a Java library called Lucene. Solr is a popular search plateform for websites because it can index and search multiple sites and written recommendations for related contents based on the search query's taxonomy. Solr can be used along with Hadoop. As Hadoop handles a large amount of data, Solr helps us in finding the required information from such a large source.
Hadoop-HBase
Apache HBase is the Hadoop database. It is a distributed, scalable, big data store. It is a subproject of the Apache Hadoop project and is used to provide real-time read and write access to your big data. HBase is called the Hadoop database because it is a NoSQL database that runs on top of Hadoop. It combines the scalability of Hadoop by running on the Hadoop Distributed File System (HDFS), with real-time data access as a key/value store and deep analytic capabilities of Map Reduce.
Hadoop-Oozie
Oozie is a jawa web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequientially into one logical unit of work. It integrated with the Hadoop stack, with YARN as its architectural centre and supports Hadoop jobs for Apache MapReduce, Apache Pig, Apache Ive and Apache Sqoop. Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability. Oozie is a scalable, reliable and extensible system.