In: Computer Science
Problem 2: Do a side-by-side comparison of Cascading and the following technologies in regards to writing Hadoop applications. Make sure you include the advantages and disadvantages of each, as well as when to use each technology over the other.
Answer:
Cascalog was created for developers who want to…
Cascalog queries run as a series of MapReduce jobs. You can query from HDFS, various databases, and locally by making use of Cascading’s Tap abstraction.
Cascalog data processing code can be written in Clojure or java. Cascalog is mainly used for processing “Big Data” on Hadoop and for analysing data residing on local computer. Cascalog is another tool for processing data similar to tools like Pig, Hive and Cascading. The major difference between the alternative tools and cascalog is that cascalog operates at a significantly higher level of abstraction than other mentioned tools.
Cascading provides a set of high level APIs which internally
call Hadoop map-reduce frameworks and invoke map-reduce jobs. It
allows any java developer to write simple java programs and solve a
Map-reduce problem in the form of simple constructs like Grouping,
Aggregate, Function etc.
The Cascading framework provides an abstraction layer on top of
Hadoop and allows enterprises to leverage existing skills and
resources to build data processing applications on Apache Hadoop,
without specialized Hadoop skills.
Because Cascading is Java-based, it naturally fits into JVM-based
languages like Scala, Clojure, Jruby, Jython, and Groovy. Within
many of these languages, many scripting and query languages has
been created that simplify ad-hoc and production-ready analytics as
well as machine learning applications.
On the other side - core map-reduce programming requires a
developer to understand map reduce constructs like partitioning,
sort, shuffle and map/reduce which takes time to learn and
sometimes could be a longer cycle to get up to speed.
The advantage of writing core map reduce job is that - if you know
it well, you can control things with greater level of access to
your flow and probably write optimal flows.
Cascading is a proven application development platform for building
Big Data applications on Apache Hadoop. Whether solving simple or
complex data problems, Cascading balances an optimal level of
abstraction with the necessary degrees of freedom through a
computation engine, systems integration framework, data processing
and scheduling capabilities