In: Computer Science
(a) What is an RDD in Spark? What is it used for?
(b) What is In-Memory Computing? Briefly describe its
advantages.
a.) RDD stands for Resilient Distributed Dataset, It is an immutable combination of objects which can be used to perform parallel computing. Apache spark supports RDD in a way to perform parallel computation on different nodes on hadoop cluster. RDD is the datasets that helps us look into the data as a tabular format while doing computation on it. RDD in spark are divided into smaller chunks of data, and different chunks are given to different node on hadoop cluster for processing. RDD are also fault tolerant, meaning you can always recover data in case a node fails in the cluster.
Use: RDD is used in big data cluster for manipulating big data sets having huge volume. Using RDD we can distribute the processing task to the different nodes and cut down the processing time.
b.) In memory computation is the process to making data available in the main memory(RAM) and processing it in the main memory itself. There are various advantages to it including the faster processing time being the major one.
Advantages:
1.) HDD can read data at a rate of 128mb/sec. SSD can read at 512mb/sec while RAM or main memory has the capability of reading data at 2000mb/sec. This makes the process of extracting data faster.
2.) There are various advantages of in-memory computing in databases. Caching data and storing sessions being the two.