Question

In: Computer Science

Develop code in a Scala Maven project to monitor a folder in HDFS in real time...

Develop code in a Scala Maven project to monitor a folder in HDFS in real time such that any new file in the folder will be processed. For each RDD in the stream, the following subtasks are performed concurrently:

(a) Count the word frequency and save the output in HDFS.Note, for each word, make sure space (" "), comma (","), semicolon (";"), colon (":"), period ("."), apostrophe (“’”), quotation marks (“””), exclamation (“!”), question mark (“?”), and brackets ("[", “{”, “(”, “<”,"]", “)”, “}”,”>” ) are trimmed.

(b) Filter out the short words (i.e., < 5 characters) and save the output in HDFS.

(c) Count the co-occurrence of words in each RDD where the context is the same line; and save the output in HDFS.

Solutions

Expert Solution

Answer : Given data

* For each RDD in the stream, the following subtasks are performed concurrently:

* Introduction:

* Resilient Distributed Dataset (RDD), the distributed collection of items is the Spark’s primary abstraction.

* RDDs can be created from Hadoop InputFormats (like HDFS files) or by transforming other RDDs.

Step1: Initial Configuration:

· To read data from Hadoop’s HDFS, add a dependency on hadoop-client for your version of HDFS:

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "<your-hdfs-version>"

· Open Scala IDE. From the menu, choose “File-> New -> Project -> Maven” by selecting Create a simple project. Use default location for Workspace.

· Click Next to POM setting page.

· Enter Group Id = ca.sparkera.spark and Artifact Id = WordCount.

· Click Finish.

· Open pom.xml file in working area and replace using code given below. Save the file. Jar files are automatically downloaded and work space is built.

· Right click on the project -> configure - > Add Scala Nature.

· Right click on project- > Properties -> Scala Compiler -> Use Project Settings ->Scala Installation Latest 2.10 bundle (dynamic) to update Scala compiler version.

· Right click on Refactor -> Rename to refactor source folder src/main/java to src/main/scala . Create a package under it say, ca.sparkera.spark.

· Right click on the package -> New -> Scala Object to create a Scala object under above created package say, WordCount.scala and give WordCount as object name.

· Paste below code for WordCount.scala

· Create a text file and a a directory/folder in HDFS, where text file is to be stored. “$ hdfs dfs -mkdir /spark” ( command)

· Upload the text file on HDFS in the specific folder. $ hdfs dfs -put /home/sparkdata.txt /spark

Step 2: Code

package ca.sparkera.spark

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

object WordCount

{

def main(args: Array[String])

{

    if (args.length < 1) {

      System.err.println("Usage: <file>")

      System.exit(1)

    }   .       

    val conf = new SparkConf()

    conf.setAppName("SparkWordCount").setMaster("local")

    val sc = new SparkContext(conf)

    val rdd = sc.textFile(args(0))

    rdd.flatMap(_.split(" ")).

    map((_, 1)).

    reduceByKey(_ + _).

    map(x => (x._2, x._1)).

    sortByKey(false).

    map(x => (x._2, x._1)).

    saveAsTextFile("SparkWordCountResult")

    sc.stop

}

}

Step 3: Run the Project

Right click on WordCount.scala - > Run as -> Run Configurations -> Arguments to run the application. Add input file path as /Users/will/Downloads/testdata/. Check the output files containing word count result from proper place where you can find from console output.

Step 4: Output:

s -l /Users/will/workspace/WordCount/SparkWordCountResult

total 1248

-rw-r--r-- 1 will staff       0 17 Oct 23:10 _SUCCESS

-rw-r--r-- 1 will staff   42669 17 Oct 23:10 part-00000

-rw-r--r-- 1 will staff   15600 17 Oct 23:10 part-00001

-rw-r--r-- 1 will staff 129997 17 Oct 23:10 part-00002

-rw-r--r-- 1 will staff 443519 17 Oct 23:10 part-00003


Related Solutions

Inside “Lab1” folder, create a project named “Lab1Ex3”. Use this project to develop a C++ program...
Inside “Lab1” folder, create a project named “Lab1Ex3”. Use this project to develop a C++ program that performs the following:  Define a function called “Find_Min” that takes an array, and array size and return the minimum value on the array.  Define a function called “Find_Max” that takes an array, and array size and return the maximum value on the array.  Define a function called “Count_Mark” that takes an array, array size, and an integer value (mark) and...
In the root of the project create a folder called res. In that folder create a...
In the root of the project create a folder called res. In that folder create a file called DemoFile.txt Create a class with a static main that tests the ability to resolve and print a Path: Create an instance of a FileSystem class. Create an instance of the Path interface for the DemoFile.txt file. Print the constructed Path with System.out.println() method. Create a class that does the following: Using a pre-Java 7 solution, create a class that tests streams in...
Week 4 Project - Submit Files Hide Submission Folder Information Submission Folder Week 4 Project Instructions...
Week 4 Project - Submit Files Hide Submission Folder Information Submission Folder Week 4 Project Instructions Changes in Monetary Policy Assume that the Bank of Ecoville has the following balance sheet and the Fed has a 10% reserve requirement in place: Balance Sheet for Ecoville International Bank ASSETS LIABILITIES Cash $33,000 Demand Deposits $99,000 Loans 66,000 Now assume that the Fed lowers the reserve requirement to 8%. What is the maximum amount of new loans that this bank can make?...
3. Healthcare providers are taking advantage of real-time tool location systems (RTLSs) to monitor both their...
3. Healthcare providers are taking advantage of real-time tool location systems (RTLSs) to monitor both their organization’s hardware assets and patient progress. Examples of these systems are radio-frequency identification (RFID) tracking for patient flow, asset tracking for equipment location, and ultrasound tag tracking. Explain how hospitals can reduce costs and increase quality of care by using these tracking systems. Please provide resources
In which folder of XAMPP would I put in my PHP code
In which folder of XAMPP would I put in my PHP code
To develop or acquire a real estate project, a sponsor may need to attract an equity...
To develop or acquire a real estate project, a sponsor may need to attract an equity partner since lenders generally will not finance 100% of a project's cost. Though there are many ways to compensate a financial partner for putting equity in a deal, a common scenario would be: a. Simple percentage of net cash flow from property b. Preferred return on investor capital, a negotiated share of residual cash flow and priority distributions on sale c. Upfront cash fee...
Please write code in SCALA Thanks in advance Use curried representation and write system which convert...
Please write code in SCALA Thanks in advance Use curried representation and write system which convert pressure units. Input: pressure in atm (atmosphere) Outputs: PSI, bar, torr 1 atm = 14.6956 psi = 760 torr = 101325 Pa = 1.01325 bar.
arduino c code only write a code that counts down a timer on serial monitor and...
arduino c code only write a code that counts down a timer on serial monitor and if A1 is typed into serial monitor prints the timer counting down and writes at 1 second hello and at 5 secs goodbye and repeats every 5 secs A2 is typed as above at 2 seconds writes hello and 3 seconds writes goodbye A3 same as above but at 3 seconds says hello and 2 seconds goodbye This continues until c is pressed to...
Inside “Lab1”folder, create a project named “Lab1Ex1”. Use this project to write and run a C++...
Inside “Lab1”folder, create a project named “Lab1Ex1”. Use this project to write and run a C++ program that produces:  Define a constant value called MAX_SIZE with value of 10.  Define an array of integers called Class_Marks with MAX_SIZE which contains students Mark. Define a function called “Fill_Array” that takes an array, and array size as parameters. The function fills the odd index of the array randomly in the range of [50- 100] and fills the even index of...
Retrieve program Grades.cpp and the data file graderoll.dat from the Lab 12 folder. The code is...
Retrieve program Grades.cpp and the data file graderoll.dat from the Lab 12 folder. The code is as follows: ______________________________________________________________________________ #include // FILL IN DIRECTIVE FOR FILES #include <iostream> #include <iomanip> using namespace std; // This program reads records from a file. The file contains the // following: student's name, two test grades and final exam grade. // It then prints this information to the screen. const int NAMESIZE = 15; const int MAXRECORDS = 50; struct Grades // declares a...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT