In: Computer Science
Create a JAVA program to find the mean and standards deviation of a large set of data by breaking it down into smaller sets and use threads to process each smaller data set. The simulation will use only 3 threads, and each data set will be at most 100 integers. Create two classes, namely HadoopSim class for the thread tasks and HadoopDriver class for the main routine.
(Then the main routine must gather up the results of each thread and compute the overall average and standard deviation. This exercise will NOT require the use of the synchronized keyword, as each thread will be working only on its own integer array in memory, i.e., there is no shared data.)
Way calculate mean and standard deviation:
1. Compute mean - add up all of the numbers and divide that sum by the number of numbers (N)
2. Compute standard deviation as follows:
a. Subtract the mean from each number in the set and square the differences;
b. Sum up all of the squares of differences;
c. Divide this sum by the number of numbers minus 1 (N - 1); and,
d. Take the (positive) square root of the result.
Note that the above formula for σ divides the sum of squares by N; this is appropriate when computing the standard deviation of a population. Instead use the formula for the standard deviation of a sample which divides by N-1.
This is the code of HadoopSim class:
The constructor for this class is passed a file name (that contains a smaller set of integers), opens the file for reading (which MAY throw FileNotFoundException - if this happens the constructor should output an error message and exit the application with System.exit()), and reads (the Scanner is recommended to read the file) the contents of the file into an array of ints (with a size of 100, not all of the array may be used), also counting up the number of numbers
import java.util.Scanner;
public class HadoopSim implements Runnable
{
private final int SIZE = 100;
private int [] arrayData = new int [SIZE];
private int count = 0;
private int sum = 0;
private double sumDiffsSq = 0.0;
private String fileName;
private Scanner scan;
private double newMean = 0.0;
//Constructor to help read the file using Scanner
public HadoopSim(String filename)
{
Scanner scan = new Scanner(System.in);
}
//count is set when reading the file
public int getCount()
{
}
//sum is set ina thread by tun() method
public int getSum()
{
}
//call after all threads have been completed
public void setNewMean(double m)
{
}
//method to compute each task;s sum of differneces quared using new
mean
public void setSumDiffSqs()
{
}
//returns the sum of differences squared for the data in each
task's array
public double getSumDiffSqs()
{
}
}
Following the instantiation of a HadoopSim object, the data file has been read into memory (in the int array), and the number of numbers has been calculated and stored in the count field. This class also has a method, void run(), that loops through the array of ints and computes its sum, setting the result in the sum instance variable. Thus, the run method simply adds up all of the numbers in the array, but this is the labor-intensive part of the work. See the for Java's Thread class.
The HadoopDriver, with a main method, is responsible for:
1) prompting the user for the names of 3 files;
2) instantiating 3 HadoopSim objects;
3) instantiating 3 threads, passing each a different task;
4) starting all 3 threads using the Thread method start() (if the ThreadStateException is thrown, display an error message and call System.exit());
5) waits for the 3 threads to finish their run methods, using the Thread join() method (if the ThreadStateException is thrown, display an error message and call System.exit());
6) adds up the total counts by calling the getCount() method of each task (see below);
7) adds up the total sums by called the getSum() method of each task (see below);
8) computes the overall mean by dividing the results of step 6) by the result of step 7) (be careful here as you are dividing two ints and want a double!); this is needed by each task in order to compute the sum of the differences squared;
9) sets the new overall mean in each task by calling the task's setNewMean(int m) method (see below);
10) gets the sum of the differences squared from each task by calling the task's getSumDiffsSq method (see below); and then
11) computes and displays the overall sample mean and standard deviation as described above.
Sample output shown below
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- END -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
In summary, then, once a Hadoop object has been instantiated and run as a thread the count and sum are correctly set for the data in the array of integers. Note that the HadoopSim constructor handles the opening and reading the data into memory (the int array) and setting the count field. Further note that the computation-intensive work is done in the run() method (i.e., in a separate thread). Lastly note that you will be heavily penalized if your solution does not observe this division of labor. The main routine will gather up the results of each thread and produce the overall mean and standard deviation of the larger set (i.e., the union of the 3 smaller sets). This exercise's architecture is consistent with the Hadoop technique: having different threads or computers to do the intensive computation on several smaller data sets, and having the main routine receive the results in order to calculate overall results.
Sample output. Using the following data files
sort1.txt:
100 90 75 80
sort2.txt:
99 92 60 75 70
sort3.txt:
99 92 60 75 70 50
Your output (with a couple of debug statements thrown in) might be:
Enter the name of the first file
sort1.txt
Enter the name of the second file
sort2.txt
Enter the name of the third file
sort3.txt
task1 returns count = 4, sum = 345
task2 returns count = 5, sum = 396
task3 returns count = 6, sum = 446
Driver newMean = 79.133333
Driver sumDiffsSq: 571.337778 1026.822222 1875.573333
Totals sum = 1187, count = 15, sumDiffsSq = 3473.733333
Average = 79.13
Standard Deviation = 15.75
Here is Solution for Above Question :
import java.io.*;
import java.text.DecimalFormat;
import java.util.Scanner;
class Test {
public static void main (String[] args) throws FileNotFoundException {
HadoopSim h=new HadoopSim("sort1.txt");
Thread thread = new Thread(h);
thread.start();
int c1=h.getCount();
int sum1=h.getSum();
System.out.println("Task1 return count = "+c1 +" and sum = "+sum1);
HadoopSim h2=new HadoopSim("sort2.txt");
Thread thread2 = new Thread(h2);
thread2.start();
int c2=h2.getCount();
int sum2=h2.getSum();
System.out.println("Task2 return count = "+c2 +" and sum = "+sum2);
HadoopSim h3=new HadoopSim("sort3.txt");
Thread thread3 = new Thread(h3);
thread3.start();
int c3=h3.getCount();
int sum3=h3.getSum();
System.out.println("Task3 return count = "+c3 +" and sum = "+sum3);
double final_sum=sum1+sum2+sum3;
double final_count=c1+c2+c3;
h.setNewMean(final_sum/final_count);
double mean=final_sum/final_count;
double sd= h.setSumDiffSqs();
h2.setNewMean(final_sum/final_count);
double sd2=h2.setSumDiffSqs();
h3.setNewMean(final_sum/final_count);
double sd3= h3.setSumDiffSqs();
System.out.println("Driver Mean is ::"+mean);
System.out.println("Driver sumDeffSeq ::"+sd +" "+sd2+" "+sd3);
System.out.println("Total Sum ="+ final_sum+" , count = "+final_count+" sumDeffSeq = "+(sd+sd2+sd3));
double avg=final_sum/final_count;
DecimalFormat df = new DecimalFormat("#.##");
System.out.println("Average is = "+df.format(avg));
System.out.println("Standard Deviation is = "+ Math.sqrt((sd+sd2+sd3)/final_count));
}
}
class HadoopSim implements Runnable
{
private final int SIZE = 100;
private int [] arrayData = new int [SIZE];
private int count = 0;
private int sum = 0;
private double sumDiffsSq = 0.0;
private String fileName;
private Scanner scan;
private double newMean = 0.0;
//Constructor to help read the file using Scanner
public HadoopSim(String filename)
{
Scanner scan = new Scanner(System.in);
try {
String FilePath="";//////////please provide file path here.
File file = new File(FilePath+"/"+filename);
scan = new Scanner(file);
int i=0;
while(scan.hasNextLine()){
arrayData[i] = scan.nextInt();
count++;
sum+=arrayData[i];
i++;
}
scan.close();
} catch(Exception e) {
}
}
public void run() {
}
//count is set when reading the file
public int getCount()
{
return count;
}
//sum is set ina thread by tun() method
public int getSum()
{
return sum;
}
//call after all threads have been completed
public void setNewMean(double m)
{
newMean=m;
}
//method to compute each task;s sum of differneces quared using new mean
public double setSumDiffSqs()
{
// System.out.println(newMean);
double x=0.0;
for(int i=0;i<arrayData.length;i++)
{
// System.out.println(arrayData[i]);
if(arrayData[i]!=0)
x+= (Math.abs(newMean-arrayData[i])* Math.abs(newMean-arrayData[i]));
//System.out.println(x);
}
sumDiffsSq+=x;
return x;
}
//returns the sum of differences squared for the data in each task's array
public double getSumDiffSqs()
{
return sumDiffsSq;
}
}
Please provide File Path where every your .txt file reside.
Test cases ::
Input :(Please provide file path and put this data in file)
sort1.txt:
100 90 75 80
sort2.txt:
99 92 60 75 70
sort3.txt:
99 92 60 75 70 50
output :
Task1 return count = 4 and sum = 345
Task2 return count = 5 and sum = 396
Task3 return count = 6 and sum = 446
Driver Mean is ::79.13333333333334
Driver sumDeffSeq ::571.3377777777774 1026.8222222222223
1875.5733333333337
Total Sum =1187.0 , count = 15.0 sumDeffSeq =
3473.7333333333336
Average is = 79.13
Standard Deviation is = 15.217825804700954