Question

In: Computer Science

Assignment – 1 1- The DISTINCT(X) operator is used to return only distinct (unique) values for...

Assignment – 1

1- The DISTINCT(X) operator is used to return only distinct (unique) values for datatype (or column) X in the entire dataset .

As an example, for the following table A:

A.ID

A.ZIPCODE

A.AGE

1

12345

30

2

12345

40

3

78910

10

4

78910

10

5

78910

20

DISTINCT(A.ID) = (1, 2, 3, 4, 5)

DISTINCT(A.ZIPCODE) = (12345, 78910)

DISTINCT(A.AGE) = (30, 40, 10, 20)

Implement the DISTINCT(X) operator using Map-Reduce. Provide the algo-

rithm pseudocode. You should use only one Map-Reduce stage, i.e. the algorithm should

make only one pass over the data.

2-The SHUFFLE operator takes a dataset as input and randomly re-orders it.

Hint: Assume that we have a function rand(m) that is capable of outputting a random integer between [1, m].

Implement the SHUFFLE operator using Map-Reduce. Provide the algorithm pseudocode.

3-What is the communication cost (in terms of total data flow on the network between mappers and reducers) for following query using Map-Reduce:

Get DISTINCT(A.ID from A WHERE A.AGE > 30 )

The dataset A has 1000M rows, and 400M of these rows have A.AGE <= 30. DISTINCT(A.ID) has 1M elements. A tuple emitted from any mapper is 1 KB in size.

4-Consider the checkout counter at a large supermarket chain. For each item sold, it generates a record of the form [ProductId, Supplier, Price]. Here, ProductId is the unique identifier of a product, Supplier is the supplier name of the product and Price is the sales price for the item. Assume that the supermarket chain has accumulated many terabytes of data over a period of several months.

The CEO wants a list of suppliers, listing for each supplier the average sales price of items provided by the supplier. How would you organize the computation using the Map-Reduce computation model?

For the following questions give short explanations of your answers.

5-True or False: Each mapper/reducer must generate the same number of output key/value pairs as it receives on the input.

6-True or False: The output type of keys/values of mappers/reducers must be of the same type as their input.

7-True or False: The input to reducers is grouped by key.

8-True or False: It is possible to start reducers while some mappers are still running.

Solutions

Expert Solution

Solutions:

1.The solution exploits MapReduce’s ability to group keys together to remove duplicates. Use a mapper to emit X from each record as the key.

The reducer simply emits the keys.

Pseudo code:

map(key, record):

output (value, null) from column X of each input row

reduce(key, records):

output key

2.All the mapper does is output the record as the value along with a random key. In other words, each record is sent to a random reducer. The reducer emits the values.

Pseudo code:

map(key, record):

rand(m) = pick a random integer in [1, m]

output (rand(n), record  

reduce(key, records):

for record in records:

output record

3.Let, p = 1000M, r = 400M, q = 1M

There will be 2 jobs and the output of WHERE is chained to DISTINCT :

WHERE emits (p - r) tuples from the mapper

DISTINCT emits (p - r) tuples from the mapper

Total = 2(p - r) = 2(600M) = 1200M * 1KB = 1.12 TB (approx)

Total = (p-r) = 600M * 1KB, if the values are filtered in the mapper.

4.SELECT AVG(Price) FROM DATA GROUP BY Supplier

Pseudo code :

map(key, record):

output [ record(SUPPLIER), record(PRICE) ]

reduce(SUPPLIER, list of PRICE):

emit [ SUPPLIER, AVG(PRICE) ]


Related Solutions

For the following schema, provide the following unique/distinct values a) A suitable primary key b) An...
For the following schema, provide the following unique/distinct values a) A suitable primary key b) An example of a different candidate key c) An example of a superkey Accounts(account_id, account_type_code, customer_id, account_name, date_opened, date_closed, current_balance)
. Let xj , j = 1, . . . n be n distinct values. Let...
. Let xj , j = 1, . . . n be n distinct values. Let yj be any n values. Let p(x) = c1 + c2x + c3x 2 + · · · + cn x ^n−1 be the unique polynomial that interpolates the data (xj , yj ), j = 1, . . . , n (Vandermonde approach). (a) Remember that (xj , yj ), j = 1, . . . , n are given. Derive the n...
Given an array A[1..n], with distinct values and k with 1 ≤ k ≤ n. We...
Given an array A[1..n], with distinct values and k with 1 ≤ k ≤ n. We want to return the k smallest element of A[1.....n], in non-decreasing order. For example: A = [5, 4, 6, 2, 10] and k = 4, the algorithm returns [2, 4, 5, 6]. There are at least the following four approaches: a. heapify A and then extract k elements one by one b. sort the array (e.g. using MergeSort or HeapSort) and then read the...
Assignment: Create data file consisting of integer, double or String values. Create unique Java application to...
Assignment: Create data file consisting of integer, double or String values. Create unique Java application to read all data from the file echoing the data to standard output. After all data has been read, display how many data were read. For example, if 10 integers were read, the application should display all 10 integers and at the end of the output, print "10 data values were read" My issue is displaying how many integers were read and how many strings...
// return index i such that A[i] = x, return -1 if x is not found...
// return index i such that A[i] = x, return -1 if x is not found int mysearch(A, left, right, x) { //add statements } (a) (10 pts) Given a sorted array A of size n, i.e. data has been sorted. Give a Divide and Conquer recursive “mysearch” algorithm which first tests the element at position n/4 for equality with some value x, and then possibly checks the element at position 3n/4. The result is either discovering x or reducing...
The following language over Σ = {1, #} contains #-separated lists of distinct unary values: A...
The following language over Σ = {1, #} contains #-separated lists of distinct unary values: A = {x1#x2# . . . #xk | k ≥ 0, xi = 1∗ for all i = 1 . . . k, xi 6= xj for all i 6= j} Use the pumping lemma to show A is not regular. In other words, give a string s ∈ A and argue that no matter how you partition s = xyz with |y| > 0,...
(a) Let X be a continuous random variable which only takes on positive values on the...
(a) Let X be a continuous random variable which only takes on positive values on the interval [1, 4]. If P(X) = (√ x + √ 1 x )C 2 for all x in this interval, compute the value of C. (b) Let X be a random variable with normal distribution. Let z represent the z-score for X, and let a be a positive number. Prove that P(z < |a|) = P(z < a) + P(z > −a) − 1.
ONLY SPSS In this week’s assignment, you will explore the different types of graphs used to...
ONLY SPSS In this week’s assignment, you will explore the different types of graphs used to visualize data. Results from SPSS should be copied and pasted into a Word document for submission. Each graph must contain a narrative description of what it represents and an interpretation of the image. Please use the provided datasets for building these figures. Pie chart Bar chart Scatterplot Histogram Data is X Y 382 4215 397 4472 344 3421 275 4077 406 3845 337 2844...
ONLY SPSS In this week’s assignment, you will explore the different types of graphs used to...
ONLY SPSS In this week’s assignment, you will explore the different types of graphs used to visualize data. Results from SPSS should be copied and pasted into a Word document for submission. Each graph must contain a narrative description of what it represents and an interpretation of the image. Please use the provided datasets for building these figures. Pie chart Bar chart Scatterplot Histogram Data is X Y 382 4215 397 4472 344 3421 275 4077 406 3845 337 2844...
The following equation can be used to compute values of y as a function of x:...
The following equation can be used to compute values of y as a function of x: ? = ?? −??sin(??)(0.012? 4 − 0.15? 3 + 0.075? 2 + 2.5?) where a and b are parameters. Write a Matlab script file (m file) with the following steps: Step 1: Define scalars a = 2, b = 5. Step 2: Define vector x holding values from 0 to π/2 in increments of Δx = π/40. Step 3: generate the vector y using...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT