In: Computer Science
Write pseudo-code to solve the following problem using MapReduce and explain how it works.
Each line in the file lists a user ID, the ID of the movie the user watched, the rating the user gave for the movie, and the timestamp. For example line 1 indicates that the user’s ID is 196, the movie ID is 242, the user gave this movie a rating of 3, and the timestamp is 881250949. Given the file, find out the top similar movies for each movie. Hint: Similarity is defined as how similarly two movies are rated by the same user across all users who have watched both movies.
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
HELLO I AM PROVIDING STEP BY STEP SOLUTION AS PER YOUR REQUIREMENT
PLEASE GO THROUGH IT
Solution:
To solve this problem we can use chaining of mapreduce.
This works like this output of first mapreduce
is given as output for second mapreduce.
Step-1 In the first step the Data is given as input to first JOB,
and output of this step will be given as input to second Step
Step-2 Second JOB will wait for output from first JOB then first output will work as
input for Second JOB and will provide final output at end
MapReduce Job-1 : Its job is to collect all user rated items
MapReduce Job-2 : Its job is to find similarity between items using correlation formula
MAPREDUCE JOB 1:
Algorithm-1
Map-1's Job is to Emit the user_id as key and (item and rating) as value
Input:-
key-line offset
value- row of input file contains (item_id,user_id,rating)
Output:-
Key- user_id
Value-(item_id,rating) pair
Require: Input dataset containing User_id, Item_id, rating fields
Procedure:
user_id, item_id, rating = line.split('\t')
key=user_id
value=item_id
value.append(rating)
emit(key,value)
Algorithm-2
Reducer-1's Job is to For each user, emit a row containing their "postings"(item , rating pairs)
Input:-
key- user_id
value- Sequence of (user_id , rating)
Output:-
Key- user_id
Value-row contain all posting of user (item_id , rating)
Procedure:
item_count = 0
item_sum = 0
final = []
for item_id, rating in values
{
item_count += 1
item_sum += rating
final.append((item_id, rating))
Key=user_id
Value= item_count, item_sum, final
Emit (key,value)
}
MAPREDUCE JOB : 2
Algorithm-3
Map-2: The output drops the user from the key entirely, instead it emits the pair of items as the key
Job-2: Require Output of job-1 as input to the job-2.
Input:-
key-user_id
value- row containing all the posting of user (user_id , rating)
Output:-
Key- (item_id , item_id)
Value-(rating , rating)
Procedure:
item_count , item_sum, ratings = values
for item1, item2 in combinations(ratings, 2)
{
key=(item1[0], item2[0])
value=( item1[1], item2[1])
Emit=(key , value)
}
Algorithm-4
Reduce-2: Sum components of each co rating pair across all users who rated both item x and
item y, then calculate correlation similarity.
Input:-
key- (item_id , item_id)
value- sequence of rating pair(rating , rating)
Output:-
Key- (item_id , item_id)
Value-(similarity , n)
Procedure
sum_xx, sum_xy, sum_yy, sum_x, sum_y, n = (0.0, 0.0, 0.0, 0.0, 0.0, 0)
sum_x += item_x
n += 1
item_pair, co_ratings = pair_key, lines
item_xname, item_yname = item_pair
for item_x, item_y in lines:
{
sum_xx += item_x * item_x
sum_yy += item_y * item_y
sum_xy += item_x * item_y
sum_y += item_y
similarity = normalized_correlation(n, sum_xy, sum_x, sum_y,sum_xx, sum_yy)
Key=(item_xname,item_yname)
Value= (similarity)
Emit(key,value)
}