In: Computer Science
Write pseudo-code to solve the following problem using MapReduce and explain how it works.
Each line in the file lists a user ID, the ID of the movie the user watched, the rating the user gave for the movie, and the timestamp. For example line 1 indicates that the user’s ID is 196, the movie ID is 242, the user gave this movie a rating of 3, and the timestamp is 881250949. Given the file, find out the top similar movies for each movie. Hint: Similarity is defined as how similarly two movies are rated by the same user across all users who have watched both movies.
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
HELLO I AM PROVIDING STEP BY STEP SOLUTION AS PER YOUR REQUIREMENT
PLEASE GO THROUGH IT
Solution:
        To solve this problem we can use chaining of mapreduce. 
        This works like this output of first mapreduce 
        is given as output for second mapreduce.
        
        Step-1 In the first step the Data is given as input to first  JOB,
                   and output of this step will be given as input to second Step
        Step-2 Second JOB will wait for output from first JOB then first output will work as 
               input for Second JOB and will provide final output at end
MapReduce Job-1 : Its job is to collect all user rated items 
MapReduce Job-2 : Its job is to find similarity between items using correlation formula
MAPREDUCE JOB 1:
Algorithm-1
        Map-1's Job is to Emit the user_id as key and (item and rating) as value
        
        Input:-
                key-line offset
                value- row of input file contains (item_id,user_id,rating)
        Output:- 
                Key- user_id
                Value-(item_id,rating) pair
        Require: Input dataset containing User_id, Item_id, rating fields
        Procedure:
                user_id, item_id, rating = line.split('\t')
                key=user_id
                value=item_id
                value.append(rating)
                emit(key,value)
Algorithm-2
        Reducer-1's Job is to For each user, emit a row containing their "postings"(item , rating pairs)
        Input:-
                key- user_id
                value- Sequence of (user_id , rating)
        Output:-
                Key- user_id
                Value-row contain all posting of user (item_id , rating)
        Procedure:
                item_count = 0
                item_sum = 0
                final = []
                for item_id, rating in values
                {
                        item_count += 1
                        item_sum += rating
                        final.append((item_id, rating))
                        Key=user_id
                        Value= item_count, item_sum, final
                        Emit (key,value)
                }
MAPREDUCE JOB : 2
Algorithm-3
        Map-2: The output drops the user from the key entirely, instead it emits the pair of items as the key
        Job-2: Require Output of job-1 as input to the job-2.
        Input:-
                key-user_id
                value- row containing all the posting of user (user_id , rating)
        Output:-
                Key- (item_id , item_id)
                Value-(rating , rating)
        Procedure:
                item_count , item_sum, ratings = values
                for item1, item2 in combinations(ratings, 2)
                {
                        key=(item1[0], item2[0])
                        value=( item1[1], item2[1])
                        Emit=(key , value)
                }
Algorithm-4
        Reduce-2: Sum components of each co rating pair across all users who rated both item x and
                          item y, then calculate correlation similarity.
        Input:-
                key- (item_id , item_id)
                value- sequence of rating pair(rating , rating)
        Output:-
                Key- (item_id , item_id)
                Value-(similarity , n)
        Procedure
                sum_xx, sum_xy, sum_yy, sum_x, sum_y, n = (0.0, 0.0, 0.0, 0.0, 0.0, 0)
                sum_x += item_x
                n += 1
                item_pair, co_ratings = pair_key, lines
                item_xname, item_yname = item_pair
                for item_x, item_y in lines:
                {
                        sum_xx += item_x * item_x
                        sum_yy += item_y * item_y
                        sum_xy += item_x * item_y
                        sum_y += item_y
                        similarity = normalized_correlation(n, sum_xy, sum_x, sum_y,sum_xx, sum_yy)
                        Key=(item_xname,item_yname)
                        Value= (similarity)
                        Emit(key,value)
                }