In: Computer Science
Read in the movies.csv into a dataframe named movies, display the first 5 rows and answer the below 10 questions
url = 'https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/movie.csv'
6) Use the count method to find the number of non-missing values for each column.
[ ]
7) Display the count of missing values for each column
[ ]
8) List the frequency for the top ten directors
[ ]
9) List the top ten director_name that has the highest average of director_facebook_likes
[ ]
10) List the top ten movie_title that has the longest duration
[ ]
Find the solutions below.
# 6) Use the count method to find the number of non-missing values for each column.
Count() method counts the number of values (not-null). axis = 0 is specified to do this column wise.
df.count(axis = 0)
# 7) Display the count of missing values for each column
isnull() method is used to check if a given data is null. Axis = 0 is applied to make it column wise.
df.isnull().sum(axis=0)
# 8) List the frequency for the top ten directors
First, we group the dataframe by the director_name. Then take count of rows per group. Count is then given the name frequency. Now, we sort the data frame by the frequency column in the descending order using sort_values() method. Again a new index is generated, which is deleted using reset_index(drop=True). Then select the top 10 by head(10).
df.groupby('director_name').size().reset_index(name='frequency').sort_values('frequency',ascending = False).reset_index(drop=True).head(10)
# 9) List the top ten director_name that has the highest average of director_facebook_likes
First we select only director_name and director_facebook_likes from the data rame. Then group by director_name. We select the average of facebook_likes by mean() method and then sort the frame in the descending order of mean of director_facebook_likes. Auto generated index is dropped and top 10 is printed using head() method.
df[['director_name', 'director_facebook_likes']].groupby('director_name').mean().reset_index().sort_values('director_facebook_likes', ascending = False).reset_index(drop=True).head(10)
# 10) List the top ten movie_title that has the longest duration
Sort the data frame by the duration column in the descending order. Reset the index to get it numbered from 0. Now select top 10 movie_title column.
df.sort_values('duration',ascending = False).reset_index()['movie_title'].head(10)