In: Computer Science
#########################PANDAS LANGUAGE##################
#########################MATPLOT LIB#########################
In [40]: #importing file
users = pd.read_table('u.user', sep='|', index_col='user_id')
Describe and show the dataframe
In [ ]:
# describe information of all columns
# describe information of all numeric columns only
# describe information of all object columns only
# show first 10 rows of users dataframe
detecting duplicate rows
In [10]:
# check wheather a row is identical to a previous row
# count all duplicate rows in the dataframe
# show only duplicate rows in the dataframe
# drop all duplicate rows in the dataframe
# check a single specific column for duplicates occur or not
# check specify more than one column for finding duplicates
In [11]:
# display the 3 most frequent occupations in 'users'
# change the data type of a column name age from int to float
# for each occupation, calculate the minimum and maximum ages
In [12]:
# for each occupation in 'users', count the number of occurrences
# plot barchar of upper out w.r.t each occupation
In [13]:
# for each occupation, calculate the mean age
# plot pie chart of the upper output
In [14]:
# for each combination of occupation and gender, calculate the mean age
# plot barchar of upper out w.r.t each occupation and gender
In [15]:
# sort 'users' by 'occupation' and then by 'age' (in a single command)
Describe and show the dataframe:
df.describe()
Describe information of all columns:
df.describe(self, percentiles=None, include=None, exclude=None)
Describe information of all numeric columns
only:
df.describe(include=[np.number])
Describe information of all object columns
only:
df.describe(include=[object])
Show first 10 rows of users dataframe:
df.head(10)
Show duplicated rows:
df[df.duplicated()]
Check wheather a row is identical to a previous
row:
#The following prints another column that tells us if the current
row is identical to the previous one.
df.col.eq(df.col.shift())
OR
def compare_previous(a):
return np.concatenate(([False],a[1:] == a[:-1]))
df['match'] = compare_previous(df.col.values)
Count all duplicate rows in the
dataframe:
df.duplicated(subset='one', keep='first').sum()
Show only duplicate rows in the
dataframe:
df[df.duplicated(['Name_of_column'])]
Drop all duplicate rows in the dataframe:
df.drop_duplicates()
Check a single specific column for duplicates occur or
not:
df.duplicated(subset=['Name_of_column'])
Check more than one column for finding
duplicates:
df.duplicated(subset=['column_1','column_2'], keep=False)]
Display the 3 most frequent occupations in
'users':
users.occupation.value_counts().head(3)
Change the data type of a column name age from int to
float:
df.column_name.astype(float)
For each occupation, calculate the minimum and maximum
ages:
users.groupby('occupation').age.agg(['min', 'max'])
For each occupation in 'users', count the number of
occurrences:
users.occupation.value_counts()
Plot barchart of upper output w.r.t each
occupation:
df = df.sort_values('occupation')
plt.bar('Education', 'Salary',data=df)
For each occupation, calculate the mean age:
users.occupation.age.mean()
Plot pie chart:
fig = plt.figure(figsize =(15, 12))
plt.pie(df, labels = occupation)
For each combination of occupation and gender, calculate
the mean age:
users.groupby(['occupation', 'gender']).age.mean()
Plot barchart w.r.t each occupation and
gender:
plt.bar(occupation, gender)
plt.xlabel("Occupations")
plt.ylabel("Gender")
plt.show()
Sort 'users' by 'occupation' and then by 'age' (in a
single command):
users.sort(['occupation', 'age'])