In: Computer Science
######################LANGUAGE PANDAS####################
#####################MATPLOTLIB###########################
#########################################################
# read ufo.csv into a DataFrame called 'ufo'
# print the head and the tail
# examine the default index, data types, and shape of ufo dataframe
# count the number of missing values in each column
# count total number of null vlaues in the dataframe
# print those rows which has null values
# fill null values,
#if any column is numerical has null value than fill this column
with mean of that column
#if any column is categorical than fill this column with most frequent value of that column
# calculate the most frequent value for each of the columns (in a single command)
# what are the four most frequent colors reported?
# for reports in VA, what's the most frequent city?
# show only the UFO reports from Arlington, VA
# show only the UFO reports in which the City is missing
# how many rows remain if you drop all rows with any missing values?
# replace any spaces in the column names with an underscore
# create a new column called 'Location' that includes both City
and State
# For example, the 'Location' for the first row would be 'Ithaca,
NY'
# map existing values to a different set of values
# like in column 'is_male', convert F value to 0 and M to 1 with
pandas
# writing generic code to replace spaces with underscores
# In other words, your code should not reference the specific
column names
# convert datatype of column 'time' to the datetime format
ufo.csv file
City | Colors Reported | Shape Reported | State | Time | |
Ithaca | TRIANGLE | NY | ######## | ||
Willingboro | OTHER | NJ | ######## | ||
Holyoke | OVAL | CO | ######## | ||
Abilene | DISK | KS | ######## | ||
New York Worlds Fair | LIGHT | NY | ######## | ||
Valley City | DISK | ND | ######## | ||
Crater Lake | CIRCLE | CA | ######## | ||
Alma | DISK | MI | ######## | ||
Eklutna | CIGAR | AK | ######## | ||
Hubbard | CYLINDER | OR | ######## | ||
Fontana | LIGHT | CA | ######## | ||
Waterloo | FIREBALL | AL | ######## | ||
Belton | RED | SPHERE | SC | ######## | |
Keokuk | OVAL | IA | ######## | ||
Ludington | DISK | MI | ######## | ||
Forest Home | CIRCLE | CA | ######## | ||
Los Angeles | CA | ######## | |||
Hapeville | GA | ######## | |||
Oneida | RECTANGLE | TN | ######## | ||
Bering Sea | RED | OTHER | AK | ######## | |
Nebraska | DISK | NE | ######## | ||
STEP - BY - STEP PROCESS
(1.) read ufo.csv into a DataFrame called 'ufo'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
(2.) print the head
df.head()
(3.) print the tail
df.tail()
(4.) examine the default index
df.index
(5.) examine the data types
df.dtypes
(6.) examine the shape of ufo dataframe
df.shape
18241 rows total
(7.) count total number of null vlaues in the dataframe
df.isnull().sum()
(8.) print those rows which has null values
null_data = df[df.isnull().any(axis=1)]
null_data
(9.) if any column is numerical has null value than fill this column with mean of that column
df.select_dtypes(include=["float", 'int']).columns
no column have numeric values.
(10.) if any column is categorical than fill this column with most frequent value of that column
df.dtypes.name == 'category'
No column have categorical values, all column dtype is 'object'
(11.) calculate the most frequent value for each of the columns
df.mode()
(12.) what are the four most frequent colors reported?
n = 4
df['Colors Reported'].value_counts()[:n].index.tolist()
(13.) for reports in VA, what's the most frequent city?
df['City'].loc[df['State'] == 'VA'].mode()
(14.) show only the UFO reports from Arlington, VA
df1 = df.loc[df['City'] == 'Arlington']
df1.loc[df1['State'] == 'VA']
(15.) show only the UFO reports in which the City is missing
city_miss = df[df.City.isnull()]
city_miss
(16.) how many rows remain if you drop all rows with any missing values?
df = df.dropna()
df.shape
2486 rows left after removing missing values
(17.) replace any spaces in the column names with an underscore
df.columns = df.columns.str.replace(' ', '_')
df.columns
(18.) create a new column called 'Location' that includes both City and State
df['Location'] = df['City'] + ", " + (df['State'])
(19.) convert datatype of column 'time' to the datetime format
df['Time'] = pd.to_datetime(df.Time)
df.head()
NOTE: Use jupyter notebook and python for better understanding.
Thumbs Up Please !!!