In: Computer Science
During Week 5, you will be provided a dataset that will create a full data pipeline in Python. This will include ingest, storage, cleansing, preprocessing, and utilization. The utilization of these data may be in a descriptive manner (e.g., simple report) or predictive manner (e.g., machine learning algorithm).
During your final IP, you will utilize the dataset provided by Data.Gov. This dataset provides demographics by zip code and is available in many formats, such as CSV, JSON, XML, and RDF. For this assignment, complete a Python program that covers the following:
Any help would be great. Chegg stop deleting my questions for no reason.
1. data ingestion is nothing but converting data from one form to the other i.e from rdbms, csv or other forms to python readable format etc.
most popular package in python to work with data is pandas.
2. data cleaning is nothing but removing any irregularities in the data. i.e missing values in rows or columns. data cleaning is necessary becuase we can't arrive at a detailed report or model with missing data.
3. Data Preprocessing is also an important step which aims to reduce the data in to only necessary fields . i.e for example : If data contains two columns representing same thing we can remove on column to reduce the data size. or if they represent two closely related fields they can be combined together to save space.
4. for creating a desciprtive report choose the columns that are worth comparing. i.e which convey the important point.example : if a house has more no. of rooms it's going to cost extra $.. which can be inferred after comparing house prices and no. of rooms .
5. for creating a predictive model it is very important to check the nature of the data . beacuse the models applicable on data drasctically changes with the nature of the data.
example : linear regression or SVM can be applied on data which is linear in nature