Question

In: Computer Science

During Week 5, you will be provided a dataset that will create a full data pipeline...

During Week 5, you will be provided a dataset that will create a full data pipeline in Python. This will include ingest, storage, cleansing, preprocessing, and utilization. The utilization of these data may be in a descriptive manner (e.g., simple report) or predictive manner (e.g., machine learning algorithm).

During your final IP, you will utilize the dataset provided by Data.Gov. This dataset provides demographics by zip code and is available in many formats, such as CSV, JSON, XML, and RDF. For this assignment, complete a Python program that covers the following:

  1. Ingest the data via one of the provided formats.
  2. Create a data structure to store the data.
  3. Conduct any cleansing and preprocessing needed.
  4. Create either a descriptive report or a predictive algorithm using the data.

Any help would be great. Chegg stop deleting my questions for no reason.

Solutions

Expert Solution

1. data ingestion is nothing but converting data from one form to the other i.e from rdbms, csv or other forms to python readable format etc.

most popular package in python to work with data is pandas.

  • So by using pandas commands you can easily convert the data from many forms to dataframe. check the example below:
    • import pandas as pd
    • df = pd.DataFrame(pd.read_csv("data.csv")) // now the df contains the daframe.
  • The above mentioned example contains both steps ingestion and creating a data structure for storing the data.

2. data cleaning is nothing but removing any irregularities in the data. i.e missing values in rows or columns. data cleaning is necessary becuase we can't arrive at a detailed report or model with missing data.

3. Data Preprocessing is also an important step which aims to reduce the data in to only necessary fields . i.e for example : If data contains two columns representing same thing we can remove on column to reduce the data size. or if they represent two closely related fields they can be combined together to save space.

  • if there are only few values missing then they can be filled using methods such as mean , median , mode . which also come under data preprocessing.

4. for creating a desciprtive report choose the columns that are worth comparing. i.e which convey the important point.example : if a house has more no. of rooms it's going to cost extra $.. which can be inferred after comparing house prices and no. of rooms .

5. for creating a predictive model it is very important to check the nature of the data . beacuse the models applicable on data drasctically changes with the nature of the data.

example : linear regression or SVM can be applied on data which is linear in nature


Related Solutions

Would you please demonstrate to me how to create dataset A and dataset B, where dataset...
Would you please demonstrate to me how to create dataset A and dataset B, where dataset A has a larger range but smaller standard deviation than dataset B. Then the reverse where data set A has a smaller range and larger standard deviation than data set B.
Use this information as you create an SPSS dataset using the data chart below paying particular...
Use this information as you create an SPSS dataset using the data chart below paying particular attention in assigning the proper variable type (scale/interval, ordinal, or nominal) in the Measure column in the Variable View in SPSS. School ID School Region Enrollment Academic Rank 278 West 56 1 044 East 825 2 416 North 134 3 489 North 152 4 223 West 79 5 126 South 345 6 013 East 924 7 156 South 256 8
Create a small dataset of at least 5 observations and calculate the mean, median, mode, range,...
Create a small dataset of at least 5 observations and calculate the mean, median, mode, range, IQR and standard deviation (use Excel or StatCrunch). Add an outlier to your data and recalculate these measures. Which changed and by how much? Explain how this illustrates the idea of resistant measures. Please provide the data that you made up and the measures for your data before and after you added an outlier. Organizing the data into a table would be a great...
This week, you will use two of the data sets that were posted during last week's...
This week, you will use two of the data sets that were posted during last week's discussion, as follows: 1) Refer to the data set that you posted last week (high temperatures for your area during the month of June 2019) and 2) Refer to the data set that one of your classmates posted last week (high temperatures for their area during the month of June 2019). Use these data sets to test the claim that the average high temperature...
This week, you will use two of the data sets that were posted during last week's...
This week, you will use two of the data sets that were posted during last week's discussion, as follows: 1) Refer to the data set that you posted last week (high temperatures for your area during the month of June 2019) and 2) Refer to the data set that one of your classmates posted last week (high temperatures for their area during the month of June 2019). Use these data sets to test the claim that the average high temperature...
You will utilize a large dataset to create a predictive analytics algorithm in Python. For this...
You will utilize a large dataset to create a predictive analytics algorithm in Python. For this assignment, complete the following: Utilize one of the following Web sites to identify a dataset to use, preferably over 500K from Google databases, kaggle, or the .gov data website Utilize a machine learning algorithm to create a prediction. K-nearest neighbors is recommended as an introductory algorithm. Your algorithm should read in the dataset, segmenting the data with 70% used for training and 30% used...
The dataset ToyotaCorolla.jmp contains data on used cars on sale during the late summer of 2004...
The dataset ToyotaCorolla.jmp contains data on used cars on sale during the late summer of 2004 in the Netherlands. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications. (a.) Explore the data using the data visualization (e.g., Graph > Scatterplot Matrix and Graph > Graph Builder) capabilities of JMP. Which of the pairs among the variables seem to be correlated? (three or four correlations please). Multivariate Correlations Price Age_08_04 KM HP CC...
You will be performing an analysis on a dataset that contains data on fertility and life...
You will be performing an analysis on a dataset that contains data on fertility and life expectancy for 198 different countries. All data is from the year 2013. The fertility numbers are the average number of children per woman in each of the countries. The life expectancy numbers are the average life expectancy in each of the countries. You will be turning in a paper that should include section headings, graphics and tables when appropriate and complete sentences which explain...
During a 5-week period in 2007, the stock of an insurance company and the stock of...
During a 5-week period in 2007, the stock of an insurance company and the stock of a small tech company showed the following weekly percentage changes. Company Weekly Price Change (%) Insurance Stock 2 -1 -1.7 0.6 -0.3 Tech Stock 3 2.2 1.3 -4.3 1.7 Find the variance of the weekly price changes of each. (Round your answers to four decimal places.) insurance stock tech stock Relate the two variances found to the riskiness of the two stocks. The two...
All work must be done in R programing. Consider this dataset provided to you as prob10.txt...
All work must be done in R programing. Consider this dataset provided to you as prob10.txt c1 t1 c2 t2 c3 t3 c4 t4 2650 3115 2619 2933 2331 2799 2750 3200 1200 1101 1200 1309 1888 1901 1315 980 1541 1358 1401 1499 1256 1238 1625 1421 1545 1910 1652 2028 1449 1901 1399 2002 1956 2999 2066 2880 1777 2898 1999 2798 1599 2710 1754 2765 1434 2689 1702 2402 2430 2589 2789 2899 2332 2300 2250 2741...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT