In: Statistics and Probability
In this exercise, we will look at descriptive statistics and how to explore and summarize data sets. For this, we use the Heart Disease dataset from the UCI data repository. This dataset consists of 4 small datasets of people with heart disease admitted to 4 hospitals.
For now, we only work with the file. this data consists of 271 instances with 7 attributes. The attributes are described as below:
Age: age in years
sex: 1 = male; 0 = female
cp: chest pain type
Value 1: typical angina
Value 2: atypical angina
Value 3: non-anginal pain
Value 4: asymptomatic
Trestbps: resting blood pressure
Chol: cholesterol level
Thalach: maximum heart rate achieved
heart_problem: 1= have heart problem; 0=No heart problem
Instruction: Use Microsoft Excel to do your work. Please submit your work as ONE MS excel file and create one tab for each question. Show your work as rigorously as possible. name the file as lastname_fastname_hw1.excel.
Using the attached data, answer the following questions:
1. How many patients have heart disease? (0.5)
2. What is the average Cholesterol level of people with heart disease and without heart disease? What is the standard deviation? (1)
3. What is the median and average age of people with,
a. cholesterol higher than 240.0? (0.5)
b. cholesterol higher than 240.0 with heart disease? (0.5)
c. cholesterol higher than 240.0 without heart disease? (0.5)
4. Create a histogram of resting blood pressure. (1)
5. Create boxplots based on the sex of the patients for the following attributes:
a. cholesterol level (1.5)
b. maximum heart rate achieved (1.5)
6. For each Box plot, answer the following questions:
a. What is the H-Spread (Q3-Q1) of cholesterol level for male and females? (0.5)
b. What are the Lower Hinge and Upper Hinge values for maximum heart rate for male and female? (0.5)
7. In order to find if two attributes are related and their values change together, we can use Scatter plot. Follow the instruction below and answer the questions:
a. Create two scatter plots of age and resting blood pressure for people with heart disease and without heart disease. Is there any visual correlation? (1+1)
b. Calculate the average resting blood pressure of each age (HINT : Use Groupby for age) for people with heart disease. (1)
c. Calculate the average resting blood pressure of each age (HINT : Use Groupby for age) for people without heart disease. (1)
d. Now create two scatter plots using the previous results. Do you observe a correlation now? Do people without heart disease have higher blood pressure as they age than people with heart disease? (2)
8.Compare the resting blood pressure of people with heart disease and without. (1)
LINK TO Data set
https://docs.google.com/document/d/1KYER8cMeWPcOlMJpegWNIDAF4maIAthKTM3Hrpr8rxk/edit?usp=sharing
1 |
Count of patients having heart disease : |
101 |
|
2 |
Average cholestrol level of people having heart disease |
269.1881188 |
|
Average cholestrol level of people not having heart disease |
239.9529412 |
||
Standard deviation of cholestrol level pf people |
67.65771142 |
||
3 |
Median age |
Average age |
|
Cholestrol > 240 |
49 |
48.35251799 |
|
Cholestrol > 240 having heart disease |
50 |
49.41935484 |
|
Cholestrol > 240 with no heart disease |
48 |
47.49350649 |
For solution 3, we create subsets for the three categories so
that we can calculated median and average for the three categories.
We create these categories by applying filter on the columns
Cholestrol dummy variable (>240) and heart problem. First we
create dummy for Cholestrol by adding for each row one variable
“=IF(<chol_variable> > 240,1,0)” and get the following
dummy(for few rows).
Cholestrol(>240) |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
0 |
1 |
Chol > 240 |
Chol > 240 heart patients |
Chol > 240 no heart disease |
29 |
31 |
29 |
32 |
33 |
32 |
33 |
35 |
33 |
35 |
36 |
35 |
35 |
38 |
35 |
36 |
40 |
36 |
37 |
41 |
37 |
37 |
43 |
37 |
37 |
46 |
37 |
38 |
48 |
38 |
38 |
48 |
38 |
38 |
48 |
38 |
39 |
49 |
39 |
39 |
50 |
39 |
39 |
50 |
39 |
39 |
51 |
39 |
40 |
57 |
40 |
40 |
59 |
40 |
40 |
60 |
40 |
41 |
65 |
41 |
41 |
32 |
41 |
41 |
39 |
41 |
41 |
40 |
41 |
41 |
43 |
41 |
41 |
48 |
41 |
42 |
48 |
42 |
42 |
48 |
42 |
43 |
53 |
43 |
43 |
54 |
43 |
44 |
54 |
44 |
45 |
55 |
45 |
46 |
57 |
46 |
46 |
58 |
46 |
47 |
44 |
47 |
47 |
44 |
47 |
47 |
46 |
47 |
47 |
47 |
47 |
48 |
49 |
48 |
48 |
52 |
48 |
48 |
52 |
48 |
48 |
52 |
48 |
49 |
52 |
49 |
49 |
53 |
49 |
50 |
53 |
50 |
52 |
55 |
52 |
52 |
55 |
52 |
52 |
55 |
52 |
53 |
56 |
53 |
53 |
56 |
53 |
53 |
59 |
53 |
53 |
65 |
53 |
53 |
41 |
53 |
54 |
43 |
54 |
54 |
44 |
54 |
54 |
47 |
54 |
54 |
50 |
54 |
54 |
52 |
54 |
54 |
52 |
54 |
54 |
54 |
54 |
55 |
56 |
55 |
55 |
58 |
55 |
55 |
65 |
55 |
55 |
55 |
|
55 |
55 |
|
55 |
55 |
|
55 |
55 |
|
56 |
56 |
|
57 |
57 |
|
57 |
57 |
|
57 |
57 |
|
58 |
58 |
|
59 |
59 |
|
59 |
59 |
|
60 |
60 |
|
61 |
61 |
|
61 |
61 |
|
62 |
62 |
|
31 |
||
33 |
||
35 |
||
36 |
||
38 |
||
40 |
||
41 |
||
43 |
||
46 |
||
48 |
||
48 |
||
48 |
||
49 |
||
50 |
||
50 |
||
51 |
||
57 |
||
59 |
||
60 |
||
65 |
||
32 |
||
39 |
||
40 |
||
43 |
||
48 |
||
48 |
||
48 |
||
53 |
||
54 |
||
54 |
||
55 |
||
57 |
||
58 |
||
44 |
||
44 |
||
46 |
||
47 |
||
49 |
||
52 |
||
52 |
||
52 |
||
52 |
||
53 |
||
53 |
||
55 |
||
55 |
||
55 |
||
56 |
||
56 |
||
59 |
||
65 |
||
41 |
||
43 |
||
44 |
||
47 |
||
50 |
||
52 |
||
52 |
||
54 |
||
56 |
||
58 |
||
65 |
4 First we create BINS for Histogram :
Here, MIN = 98, MAX = 190
So the bins created are :
110 |
125 |
140 |
155 |
170 |
185 |
Then click on Data Tab -> Data analysis Toolpack -> Histogram -> Input Range : select the column trestbps -> Bin Range : select the above values -> Tick chart output -> OK
Bin |
Frequency |
95-110 |
11 |
110-125 |
30 |
125-140 |
26 |
140-155 |
9 |
155-170 |
4 |
170-185 |
1 |
More |
1 |
Bins are edited manually to make it understandable.(Else it was
just 110,125,….,185,more)