In: Statistics and Probability
Please find one medical dataset that is suitable for correlation, logistic regression and linear regression.
For example, here are two graphs.
For the first, I dusted off the elliptical machine in our basement and measured my pulse after one minute of ellipticizing at various speeds:
Speed, kph | Pulse, bpm |
---|---|
0 | 57 |
1.6 | 69 |
3.1 | 78 |
4 | 80 |
5 | 85 |
6 | 87 |
6.9 | 90 |
7.7 | 92 |
8.7 | 97 |
12.4 | 108 |
15.3 | 119 |
Graph of my pulse rate vs. speed on an elliptical exercise machine.
For the second graph, I dusted off some data from McDonald (1989): I collected the amphipod crustacean Platorchestia platensis on a beach near Stony Brook, Long Island, in April, 1987, removed and counted the number of eggs each female was carrying, then freeze-dried and weighed the mothers:
Weight, mg | Eggs |
---|---|
5.38 | 29 |
7.36 | 23 |
6.13 | 22 |
4.75 | 20 |
8.10 | 25 |
8.62 | 25 |
6.30 | 17 |
7.44 | 24 |
7.26 | 20 |
7.17 | 27 |
7.78 | 24 |
6.23 | 21 |
5.42 | 22 |
7.87 | 22 |
5.25 | 23 |
7.37 | 35 |
8.01 | 27 |
4.92 | 23 |
7.03 | 25 |
6.45 | 24 |
5.06 | 19 |
6.72 | 21 |
7.00 | 20 |
9.39 | 33 |
6.49 | 17 |
6.34 | 21 |
6.16 | 25 |
5.74 | 22 |
Graph of number of eggs vs. dry weight in the amphipod Platorchestia platensis.
::There are three things you can do with this kind of data:
(1) One is a hypothesis test::
To see if there is an association between the two variables; in other words, as the X variable goes up, does the Y variable tend to change (up or down). For the exercise data, you'd want to know whether pulse rate was significantly higher with higher speeds. The P value is 1.3×10−8, but the relationship is so obvious from the graph, and so biologically unsurprising (of course my pulse rate goes up when I exercise harder!), that the hypothesis test wouldn't be a very interesting part of the analysis. For the amphipod data, you'd want to know whether bigger females had more eggs or fewer eggs than smaller amphipods, which is neither biologically obvious nor obvious from the graph. It may look like a random scatter of points, but there is a significant relationship (P=0.015).
(2) The second goal is to describe how tightly the two variables are associated::
This is usually expressed with r, which ranges from −1 to 1, or r2, which ranges from 0 to 1. For the exercise data, there's a very tight relationship, as shown by the r2 of 0.98; this means that if you knew my speed on the elliptical machine, you'd be able to predict my pulse quite accurately. The r2 for the amphipod data is a lot lower, at 0.21; this means that even though there's a significant relationship between female weight and number of eggs, knowing the weight of a female wouldn't let you predict the number of eggs she had with very much accuracy.
(3) The final goal is to determine the equation of a line that goes through the cloud of points::
The equation of a line is given in the form Ŷ=a+bX, where Ŷ is the value of Y predicted for a given value of X, a is the Y intercept (the value of Y when X is zero), and b is the slope of the line (the change in Ŷ for a change in X of one unit). For the exercise data, the equation is Ŷ=63.5+3.75X; this predicts that my pulse would be 63.5 when the speed of the elliptical machine is 0 kph, and my pulse would go up by 3.75 beats per minute for every 1 kph increase in speed. This is probably the most useful part of the analysis for the exercise data; if I wanted to exercise with a particular level of effort, as measured by pulse rate, I could use the equation to predict the speed I should use. For the amphipod data, the equation is Ŷ=12.7+1.60X. For most purposes, just knowing that bigger amphipods have significantly more eggs (the hypothesis test) would be more interesting than knowing the equation of the line, but it depends on the goals of your experiment.