Question

In: Computer Science

In building a regression tree, instead of the mean we can use the median, and instead...

In building a regression tree, instead of the mean we can use the median, and instead of minimizing the squared error we can minimize the absolute error. Why does this help in the case of noise?

Expert Solution

The regression tree, instead of the mean we can use the median, and instead of minimizing the squared error we can minimize the absolute error. Why does this help in the case of noise?

noise is that part of the residual which is in-feasible to model by any other means than a purely statistical description. note that such modelling limitations also arise due to limitations of the measurement device (e.g. finite bandwidth & resolution).

error is that component of the residual that remains after accounting for the noise.

according to the above definitions:

a) noise and error are uncorrelated

b) residual may be reduced by either reducing noise or by reducing error

c) these definitions are compatible with the intuitive statements that "noise does not introduce bias" and "bias is a class of error".

Regression analysis is a quantitative research method which is used when the study involves modelling and analysing several variables, where the relationship includes a dependent variable and one or more independent variables. In simple terms, regression analysis is a quantitative method used to test the nature of relationships between a dependent variable and one or more independent variables.

The basic form of regression models includes unknown parameters (β), independent variables (X), and the dependent variable (Y).

Regression model, basically, specifies the relation of dependent variable (Y) to a function combination of independent variables (X) and unknown parameters (β)

Y ≈ f (X, β)

Regression equation can be used to predict the values of ‘y’, if the value of ‘x’ is given, and both ‘y’ and ‘x’ are the two sets of measures of a sample size of ‘n’. The formulae for regression equation would be

y^*=a + bx

Where,

Do not be intimidated by visual complexity of correlation and regression formulae above. You don’t have to apply the formula manually, and correlation and regression analyses can be run with the application of popular analytical software such as Microsoft Excel, Microsoft Access, SPSS and others.

Linear regression analysis is based on the following set of assumptions:

1. Assumption of linearity. There is a linear relationship between dependent and independent variables.

2. Assumption of homoscedasticity. Data values for dependent and independent variables have equal variances.

3. Assumption of absence of collinearity or multicollinearity. There is no correlation between two or more independent variables.

4. Assumption of normal distribution. The data for the independent variables and dependent variable are normally distributed

A person cherly Dixon is interest to see the student who consume more caffine in take mg.she randomly select 2 students in their school.number of hours spent studying

Lets take x=caffine consumed and y= hours studying

The least square analysis of regression of data

Y=mx+b

This is computer output from a least-squares regression analysis on the data:

Predictor	Coef	SE Coef	T	P
Constant	2.5442.5442, point, 544	0.1340.1340, point, 134	18.95518.95518, point, 955	0.0000.0000, point, 000
Caffeine (mg)	0.1640.1640, point, 164	0.0570.0570, point, 057	2.8622.8622, point, 862	0.0050.0050, point, 005

S=1.532R-Sq=60.032%R-Sq(adj)=58.621%

The median-median line traces back to a line-fitting approach proposed by Wald (1940). He suggested a very simple method where the points on a scatterplot are separated into a left half and a right half based on the median of the x-scores in a sample of bivariate data. The means of the x-scores and the y-scores are calculated using the data from only the left half of the scatterplot and then calculated using only the data on the right half. Concentrating on the two points, and , Wald proposed finding a line connecting these points that is then adjusted up or down to better fit the full array of points on the scatterplot. (The subscripts R and L denote which half of the data is used and no subscript implies the means are taken over the entire sample.) His line of fit has slope and y-intercept

A similar procedure suggested by Nair and Shrivastava (1942) breaks up the points on a scatterplot into three regions with each region containing about the same number of points. The means of the x and y points in the left and right regions are used to find the slope of the line of fit much as Wald suggested.

Brown and Mood (1951) used the two-region approach but found the slope of the line of fit using medians in place of means. The primary advantage of using this measure of center comes from the median’s inherent ability to resist the strong effect of outliers. Most students of statistics know that the mean can be affected greatly by outliers since they are included with equal weight with the rest of the data in the sum of the scores. But the median takes on the same value whether the largest score in a data set is just somewhat larger than the rest of the data or is much larger than the second biggest score. In the context of fitting points on a scatterplot, this implies that a single point far from the general sloping trend of the rest of the points would not apply such a large “tug” on the location of the line of fit if the median is used to find the line.

Readers may recall that the least squares line is found by minimizing the sum of the squared distances that each point lies from the line. Since these distances are squared, Hartwig and Dearing (1979, p. 34) noted that “… cases lying farther and farther from the regression line increase the sum of the squared residuals at an increasing rate ... [and the line] will have to come reasonably close to them to satisfy the least squares criterion and, therefore, the least squares regression line will lack resistance to the excessive influence of a few atypical cases.” This means that Brown and Mood’s method is not only simple to apply, but also has the advantage of not allowing outlying cases to have undue impact on determining the line of fit.

Like Brown and Mood, Tukey (1971) utilized the medians in finding his line of fit, but he did so borrowing the three-region approach of Nair and Shrivastava. His line of fit, called the resistant line, is considered a basic methodology of exploratory data analysis. The median-median line provides the first iteration in the procedure to find Tukey’s resistant line. To obtain Tukey’s resistant line, the residuals are used to adjust the parameters in an iterative fashion. Appendix A provides the algorithm used by Texas Instruments to compute the median-median line. Tukey’s resistant line may be obtained using Minitab under the EDA submenu of Stat.

For those who wish to learn more about the broad methodology of exploratory data analysis, Velleman and Hoaglin (1981) provide a gentle review of the entire subject. To learn more about the median-median line at a more sophisticated mathematical level, readers are encouraged to consider Emerson and Hoaglin (1983) and Johnstone and Velleman (1985).

3. Examples

In order to illustrate the comparative performance of the median-median line and the least squares line, we consider three example data sets from popular introductory statistics texts. The first contains no outliers. The second illustrates the impact of influential observations and outliers on the least squares line. The third example examines the data in two ways: one way has a single extreme influential observation while the other has a single outlier.

3.1 Manatees and Motor Boats

An investigation of the relationship between the number of manatees killed by boats in Florida and the number of powerboat registrations in that state, from 1977 to 1990 (Moore, 1997, p 347), shows a strong positive correlation with no obvious outliers or influential points (r = 0.941, p-value < 0.001). To calculate the equation of the median-median line, one first divides the data into three regions by x-score (in this case, the number of powerboat registrations, in thousands) (see Table 1). In the case when there are ties in the x-scores that would result in these points being in different regions, all points with the same x-score are placed in an outer region. This may result in the middle region having less than 1/3 of the observations (see Appendix A and Appendix B). While this does not occur with the data for Example 3.1, it does happen with the data for Example 3.2.

Table 1. Manatee data divided into three regions based on x-score. In each region, the first column x is the
number of powerboat registrations, in thousands, and the second column y is the number of manatees killed.

	Region 1	Region 2	Region 3

	x	y	x	y	x	y

	447	13	513	24	614	33
	460	21	526	15	645	39
	481	24	559	34	675	43
	498	16	585	33	711	50
	512	20			719	47

median	482	20	542.5	28.5	675	43

Once the data are divided into three regions, the median of the x- and median of the y-scores are calculated for each region. The resulting three points for these data are termed

The slope of the median-median line is the slope of the line passing through

A three median regression line is an alternative and graphical approach to the least squares regression line, to find a relationship between two variables. A three median regression line can be fitted if the variables are numerical and the relationship is linear. This method is especially useful when there are outliers as it is not easily affected by them.

To find the regression line using the three median method:

Divide the data into three groups.

Ø Note: The number of data points in the outside groups must always be the same. If there is one left over, put that one in the middle group. If there are two points left over, divide them evenly between the side groups.

Locate the median of each group of points.
Place your ruler or any straight surface on the right and left median points.
Move one-third of the way towards the middle median point
Find the gradient. . u = upper value, l = lower value.
Find or calculate the y-intercept.

Example:

y = the amount of fertilizer (gm) and x = crop yields.

The data is divided into three groups as represented by the double line separating the cells in the graph. The median for the first group is (2.5, 3.7), the second group is (6.5, 8) and the third group is (10.5, 11.3). Below are the three points when they are plotted, as well as the line when it is moved 1/3 towards the middle median point.

From this graph, it can be estimated that the gradient is approximately 1.8. the value of c (y-intercept) is approximately 1.75, and thus the equation is:

venereology answered 1 year ago

1) Why would one use the median instead of the mean? 2) An experiment involves rolling...

1) Why would one use the median instead of the mean? 2) An experiment involves rolling a six-sided die 480 times and recording the number of 3s. What is the mean and standard deviation? 3) In a normal distribution, what is the percentage of data having a z score less than -1? 4) We roll 5 six-sided die. What is the probability of obtaining exactly two 1s? 5) What is sampling error, and how would you distinguish it from nonsampling...

Describe in what cases the use of ‘‘median’’ can be preferred over that of ‘‘mean’’ in...

Describe in what cases the use of ‘‘median’’ can be preferred over that of ‘‘mean’’ in environmental data reporting?

Please give a few examples of how we use mean, median, and mode in our daily...

Please give a few examples of how we use mean, median, and mode in our daily personal or work lives.

1.Instead of Build-Max-Heap, we could use Heap-Insert-Max to build a tree with heap property. Write a...

1.Instead of Build-Max-Heap, we could use Heap-Insert-Max to build a tree with heap property. Write a pseudocode for that procedure, also evaluate it’s time complexity. 2. How Insertion sort works on the following array [16, 12, 3, 27, 9, 4, 5, 7]]

Which measure(s) of central tendency can be applied to nominal data? Mean Median Mode Mean&Median

We use binary search tree because in best case scenario we can retrieve anything we search...

We use binary search tree because in best case scenario we can retrieve anything we search for in O(log(n)) times. However, this is not always the case. Give an example of when this fails and what can be done to avoid it.

in a linear regression, the distribution of error is not i.i.t. How can we use MLE...

in a linear regression, the distribution of error is not i.i.t. How can we use MLE function?

In this problem, we explore the effect on the mean, median, and mode of multiplying each...

In this problem, we explore the effect on the mean, median, and mode of multiplying each data value by the same number. Consider the following data set. 4, 4, 5, 8, 12 (a) Compute the mode, median, and mean. mode median mean (b) Multiply each data value by 6. Compute the mode, median, and mean. mode median mean (c) Compare the results of parts (a) and (b). In general, how do you think the mode, median, and mean are affected...

This is the program that use to find the Mean (Average) and Median in C++. #include...

This is the program that use to find the Mean (Average) and Median in C++. #include <iostream> #include <fstream> #include <iomanip> #include <cstdlib> // used by the exit() functiona using namespace std; int main(int argc, char* argv[]) { // variables to control the disk file ifstream infile; char filename[200]; int recordCount = 0; int recordsToSkip = 0; // variables for fields of each record in the file int AcctNo = 0; char Name[100] = ""; double AcctBal = 0.0; //...

How can we use “linear regression” to estimate non-linear functional forms?