In: Advanced Math
Q1. Define/outline the following:
a. the difference between a ‘one-tailed test’ and a ‘two-tailed test’.
b. the importance of sample size in the context of OLS regression.
*c. four different types of data structures and discuss their potential usefulness and application in finance.
d. the OLS assumptions
e. Correlation is not causation. Briefly discuss.
(c. is the part I'm confused most about)
(a) Differences Between One-tailed and Two-tailed Test
The fundamental differences between one-tailed and two-tailed test, is explained below in points:
(d) The Seven Classical OLS Assumptions
Assumption 1: The regression model is linear in the coefficients and the error term
This assumption addresses the functional form of the model. In statistics, a regression model is linear when all terms in the model are either the constant or a parameter multiplied by an independent variable. You build the model equation only by adding the terms together. These rules constrain the model to one type:
In the equation, the betas (βs) are the parameters that OLS estimates. Epsilon (ε) is the random error.
Assumption 2: The error term has a population mean of zero
The error term accounts for the variation in the dependent variable that the independent variables do not explain. Random chance should determine the values of the error term. For your model to be unbiased, the average value of the error term must equal zero.
Assumption 3: All independent variables are uncorrelated with the error term
If an independent variable is correlated with the error term, we can use the independent variable to predict the error term, which violates the notion that the error term represents unpredictable random error. We need to find a way to incorporate that information into the regression model itself.
Assumption 4: Observations of the error term are uncorrelated with each other
One observation of the error term should not predict the next observation. For instance, if the error for one observation is positive and that systematically increases the probability that the following error is positive, that is a positive correlation. If the subsequent error is more likely to have the opposite sign, that is a negative correlation. This problem is known both as serial correlation and autocorrelation.
Assumption 5: The error term has a constant variance (no heteroscedasticity)
The variance of the errors should be consistent for all observations. In other words, the variance does not change for each observation or for a range of observations. This preferred condition is known as homoscedasticity (same scatter). If the variance changes, we refer to that as heteroscedasticity (different scatter).
Assumption 6: No independent variable is a perfect linear function of other explanatory variables
Perfect correlation occurs when two variables have a Pearson’s correlation coefficient of +1 or -1. When one of the variables changes, the other variable also changes by a completely fixed proportion. The two variables move in unison.
Perfect correlation suggests that two variables are different forms of the same variable. For example, games won and games lost have a perfect negative correlation (-1). The temperature in Fahrenheit and Celsius have a perfect positive correlation (+1).
Assumption 7: The error term is normally distributed (optional)
OLS does not require that the error term follows a normal distribution to produce unbiased estimates with the minimum variance.
(b) Sample size as optimisation problem
For multiple regression, you have some theory to suggest a minimum sample size. If you are going to be using ordinary least squares, then one of the assumptions you require is that the "true residuals" be independent.
Many of the sample size/precision/power issues for multiple linear regression are best understood by first considering the simple linear regression context. Thus, I will begin with the linear regression of Y on a single X and limit attention to situations where functions of this X, or other X’s, are not necessary. As an illustration, I will use a genuine ‘‘prediction’’ problem. (Some clinical ‘‘pre’’- diction problems, including diagnostic ones, and the quantitative examples I cite and use, do not involve the future but the present.
(e) Correlation Does Not Imply Cause
It seems pretty self-explanatory, but it's not always easy to understand exactly what this phrase means until you examine it carefully. First of all, it is important to understand what a correlation is and what a causation is. A correlation is a mutual relationship or a connection between two variables. Causation is the relationship between cause and effect. So, when a cause results in an effect, that's a causation. In other words, correlation between two events or variables simply indicates that a relationship exists, whereas causation is more specific and says that one event actually causes the other.
When we say that correlation does not imply cause, we mean that just because you can see a connection or a mutual relationship between two variables, it doesn't necessarily mean that one causes the other. Of course, it might be the case that one event or variable causes the other, but we can't know that by looking at the correlation alone. More research would be necessary before that conclusion could be reached.
(c) Types of data structures
1. Arrays- An array stores a collection of items at adjoining memory locations. Items that are the same type get stored together so that the position of each element can be calculated or retrieved easily. Arrays can be fixed or flexible in length.
2. Trees- A tree stores a collection of items in an abstract, hierarchical way. Each node is linked to other nodes and can have multiple sub-values, also known as children.
3. Graphs- A graph stores a collection of items in a non-linear fashion. Graphs are made up of a finite set of nodes, also known as vertices, and lines that connect them, also known as edges. These are useful for representing real-life systems such as computer networks.
4. Hash tables- A hash table, or a hash map, stores a collection of items in an associative array that plots keys to values. A hash table uses a hash function to convert an index into an array of buckets that contain the desired data item.
Basic data types
In finance, the classes int, float, and string provide the atomic data types.
Standard data structures
The classes tuple, list, dict, and set have many application areas in finance, with list being the most flexible workhorse in general.
Arrays
A large class of finance-related problems and algorithms can be cast to an array setting; NumPy provides the specialized class numpy.ndarray, which provides both convenience and compactness of code as well as high performance.
This chapter shows that both the basic data structures and the NumPy ones allow for highly vectorized implementation of algorithms. Depending on the specific shape of the data structures, care should be taken with regard to the memory layout of arrays. Choosing the right approach here can speed up code execution by a factor of two or more.