Question

In: Computer Science

1. How logistic regression maps all outcome to either 0 or 1. The equation for log-likelihood...

1. How logistic regression maps all outcome to either 0 or 1. The equation for log-likelihood
function (LLF) is :
LLF = Σi( i log( ( i)) + (1 − i) log(1 − ( i))). y p x y p x
How logistic regression uses this in maximum likelihood estimation?

2. We can apply PCA to reduce features in a data set for model construction. But, why do we still
need regularization?
What is the difference between lasso and ridge regression? What is the role of hyper parameter in
regularization task?

Solutions

Expert Solution

1. How logistic regression maps all outcome to either 0 or 1?

Ans: The basis of logistic regression is the logistic function, also called the sigmoid function, which takes in any real valued number and maps it to a value between 0 and 1.

2. How logistic regression uses this in maximum likelihood estimation?

Ans: The parameters of a logistic regression model can be estimated by the probabilistic framework called maximum likelihood estimation. Under this framework, a probability distribution for the target variable (class label) must be assumed and then a likelihood function defined that calculates the probability of observing the outcome given the input data and the model. The maximum likelihood approach to fitting a logistic regression model both aids in better understanding the form of the logistic regression model and provides a template that can be used for fitting classification models more generally. This is particularly true as the negative of the log-likelihood function used in the procedure can be shown to be equivalent to cross-entropy loss function.

In order to use maximum likelihood, we need to assume a probability distribution. In the case of logistic regression, a Binomial probability distribution is assumed for the data sample, where each example is one outcome of a Bernoulli trial. The Bernoulli distribution has a single parameter: the probability of a successful outcome (p).

  • P(y=1) = p
  • P(y=0) = 1 – p

3. We can apply PCA to reduce features in a data set for model construction. But, why do we still
need regularization?

Ans: Dimensionality reduction is the process through which we remove irrelevant features (those that do not contribute to the goal), as well as reducing a number of dependent variables into a smaller number of free variables. Regularisation is the process of penalising complexity in a model so as to prevent overfitting through generalisation.

PCA considers only the variance of the features (X) but not the relationship between features and labels while doing this compression. Regularization, on the other hand, acts directly on the relationship between features and labels and hence develops models which are better at explaining the labels given the features.

4. What is the difference between lasso and ridge regression?

Ans: A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression.

The key difference between these two is the penalty term.

Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function.

Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large then it will add too much weight and it will lead to under-fitting. Having said that it’s important how lambda is chosen. This technique works very well to avoid over-fitting issue.

Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of coefficient as penalty term to the loss function.

Again, if lambda is zero then we will get back OLS whereas very large value will make coefficients zero hence it will under-fit.

The key difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.

5. What is the role of hyper parameter in regularization task?

Ans:  Hyper parameter is a parameter in machine learning whose value is initialized before the learning takes place. They are like settings that we can change and alter to control the algorithm’s behavior.

In regularization, we add an extra term to our cost function which is the Frobenius norm of the weight matrix W. The parameter lambda is called as the regularization parameter or hyper paramter which denotes the degree of regularization. Setting lambda to 0 results in no regularization, while large values of lambda correspond to more regularization. Lambda is usually set using cross validation.

Thus hyper parameter is required to tweak the algorithm to make it more proper or more accurate.

Hope this answers your question, leave a upvote if you find this helpful.


Related Solutions

How logistic regression maps all outcome to either 0 or 1. The equation for log-likelihood function...
How logistic regression maps all outcome to either 0 or 1. The equation for log-likelihood function (LLF) is : LLF = Σi( i log( ( i)) + (1 − i) log(1 − ( i))). y p x y p x How logistic regression uses this in maximum likelihood estimation?
Logistic Regression In logistic regression we are interested in determining the outcome of a categorical variable....
Logistic Regression In logistic regression we are interested in determining the outcome of a categorical variable. In most cases, we deal with binomial logistic regression with the binary response variable, for example yes/no, passed/failed, true/false, and others. Recall that logistic regression can be applied to classification problems when we want to determine a class of an event based on the values of its features.    In this assignment we will use the heart data located at   http://archive.ics.uci.edu/ml/datasets/Statlog+%28Heart%29 Here is the...
Use maximum likelihood to find the parameters in logistic regression, where the domain is x and...
Use maximum likelihood to find the parameters in logistic regression, where the domain is x and the sigmoid is used for the ’activation’.
Consider two logistic regression models: log(P(X)/(1-P(X)) = 0.1 + 0.2x1 + 0.3x2, and log(P(X)/(1-P(X)) = 0.1...
Consider two logistic regression models: log(P(X)/(1-P(X)) = 0.1 + 0.2x1 + 0.3x2, and log(P(X)/(1-P(X)) = 0.1 - 0.2x1 - 0.3x2 Compute the corresponding likelihood of the data below. Which model is better? x1 1.1 2.0 1.3 1.5 1.3 x2 2.0 1.2 1.2 1.4 2.1 Y 0 1 0 1 1
1 a)Describe features of the covariance correlation matrix    b) How is a log likelihood ratio...
1 a)Describe features of the covariance correlation matrix    b) How is a log likelihood ratio test is constructed to assess the adequacy of a given model
What is binary logistic regression, and how to use it?
What is binary logistic regression, and how to use it?
What is binary logistic regression, and how to use it?
What is binary logistic regression, and how to use it?
How do you use R to derive the parameters maximizes the log-likelihood?
How do you use R to derive the parameters maximizes the log-likelihood?
How to find exact and conditional log likelihood for MA(1) model. Find ML estimates. (For Nile...
How to find exact and conditional log likelihood for MA(1) model. Find ML estimates. (For Nile data in R but if you tell me the process I’ll apply it to that data)   
how does the logistic regression work to further the ability of the results
how does the logistic regression work to further the ability of the results
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT