How to Read Unweighted Linear Probability Model

março 07, 2022 Postar um comentário

Annotation on required packages: The following code requires the packages sandwich, lmtest and tidyverse. The packages sandwich and lmtest include functions to estimate regression error variance that may change with the explanatory variables. The package tidyverse is a drove of packages convenient for manipulating and graphing information. If you have not already done so, download, install, and load the libraries with the post-obit code:

Introduction

We established in a previous tutorial that binary variables can be used to estimate proportions or probabilities that an event will occur. If a binary variable is equal to ane for when the effect occurs, and 0 otherwise, estimates for the mean can exist interpreted equally the probability that the event occurs.

A linear probability model (LPM) is a regression model where the outcome variable is a binary variable, and i or more explanatory variables are used to predict the outcome. Explanatory variables tin can themselves be binary, or exist continuous.

Information Fix: Mortgage loan applications

The information fix, loanapp.RData, includes bodily data from 1,777 mortgage loan applications, including whether or not a loan was approved, and a number of possible explanatory variables including demographic data of the applicants and financial variables related to the applicant'due south ability to pay the loan such every bit the applicant's income and employment information, value of the mortgaged belongings, and credit history.

The lawmaking below loads the R information set, which creates a data frame called df, and a list of descriptions for the variables chosen desc.

Estimating a Linear Probability Model

Model Setup

Let united states of america gauge a linear probability model with loan approving condition as the effect variable (corroborate) and the following explanatory variables:

loanprc:Loan amount relative to toll of the belongings
loaninc: Loan amount relative to total income
obrat: Value of other debt obligations relative to total income
mortno: Dummy variable equal to 1 if the applicant has no previous mortgage history, 0 otherwise
unem: Unemployment rate in the industry where the bidder is employment

              ##  ## Telephone call: ## lm(formula = approve ~ loanprc + loaninc + obrat + mortno + unem,  ##     information = df) ##  ## Residuals: ##      Min       1Q   Median       3Q      Max  ## -1.05548  0.03789  0.11512  0.16194  0.52705  ##  ## Coefficients: ##               Judge Std. Error t value Pr(>|t|)     ## (Intercept)  1.240e+00  4.374e-02  28.356  < 2e-xvi *** ## loanprc     -1.927e-03  iv.188e-04  -4.601 4.51e-06 *** ## loaninc     -4.676e-05  five.604e-05  -0.835  0.40409     ## obrat       -5.906e-03  9.597e-04  -6.154 9.31e-x *** ## mortno       five.358e-02  1.661e-02   3.225  0.00128 **  ## unem        -8.628e-03  iii.520e-03  -2.451  0.01433 *   ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.ane ' ' 1 ##  ## Residual standard error: 0.3228 on 1771 degrees of freedom ## Multiple R-squared:  0.05718,    Adjusted R-squared:  0.05452  ## F-statistic: 21.48 on 5 and 1771 DF,  p-value: < 2.2e-16

Visualizing the Linear Probability Model

Let us visualize the actual and predicted outcomes with a plot. The code below calls the ggplot() role to visualize the how loan blessing depends on the size of the loan as a percentage of the cost of the holding.

On the vertical axis nosotros have the bodily value of approve (equal to 0 or one) or the predicted probability of a loan approval. The black points testify the actual values and the blue line shows the predicted values.

The beginning parameter sets the data layer, pointing to the data frame, df.

The second parameter sets the aesthetics layer (likewise known as mapping layer). Nosotros phone call the function aes() to map the variable loanprc to the ten-axis and approve to the y-centrality.

Side by side we add the geometry layer with a call to geom_point(). This produces a scatter plot with points.

Finally, nosotros create the best fit linear regression line using the role geom_smooth(method="lm", se=Fake). This function creates both a geometry and a statistics layer. The role estimates the best fit simple linear regression function (using loanprc as the only explanatory variable) using the function lm(). We fix se=FALSE because we do non wish to view the confidence bounds around the line. As we hash out below, the standard errors computed by the lm() function that are used to create the confidence bounds are incorrect for a linear probability model.

It is a foreign looking scatter plot because all the values for corroborate are either at the top (=1) or at the bottom (=0). The all-time fitting regression line does non visually appear to describe the behavior of the values, but it still is chosen to minimize the boilerplate squared vertical distance between all the observations and the predicted value on the line.

The strange look of the scatter plot is telling as to how well the model predicts the data. You can see that the model fails to predict very well the many number of unapproved loans (approve=0) with values of loanprc between 0 and 150. While all of these loans were not approved, the linear model predicts a probability for approval between 60% and 100%.

The negative slope of the line is indicative that an increment in the size of the loan relative to the belongings price leads to a decrease in the probability that the loan is accustomed. The magnitude of the slope indicates how much the approval probability decreases for each 1 per centum point increase in the size of the loan relative to the property price.

Predicting marginal effects

Since the average of the binary outcome variable is equal to a probability, the predicted value from the regression is a prediction for the probability that someone is approved for a loan.

Since the regression line is sloping downward for loanprc, nosotros encounter that equally an applicant'due south loan amount relative to the belongings toll increases, the probability that he/she is approved for a loan decreases.

The coefficient on loanprc is the estimated marginal effect of loanprc on the probability that the outcome variable is equal to i. With a coefficient equal to -0.0019, our model predicts that for every 1 percentage indicate increase in housing expenses relative to income, the probability that the bidder is approved for a mortgage loan decreases by 0.19%.

Heteroskedasticity

All linear probability models take heteroskedasticity. Because all of the actual values for \(y_i\) are either equal to 0 or 1, but the predicted values are probabilities anywhere between 0 and 1 (and sometimes even greater or smaller), the size of the residuals grow or shrink equally the predicted values grow or compress.

Visualizing Heteroskedasticity

Let us plot the predicted values against the squared residuals to come across this:

You can see that as the predicted probability that a loan is approved (the 10-axis) increases, the gauge of the variance increases for some observations and decreases for some others.

Correcting for Heteroskedasticity

In order to conduct hypothesis tests and conviction intervals for the marginal effects an explanatory variable has on the outcome variable, we must first correct for heteroskedasticity. We can use the White estimator for correcting heteroskedasticity.

Nosotros compute the White heteroskedastic variance/covariance matrix for the coefficients with the call to vcovHC (which stands for Variance / Covariance Heteroskedastic Consequent):

The first parameter in the telephone call above is our original output from our call to lm() above, and the second parameter blazon="HC1" tells the role to use the White correction.

Then we phone call coeftest() to employ this estimate for the variance / covariance to properly compute our standard errors, t-statistics, and p-values for the coefficients.

              ##  ## t test of coefficients: ##  ##                Estimate  Std. Mistake t value  Pr(>|t|)     ## (Intercept)  1.2402e+00  iv.5220e-02 27.4251 < 2.2e-16 *** ## loanprc     -1.9267e-03  4.0644e-04 -4.7404 2.303e-06 *** ## loaninc     -4.6765e-05  7.5451e-05 -0.6198 0.5354670     ## obrat       -five.9063e-03  i.2134e-03 -iv.8677 ane.230e-06 *** ## mortno       5.3579e-02  ane.4884e-02  3.5999 0.0003271 *** ## unem        -8.6279e-03  4.0353e-03 -2.1381 0.0326438 *   ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.ane ' ' 1

Suppose we wish to test the hypothesis that a college loan value relative to the holding toll leads to a subtract in the probability that a loan application is accepted. The zip and alternative hypotheses are given past,

\[ H_0: \beta_{loanprc} = 0 \] \[ H_0: \beta_{loanprc} < 0 \]

The coefficient is negative (-0.0019) and the p-value in the output is equal to 0.000. This is the p-value for a ii-tailed test. The p-value for a i-tailed examination is half that amount, or 0.000. Since 0.000 < 0.05, we refuse the null hypothesis and conclude that we have statistical evidence that, given the estimated effects of all the other explanatory variables in the model, an increase in the value of the loan relative to the property price leads to a decrease in the probability a loan is approved.

Problems using the Linear Probability Model

There are some problems using a binary dependent variable in a regression.

There is heteroskedasticity. But that'southward OK, we know how to correct for information technology.

A linear model for a probability volition eventually exist incorrect for probabilities which are past definition bounded betwixt 0 and 1. Linear equations (i.e. straight lines) have no bounds. They continue eventually upwards to positive infinity in ane direction, and negative infinity in the other management. Information technology is possible for the linear probability model to predict probabilities greater than ane and less than 0.

Use caution when the predicted values are well-nigh 0 and 1. It is useful to examine the predicted values from your regression to run across if any are near these boundaries. In the example above, all the predicted values are between 0.vii and 0.95, so fortunately our regression equation is not making whatsoever mathematically incommunicable predictions.

Also, be cautious when using the regression equation to brand predictions outside of the sample. The predicted values in your regression may take all fallen between 0 and 1, but perchance a predicted value will movement outside the range.

The error term is not normal. When it is, and so with small or large sample sizes, the sampling distribution of your coefficient estimates and predicted values are as well normal.

While the residuals and the fault term are never normal, with a big enough sample size, the central limit theorem does evangelize normal distributions for the coefficient estimates and the predicted values. This problem that the error term is not normal, is actually merely a problem with small samples.

cernyusbarce.blogspot.com

Source: https://www.murraylax.org/rtutorials/linearprob.html

Cerny Usbarce