Least-Angle Regression (LARS)

Group 5 Overfit Noise: Yahya Hussein, Andrew Farabow, Bryan Nguyen, Jack Jiang, Manning Luo, Shelby Neal

November 30, 2021

History of LARS

LARS is a relatively new algorithm for model selection and is suited for high-dimensional data.
It was developed in 2004 by Stanford affiliates Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani.
LARS takes advantage of a less greedy and more efficient approach compared to other model selection algorithms.

Core Concepts

LARS is considered a shrinkage method.
- Regression used to shrink high dimensional data and automatically select variables for model selection
LARS is very similar to forward stepwise regression and forward stagewise regression.
Estimated parameters are increased in a direction equiangular to each predictor’s correlation with the residual.

Core Concepts

LARS first standardizes all coefficients of a regression model to have a mean zero and unit norm.
Then, LARS selects the predictor with the largest correlation to the response variable.
LARS increases the coefficient of that predictor variable, starting from zero until it reaches its least-square value.

Core Concepts

Continues until another predictor variable has similar correlation with the calculated residual.
Add that next predictor variable to the model, then increase their coefficients in the direction of the joint least-squares coefficient until another predictor variable of similar correlation to their residual is found.
This process repeats itself until the model reaches the full least-squares solution.

Why and When to Use Lars

Some other model selection methods

Forward selection: starts with no variables in the model, and at each step it adds to the model the variable with the most explanatory power, stopping if the explanatory power falls below some threshold.
- This is a fast and simple method, but it can also be too greedy: we fully add variables at each step, so correlated predictors don’t get much of a chance to be included in the model.
- Example: Build a model for the deliciousness of a sandwich
  - Variables: peanut butter and jelly

Why and When to Use Lars

Some other model section methods

Forward stagewise regression: tries to remedy the greediness of forward selection by only partially adding variables. Whereas forward selection finds the variable with the most explanatory power and goes all out in adding it to the model, forward stagewise finds the variable with the most explanatory power and updates its weight by only epsilon in the correct direction.
- The problem now is that we have to make tons of updates, so forward stagewise can be very inefficient.

Why and When to Use Lars

LARS: Instead of making tiny hops in the direction of one variable at a time, LARS makes optimally-sized leaps in optimal directions. These directions are chosen to make equal angles/correlations with each of the variables currently in our model.
- Should be used when overfitting is a concern, or want to have sparse parameter values, or want your model to be easily interpretable

Why and When to Use Lars

Advantages

Computationally as fast as forward selection but may sometimes be more accurate.
Numerically very efficient when the number of features is much larger than the number of data instances.
It can easily be modified to produce solutions for other estimators.

Disadvantages

LARS is highly sensitive to noise and can produce unpredictable results sometimes.
May contain some collinearity in high dimensional data, but that is expected for such data.

Mathematical Theory

Setting up the Problem

As with the classic forward selection model, start with all coefficients \(\beta_j=0;\ j\in[1,m];\ m=\text{number of predictors}\)
Find the predictor \(x_j\) that has the highest correlation with \(y\)
\(\begin{cases}r > 0 & ;\ \text{increase } \beta _{j}\\r < 0 & ;\ \text{decrease } \beta _{j}\end{cases}\), and calculate residuals of the new model fit (\(\hat{y}=\beta_jx;\ r=y-\hat{y}\))

Mathematical Theory

Estimating the First Coefficient

Now that we have some residuals, we can calculate the correlation between the residuals and \(x_j\), as well as the other predictors.
At each increment/decrement of \(\beta_j\), we examine these correlations and only stop adjusting \(\beta_j\) when we find some new predictor \(x_k\) that has equal correlation with the residuals as \(x_j\).

Mathematical Theory

Fitting the Model

Once we’ve found the competing predictor \(x_k\), we increase \((\beta_j, \beta_k)\), but this time in their joint least squares direction, and continue until some new competing predictor has as much correlation with the residuals. [4] LARS Algorithm w/ m=2 Predictors

Mathematical Theory

Optimizing the Solution

For \(m>3\), once we estimate the first two predictors, subsequent estimates with 3 or more predictors will fit the model along equiangular vectors.
- Since we are now dealing in higher-dimensional space, it’s not as simple as taking the bisector of two vectors
We stop adjusting the model when \(< r,x_j>=0\forall j\)

Mathematical Theory

Visualizing the Goal

The end goal is to maximize the reduction of absolute correlations to the residuals at each step. [4] Correlations in LARS w/ m=10 Predictors

Baby Example 1 - LARS & USArrests Dataset

# UsArrests Data from 1973
head(USArrests)

           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

# Call lars library to use lars() function
library(lars)

# Matrix of predictors for lars function
X = model.matrix(USArrests$UrbanPop ~ ., data = USArrests)[,-1]
y = USArrests$UrbanPop # Response variable
# Call lars function to perform least angle regression
us.lars = lars(X,y,type="lar")
plot(us.lars)

# Summary and coefficients of our lars fit
summary(us.lars)

LARS/LAR
Call: lars(x = X, y = y, type = "lar")
  Df     Rss      Cp
0  1 10266.4 12.0289
1  2  8639.9  4.5183
2  3  8338.6  4.7568
3  4  7867.1  4.0000

coef(us.lars)

         Murder    Assault      Rape
[1,]  0.0000000 0.00000000 0.0000000
[2,]  0.0000000 0.00000000 0.4753210
[3,] -0.2871069 0.00000000 0.6088300
[4,] -1.4115375 0.05190045 0.6984111

Baby Example 2 - LARS & mtcars Dataset

head(mtcars, 5)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

# Create the x matrix to be used in lars() function
car_x <- as.matrix(subset(mtcars, select=-c(mpg)))
head(car_x, 5)

                  cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   8  360 175 3.15 3.440 17.02  0  0    3    2

# Using lars function on mtcars dataset
car_lars <- lars(car_x, mtcars$mpg, type="lar", trace=TRUE, 
                 normalize=TRUE, intercept=TRUE)

LAR sequence
Computing X'X .....
LARS Step 1 :    Variable 5     added
LARS Step 2 :    Variable 1     added
LARS Step 3 :    Variable 3     added
LARS Step 4 :    Variable 8     added
LARS Step 5 :    Variable 10    added
LARS Step 6 :    Variable 4     added
LARS Step 7 :    Variable 6     added
LARS Step 8 :    Variable 7     added
LARS Step 9 :    Variable 9     added
LARS Step 10 :   Variable 2     added
Computing residuals, RSS etc .....

# Let's see what variables were added to model and in which order
colnames(car_x)

 [1] "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

# Get summary of our mtcars lars object in the form of an ANOVA table
summary(car_lars)

LARS/LAR
Call: lars(x = car_x, y = mtcars$mpg, type = "lar", trace = TRUE, normalize = TRUE, 
Call:     intercept = TRUE)
   Df     Rss       Cp
0   1 1126.05 130.3246
1   2  992.54 113.3155
2   3  378.79  27.9310
3   4  194.17   3.6454
4   5  190.76   5.1607
5   6  184.29   6.2386
6   7  170.09   6.2177
7   8  169.29   8.1030
8   9  157.32   8.3992
9  10  151.71   9.6000
10 11  147.49  11.0000

plot(car_lars)

This problem is perfect for LARS, since the number of predictors is much greater than the number of samples (153 vs 8) and multicollinearity is very high.

# Load the dataset for the predictors
social_data = read.csv("ACS_Social_Characteristics_DC_Ward.csv")
# Clean up name of 1st column
names(social_data)[1] <- 'OBJECTID'

# Order the dataset by ward 1-8 for convenience
x = social_data[order(social_data$WARD),]

# Drop unneeded columns and convert to a matrix
x = as.matrix(subset(x, select = -c(OBJECTID, STATEFP, SLDUST, 
                                    GEOID, NAMELSAD, LSAD,  
                                    LSY, MTFCC, FUNCSTAT, 
                                    ALAND, AWATER, INTPTLAT, 
                                    INTPTLON, NAME, WARD, 
                                    GIS_ID, SHAPEAREA, SHAPELEN)))

# Load the response data
social_data = read.csv("dc-crimes-search-results.csv")

# Since the dataset is just a raw list of crimes, 
# calculate the total number of crimes for each ward
# and generate a new matrix from that information
y = as.matrix(data.frame(y=c(
  nrow(social_data[social_data$WARD == 1,]),
  nrow(social_data[social_data$WARD == 2,]),
  nrow(social_data[social_data$WARD == 3,]),
  nrow(social_data[social_data$WARD == 4,]),
  nrow(social_data[social_data$WARD == 5,]),
  nrow(social_data[social_data$WARD == 6,]),
  nrow(social_data[social_data$WARD == 7,]),
  nrow(social_data[social_data$WARD == 8,])
)))

# Fit the model
lars_fit = lars(x, y, type="lar", trace=TRUE)

LAR sequence
LARS Step 0 :    1 Variables with Variance < eps; dropped for good
Computing X'X .....
LARS Step 1 :    Variable 26    added
LARS Step 2 :    Variable 6     added
LARS Step 3 :    Variable 9     added
LARS Step 4 :    Variable 25    added
LARS Step 5 :    Variable 7     added
LARS Step 6 :    Variable 56    added
LARS Step 7 :    Variable 30    added
Computing residuals, RSS etc .....

We can consult the dataset to see what the variables shown in the other slide represent.

Males 15 years and over: Never married

Total households: Male householder, no spouse/partner present  

Total households: Male householder, no spouse/partner present: 
                  Householder living alone: 65 years and over  

Males 15 years and over  

Total households: Male householder, no spouse/partner present: 
          With own children of the householder under 18 years  

Population 3 years and over enrolled in school: 
                Elementary school (grades 1-8)  

Males 15 years and over: Divorced

# Plot fitted values vs actual values to assess goodness
fitted_values = predict.lars(lars_fit, type="fit", x, s=2)$fit

ggplot(data=as.data.frame(y), aes(x=fitted_values, y=y)) + 
      geom_point() + labs(x='Predicted Values', 
      y='Actual Values', title='Predicted vs. Actual Values')

plot(lars_fit)

Least-Angle Regression (LARS)

Group 5 Overfit Noise: Yahya Hussein, Andrew Farabow, Bryan Nguyen, Jack Jiang, Manning Luo, Shelby Neal

November 30, 2021

History of LARS

Core Concepts

Core Concepts

Core Concepts

Why and When to Use Lars

Some other model selection methods

Why and When to Use Lars

Some other model section methods

Why and When to Use Lars

Why and When to Use Lars

Advantages

Disadvantages

Mathematical Theory

Setting up the Problem

Mathematical Theory

Estimating the First Coefficient

Mathematical Theory

Fitting the Model

Mathematical Theory

Optimizing the Solution

Mathematical Theory

Visualizing the Goal

Baby Example 1 - LARS & USArrests Dataset

Baby Example 2 - LARS & mtcars Dataset

Demo: Crime and Social Factors by DC Ward

References