Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
LARS/LAR
Call: lars(x = X, y = y, type = "lar")
Df Rss Cp
0 1 10266.4 12.0289
1 2 8639.9 4.5183
2 3 8338.6 4.7568
3 4 7867.1 4.0000
Murder Assault Rape
[1,] 0.0000000 0.00000000 0.0000000
[2,] 0.0000000 0.00000000 0.4753210
[3,] -0.2871069 0.00000000 0.6088300
[4,] -1.4115375 0.05190045 0.6984111
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
# Create the x matrix to be used in lars() function
car_x <- as.matrix(subset(mtcars, select=-c(mpg)))
head(car_x, 5)
cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 8 360 175 3.15 3.440 17.02 0 0 3 2
# Using lars function on mtcars dataset
car_lars <- lars(car_x, mtcars$mpg, type="lar", trace=TRUE,
normalize=TRUE, intercept=TRUE)
LAR sequence
Computing X'X .....
LARS Step 1 : Variable 5 added
LARS Step 2 : Variable 1 added
LARS Step 3 : Variable 3 added
LARS Step 4 : Variable 8 added
LARS Step 5 : Variable 10 added
LARS Step 6 : Variable 4 added
LARS Step 7 : Variable 6 added
LARS Step 8 : Variable 7 added
LARS Step 9 : Variable 9 added
LARS Step 10 : Variable 2 added
Computing residuals, RSS etc .....
[1] "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
LARS/LAR
Call: lars(x = car_x, y = mtcars$mpg, type = "lar", trace = TRUE, normalize = TRUE,
Call: intercept = TRUE)
Df Rss Cp
0 1 1126.05 130.3246
1 2 992.54 113.3155
2 3 378.79 27.9310
3 4 194.17 3.6454
4 5 190.76 5.1607
5 6 184.29 6.2386
6 7 170.09 6.2177
7 8 169.29 8.1030
8 9 157.32 8.3992
9 10 151.71 9.6000
10 11 147.49 11.0000
Predict the number of crimes for each of Washington, DC’s eight wards from a dataset of socioeconomic statistics.
Predictors - ACS Social Characteristics DC Ward
US Census data on topics such as household types, relationship and marital status, school enrollment and educational attainment, languages spoken and immigration info. Sorted by Ward and provided by https://opendata.dc.gov.
Response - Crimes Committed in the Past Two Years
Provided by the DC Metropolitian Police Department, hosted at https://dcatlas.dcgis.dc.gov.
This problem is perfect for LARS, since the number of predictors is much greater than the number of samples (153 vs 8) and multicollinearity is very high.
# Load the dataset for the predictors
social_data = read.csv("ACS_Social_Characteristics_DC_Ward.csv")
# Clean up name of 1st column
names(social_data)[1] <- 'OBJECTID'
# Order the dataset by ward 1-8 for convenience
x = social_data[order(social_data$WARD),]
# Drop unneeded columns and convert to a matrix
x = as.matrix(subset(x, select = -c(OBJECTID, STATEFP, SLDUST,
GEOID, NAMELSAD, LSAD,
LSY, MTFCC, FUNCSTAT,
ALAND, AWATER, INTPTLAT,
INTPTLON, NAME, WARD,
GIS_ID, SHAPEAREA, SHAPELEN)))
# Load the response data
social_data = read.csv("dc-crimes-search-results.csv")
# Since the dataset is just a raw list of crimes,
# calculate the total number of crimes for each ward
# and generate a new matrix from that information
y = as.matrix(data.frame(y=c(
nrow(social_data[social_data$WARD == 1,]),
nrow(social_data[social_data$WARD == 2,]),
nrow(social_data[social_data$WARD == 3,]),
nrow(social_data[social_data$WARD == 4,]),
nrow(social_data[social_data$WARD == 5,]),
nrow(social_data[social_data$WARD == 6,]),
nrow(social_data[social_data$WARD == 7,]),
nrow(social_data[social_data$WARD == 8,])
)))
LAR sequence
LARS Step 0 : 1 Variables with Variance < eps; dropped for good
Computing X'X .....
LARS Step 1 : Variable 26 added
LARS Step 2 : Variable 6 added
LARS Step 3 : Variable 9 added
LARS Step 4 : Variable 25 added
LARS Step 5 : Variable 7 added
LARS Step 6 : Variable 56 added
LARS Step 7 : Variable 30 added
Computing residuals, RSS etc .....
We can consult the dataset to see what the variables shown in the other slide represent.
Males 15 years and over: Never married
Total households: Male householder, no spouse/partner present
Total households: Male householder, no spouse/partner present:
Householder living alone: 65 years and over
Males 15 years and over
Total households: Male householder, no spouse/partner present:
With own children of the householder under 18 years
Population 3 years and over enrolled in school:
Elementary school (grades 1-8)
Males 15 years and over: Divorced