Experience with R: notes for Polynomial Regression, Splines and GAM

Linear models have significant advantages in terms of interpretation/inference. However, standard linear regression has significant limitations in terms of prediction power. Polynomial Regression is just a particular kind of “Basis Function”.
# use MASS library
# Use the poly() function to fit a polynomial regression
library(MASS)
attach(Boston);
names(Boston)
polyfit2 <- lm(nox ~ poly(dis, 2));
summary(polyfit2)
polyfit3 <- lm(nox ~ poly(dis, 3));
summary(polyfit3)
# The plot of regression fit with power of 3:
windows();
plot(dis, nox);
lines(sort(dis), polyfit3$fit[order(dis)], col = 2, lwd = 3)
###########################################################
Instead of fitting a high dimensional polynomial over the entire range of x. a spline works by fitting different low dimensional polynomials over different regions of x. For example a cubic spline works by fitting a cubic y=ax3+bx2+cx+d but the coefficients a, b, c and d may differ depending on which part of x we are looking at. The more knots that are used the more flexible the spline is.

It appears from this example that there are 8 parameters (or degrees of freedom) for us to choose. However, in reality there are a number of constraints. For example it makes sense to insist that the cubic curves meet at the knots. To make the curve smooth we also insist that the first and second derivatives are equal at the knots. This means there are really only 5 parameters we get to choose. In general there will be 4 + #knots free parameters to choose between.
# Use the bs() function to fit a spline regression to nox (Y) and dis (X).
# require the {gam} and {splines} package
#############################################################
library(gam)
# Generate the B-spline basis matrix for a polynomial spline using bs(), basis function
# df: degrees of freedom; df = length(knots) + 3 + intercept = #knots + 4, since we make equal at knot points, first and second derivatives
# knots the internal breakpoints that define the spline. The default is NULL,
# which results in a basis for ordinary polynomial regression.
# degree: degree of the piecewise polynomial—default is 3 for cubic splines.
splinefit3 <- lm(nox ~ bs(dis, 3))
summary(splinefit3);
# df = 4, which means 1 knot
splinefit4 <- lm(nox ~ bs(dis, 4))
summary(splinefit4);

windows();
plot(dis, nox);
lines(sort(dis), splinefit4$fit[order(dis)], col = 2, lwd = 3)

# smoothspline
Smooth Spline: What we really want to do is find some function, say g(xi) such that it fits the observed data wells. However, if we don’t put any constraints on g(xi) then we can always set RSS equal to zero simply by choosing a g(xi) that interpolates all the Y’s. What we really want is a g that makes RSS small but is also smooth. A penalty on the integrative of second derivative of g(xi).
# cv ordinary (TRUE) or “generalized” cross-validation (GCV) when FALSE.
smoothfit <- smooth.spline(dis, nox, cv = T)

###########################################################
## Generalized additive model (GAM)
When we have several predictors and want to achieve a non-linear fit, a natural way to extend the multiple linear regression model is to replace each linear part, ßjXij, with fj(Xij) where fj is some smooth non-linear function.
Pros of GAM: (Generalized additive model)
1. By allowing one to fit a non-linear fj to each Xj, model non-linear relationships. 2. We can potentially make more accurate predictions.
3. Because we are fitting an additive model we can still examine the effect of each Xj on Y individually, still good for inference.
cons: The model is restricted to be additive. a simple interaction between X1 and X2 can’t automatically be modeled using GAM
We can extend logistic regression to allow for non-linear relationships using the GAM framework
###########################################################
gamfit = gam(nox ~ ., data = Boston); #same as lm, all linear functions
summary(gamfit)
par(mfrow=c(4, 4)); plot(gamfit);
# GAM to use non-linear fits for most of the variables.
# To do this we use the notation s(X) instead of X.
gamfit=gam(medv~s(crim)+s(zn)+s(indus)+chas+s(nox)+s(rm)+s(dis)+s(rad)+s(tax)+s(ptratio)+s(black)+s(lstat),data=Boston)
par(mfrow=c(4, 4)); plot(gamfit,se=T)

# Now use the significant non linear varialbes. and test the prediction power
tr=1:400
gamfit=gam(medv~crim+zn+indus+chas+nox+s(rm)+dis+rad+tax + ptratio+black+s(lstat),data=Boston,subset=tr) #s(x, df=4, spar=1)
mean((predict(gamfit,Boston)[-tr]-medv[-tr])^2)

Experience with R

Tuesday, October 21, 2008

notes for Polynomial Regression, Splines and GAM

2 comments:

Blog Archive

Labels