Tuesday, October 21, 2008

Notes for nnet, tree

# use {nnet} to Neural Networks
Neural Networks were first developed as a model for the human brain: Getting a prediction for Y involves 2 steps.
First we get predictions for Z1,….,ZM (hidden units) using the X’s. Second we predict Y using Z1,….,ZM.

Regression Equation:
the simple example:
One common approach is to find the α’s and β’s that minimize RSS
Classification Equation
algorithm:
Given the Z’s and f(X)’s one can compute derivatives for the change in RSS as the α’s and β’s change. Hence we can either increase or decrease each α and β (according to the sign on the derivative) so as to reduce RSS.

By making M large enough, neural networks allow one to fit almost arbitrarily flexible. another example of a neural network fit where the test error rate starts to increase as we increase the number of hidden units.
This penalty (called weight decay) forces the neural network fit to be smoother. The penalty function: once we use weight decay in the fit the error rate becomes fairly insensitive to the number of hidden units (as long as we have enough).


dim(spam)
tr = 1:2500;
library(nnet)
# The nnet() function implements a single hidden layer neural network. The
# syntax is almost identical to that for lm() etc.
# There are five possible additional variables to feed nnet(). First we must tell
# it how many hidden units to use through the command size=?. We can also specify a
# decay value by decay=? (by default this is zero if we don’t specify anything).
# By default the maximum number of iterations it
# does is 100. You can change the default using maxit=?. Finally, if you
# are using nnet for a regression (rather than a classification) problem
# you need to set linout=T to tell nnet to use a linear output (rather
# than a sigmoid output that is used for a classification situation).

# Fit a neural network to your training data using 2 hidden units and 0 decay.
# decay parameter for weight decay. Default 0. It is the penalty on large coefficient
nnfit = nnet(email ~ ., spam, size = 2, subset = tr);
summary(nnfit)
# Refit the neural network with 10 hidden units and 0 decay.
nnfit = nnet(email ~ ., spam, size = 10, subset = tr);

# Use the nnet.cv() function (on the training data) to estimate the optimal decay for your data when using 10 hidden units.
decayseq <- 10^seq(-3, 0, length = 10)
nnfit <- nnet.cv(email ~ ., data = spam, size = 10, trace = T, decay = decayseq, cv = 5);
windows()
plot(nnfit$decay, nnfit$cv, type = "l", ylim = c(0.03, 0.08))
lines(nnfit$decay, nnfit$train, col = "red");
## use the optimal decay
nnfit=nnet(email~.,spam,subset=tr,size=10,decay=1.4678)
nnpred <- predict(nnfit, spam, type = "class");
table1 <- table(nnpred[-tr], spam[-tr, "email"])
table1;
(table1[1, 1] + table1[2, 2]) / sum(table1)
mean(nnpred[-tr] == spam[-tr, "email"])
mean(nnpred[-tr] != spam[-tr, "email"])

#######################################################################
## library(tree)
Tree: make predictions in a regression problem is to divide the predictor space (i.e. all the possible values for X1,X2,…,Xp) into distinct regions, say R1, R2,…,Rk. every X that falls in a particular region (say Rj) we make the same prediction create the partitions by iteratively splitting one of the X variables into two regions. We can always represent them using a tree structure. This provides a very simple way to explain the model to a non-expert
For region Rj the best prediction is simply the average of all the responses from our training data that fell in Rj.
We consider splitting into two regions, Xj larger than s and Xj less than s for all possible values of s and j. We then choose the s and j that results in the lowest RSS on the training data.
The end points of the tree are called “terminal nodes”.

Classification tree: For each region (or node) we predict the most common category among the training data within that region. but minimizing RSS no longer makes sense. We use the criteria minimize the gini index or cross-entropy. If the node has similar prob for each category, the entroy will be larger.
We can improve accuracy by “pruning” the tree; cutting off some of the terminal nodes. use cross validation to see which tree has the lowest error rate.
#######################################################################
library(tree)
tr=1:400;
OJ.tree <- tree(Purchase ~., OJdata, subset = tr);
summary(OJ.tree)
OJ.tree
plot(OJ.tree)
text(OJ.tree , pretty = 0, cex = 0.8)

tree.pred <- predict(OJ.tree, OJdata, type = "class")
table1 <- table(tree.pred[-tr], OJdata$Purchase[-tr])
table1
(table1[1, 1] + table1[2, 2])/sum(table1)
(table1[1, 2] + table1[2, 1])/sum(table1)
### Pruning the Tree #####################
## Use the cv.tree() function on the tree from question 1 to compute the cross validation statistics for different tree sizes.
# By default K=10 (i.e. the data is divided into 10 parts).
cv.OJ.tree <- cv.tree(OJ.tree, K = 30)
names(cv.OJ.tree);
windows()
par(mfrow = c(1, 2))
plot(cv.OJ.tree$size, cv.OJ.tree$dev, type ='b')
plot(cv.OJ.tree$k, cv.OJ.tree$dev, type ='b')
cv.OJ.tree$k[order(cv.OJ.tree$dev)[1]]
# k: cost-complexity parameter defining either a specific subtree of tree
# The k from the optimal cross validation is around 12.2.
prune.OJ.tree <- prune.tree(OJ.tree, k =12.3)
windows();
plot(prune.OJ.tree)
text(prune.OJ.tree , pretty = 0, cex = 0.8)

summary(prune.OJ.tree)
tree.pred <- predict(prune.OJ.tree, OJdata, type = "class")
table1 <- table(tree.pred[-tr], OJdata$Purchase[-tr])
table1
(table1[1, 1] + table1[2, 2])/sum(table1)
(table1[1, 2] + table1[2, 1])/sum(table1)
# The mindev part of this line controls how far the tree is grown
# (the default value is 0.01). As we choose larger values for mindev
# the tree is made smaller and as we make mindev larger we get bigger trees.
# Find the best dev
mindev.seq = 10^seq(-3, -1,length=100)
library(tree)
tr=1:800;
test.error <- c();
tree.size <- c();
for(j in 1:100){
OJ.tree <- tree(Purchase ~., OJdata, subset = tr, control = tree.control(nobs=800, mindev=mindev.seq[j]));
tmp1 <- summary(OJ.tree)$size
tree.size <- c(tree.size, tmp1);
tree.pred <- predict(OJ.tree, OJdata, type = "class")
table1 <- table(tree.pred[-tr], OJdata$Purchase[-tr])
tmp2 <- (table1[1, 2] + table1[2, 1])/sum(table1)
test.error <- c(test.error, tmp2);
}

2 comments:

Freddy Lopez said...

...and where can we found the nnet.cv() function?

Accessible Renovations Hampton said...

Thanks ggreat blog