###########################################################

Bagging and Boosting

The bootstrap approach does the next best thing by taking repeated a random sample, with replacement, of the same size as the original sample.

Bagging: use the bootstrap approach which does the next best thing by taking repeated samples from the training data. We therefore end up with B different training data sets. We can train our method on each data set and then average all the predictions it will reduce the variance by sqrt(B).

Classification bagging then, for any particular X, there are two possible approaches: 1) Record the class that each bootstrapped data set predicts and provide an overall prediction to the most commonly occurring one.(Voting). 2) If our classifier produces probability estimates we can just average the probabilities and then predict to the class with the highest probability. (Vote approach and averaging the probability estimates.)

library(ipred)

bagging.fit <- bagging(Salary~., data = hitters.noNA, subset = tr, nbagg = 1000)

bagging.pred <- predict(bagging.fit, hitters.noNA, nbagg = 1000)[-tr]

mean((hitters.noNA$Salary[-tr] - bagging.pred)^2);

###########################################################

Boosting works in a similar way except that, in each iteration, in the algorithm (i.e. each new data set) it places more weight in the fitting procedure on observations that were misclassified in the previous iterations.

The algorithm:

Boosting allows one to produce a more flexible decision boundary than Bagging. The training error rate will go to zero, if we keep doing the boosting. Even the training error touch zero, the test error still will go down. (the plot)

Relative Influence Plots: The particular variable is chosen which gives maximum reduction in the RSS over simply fitting a constant over the whole region. call this quality i. the relative influence of Xi us sum of all these i over all region for which it provide the best split over all region. In boosting, we just sum up all the RI in different trees.

Partial plot: the relationship between one or more predictors after accounting for or averaging out the effects of all the other predictors.

function:

Boosting is often suffered from the overfitting, we also use the shrinkage as before to penalty and get much better prediction on test data.

library(gbm);

shrinkage.seq = seq(0, 0.5, length = 20);

boost.fit <- boost(Salary~., data = hitters.noNA, shrinkage = shrinkage.seq, subset = tr, n.trees = 1000, trace = F, distribution = "gaussian")

par(mfrow=c(2,2));

plot(boost.fit$shrinkage, boost.fit$error, type = 'b');

# Produce 2 partial dependence plots for the 2 most influential variables. Also produce a joint partial influence plot for these 2 variables

par(mfrow=c(1, 3));

plot(boost.fit, i = "CHmRun");

plot(boost.fit, i = "Walks");

plot(boost.fit, i = c("CHmRun", "Walks"));

###########################################################

SVM: basic idea of a support vector is to find the straight line that gives the biggest separation between the classes i.e. the points are as far from the line as possible. C is the minimum perpendicular distance between each point and the separating line. We find the line which maximizes C. This line is called the “optimal separating hyperplane”

in practice it is not usually possible to find a hyper-plane that perfectly separates two classes. In this situation we try to find the plane that gives the best separation between the points that are correctly classified subject to the points on the wrong side of the line not being off by too much. Let ΞΎ*i represent the amount that the ith point is on the wrong side of the margin (the dashed line).we want to maximize C subject to restriction: ( <= constant).The constant is a tuning parameter that we choose.

basic idea of a support vector classifier Instead we can create transformations (or a basis) b1(x), b2(x), …, bM(x) and find the optimal hyper-plane in the space spanned by b1(X), b2(X), …, bM(X). we choose something called a Kernel function which takes the place of the basis. Common kernel functions include: Linear, Polynomial, Radial Basis, Sigmoid

library(e1071)

svmfit <- svm(Salary~., SalaryData, subset = tr);

svmpred <- predict(svmfit, SalaryData)[-tr];

table1 <- table(svmpred, SalaryData$Salary[-tr]);

table1

(table1[1, 1] + table1[2, 2])/sum(table1);

(table1[1, 2] + table1[2, 1])/sum(table1);

mean(svmpred != SalaryData$Salary[-tr])

svmfit <- svm(Salary~., SalaryData, subset = tr, kernel = "linear", cost = opti.cost);

svmfit <- svm(Salary~., SalaryData, subset = tr, kernel = "polynomial", cost = opti.cost);

svmfit <- svm(Salary~., SalaryData, subset = tr, kernel = "radial", cost = opti.cost);

svmfit <- svm(Salary~., SalaryData, subset = tr, kernel = "sigmoid", cost = opti.cost);

## Sunday, October 26, 2008

Subscribe to:
Post Comments (Atom)

## No comments:

Post a Comment