Logistic Regression and Application (R)
Logistic Regression and Classification
Logistic regression is a machine learning classification algorithm used to assign observations to a discrete set of classes. Some of the well known examples of this kind of classification are email spam or ham, transactions fraud, tumor malignant or benign. Given a feature vector
Unlike linear regression where the target variable
in logistic regression, we relate
Define
- Display a graph of
for
it shows no matter what values
To summarize, the logistic regression starts with the idea of linear regression and transforms its output using the sigmoid function to return a probability value.
Maximum Likelihood Estimation (MLE) of the Model
In logistic regression, our goal is to learn a set of parameters
In logistic regression, the probability model is based on the binomial distributions:
where
where
Typically, numerical approximations and optimization procedures are used to find the (best) vector
There are many known techniques and here I implement my own procedure which is a logistic regression classifier using gradient ascent as illustrated in the following algorithm.
Here I provide an implementation in the simple case where there is only one feature variable
- Load and clean data.
# Load data
data <- read.csv('SAheart.data')
data <- data[, c('ldl','chd')]
# remove NA rows
data <- na.omit(data)
- Normalize the feature variable
(see Nota-Bene below).
# Normalize the feature variable
normalize <- function(x) { (x - mean(x)) / sd(x) }
data['ldl'] <- lapply(data['ldl'], normalize)
- Split training data and test data.
# Train test split
train <- data[1:100, ]
test <- tail(data, -100)
X_train <- train$ldl
y_train <- train$chd
X_test <- test$ldl
y_test <- test$chd
- Calculate objective function
.
# Sigmoid function
sigmoid <- function(w) { 1/(1+exp(-w)) }
# Objective Function
cost <- function(beta, X, y) {
p <- sigmoid(X %*% beta)
t(y)%*%log(p) + t(1-y)%*%log(1-p)
}
- Calculate gradient function
.
# Gradient function
grad <- function(beta, X, y) {
t(X) %*% (y - sigmoid(X%*%beta))
}
- Create logistic regression and complete implementation of gradient ascent. It is much better to look ar the convergence of the values
than the convergence of the parameters themselves.
# Fit logistic regression, set learning coefficient and tolerance term
fit_logit <- function(X, y, bias=T, eta=10e-3, eps=10e-6, max_iters=200) {
# Type conversion
if (!is.matrix(X)) { X <- as.matrix(X) }
if (!is.matrix(y)) { y <- as.matrix(y) }
# Add bias
if (bias) { X <- cbind(1, X) }
# Algorithm initialization
iters <- 0
# Initialize beta to 0
beta <- matrix(0, ncol(X))
prev_cost <- cost(beta, X, y)
# Update the beta using gradient ascent until converaged or reach max iterations
while (iters < max_iters) {
iters <- iters+1
# Compute the gradient and update beta
beta <- beta + eta*grad(beta, X, y)
curr_cost <- cost(beta, X, y)
# Check whether converaged
if (abs(curr_cost-prev_cost) < eps) { break }
prev_cost <- curr_cost
}
# Create the logit Object
logit <- list(bias=bias)
logit[['beta']] <- beta
logit[['probs']] <- sigmoid(X%*%beta)
logit[['residuals']] <- y - logit[['probs']]
logit[['preds']] <- ifelse(logit[['probs']] > .5, 1, 0)
logit[['score']] <- sum(logit[['preds']] == y)/nrow(y)
logit[['iters']] <- iters
attr(logit, 'class') <- 'logit'
logit
}
- Predict the labels for a set of test examples.
# Predict on new data
predict.logit <- function(logit, X, probs=F, ..) {
if (!is.matrix(X)) { X <- as.matrix(X) }
if (logit[['bias']]) { X <- cbind(1, X) }
if (probs) { sigmoid(X%*%logit[['beta']]) }
else { ifelse(sigmoid(X%*%logit[['beta']]) > .5, 1, 0) }
}
- Try different learning rate
values.
# Score function
score <- function(y_pred, y) {
if (!is.matrix(y_pred)) { y_pred <- as.matrix(y_pred) }
if (!is.matrix(y)) { y <- as.matrix(y) }
sum(y_pred == y)/nrow(y)
}
# Try different learning rate
etas <- list(.25, .2, .15, .1, .05, .01, .005)
for (eta in etas) {
logit.fit <- fit_logit(X_train, y_train, eta=eta)
# Predict the labels for test examples
y_pred <- predict(logit.fit, X_test)
# print accuracy score
print(score(y_pred, y_test))
}
The output is
[1] 0.3314917
[1] 0.3259669
[1] 0.4585635
[1] 0.6657459
[1] 0.6823204
[1] 0.6823204
[1] 0.6823204
References
- An Introduction to Statistical Learning with Applications in R
- Probability & Statistics for Engineers & Scientists
Nota-Bene
In machine learning, it is a standard routine to normalize or scale the feature variables to speed up the convergence of the learning algorithms and to ensure that the features contribute equally to the learning task. One way to achieve the normalization if by making the values of each feature in the data have zero mean (when subtracting the mean in the numerator) and unit variance, i.e., replace the variable