Logistic Regression and Classification

Logistic regression is a machine learning classification algorithm used to assign observations to a discrete set of classes. Some of the well known examples of this kind of classification are email spam or ham, transactions fraud, tumor malignant or benign. Given a feature vector X and a qualitative response Y taking values in the set C={C1,,Ck}, a classification task is to build a function f:XY (classifier) that takes the feature vector X as input and predicts the value for Y, i.e. YC. The model or function or classifier f is built using a set of training observations (x1,y1),,(xn,yn) for a given n. In order to avoid crisp decisions and give to the user flexibility in the classification process, often we are more interested in estimating the probabilities that Y belongs to each category in C than estimating directly the value of Y.

Unlike linear regression where the target variable y is related to the features via the linear relationship:

y=β0+β1x1++βkxk+ϵ,

in logistic regression, we relate p(y), the probability of Y belonging to a certain class (which ranges between 0 and 1), to the features x1,,xk via the logistic (or logit) transformation given by

logp(y)1p(y)=β0+β1x1++βkxk.

Define S(w) as the logistic sigmoid function give by

S(w)=11+ew
  • Display a graph of S(w) for <w<

it shows no matter what values β0,β1,,βk or x1,,xk take, p(y) will always have values between 0 and 1.

To summarize, the logistic regression starts with the idea of linear regression and transforms its output using the sigmoid function to return a probability value.

Maximum Likelihood Estimation (MLE) of the Model

In logistic regression, our goal is to learn a set of parameters βT=(β0,β1,,βk) using the available estimation. Although we could use (non-linear) least squares to fit the logistic regression model, the more general method of maximum likelihood estimation (MLE) is preferred, since it has better statistical properties. The idea behind MLE is to choose the most likely values of the parameters β0,,βn given the observed sample

(x1(i),,xk(i),yi),1in.

In logistic regression, the probability model is based on the binomial distributions:

f(xi,pi)=f(yi,pi)={piif yi=11piif yi=0,

where xi=(x1(i),,xk(i))T is the vector of features and 0<pi<1 are the probabilities associated to the binomials in the model. In other words, the probability of the feature vector xi specifying the class yi=1 occurs with probability pi, that is

p(yi=1)=pi=eβ0+β1x1(i)++βkxk(i)1+eβ0+β1x1(i)++βkxk(i)=exiTβ1+exiTβ

where β=(β0,β1,,βk)T and xi=(1,x1(i),,xk(i))T for 1in.

Typically, numerical approximations and optimization procedures are used to find the (best) vector β satisfying

β=argmax((β))

There are many known techniques and here I implement my own procedure which is a logistic regression classifier using gradient ascent as illustrated in the following algorithm.

Here I provide an implementation in the simple case where there is only one feature variable x1. I have uploaded the data set SAheart.data on GitHub, the task here is to predict the value of the variable chd (response, coronary heart disease diagnosis) from the feature value ldl (low density lipoprotein cholesterol). I use the first 100 rows for training and any values in the remaining rows for testing. Following procedures are identified:

  • Load and clean data.
# Load data
data <- read.csv('SAheart.data')
data <- data[, c('ldl','chd')]

# remove NA rows
data <- na.omit(data)
  • Normalize the feature variable x1 (see Nota-Bene below).
# Normalize the feature variable
normalize <- function(x) { (x - mean(x)) / sd(x) }
data['ldl'] <- lapply(data['ldl'], normalize)
  • Split training data and test data.
# Train test split
train <- data[1:100, ]
test <- tail(data, -100)

X_train <- train$ldl
y_train <- train$chd
X_test <- test$ldl
y_test <- test$chd
  • Calculate objective function (β0,β1).
# Sigmoid function
sigmoid <- function(w) { 1/(1+exp(-w)) }

# Objective Function
cost <- function(beta, X, y) {
    p <- sigmoid(X %*% beta)
    t(y)%*%log(p) + t(1-y)%*%log(1-p)
}
  • Calculate gradient function (β0,β1).
# Gradient function
grad <- function(beta, X, y) {
    t(X) %*% (y - sigmoid(X%*%beta))
}
  • Create logistic regression and complete implementation of gradient ascent. It is much better to look ar the convergence of the values (β0,β1) than the convergence of the parameters themselves.
# Fit logistic regression, set learning coefficient and tolerance term
fit_logit <- function(X, y, bias=T, eta=10e-3, eps=10e-6, max_iters=200) {
    # Type conversion
    if (!is.matrix(X)) { X <- as.matrix(X) }
    if (!is.matrix(y)) { y <- as.matrix(y) }

    # Add bias
    if (bias) { X <- cbind(1, X) }

    # Algorithm initialization
    iters <- 0

    # Initialize beta to 0
    beta <- matrix(0, ncol(X))
    prev_cost <- cost(beta, X, y)

    # Update the beta using gradient ascent until converaged or reach max iterations
    while (iters < max_iters) {
        iters <- iters+1
        # Compute the gradient and update beta
        beta <- beta + eta*grad(beta, X, y)

        curr_cost <- cost(beta, X, y)
        # Check whether converaged
        if (abs(curr_cost-prev_cost) < eps) { break }

        prev_cost <- curr_cost
    }
    # Create the logit Object
    logit <- list(bias=bias)
    logit[['beta']] <- beta
    logit[['probs']] <- sigmoid(X%*%beta)
    logit[['residuals']] <- y - logit[['probs']]
    logit[['preds']] <- ifelse(logit[['probs']] > .5, 1, 0)
    logit[['score']] <- sum(logit[['preds']] == y)/nrow(y)
    logit[['iters']] <- iters
    attr(logit, 'class') <- 'logit'
    logit
}
  • Predict the labels for a set of test examples.
# Predict on new data
predict.logit <- function(logit, X, probs=F, ..) {
    if (!is.matrix(X)) { X <- as.matrix(X) }
    if (logit[['bias']]) { X <- cbind(1, X) }
    if (probs) { sigmoid(X%*%logit[['beta']]) }
    else { ifelse(sigmoid(X%*%logit[['beta']]) > .5, 1, 0) }
}
  • Try different learning rate η values.
# Score function
score <- function(y_pred, y) {
    if (!is.matrix(y_pred)) { y_pred <- as.matrix(y_pred) }
    if (!is.matrix(y)) { y <- as.matrix(y) }
    sum(y_pred == y)/nrow(y)
}

# Try different learning rate
etas <- list(.25, .2, .15, .1, .05, .01, .005)

for (eta in etas) {
    logit.fit <- fit_logit(X_train, y_train, eta=eta)
    # Predict the labels for test examples
    y_pred <- predict(logit.fit, X_test)
    # print accuracy score
    print(score(y_pred, y_test))
}

The output is

[1] 0.3314917
[1] 0.3259669
[1] 0.4585635
[1] 0.6657459
[1] 0.6823204
[1] 0.6823204
[1] 0.6823204

References

Nota-Bene

In machine learning, it is a standard routine to normalize or scale the feature variables to speed up the convergence of the learning algorithms and to ensure that the features contribute equally to the learning task. One way to achieve the normalization if by making the values of each feature in the data have zero mean (when subtracting the mean in the numerator) and unit variance, i.e., replace the variable x1 with x1μσ, where μ and σ are respectively the mean and the standard deviation of the values of x1 in the training data.