Not-so-Naive Classification with the Naive Bayes Classifier

Posted in Bayes Reasoning at 8:59 pm by Auro Tripathy

A common (and successful) learning method is the Naive Bayes classifier. When supplied with a moderate-to-large training set to learn from, the Naive Bayes Classifier does a good job of filtering out less relevant attributes and make good classification decisions. In this article, I introduce the basics of a  Naive Bayes classifier, provide an often-cited example, and provide working R code.

Introduction to Naive Bayes Classifiers

The Naive Bayes classifier is based on Bayes’ theorem with the independence assumptions between features.


The Bayes’ rule (above) plays a central role in the probabilistic reasoning since it helps us ‘invert’ probabilistic relationships between P(Class | x ) and P(x | Class).

So what’s naive about Naive Bayes?

It naively assumes that the attributes of any instance of the training-set are conditionally independent of each other (in our example below, the cool temperatures are completely independent of the sunny outlook). We represent this independence as:

P(x1, x…, x| Classj) = ∏i P(xi, | Classj), or

P(x1, x…, x| Classj) = P(x| Classj) × P(x| Classj) ×…× P(x| Classj)

In plain English, if each feature (predictor) x is independent of every other  feature, then the probability a data-point  (x1, x…, xk) is in Classj is simply the product of  all the individual probabilities of feature xi in Classj.


Let’s build a classifier that predicts whether I should play tennis given the forecast. It takes four attributes to describe the forecast; namely, the outlook, the temperature, the humidity, and the presence or absence of wind. Furthermore the values of the four attributes are qualitative (also known as categorical). They take on the values shown below.

Outlook ∈ [Sunny, Overcast, Rainy]
Temperature ∈ [Hot, Mild, Cool]
Humidity ∈ [High, Normal]
Windy ∈ [Weak, Strong]

The class label is the variable, Play and takes the values yes or no.

Play∈ [Yes, No]

We read-in training data below that has been collected over 14 days.


 The Learning Phase

In the learning phase, we compute the table of likelihoods (probabilities) from the training data. They are:

P(Outlook=o|ClassPlay=b), where o ∈ [Sunny, Overcast, Rainy] and b ∈ [yes, no]

P(Temperature=t|ClassPlay=b), where t ∈ [Hot, Mild, Cool] and b ∈ [yes, no],

P(Humidity=h|ClassPlay=b), where h∈ [High, Norma] and b ∈ [yes, no],

P(Wind=w|ClassPlay=b), where w ∈ [Weak, Strong] and b ∈ [yes, no].


We also calculate P(ClassPlay=Yes) and P(ClassPlay=No).


Classification Phase

Let’s say, we get a new instance of the weather condition, x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) that will have to be classified (i.e., are we going to play tennis under the conditions specified by x’).

With the MAP rule, we compute the posterior probabilities. This is easily done by  looking up the tables we built in the learning phase.

P(ClassPlay=Yes|x’) = [P(Sunny|ClassPlay=Yes) × P(Cool|ClassPlay=Yes) ×

P(High|ClassPlay=Yes) × P(Strong|ClassPlay=Yes)] ×


= 2/9 × 3/9 × 3/9 × 3/9 × 9/14 = 0.0053

P(ClassPlay=No|x’) =  [P(Sunny|ClassPlay=No) ×P(Cool|ClassPlay=No) ×

P(High|ClassPlay=No) × P(Strong|ClassPlay=No)] ×


= 3/5 × 1/5 × 4/5 × 3/5 × 5/14 = 0.0205

Since P(ClassPlay=Yes|x’) less than P(ClassPlay=No|x’), we classify the new instance x’ to be “No”.

The R Code

The R code works with the example dataset above and shows you a programmatic way to invoke the Naive Bayes classifier in R.

tennis.anyone <- read.table("http://www.shatterline.com/MachineLearning/data/tennis_anyone.csv", header=TRUE, sep=",")
library(e1071) #naive Bayes classifier library
classifier<-naiveBayes(tennis.anyone[,1:4], tennis.anyone[,5])
table(predict(classifier, tennis.anyone[,-5]), tennis.anyone[,5], dnn=list('predicted','actual'))
#new data #15
tennis.anyone[15,-5] <- as.factor(c(Outlook = "Sunny", Temperature = "Cool", Humidity = "High", Wind = "Strong"))

print(tennis.anyone[15,-5] )

result <- predict(classifier, tennis.anyone[15,-5] )

Created by Pretty R at inside-R.org

Things t0 watch-out for – data underflow during multiplications

Calculating the product below may cause underflows.

P(x| Classj) × P(x| Classj) ×…× P(x| Classj) × P(Classj).

You can easily side-step the issue by moving the computation to the logarithmic domain.
log(P(x| Classj) × P(x| Classj) ×…× P(x| Classj) × P(Classj)) =
log(P(x| Classj)) + log(P(x| Classj)) +…+ log(P(x| Classj)) + log(P(Classj))


Bayesian Reasoning and Machine Learning, by David Barber





Regularization – Predictive Modeling Beyond Ordinary Least Squares Fit

Posted in Linear Regression at 6:07 pm by Auro Tripathy

Introduction to Linear Functions and Regularization

A simple yet powerful prediction model assumes that the function is linear in the input even in cases where the input consists of hundreds of variables and the input variables far outstrip the number of observations. Such prediction models, known are linear regression/classification models, can often outperform fancier non-linear models.

The most popular method of estimation of the parameters (used interchangeably with the word, coefficients)  is the method of Ordinary Least Squares (OLS). The linear model can be written as

f(x) = βo + ∑Xβj

where j=1 to p, X is the input vector, and βjs are the unknown coefficients.

 We solve for the coefficients β (βo1,…βp) that minimize the residual sum of squares (RSS).

RSS() = ∑(yi – βo - ∑Xijβj)2

where i=1 to N observations, j= 1 to p variables

Reasons why OLS estimation is often unsatisfactory are:

  1. Large variance in prediction accuracy. A solution to improving the overall accuracy is to shrink (or set to zero) some of the coefficients. The overall effect is to prevent or reduce over-fitting.
  2. With a large number of input predictors, one would like to determine a smaller subset that would exhibit the strongest effects so we see the big picture.

The process of regularization involves a family of penalty terms that can  be added to OLS to achieve the shrinkage (in the coefficients).

The Ridge penalty term shrinks the regression coefficients by introducing the complexity parameter, λ, the greater the value of λ, the greater the amount of shrinkage. By varying λ, the coefficients are shrunk towards zero (and to each other).

λ ∑ βj2 where j=1 to p

While the Ridge penalty does a proportional shrinkage, the LASSO penalty λ, translates each coefficients by a constant factor stopping at zero. LASSO also does feature-selection; if many features are correlated, LASSO will just pick one.

λ ∑ |βj|, where j=1 to p

Elastic Net penalty is a combination of the LASSO and Ridge regression penalty.

λ∑( α|βj| + (1 – α)βj), where j=1 to p

The first term encourages a sparse solution in the coefficients and the second term encourages highly correlated features to be averaged. The parameter α determines the mix of penalties and lies in the range of 0 and 1. With α set to 0, we get the Ridge penalty and with α set to 1, we get the LASSO penalty.


We now demonstrate this with a example dataset with 204 binary attributes and 704 observations.

Getting the Data

The R snippet below will download the dataset  from where it is hosted. The data has been previously saved as an R object in the .rda format. We reload it back in to the R object, hiv.data.

download.file("http://www.shatterline.com/MachineLearning/data/hiv.rda","hiv.rda", mode="wb")

load("hiv.rda", verbose=TRUE) #contains hiv.train & hiv.test

Visualizing the Data

The image function in R helps us visualize the dataset. You can see below the relatively strong correlation between the variables.  See the visualize.matrix function below.

Fitting/Plotting Data

The code snippet below shows the coefficient shrinkage is proportional to λ when we apply the Ridge penalty.

fit <- glmnet(hiv.train$x,hiv.train$y, alpha=0) #Ridge penalty

The code snippet below shows that, with the LASSO penalty, the coefficient hit zero (unlike Ridge) as λ shrinks.

fit <- glmnet(hiv.train$x,hiv.train$y, alpha=1) #Lasso penalty

The code snippet below shows a mix of the Ridge and LASSO penalties with the Elastic Net penalty for a specific value of  α=0.2 (could also be chose with cross-validation).

fit <- glmnet(hiv.train$x,hiv.train$y, alpha=0.2) #Elastic Net penalty


Ten-fold cross-validation shows us that the number of active variables are approximately 30.

cv.fit <- cv.glmnet(hiv.train$x,hiv.train$y) #10-fold cross-validation
legend("topleft",legend=c("10-fold Cross Validation"))
Predicting the Test Data with the Model

The code snippet below predicts the error at every value of  λ.

pred.y <- predict(fit, hiv.test$x) #predict the test data
mean.test.error <- apply((pred.y - hiv.test$y)^2,2,mean)
points(log(fit$lambda), mean.test.error, col="blue",pch="*")
legend("topleft",legend=c("10-fold Cross Validation","Test HIV Data"),pch="*",col=c("red","blue"))

Plotting the Regularization Path

The code snippet below shows the regularization path by plotting the coefficients against  (log of) λ. Each curve represents a coefficient in the model. As  λ gets smaller, more coefficients enter the model from a zero value. (see to the left).



# Author Auro Tripathy, auro@shatterline.com
# Adapted from ...Trevor Hastie's talk

visualize.matrix <- function(mat) {

  image(1:nrow(mat$x), 1:ncol(mat$x), z=mat$x,
        col = c("darkgreen", "white"),
        xlab = "Observations", ylab = "Attributes")

  title(main = "Visualizing the Sparse Binary Matrix",
        font.main = 4)
  return (dim(mat$x)) #returns the dimensions of the matrix

              "hiv.rda", mode="wb")
     verbose=TRUE) #contains hiv.train & hiv.test

print(length(hiv.train$y)) #length of response variable

fit <- glmnet(hiv.train$x,hiv.train$y, alpha=0) #Ridge penalty
legend("bottomleft",legend=c("Ridge Penalty, alpha=0"))

fit <- glmnet(hiv.train$x,hiv.train$y, alpha=1) #Lasso penalty
legend("bottomleft",legend=c("LASSO Penalty, alpha=1"))

fit <- glmnet(hiv.train$x,hiv.train$y, alpha=0.2) #ElasticNet penalty
legend("bottomleft",legend=c("Elastic Net, alpha=0.2"))

cv.fit <- cv.glmnet(hiv.train$x,hiv.train$y) #10-fold cross-validation
legend("topleft",legend=c("10-fold Cross Validation"))
pred.y <- predict(fit, hiv.test$x) #predict the test data
mean.test.error <- apply((pred.y - hiv.test$y)^2,2,mean)
points(log(fit$lambda), mean.test.error, col="blue",pch="*")
legend("topleft",legend=c("10-fold Cross Validation","Test HIV Data"), pch="*", col=c("red","blue"))


  1. Prof Trever Hastie’s talk
  2. The Elements of Statistical Learning: Data Mining, Inference, and Prediction,  Trevor Hastie , Robert Tibshirani , Jerome Friedman 



Heart-Disease Predictor Using Logistic Regression

Posted in Linear Regression at 10:48 am by Auro Tripathy

Probability is the very guide of life.
- Cicero

Given a two-column dataset, column one being age and column two being the presence/absence of heart-disease, we build a model (in R) that predicts the probability of heart-disease at an age. For a realistic model we aught to have big datasets with additional predictor variables such as blood-pressure, cholesterol, diabetes, smoking etc. However, the one-and-only predictor variable we have is age and the sample-size is 100 subjects!

Plotting the data (see below) doesn’t really provide a clear picture of the nature of the relationship between heart-disease and age. The problem is that the response variable (presence/absence of heart disease) is binary.

Let’s create intervals of the independent variable (age) and compute the frequency of occurrence of the response variable (presence/absence of heart disease). You can get the table below  here.


 A short and lucid tutorial in logistic regression is here (text) and here (video). The logistic curve is an S-shaped curve that takes the form,
y = [exp(b0 + b1x)] / [1 + exp(b0 + b1x)]

Clearly, the curve is non-linear, but the logit-transform makes it linear.
logit(y) = b0 + b1x

Thus, logistic regression is linear regression on the logit transform of y, where y is the probability of success at each value of x. Logistic regression fits b0 and b1, the regression coefficients.

The glm package in R is used to fit generalized regression models and can be used for logistic regression by specifying the family parameter to be binomial with the logit link like so:

> glm.out = glm(cbind(chd.present, chd.absent) ~ age.mean,
+               family=binomial(logit), data=frequency.coronary.data)

Plotting the fit shows us the close relationship between the fitted values and the observed values.

Below is the R code that generated the plots.

coronary.data <- read.table("http://www.shatterline.com/MachineLearning/data/AGE-CHD-Y-N.txt",
plot(CHD ~ Age, data=coronary.data, col="red")
title(main="Scatterplot of presence/absence of \ncoronary heart disease by age \nfor 100 subjects")

library(calibrate) #needed to label observation
frequency.coronary.data <- read.table("http://www.shatterline.com/MachineLearning/data/frequency-table-of-age-group-by-chd.txt",
frequency.coronary.data[,"age.mean"] <- (frequency.coronary.data$age.start +
frequency.coronary.data <- frequency.coronary.data[, c(1,2,6,3,4,5)] #reorder cols
#With "family=" set to "binomial" with a "logit" link, 
# glm( ) produces a logistic regression
glm.fit = glm(cbind(chd.present, chd.absent) ~ age.mean,
              family=binomial(logit), data=frequency.coronary.data)

plot(chd.present/age.group.total ~ age.mean, data=frequency.coronary.data)
lines(frequency.coronary.data$age.mean, glm.fit$fitted, type="l", col="red")

       frequency.coronary.data$age.mean, cx=0.6)
title(main="Percentage of subjects with heart disease in each age group")

Created by Pretty R at inside-R.org


  1. http://www.youtube.com/watch?v=qSTHZvN8hzs&list=WL980F0C0E5B4CD53D#t=24m03s
  2. http://ww2.coastal.edu/kingw/statistics/R-tutorials/logistic.html
  3. Applied Logistic Regression, David W. Hosmer, Jr., Stanley Lemeshow, Rodney X. Sturdivant


My First Brush w/Open Data – Hospital Charges

Posted in Data Visualization at 4:35 am by Auro Tripathy

Curious about what a medical procedure may cost you? Then, read on…

Recent data on the top 100 medical procedures is available here. The Government will soon release data on yet another 30 procedures. Below is a box plot showing the bewildering variation in in-patient cost for medical procedures nationwide.

The R code below can be executed, without changes, to generate the plot above.

You can also use openrefine to discover that a medical procedure with code 207 can cost up to a million dollars!

# Author Auro Tripathy, auro@shatterline.com
# The box plot code, written in R is reproducible and is licensed under Creative Commons, Attribution-NonCommercial-ShareAlike, CC BY-NC-SA

# Medicare Provider Charge Data: Inpatient
# The data provided here include hospital-specific charges for the more than 3,000 U.S. hospitals 
# that receive Medicare Inpatient Prospective Payment System (IPPS) payments for the top 100 most 
# frequently billed discharges, paid under Medicare based on a rate per discharge using the 
# Medicare Severity Diagnosis Related Group (MS-DRG) for Fiscal Year (FY) 2011. These DRGs 
# represent almost 7 million discharges or 60 percent of total Medicare IPPS discharges.

# Read further and get the data from the link below
# http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Inpatient.html

temp.zipped <- tempfile()
hospital.charges <- read.csv(unz(temp.zipped, "Medicare_Provider_Charge_Inpatient_DRG100_FY2011.csv"), header=TRUE)

#min/max needed to bound the plot
max <- max(hospital.charges$Average.Covered.Charges)
min <- min(hospital.charges$Average.Covered.Charges)

#if you want to study the data further, use openrefine (aka Google Refine)

procedures <- unique(hospital.charges$DRG.Definition)

procedure.charges.array <- array(list(NULL), c(100))

for (i in 1:length(procedures)) {

  procedure.charges <- hospital.charges[which(hospital.charges$DRG.Definition == procedures[i]), ]

  #used in the box-and-whiskers plot below
  procedure.charges.array[[i]] <- array(procedure.charges$Average.Covered.Charges, dim=nrow(procedure.charges))


#retain the three-digit medical code
procedures.labels <- as.character(procedures)
for (i in 1:length(procedures.labels)) {
  procedures.labels[i] <- substr(procedures.labels[i], 1, 3)

#boxplot to show the media, the quartiles, and the outliers
boxplot(x = procedure.charges.array, main="Boxplot Showing Variation in In-Patient Cost for Medical Procedures Nationwide",
        xlab="Medical Procedure Code",
        col = c("lightgreen", "brown2", "cyan4"),
        ylim=c(min, max), yaxt="n", col.ticks = "red", col.axis = "azure4", names=procedures.labels, las=2)

axis(2, axTicks(2), labels=sprintf("$%2d", axTicks(2)), las=1)

Created by Pretty R at inside-R.org