Predict disease classes using genetic microarray data

Objective

The purpose of this article is to show how to develop a method that uses genetic data for disease classification. Data is extracted from a DNA microarray which measures the expression levels of large numbers of genes simultaneously.

Samples in the datasets represent patients. For each patient 7070 genes expressions (values) are measured in order to classify the patient’s disease into one of the following cases: EPD, JPA, MED, MGL, RHB.

Data

Gene data is in genes-in-rows format, comma-separated values. There are following 3 files on my GitHub:

Training dataset: pp5i_train.gr.csv
Training data classes: pp5i_train_class.txt
Test dataset: pp5i_test.gr.csv

Instructions

Training data: file pp5i_train.gr.csv, with 7070 genes for 69 samples. A separate file pp5i_train_class.txt has classes for each sample, in the order corresponding to the order of samples in pp5i_train.gr.cdv.

Test data: file pp5i_test.gr.csv, with 23 unlabelled samples and same genes. Assume that the class distribution is similar.

The goal is to learn the best model from the training data and use it to predict the label (class) for each sample in test data. Randomization experiments showed that one can get about 10-12 (from 23) correct answers with random guessing.

Hints

Be sure that don’t use the sample number as one of the predictors. Training data is ordered by class, so sample number will be appear to be a good predictor on cross-validation, but it will not work on the test data.

One of the MED samples in the training data is very likely misclassified (by a human). So the best result expected to get on cross-validation is one error (on a MED sample) out of 69. However, this doesn’t affect accuracy on the test set.

Here I use Python and its libraries. The following steps are one way of finding the best model.

Steps

Read data

import pandas as pd

train_data = pd.read_csv('pp5i_train.gr.csv', index_col=0)
test_data = pd.read_csv('pp5i_test.gr.csv', index_col=0)
train_class = pd.read_csv('pp5i_train_class.txt')
print(train_class['Class'].value_counts())

The class distribution as follows:

MED    39
EPD    10
MGL     7
RHB     7
JPA     6
Name: Class, dtype: int64

Data Cleaning

Threshold both train and test data to a minimum value of 20, maximum of 16,000. Remove null values.

import numpy as np

train_data[(train_data < 20) | (train_data > 16000)] = np.nan
test_data[(test_data < 20) | (test_data > 16000)] = np.nan

train_data_clean = train_data[(~train_data.isnull().any(1)) & (~test_data.isnull().any(1))]
test_data_clean = test_data[(~train_data.isnull().any(1)) & (~test_data.isnull().any(1))]

Select top genes by class

Remove from train data genes with fold differences across samples less than 2. Fold difference is defined as a ratio between maximum and minimum values (Max/Min) for a given data set.

train_data_fold = train_data_clean[train_data_clean.apply(lambda x: x.max()/x.min() > 2, axis=1)]

For each class, generate subsets with top 2, 4, 6, 8, 10, 12, 15, 20, 25, and 30 top genes with the highest absolute T-value. The Objective is to find for each class the set of best genes to discriminate it from the other classes.

from scipy import stats

classes = np.unique(train_class.values)

t_values = pd.DataFrame(index=train_data_fold.index.tolist())
for label in classes:
  for gene in train_data_fold.index.tolist():
    cur_samples = train_data_fold.loc[gene][np.ravel((train_class==label).values.tolist())]
    rest_samples = train_data_fold.loc[gene][np.ravel((train_class!=label).values.tolist())]
    t_values.loc[gene, label] = abs(stats.ttest_ind(cur_samples, rest_samples,
                                                    equal_var=False, nan_policy='raise')[0])

For \(N = 2, 4, 6, 8, 10, 12, 15, 20, 25, 30\) combine top genes for each class into one file (removing duplicates, if any) and call the resulting file pp5i_train.topN.gr.csv. Add the class as the last column, remove sample no, transpose each file to “gene-in-columns” format.

N = [2,4,6,8,10,12,15,20,25,30]

for n in N:
  top_genes = set()
  for label in classes:
    top_genes.update(t_values.sort_values(label, ascending=False).head(n).index.tolist())
  pd.concat([train_data_fold.loc[list(top_genes), :].T.reset_index(drop=True), train_class],
            axis=1).to_csv('pp5i_train.top'+str(n)+'.gr.csv', index=False)

Find the best classifier/best gene set combination

Use following classifiers:

Naïve Bayes
Decision Tree
K-NN
Neural Network
Random Forest

For each classifier, using default settings, measure classifier accuracy on the training set using previously generated files with top \(N = 2, 4, 6, 8, 10, 12, 15, 20, 25, 30\) genes.

Select the model and the gene set with the lowest cross-validation error. Cross-validation is the main tool to measure classification accuracy. In Scikit-learn, it shows how it can be calculated.

Once found the gene set with the lowest cross-validation error, vary 1-2 additional relevant parameters for each classifier to improve the accuracy.

Naïve Bayes

from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score

GNBclf = GaussianNB()

def search_gene(clf):
    best_score = 0
    accuracy = []
    for n in N:
        scores = cross_val_score(GNBclf, globals()['x_train%s'%n], globals()['y_train%s'%n], cv=5, n_jobs=-1)
        score = scores.mean()
        accuracy.append(score)
        print("N=%d accuracy: %0.2f (+/- %0.2f)" % (n, score, scores.std() * 2))
        best_n = n if score > best_score else best_n
        best_score = score if score > best_score else best_score
    return best_n, accuracy

best_n, scores = search_gene(GNBclf)

N=2 accuracy: 0.94 (+/- 0.11)
N=4 accuracy: 0.92 (+/- 0.14)
N=6 accuracy: 0.95 (+/- 0.10)
N=8 accuracy: 0.95 (+/- 0.10)
N=10 accuracy: 0.95 (+/- 0.10)
N=12 accuracy: 0.93 (+/- 0.08)
N=15 accuracy: 0.92 (+/- 0.11)
N=20 accuracy: 0.93 (+/- 0.08)
N=25 accuracy: 0.95 (+/- 0.10)
N=30 accuracy: 0.95 (+/- 0.10)

Next, try different types of naïve bayes:

from sklearn.naive_bayes import MultinomialNB, ComplementNB, BernoulliNB

MNBclf = MultinomialNB()
CNBclf = ComplementNB()
BNBclf = BernoulliNB()

for c in ['G', 'M', 'C', 'B']:
    scores = cross_val_score(globals()[c+'NBclf'], globals()['x_train%s' % best_n], globals()['y_train%s' % best_n], cv=5, n_jobs=-1)
    print("N=%d %sNBclf accuracy: %0.2f (+/- %0.2f)" % (best_n, c, scores.mean(), scores.std() * 2))

N=6 GNBclf accuracy: 0.95 (+/- 0.10)
N=6 MNBclf accuracy: 0.96 (+/- 0.11)
N=6 CNBclf accuracy: 0.86 (+/- 0.12)
N=6 BNBclf accuracy: 0.57 (+/- 0.09)

Decision Tree

from sklearn.tree import DecisionTreeClassifier

DTclf = DecisionTreeClassifier()

best_n, scores = search_gene(DTclf)

N=2 accuracy: 0.78 (+/- 0.31)
N=4 accuracy: 0.78 (+/- 0.20)
N=6 accuracy: 0.80 (+/- 0.29)
N=8 accuracy: 0.84 (+/- 0.31)
N=10 accuracy: 0.80 (+/- 0.29)
N=12 accuracy: 0.79 (+/- 0.21)
N=15 accuracy: 0.72 (+/- 0.35)
N=20 accuracy: 0.82 (+/- 0.27)
N=25 accuracy: 0.69 (+/- 0.37)
N=30 accuracy: 0.73 (+/- 0.29)

Then search for best parameters in decision tree model:

from sklearn.model_selection import GridSearchCV


params = {
    'max_depth': [30, 60, 90, None],
    'class_weight': ['balanced', None]
}

def print_results(results):
    print('Best Params: {}\n'.format(results.best_params_))
    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 2), round(std * 2, 2), params))

def search_param(clf, params, n):
    cv = GridSearchCV(clf, params, cv=5, n_jobs=-1, iid=False)
    cv.fit(globals()['x_train%s' % n], globals()['y_train%s' % n])

    print('N=%s:' % n)    
    print_results(cv)

search_param(DTclf, params, best_n)

N=8:
Best Params: {'class_weight': None, 'max_depth': 30}

81 (+/-0.19) for {'class_weight': 'balanced', 'max_depth': 30}
82 (+/-0.27) for {'class_weight': 'balanced', 'max_depth': 60}
81 (+/-0.19) for {'class_weight': 'balanced', 'max_depth': 90}
86 (+/-0.18) for {'class_weight': 'balanced', 'max_depth': None}
89 (+/-0.14) for {'class_weight': None, 'max_depth': 30}
78 (+/-0.23) for {'class_weight': None, 'max_depth': 60}
76 (+/-0.34) for {'class_weight': None, 'max_depth': 90}
77 (+/-0.26) for {'class_weight': None, 'max_depth': None}

K-NN

For K-NN, test accuracy with \(K = 2, 3, 4\).

from sklearn.neighbors import KNeighborsClassifier

KNNclf = KNeighborsClassifier(n_jobs=-1)

params = {
    'n_neighbors': [2, 3, 4]
}

best_score = 0
for n in N:
    cv = GridSearchCV(KNNclf, params, cv=5, n_jobs=-1, iid=False)
    cv.fit(globals()['x_train%s'%n], globals()['y_train%s'%n])
    score = max(cv.cv_results_['mean_test_score'])
    best_n = n if score > best_score else best_n
    best_K = cv.best_params_['n_neighbors'] if score > best_score else best_K
    best_score = score if score > best_score else best_score
    print('N=%s:'%n)
    print_results(cv)

N=2:
Best Params: {'n_neighbors': 4}

0.93 (+/-0.15) for {'n_neighbors': 2}
0.95 (+/-0.1) for {'n_neighbors': 3}
0.96 (+/-0.11) for {'n_neighbors': 4}
N=4:
Best Params: {'n_neighbors': 3}

0.9 (+/-0.07) for {'n_neighbors': 2}
0.93 (+/-0.02) for {'n_neighbors': 3}
0.93 (+/-0.02) for {'n_neighbors': 4}
N=6:
Best Params: {'n_neighbors': 4}

0.91 (+/-0.07) for {'n_neighbors': 2}
0.93 (+/-0.02) for {'n_neighbors': 3}
0.94 (+/-0.06) for {'n_neighbors': 4}
N=8:
Best Params: {'n_neighbors': 3}

0.89 (+/-0.14) for {'n_neighbors': 2}
0.93 (+/-0.08) for {'n_neighbors': 3}
0.92 (+/-0.09) for {'n_neighbors': 4}
N=10:
Best Params: {'n_neighbors': 3}

0.9 (+/-0.13) for {'n_neighbors': 2}
0.93 (+/-0.08) for {'n_neighbors': 3}
0.9 (+/-0.05) for {'n_neighbors': 4}
N=12:
Best Params: {'n_neighbors': 2}

0.92 (+/-0.09) for {'n_neighbors': 2}
0.9 (+/-0.08) for {'n_neighbors': 3}
0.9 (+/-0.05) for {'n_neighbors': 4}
N=15:
Best Params: {'n_neighbors': 3}

0.86 (+/-0.16) for {'n_neighbors': 2}
0.9 (+/-0.08) for {'n_neighbors': 3}
0.9 (+/-0.08) for {'n_neighbors': 4}
N=20:
Best Params: {'n_neighbors': 4}

0.89 (+/-0.2) for {'n_neighbors': 2}
0.86 (+/-0.07) for {'n_neighbors': 3}
0.89 (+/-0.15) for {'n_neighbors': 4}
N=25:
Best Params: {'n_neighbors': 4}

0.89 (+/-0.18) for {'n_neighbors': 2}
0.89 (+/-0.1) for {'n_neighbors': 3}
0.92 (+/-0.09) for {'n_neighbors': 4}
N=30:
Best Params: {'n_neighbors': 2}

0.93 (+/-0.12) for {'n_neighbors': 2}
0.89 (+/-0.1) for {'n_neighbors': 3}
0.92 (+/-0.09) for {'n_neighbors': 4}

KNN4clf = KNeighborsClassifier(n_neighbors=4, n_jobs=-1)

params = {
    'n_neighbors': [2, 3, 4],
    'weights' : ['uniform', 'distance']
}

search_param(KNN4clf, params, best_n)

N=2:
Best Params: {'n_neighbors': 4, 'weights': 'uniform'}

93 (+/-0.15) for {'n_neighbors': 2, 'weights': 'uniform'}
93 (+/-0.13) for {'n_neighbors': 2, 'weights': 'distance'}
95 (+/-0.1) for {'n_neighbors': 3, 'weights': 'uniform'}
95 (+/-0.1) for {'n_neighbors': 3, 'weights': 'distance'}
96 (+/-0.11) for {'n_neighbors': 4, 'weights': 'uniform'}
95 (+/-0.1) for {'n_neighbors': 4, 'weights': 'distance'}

Neural Network

from sklearn.neural_network import MLPClassifier

NNclf = MLPClassifier()
best_n, scores = search_gene(NNclf)

N=2 accuracy: 0.84 (+/- 0.27)
N=4 accuracy: 0.72 (+/- 0.43)
N=6 accuracy: 0.88 (+/- 0.18)
N=8 accuracy: 0.90 (+/- 0.13)
N=10 accuracy: 0.84 (+/- 0.17)
N=12 accuracy: 0.81 (+/- 0.27)
N=15 accuracy: 0.86 (+/- 0.19)
N=20 accuracy: 0.81 (+/- 0.23)
N=25 accuracy: 0.94 (+/- 0.06)
N=30 accuracy: 0.83 (+/- 0.11)

params = {
    'hidden_layer_sizes' : [(100,), (200,), (400,)],
    'activation' : ['identity', 'logistic', 'tanh', 'relu']
}

search_param(NNclf, params, best_n)

N=8:
Best Params: {'activation': 'logistic', 'hidden_layer_sizes': (200,)}

83 (+/-0.21) for {'activation': 'identity', 'hidden_layer_sizes': (100,)}
89 (+/-0.15) for {'activation': 'identity', 'hidden_layer_sizes': (200,)}
86 (+/-0.12) for {'activation': 'identity', 'hidden_layer_sizes': (400,)}
94 (+/-0.11) for {'activation': 'logistic', 'hidden_layer_sizes': (100,)}
97 (+/-0.06) for {'activation': 'logistic', 'hidden_layer_sizes': (200,)}
97 (+/-0.06) for {'activation': 'logistic', 'hidden_layer_sizes': (400,)}
91 (+/-0.13) for {'activation': 'tanh', 'hidden_layer_sizes': (100,)}
95 (+/-0.1) for {'activation': 'tanh', 'hidden_layer_sizes': (200,)}
97 (+/-0.07) for {'activation': 'tanh', 'hidden_layer_sizes': (400,)}
87 (+/-0.17) for {'activation': 'relu', 'hidden_layer_sizes': (100,)}
88 (+/-0.07) for {'activation': 'relu', 'hidden_layer_sizes': (200,)}
89 (+/-0.09) for {'activation': 'relu', 'hidden_layer_sizes': (400,)}

Random Forest

from sklearn.ensemble import RandomForestClassifier

RFclf = RandomForestClassifier(n_jobs=-1)
best_n, scores = search_gene(RFclf)

N=2 accuracy: 0.86 (+/- 0.12)
N=4 accuracy: 0.93 (+/- 0.14)
N=6 accuracy: 0.89 (+/- 0.17)
N=8 accuracy: 0.94 (+/- 0.16)
N=10 accuracy: 0.91 (+/- 0.13)
N=12 accuracy: 0.87 (+/- 0.26)
N=15 accuracy: 0.92 (+/- 0.14)
N=20 accuracy: 0.89 (+/- 0.10)
N=25 accuracy: 0.90 (+/- 0.11)
N=30 accuracy: 0.87 (+/- 0.09)

params = {
    'n_estimators': [100, 150, 300],
    'max_depth' : [30, 60, 90, None],
    'class_weight' : ['balanced']
}

search_param(RFclf, params, best_n)

N=8:
Best Params: {'class_weight': 'balanced', 'max_depth': 90, 'n_estimators': 150}

95 (+/-0.1) for {'class_weight': 'balanced', 'max_depth': 30, 'n_estimators': 100}
95 (+/-0.15) for {'class_weight': 'balanced', 'max_depth': 30, 'n_estimators': 150}
97 (+/-0.07) for {'class_weight': 'balanced', 'max_depth': 30, 'n_estimators': 300}
96 (+/-0.1) for {'class_weight': 'balanced', 'max_depth': 60, 'n_estimators': 100}
97 (+/-0.07) for {'class_weight': 'balanced', 'max_depth': 60, 'n_estimators': 150}
97 (+/-0.07) for {'class_weight': 'balanced', 'max_depth': 60, 'n_estimators': 300}
95 (+/-0.1) for {'class_weight': 'balanced', 'max_depth': 90, 'n_estimators': 100}
98 (+/-0.06) for {'class_weight': 'balanced', 'max_depth': 90, 'n_estimators': 150}
97 (+/-0.07) for {'class_weight': 'balanced', 'max_depth': 90, 'n_estimators': 300}
92 (+/-0.18) for {'class_weight': 'balanced', 'max_depth': None, 'n_estimators': 100}
95 (+/-0.15) for {'class_weight': 'balanced', 'max_depth': None, 'n_estimators': 150}
96 (+/-0.1) for {'class_weight': 'balanced', 'max_depth': None, 'n_estimators': 300}

Use the gene names from best train gene set and extract the data corresponding to these genes from the test set. Convert test set to “gene-in-columns” format to prepare it for classification.

best_n = 8
y_test = test_data.loc[pd.read_csv('pp5i_train.top'+str(best_n)+'.gr.csv').drop(labels='Class',
                                                                           axis=1).columns.tolist(), :].T.reset_index(drop=True)

Generate predictions for the test set

Now have the best train file, pp5i_train.bestN.csv, (with 69 samples and bestN number of genes found) and a corresponding test file, pp5i_test.bestN.csv, with the same genes and 23 test samples.

pd.read_csv('pp5i_train.top'+str(best_n)+'.gr.csv').to_csv('pp5i_train.best'+str(best_n)+'.csv', index=False)
y_test.to_csv('pp5i_test.best'+str(best_n)+'.csv', index=False)

Use the best train file and the matching test file and generate predictions for the test file class.

from sklearn.metrics import classification_report
from sklearn.utils import class_weight

x_train = pd.read_csv('pp5i_train.best'+str(best_n)+'.csv').drop(labels='Class', axis=1).values.tolist()
y_train = LE.transform(np.ravel(pd.read_csv('pp5i_train.best'+str(best_n)+'.csv', usecols=['Class']).values.tolist()))
sample_weight = class_weight.compute_sample_weight('balanced', y_train)

clf = RandomForestClassifier(n_estimators=150, max_depth=90, n_jobs=-1, class_weight='balanced')
clf.fit(x_train, y_train, sample_weight=sample_weight)
np.savetxt('pp5i_test_class.txt', LE.inverse_transform(clf.predict(y_test)), fmt='%s')
print(classification_report(y_train, clf.predict(x_train), target_names=classes))

precision    recall  f1-score   support

         EPD       1.00      1.00      1.00        10
         JPA       1.00      1.00      1.00         6
         MED       1.00      1.00      1.00        39
         MGL       1.00      1.00      1.00         7
         RHB       1.00      1.00      1.00         7

   micro avg       1.00      1.00      1.00        69
   macro avg       1.00      1.00      1.00        69
weighted avg       1.00      1.00      1.00        69