Model Evaluation (Part 3)

A practical approach

This article is a part of my 'Practical Model Evaluation' series that I started a few months ago. You can find the links to the first and second parts below.


In today's article, we're going to be evaluating our models on different metrics:

  1. The time it takes to train.

  2. The time it takes to do batch inference on the held-out data.

  3. Overall accuracy.

  4. How well they perform across cases.

  5. How performances change if one of the input features changes.

First things first, load in all the libraries and data as needed.

import random
import os
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, accuracy_score
import category_encoders as ce
from xgboost import XGBClassifier
import h2o
from h2o.automl import H2OAutoML

# set seed for reproducability
random.seed()

training_data = pd.read_csv("../input/model-evaluation/train_data_2018.csv")
testing_data = pd.read_csv("../input/model-evaluation/test_data_2018.csv")

# save out copy of testing data to use w/ GCP
with open("test_data_2018.csv", "+w") as file:
    testing_data.to_csv(file, index=False, na_rep='NA')

# split into predictors & target variables
X_training = training_data.drop("job_title", axis=1)
y_training = training_data["job_title"]

X_testing = testing_data.drop("job_title", axis=1)
y_testing = testing_data["job_title"]

# encoded copy of our training data for training TPOT model
encoder_X = ce.OrdinalEncoder()
X_encoded = encoder_X.fit_transform(X_training)
X_testing_encoded = encoder_X.transform(X_testing)

encoder_y = ce.OrdinalEncoder()
y_encoded = encoder_y.fit_transform(y_training)

Load in our models

For the TPOT model, we are developing a new iteration of the successful pipeline.

# load our saved XGBoost model
xgboost_model = XGBClassifier()
xgboost_model.load_model("../input/evaluation/xgboost_baseline.model")
xgboost_model._le = LabelEncoder().fit(training_data["job_title"])

# initilaize H2o instance & load winning AutoML model
h2o.init()
h2o_model = h2o.load_model("../input/model-evaluation/GBM_5_AutoML_20191205_060406")

# convert our data to h20Frame, an alternative to pandas datatables
# (required for h20 AutoMl)
train_data = h2o.H2OFrame(X_testing)
test_data = h2o.H2OFrame(list(y_testing))
test_data_h2o = train_data.cbind(test_data)

# train new model using the pipeline generated by TPOT 
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

training_features, testing_features, training_target, testing_target = \
            train_test_split(X_encoded.values, y_encoded.values, random_state=None)

exported_pipeline = GradientBoostingClassifier(learning_rate=0.1, max_depth=4, max_features=0.7500000000000001, min_samples_leaf=3, min_samples_split=2, n_estimators=100, subsample=0.45)
exported_pipeline.fit(training_features, training_target)

When doing this on Kaggle and using Cloud AutoML, you'll need to connect your GCP account to your notebook as shown below.

from google.cloud import automl_v1beta1 as automl
from kaggle.gcp import KaggleKernelCredentials
from kaggle_secrets import GcpTarget
from google.cloud import storage

# don't change this value!
REGION = 'us-central1' # don't change: this is the only region that works currently

# these you'll change based on your GCP project/data
PROJECT_ID = 'kaggle-automl-example' # this will come from your specific GCP project
DATASET_DISPLAY_NAME = 'data_jobs_info_2018' # name of your uploaded dataset (from GCP console)
TARGET_COLUMN = 'job_title' # column with feature you're trying to predict

# these can be whatever you like
MODEL_DISPLAY_NAME = 'kaggle_automl_example_model' # what you want to call your model
TRAIN_BUDGET = 1000 # max time to train model in milli-hours, from 1000-72000

storage_client = storage.Client(project=PROJECT_ID, credentials=KaggleKernelCredentials(GcpTarget.GCS)) 
tables_gcs_client = automl.GcsClient(client=storage_client, credentials=KaggleKernelCredentials(GcpTarget.GCS)) 
tables_client = automl.TablesClient(project=PROJECT_ID, region=REGION, gcs_client=tables_gcs_client, credentials=KaggleKernelCredentials(GcpTarget.AUTOML))

With the setup done, let's get to evaluating our models.

We'll look at how much time each of these models took to train/retrain.

Training/retraining time

If you wish to add a new class, for example, you'll likely need to train these four types of models from scratch. The training durations for each of the models are listed below:

ModelTime to Train
1XGBoost10.2 s ± 71.7 ms (using %%timeit)
2TPOT10–15 minutes (depending on run)
3H2o AutoML36 minutes (HT Erin LeDell)
3Cloud AutoML1 Hour (user-specified)

It looks like the XGBoost baseline is probably your best bet if what you care about is training a model as fast as possible. But what about inference time?

Inference time

But what about the speed with which each model can generate predictions? To figure this out, I'll use the%%time magic, which runs a cell and outputs how long it took.

%%time

tpot_predictions = exported_pipeline.predict(X_testing_encoded)

Output

CPU times: user 72 ms, sys: 0 ns, total: 72 ms
Wall time: 68.7 ms
%%time

xgb_predictions = xgboost_model.predict(X_testing_encoded)

Output

CPU times: user 140 ms, sys: 4 ms, total: 144 ms
Wall time: 144 ms
xgb_predictions

Out

array(['Research Assistant', 'Consultant', 'Consultant', ...,
       'Research Assistant', 'Consultant', 'Consultant'], dtype=object)
%%time

h20_predictions = h2o_model.predict(test_data_h2o)

Output

gbm prediction progress: |████████████████████████████████████████████████| 100%
CPU times: user 64 ms, sys: 12 ms, total: 76 ms
Wall time: 999 ms
def download_to_kaggle(bucket_name,destination_directory,file_name,prefix=None):
    """Takes the data from your GCS Bucket and puts it into the working directory of your Kaggle notebook"""
    os.makedirs(destination_directory, exist_ok = True)
    full_file_path = os.path.join(destination_directory, file_name)
    blobs = storage_client.list_blobs(bucket_name,prefix=prefix)
    for blob in blobs:
        blob.download_to_filename(full_file_path)
%%time
# name of the bucket to store your results & data in
BUCKET_NAME = "kaggle-automl-example"
# url of the data you're using to test
gcs_input_uris = "gs://kaggle-automl-example/test_data_2018.csv"
# folder to store outputs in (you should create this folder)
gcs_output_uri_prefix = 'gs://kaggle-automl-example/predictions'

# predict
cloud_predictions = tables_client.batch_predict(
    model_display_name=MODEL_DISPLAY_NAME, 
    gcs_input_uris=gcs_input_uris,
    gcs_output_uri_prefix=gcs_output_uri_prefix
)

Output

CPU times: user 40 ms, sys: 8 ms, total: 48 ms
Wall time: 4.28 s

In

# from here we need to download our result file
# you can find the file path in the GCP console in the buckets for your project
RESULT_FILE_PATH = "gs://kaggle-automl-example/predictions/prediction-kaggle_automl_example_model-2019-12-05T05:07:28.873Z/tables_1.csv"

# save to working directory
with open('cloud_automl_results.csv', "wb") as file_obj:
     storage_client.download_blob_to_file(RESULT_FILE_PATH,
                                  file_obj)

# load predictions into dataframe
cloud_predictions_df =  pd.read_csv("cloud_automl_results.csv")

Based only on inference time, the TPOT model is the fastest of the four, followed by XGBoost, AutoML, and Cloud AutoML.

Comparing metrics

Now that we have our forecasts, let's compare how these models performed in terms of metrics. For this example, we'll only consider raw accuracy: what percentage of job titles did each model properly assign?

If we were looking at probabilities per class rather than projected category, we could use log loss instead, but for the sake of simplicity, let's just use accuracy here.

# TPOT Accuracy
tpot_predictions_df = pd.DataFrame(data= {'job_title': tpot_predictions})
tpot_predictions_unencoded = encoder_y.inverse_transform(tpot_predictions_df)
print("TPOT: " + str(accuracy_score(y_testing, tpot_predictions_unencoded)))

# H2O accuracy
h20_predictions_df = h20_predictions.as_data_frame()
print("H2O: " + str(accuracy_score(y_testing, h20_predictions_df.predict)))

# XGBoost accuracy
print("XGBoost: " + str(accuracy_score(y_testing, xgb_predictions)))

# Cloud AutoML accuracy
prediction_probs = cloud_predictions_df[cloud_predictions_df.columns[pd.Series(cloud_predictions_df.columns).str.startswith('job_title_')]]
titles_uncleaned = prediction_probs.idxmax(axis=1)

predicted_titles_cloud = titles_uncleaned.str.replace(r'job_title_', '')
predicted_titles_cloud = predicted_titles_cloud.str.replace(r'_score', '')

print("Cloud AutoML*: " + str(accuracy_score(cloud_predictions_df.job_title, predicted_titles_cloud)))
print("* some rows missing from Cloud AutoML predictions")

Output

TPOT: 0.4915254237288136
H2O: 0.4987378290659935
XGBoost: 0.0576992426974396
Cloud AutoML*: 0.5312270389419544
* some rows missing from Cloud AutoML predictions

Error analysis

For our error analysis, we're going to be using confusion matrices. The idea of a confusion matrix is that you have the actual labels on one axis, the predicted labels on the other axis, and then the count or proportion of classifications in the matrix itself. They're mostly handy for quickly comparing performance across multiple classes, which is how we'll use them here.

Below is a custom function, written by Dr. Rachael Tatman, based on one from the SciKitLearn documentation, to plot confusion matrices, and we're going to use it to compare classifications from the four models.

Custom function

import numpy as np
import matplotlib.pyplot as plt
from sklearn.utils.multiclass import unique_labels

# function based one from SciKitLearn documention (https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)
# and is modified and redistributed here under a BSD liscense, https://opensource.org/licenses/BSD-2-Clause
def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)

    # Only use the labels that appear in the data
    #classes = classes[unique_labels(y_true, y_pred)]
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)

    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    fig.set_figheight(15)
    fig.set_figwidth(15)
    return ax

Plots

plot_confusion_matrix(xgb_predictions, testing_data["job_title"], 
                      classes=unique_labels(testing_data["job_title"]),
                      normalize=True,
                      title='XGBoost Confusion Matrix')

plot_confusion_matrix(tpot_predictions_unencoded, testing_data["job_title"], 
                      classes=unique_labels(testing_data["job_title"]),
                      normalize=True,
                      title='TPOT Confusion Matrix')

plot_confusion_matrix(h20_predictions_df["predict"], testing_data["job_title"], 
                      classes=unique_labels(testing_data["job_title"]),
                      normalize=True,
                      title='H2O AutoML Confusion Matrix')

plot_confusion_matrix(predicted_titles_cloud, cloud_predictions_df.job_title, 
                      classes=unique_labels(cloud_predictions_df["job_title"]),
                      normalize=True,
                      title='Cloud AutoML Confusion Matrix')

Output

<matplotlib.axes._subplots.AxesSubplot at 0x7efbb9037160>

Conclusion

These confusion matrixes highlight a few things. First, "Consultant" and "Data Engineer" seem to be the two classes that are the most challenging. Second, there is a lot of variety in how well different models handle different classes. For instance, when it comes to recognizing "Data Analyst" roles, the H2O model is more accurate than the TPOT model, but less accurate when identifying "Business Analyst" roles. If you value one of those classes more than the others, you should probably take that into account when choosing a model.

Remember to omit the Cloud AutoML bits if you're not using them