This article is a part of my 'Practical Model Evaluation' series that I started a few months ago. You can find the link to the first part below.
- Part 1: The approach
In this article, we're going to be working on classifying roles into job titles based on some information about the role. The data is from the 2018 Kaggle data science survey.
The data is not very clean, so you'll need to do some data prep before you attempt anything on it.
In this tutorial, four distinct libraries, including some automated ML libraries, will be used to generate four different models.
Automated machine learning is the process of automating the tasks of applying machine learning to real-world problems. AutoML potentially includes every stage from beginning with a raw dataset to building a machine learning model ready for deployment. ~ Wikipedia definition of AutoML
The libraries we'll be using include:
XGBoost (not automated machine learning: we'll be using this as a baseline)
TPOT, an open-source automated machine learning library developed at the University of Pennsylvania
H20.ai AutoML, a second open-source automated machine learning library developed by researchers at H20.ai
Cloud AutoML, an enterprise-focused automated machine learning product
Load Arbor Arbor Arbor Arbor Arbor Arbor Arbor Arbor Arbor Arbor Arbor Arbor Arbor Arbor Arbor Arbor Arbor Arbor Arbor arbor
Load our pre-cleaned data.
# Importing libraries
import random
from sklearn.model_selection import train_test_split
from sklearn.metrics import auc, accuracy_score, confusion_matrix
import pandas as pd
import category_encoders as ce
# set a seed for reproducability
random.seed(42)
# read in our data
df_2018 = pd.read_csv("../input/data-prep-for-job-title-classification/data_jobs_info_2018.csv")
df_2019 = pd.read_csv("../input/data-prep-for-job-title-classification/data_jobs_info_2019.csv")
Data preparation
We'll split the data into training and testing sets:
# split into predictor & target variables
X = df_2018.drop("job_title", axis=1)
y = df_2018["job_title"]
# Split data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80, test_size=0.20)
# save the split training data to use with Cloud AutoML
with open("train_data_2018.csv", "+w") as file:
pd.concat([X_train, y_train], axis=1).to_csv(file, index=False)
with open("test_data_2018.csv", "+w") as file:
pd.concat([X_test, y_test], axis=1).to_csv(file, index=False)
For H20 AutoML and Cloud AutoML, we won't need to do much.
For TPOT and XGBoost, however, we'll need to make sure that all our input data is numeric. We'll be using ordinal label encoding for this.
# encode all features using ordinal encoding
encoder_x = ce.OrdinalEncoder()
X_encoded = encoder_x.fit_transform(X)
# you'll need to use a different encoder for each dataframe
encoder_y = ce.OrdinalEncoder()
y_encoded = encoder_y.fit_transform(y)
# split encoded dataset
X_train_encoded, X_test_encoded, y_train_encoded, y_test_encoded = train_test_split(X_encoded, y_encoded,
train_size=0.80, test_size=0.20)
XGBoost Baseline
Using the default arguments, we will train a simple XGBoost model.
from xgboost import XGBClassifier
# train XGBoost model with default parameters
my_model = XGBClassifier()
my_model.fit(X_train_encoded, y_train_encoded, verbose=False)
# and save our model
my_model.save_model("xgboost_baseline.model")
/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Cloud AutoML
Now let's train our Cloud AutoML model! We'll be using both the GCP console and notebook code here, so you'll probably want to open those in separate tabs or windows.
Prepare your account and project
You’ll need to create a GCP account (if you already have a Google account, you can use that one) and enable billing.
If you’re not able to enable billing you can still follow along with the rest of the workshop, just skip the Cloud AutoML parts.
From there, create a new project. You should set the region of your project to “US-Central.”.
Go to the AutoML Tables page in the Google Cloud Console and click Enable API. This will let you train an AutoML Tables model in your current project.
Creating your dataset
We'll use the GCP console to construct our AutoML datasets. This is due to the fact that importing datasets might be time-consuming. When you run your notebook from top to bottom, it will give you an error because the modeling code was executed before the dataset had finished importing if the code to import your dataset comes directly before the code to generate your model.
Click on “Datasets” in the list on the left-hand side of your screen, and then click on the blue [+] New Dataset text near the top of your screen.
Give your dataset a name, and make sure the region is US-CENTRAL1.
Select “Upload files from your computer” and select the file with the dataset you want.
Click on Browse under the “Select Files” button, and a side panel will pop up.
If you haven’t created any buckets, you’ll see the text “No buckets found”. To create a new bucket, click on the icon that looks like a shopping basket with a plus sign in it.
Follow the prompts to create your bucket. Important: Make sure in the “Choose where to store your data” step that you pick “Region” and set the location as “us-central1 (Iowa).
Select the bucket where you’d like to store your data.
Import your dataset. (This may take a while.)
Once your dataset is done importing, take a close look at your imported data and make sure it looks the way you’d expect.
Training our model
To train an AutoML model from inside Kaggle Notebooks, you’ll need to attach a notebook to your Google Cloud Account.
After that, you can modify the following code to start your AutoML model training:
from google.cloud import automl_v1beta1 as automl
from kaggle.gcp import KaggleKernelCredentials
from kaggle_secrets import GcpTarget
from google.cloud import storage
# don't change this value!
REGION = 'us-central1' # don't change: this is the only region that works currently
# these you'll change based on your GCP project/data
PROJECT_ID = 'kaggle-automl-example' # this will come from your specific GCP project
DATASET_DISPLAY_NAME = 'data_jobs_info_2018' # name of your uploaded dataset (from GCP console)
TARGET_COLUMN = 'job_title' # column with feature you're trying to predict
# these can be whatever you like
MODEL_DISPLAY_NAME = 'kaggle_automl_example_model' # what you want to call your model
TRAIN_BUDGET = 1000 # max time to train model in milli-hours, from 1000-72000
storage_client = storage.Client(project=PROJECT_ID, credentials=KaggleKernelCredentials(GcpTarget.GCS))
tables_gcs_client = automl.GcsClient(client=storage_client, credentials=KaggleKernelCredentials(GcpTarget.GCS))
tables_client = automl.TablesClient(project=PROJECT_ID, region=REGION, gcs_client=tables_gcs_client, credentials=KaggleKernelCredentials(GcpTarget.AUTOML))
# you'll need to make sure your model is predicting the right column
tables_client.set_target_column(
dataset_display_name=DATASET_DISPLAY_NAME,
column_spec_display_name=TARGET_COLUMN,
)
Output
name: "projects/452234229115/locations/us-central1/datasets/TBL5202814255845343232"
display_name: "data_jobs_info_2018"
create_time {
seconds: 1575401904
nanos: 806497000
}
etag: "AB3BwFo5q1Fga2k13Y3rkhfmayPrJGyEQjwXFcL73vYEuzVs5XVK53PbLCmEHorp77ZA"
example_count: 13862
tables_dataset_metadata {
primary_table_spec_id: "5034575782656081920"
target_column_spec_id: "9148800959235751936"
target_column_correlations {
key: "1078350426987823104"
value {
cramers_v: 0.02835003252597241
}
}
target_column_correlations {
key: "1294523209101606912"
value {
cramers_v: 0.05767843611880267
}
}
target_column_correlations {
key: "1654811179291246592"
value {
cramers_v: 0.029940460488645747
}
}
target_column_correlations {
key: "1943041555442958336"
value {
cramers_v: 0.015452905323776783
}
}
target_column_correlations {
key: "2231271931594670080"
value {
cramers_v: 0.01498704796962198
}
}
target_column_correlations {
key: "2807732683898093568"
value {
cramers_v: 0.026341049526701674
}
}
target_column_correlations {
key: "3095963060049805312"
value {
cramers_v: 0.05113002137364723
}
}
target_column_correlations {
key: "3384193436201517056"
value {
cramers_v: 0.026228333569284504
}
}
target_column_correlations {
key: "3672423812353228800"
value {
cramers_v: 0.050759289946223245
}
}
target_column_correlations {
key: "3960654188504940544"
value {
cramers_v: 0.014786091207045333
}
}
target_column_correlations {
key: "4248884564656652288"
value {
cramers_v: 0.03745949678439566
}
}
target_column_correlations {
key: "4537114940808364032"
value {
cramers_v: 0.027984289925978574
}
}
target_column_correlations {
key: "4753287722922147840"
value {
cramers_v: 0.03955251126873317
}
}
target_column_correlations {
key: "501889674684399616"
value {
cramers_v: 0.010360098755500056
}
}
target_column_correlations {
key: "5113575693111787520"
value {
cramers_v: 0.005501737941815892
}
}
target_column_correlations {
key: "5401806069263499264"
value {
cramers_v: 0.02990491323748493
}
}
target_column_correlations {
key: "5690036445415211008"
value {
cramers_v: 0.01480733951023409
}
}
target_column_correlations {
key: "5906209227528994816"
value {
cramers_v: 0.050180324750210566
}
}
target_column_correlations {
key: "6266497197718634496"
value {
cramers_v: 0.010363774712942754
}
}
target_column_correlations {
key: "6554727573870346240"
value {
cramers_v: 0.026584870607957077
}
}
target_column_correlations {
key: "6842957950022057984"
value {
cramers_v: 0.015852384216764638
}
}
target_column_correlations {
key: "7059130732135841792"
value {
cramers_v: 0.051269547098965126
}
}
target_column_correlations {
key: "7419418702325481472"
value {
cramers_v: 0.011611357384300698
}
}
target_column_correlations {
key: "7707649078477193216"
value {
cramers_v: 0.05960229203419278
}
}
target_column_correlations {
key: "790120050836111360"
value {
cramers_v: 0.04437117282601206
}
}
target_column_correlations {
key: "7995879454628904960"
value {
cramers_v: 0.0674765679112893
}
}
target_column_correlations {
key: "8284109830780616704"
value {
cramers_v: 0.045443076652998474
}
}
target_column_correlations {
key: "8572340206932328448"
value {
cramers_v: 0.02821290209068001
}
}
target_column_correlations {
key: "8860570583084040192"
value {
cramers_v: 0.01589312144662912
}
}
stats_update_time {
seconds: 1575523334
nanos: 662000000
}
}
# let our model know that input columns may have missing values
for col in tables_client.list_column_specs(project=PROJECT_ID,
dataset_display_name=DATASET_DISPLAY_NAME):
if TARGET_COLUMN in col.display_name:
continue
tables_client.update_column_spec(project=PROJECT_ID,
dataset_display_name=DATASET_DISPLAY_NAME,
column_spec_display_name=col.display_name,
nullable=True)
# and then you'll need to kick off your model training
response = tables_client.create_model(MODEL_DISPLAY_NAME, dataset_display_name=DATASET_DISPLAY_NAME,
train_budget_milli_node_hours=TRAIN_BUDGET,
exclude_column_spec_names=[TARGET_COLUMN])
# check if it's done yet (it won't be)
response.done()
Output
False
We don't need to do anything else once the model has begun training because it has already been stored in our GCP account and is ready to use.
TPOT
We'll now discuss TPOT. The best thing about this academic library, which was constructed on top of Scikit-Learn, is that you can export models together with all the Python code required to train them.
from tpot import TPOTClassifier
# create & fit TPOT classifier with
tpot = TPOTClassifier(generations=8, population_size=20,
verbosity=2, early_stop=2)
tpot.fit(X_train_encoded, y_train_encoded)
# save our model code
tpot.export('tpot_pipeline.py')
# print the model code to see what it says
!cat tpot_pipeline.py
/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Generation 1 - Current best internal CV score: 0.4962519650431746
Generation 2 - Current best internal CV score: 0.4962519650431746
The optimized pipeline was not improved after evaluating 2 more generations. Will end the optimization process.
TPOT closed prematurely. Will use the current best pipeline.
Best pipeline: GradientBoostingClassifier(input_matrix, learning_rate=0.1, max_depth=3, max_features=0.35000000000000003, min_samples_leaf=11, min_samples_split=18, n_estimators=100, subsample=0.6500000000000001)
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
train_test_split(features, tpot_data['target'].values, random_state=None)
# Average CV score on the training set was:0.4962519650431746
exported_pipeline = GradientBoostingClassifier(learning_rate=0.1, max_depth=3, max_features=0.35000000000000003, min_samples_leaf=11, min_samples_split=18, n_estimators=100, subsample=0.6500000000000001)
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
H20.ai AutoML
We'll be utilizing the free AutoML library from H20.ai for our final model. The fact that each model is evaluated both independently and as a member of a stacked ensemble as it is trained is one feature of this library that I Like.
import h2o
from h2o.automl import H2OAutoML
# initilaize an H20 instance running locally
h2o.init()
Output
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
Java Version: openjdk version "1.8.0_232"; OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-1~deb9u1-b09); OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)
Starting server from /opt/conda/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
Ice root: /tmp/tmpt4h02t2p
JVM stdout: /tmp/tmpt4h02t2p/h2o_unknownUser_started_from_python.out
JVM stderr: /tmp/tmpt4h02t2p/h2o_unknownUser_started_from_python.err
Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
H2O cluster uptime: | 02 secs |
H2O cluster timezone: | Etc/UTC |
H2O data parsing timezone: | UTC |
H2O cluster version: | 3.26.0.8 |
H2O cluster version age: | 1 month and 17 days |
H2O cluster name: | H2O_from_python_unknownUser_wh0ijv |
H2O cluster total nodes: | 1 |
H2O cluster-free memory: | 3.556 Gb |
H2O cluster total cores: | 4 |
H2O cluster allowed cores: | 4 |
H2O cluster status: | accepting new members, healthy |
H2O connection url: | |
H2O connection proxy: | {'http': None, 'https': None} |
H2O internal security: | False |
H2O API Extensions: | Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, and Core V4 |
Python version: | 3.6.6 final |
# convert our data to h20Frame, an alternative to pandas datatables
train_data = h2o.H2OFrame(X_train)
test_data = h2o.H2OFrame(list(y_train))
train_data = train_data.cbind(test_data)
# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml = H2OAutoML(max_models=20, seed=1)
aml.train(y="C1", training_frame=train_data)
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
AutoML progress: |████████████████████████████████████████████████████████| 100%
# View the top five models from the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=5)
# The leader model can be access with `aml.leader`
model_id | mean_per_class_error | logloss | rmse | mse |
GBM_5_AutoML_20191205_060406 | 0.680639 | 1.46891 | 0.696247 | 0.48476 |
XGBoost_1_AutoML_20191205_060406 | 0.681909 | 1.44647 | 0.6996 | 0.489441 |
DeepLearning_grid_1_AutoML_20191205_060406_model_1 | 0.682223 | 1.82753 | 0.692592 | 0.479684 |
GBM_1_AutoML_20191205_060406 | 0.682883 | 1.49003 | 0.696902 | 0.485673 |
XGBoost_2_AutoML_20191205_060406 | 0.683375 | 1.46133 | 0.7057 | 0.498013 |
Out
# save the model out (we'll need to for tomorrow!)
h2o.save_model(aml.leader)
Out
'/kaggle/working/GBM_5_AutoML_20191205_060406'
Check if our models have been saved
Before we wrap up for the day, we want to make sure we've saved all of our models for tomorrow! The Cloud AutoML model is saved automatically on GCP, but we've saved each of the other models in our current working directory. Let's just double-check that that's the case:
# check to see that we've saved all of our models
! ls
Output
GBM_5_AutoML_20191205_060406 test_data_2018.csv xgboost_baseline.model
__notebook__.ipynb tpot_pipeline.py
__output__.json train_data_2018.csv
Alright, we've got three models and the code for the notebook. We're all set!
Reference
Most of the code used in this tutorial is from a Kaggle event by Dr. Rachael Tatman