Model Evaluation (Part 1)

A practical approach

Working on building machine learning models in a work setting may be a bit more complicated. While achieving a satisfactory score on your chosen metric is undoubtedly crucial, it is only one aspect of the issue at hand. When selecting the optimal model for a given problem, you must take into account several factors, such as:

  • Model performance

  • Cost for training and inference, and computation time

  • Your Time

Firstly, we will discuss each of these points individually, and then we will explore how to strike a balance between them when deciding on the appropriate model to use.

Time, your Time

It’s easy to forget that your time is a limited and valuable resource. It can be hard to predict how long a data science project will take, but here are some things that can take more time than you might anticipate.

  • One task that can be particularly time-consuming is scoping your project. It involves evaluating what is feasible, determining the available data and the most valuable model type, and establishing timelines that are agreeable to all stakeholders. If you have the luxury of working with a project or product manager, they can assist you with this.

  • Another time-consuming aspect is setting up your environment and installing dependencies. For instance, when I was in grad school, I spent an entire summer dealing with various audio codecs and their dependencies. You may also underestimate the time required to get a new modeling framework and its dependencies in place. Especially with novel frameworks, you may encounter new issues or bugs that must be resolved before model training can commence.

  • Data collection and preparation are also significant time sinks. Ideally, the required data will already exist and be in a format suitable for modeling. However, this is not always the case, and correcting it can take much more time than anticipated. According to the 2018 Kaggle machine learning and data science survey, data scientists spend over 50% of their time gathering, cleaning, and visualizing data.

  • Finally, communicating with stakeholders is crucial to ensure that your model addresses a real need, that your stakeholders understand the strengths and limitations of your model and machine learning in general, and that your model is put into use. Otherwise, building a model that no one uses is pointless.

Below are some tips you can use to get your model up and training in as little time as possible :

  • If your data is tabular and stored in a SQL database, it is preferable to perform as much of the data cleaning as possible in SQL. A well-written SQL query is generally faster than executing equivalent data manipulations in Python or R.

  • It is advisable to begin with a well-established and older model family and implementation, as there is usually more sample code available and fewer bugs to contend with. This is especially true if you have correctly installed it for training models using your chosen compute.

  • If feasible, look for a container, such as Docker, that already has all the dependencies with the correct versions set up for the packages you intend to use. This will enable you to save time on setup.

  • Before commencing data cleaning, take some time to document what your data should look like before feeding it into the model. Then, list all the steps required to transform the data into the desired format. This will help you stay focused during the data cleaning process and provide a checklist to monitor your progress.

Model performance

The most prevalent method of measuring a model's performance is through its loss metric. For classification tasks, I usually employ cross-entropy, and for regression tasks, mean squared error is preferable, particularly when outliers are of concern. Nonetheless, while these metrics are valuable for training purposes, they may obscure important distinctions between individual models.

Error analysis

Here's where error analysis comes in handy. In machine learning, "error analysis" typically refers to examining the number and types of mistakes a model has made. This can aid in enhancing the model during training and tuning by identifying areas where additional feature engineering or data preprocessing may be required.

Incorporating an error analysis discussion with your final model can also improve trust in the model's outcomes.

All machine learning models make errors. It’s important to be able to communicate the types of errors your model is likely to make and consider that when selecting which model to implement.

For this project, we’ll be looking at multi-class classification problems and using confusion matrices to quickly summarize errors.

Interpretability

When assessing model performance, it is essential to consider interpretability. Specifically, why did a model make a particular decision for a specific class? When working with stakeholders who possess a wealth of knowledge about the data under consideration, being able to explain this can help establish confidence in your model.

In this project, we will use counterfactuals as a means of interpreting model decisions. Counterfactuals enable you to ask questions such as "What feature would I need to modify and how to obtain a different output?" or, in the simpler-to-compute scenario, "How would my model's output change if I altered a single feature in a specific manner?"

Counterfactuals have two main advantages: you can use them for any class of model and it’s easy for someone without much of a machine learning background to understand. This is important; not everyone on your team is going to have--or need--a deep background in machine learning.

If curious, you can check out this chapter in Christoph Molnar’s book “Interpretable Machine Learning A Guide for Making Black Box Models Explainable” for more details.

For this project, we’ll be working on a project to predict what job title a data-science-related role will have given the responsibilities of the role using data from the 2018 & 2019 Kaggle Data Science Survey.