How Accurate is Your Accuracy?

By validation dataset, we mean the specific set of data used to calculate a model’s final quality metric. This is where we get the definitive measure of how well our model performs.

A Quick Note Before We Begin

The ideas in this post are specifically for metrics like accuracy, which are based on a series of independent “yes” or “no” outcomes. These methods do not apply to “aggregated” metrics like Mean Squared Error (MSE), Mean Absolute Error (MAE), or Root Mean Squared Error (RMSE). We’ll be focusing on a different kind of statistical analysis here.

The Train-Validation Split

It’s a common practice to split a dataset into a 80%/20% ratio, dedicating 80% of the data to training the model and the remaining 20% to validation. But is it enough to have 20% of a dataset to get a good estimation of a quality metric?

As any data scientist or machine learning engineer knows, a larger training dataset generally leads to better results. However, here, we’re not focusing on the bias-variance tradeoff (the balance between underfitting and overfitting). Instead, we’re going to discuss something else: the size of the validation set. The bigger the validation set, the more accurate estimation of the model’s quality metric.

A model’s quality metric is a statistical value, which means it has a confidence interval - a range in which the true value is likely to fall. For example, instead of just saying your model has an accuracy of 93%, it’s more accurate to say 93% ± 2% (with a confidence level of 95%, where z = 1.96). Unfortunately, this level of detail is rarely shared, even by many experienced practitioners. Even by me.

Multiple Observations

In statistical terms, any quality metric we get from our validation dataset is a single observation 1. What does that mean for us? It means if we were to create slightly different validation sets, all sampled from the same original dataset, we would get slightly different metric values. By collecting a set of these metric values, we can then estimate the metric’s expected value 2.

In simple terms, the expected value is just the mean of all our observations. The more observations we have, the more precise our estimate becomes, resulting in a narrower confidence interval.

As an example, let’s consider the following.

We have five observations of a model’s quality metric: x = [75, 72, 78, 73, 77]. What is the expected value of this metric? And what is its 95% confidence interval? For this example, let’s assume the data has a normal distribution and a z-score of 1.96.

Mean of observations: $$\bar{x} = {75 + 72 + 78 + 73 + 77 \above{1pt} 5} = 75.0$$

Variance: $$var = { (75 - 75.0)^2 + (72 - 75.0)^2 + (78 - 75.0)^2 + (73 - 75.0)^2 + (77 - 75.0)^2 \above{1pt} 5} = 5.2$$

Standard deviation 3: $$\sigma = \sqrt {var} = \sqrt {5.2} = 2.28 $$

Standard error: $$SE = z {\sigma \above{1pt} \sqrt{n}} = 1.96 {2.28 \above{1pt} \sqrt{5}} = 1.99$$

Confidence interval 4: $$CI_x = \bar{x} \plusmn SE = 75.0 \plusmn 1.99$$

For 5 observations of accuracy we have to shuffle dataset 5 times and train the model 5 times. But is it good idea to waste compute resources (especially GPU) for this kind of task? Can we use just a single observation of the quality metric to get it’s confidence interval?

Single Observation

For a single observation we have a fiew options main options, main of them are:

  • Wald CI 5
  • Wilson CI 6

For simplicity, let’s consider only Wald CI on the following example.

Let’s say, we have model accyracy for binary classification task x = 75%, and validation dataset has 1000 samples. What is the confidence interval? Assume, data has normal distribution and $z$ score is 1.96 (95% confidence interval).

So, we have:

  • Accuracy: $75 \% = 0.75$
  • Sample size: $n = 1000$
  • Number of correct classifications: $k = 0.75 \cdot 1000 = 750$

We consider:

$$ k ∼Binomial(n=1000,p)$$

So, accuracy is:

$$\hat {p} = {k \above{1pt} n}$$

a random value, so that we can estimage confidence interval for $p$.

Confidence interval: $$CI = \hat {p} \plusmn z \cdot \sqrt { { \hat {p} (1 - \hat {p}) \above{1pt} n } } $$

  • $\hat p=0.75$, $n=1000$, $z=1.96$ (for confidence level 95%).

Standard error: $$SE = \sqrt { { 0.75 (1 - 0.75) \above{1pt} 1000 } } = 0.0137$$

So, confidence interval is: $$CI = 0.75 \plusmn 1.96 \cdot 0.0137 = 0.75 \plusmn 0.0268$$

Be careful with the statistical tests you choose. The Wald confidence interval, for example, isn’t reliable for small sample sizes and can be misleading. It tends to shrink to a width of zero as your model’s accuracy ($\bar {p}$​) gets very close to 1 or 0, giving a false sense of certainty.

With these simple formulas in mind, we can estimate the required size of our validation dataset. This helps us achieve the level of accuracy we need in assessing our model’s performance metric.

Accuracy Val Dataset Size Confidence Interval
0.80 100 $0.80 \plusmn 0.08$
0.80 200 $0.80 \plusmn 0.06$
0.80 500 $0.80 \plusmn 0.04$
0.80 1000 $0.80 \plusmn 0.02$
0.80 5000 $0.80 \plusmn 0.01$

Final Thoughts

It’s common for ML engineers to overlook the size of the validation dataset and its impact on a model’s performance. This is understandable - we’re often more into pressing tasks like data cleaning, model training, and hyperparameter tuning. In most projects, data is a limited resource, and the focus is on putting as much of it as possible into the training process.

This narrow focus on training often causes us to forget about other critical stages. A complete project involves not only validation but also serving, monitoring, integration (with frontend and backend), and presenting results to stakeholders to collect feedback.

References