Is it Really a Test Set?
Best Practice
We've done it! After much stressing, finagling, maybe even praying, we've finally gotten a good score on our holdout test set. This was particularly difficult because of our small sample size, but after so many tries, we did it! All that stratifying, quantifying, and optimizing to get the right distributions, weights, and initial conditions has finally paid off.
Or has it?
A single word above betrays a hidden, pernicious, and fundamental flaw in our approach to getting a generalizeable and useful classification algorithm.
"Finally".
After all our efforts to get a "good" score on our holdout test, what was our mistake? We allowed the test results to influence our methods.
"No no," we reflect, "that's not true. We had the right splits; a training set, a validation set, and a test set. We even stratified by patient and location, so that each location and patient was only seen within their respective dataset" (the training, validation, or testing sets respectively). "How could our test dataset result have influenced our optimization methods?"
It's true that those and other best practices all help achieve better good results while also increasing confidence that your results will generalize. That said, any time we allow feedback to make it's way into the system based on some result, you immediately begin to fit to that feedback. This is particularly pernicious if, as also mentioned earlier, we have a small dataset.
This becomes a hidden and viscious problem when that feedback is coming from our test set, as the aim of a holdout test set is to receive no feedback into our system. We only get a true measure of a model's generalizeability on true holdout test sets, where they have been withheld throughout the entire optimization and feedback process.
"But, the test set isn't affecting the hyperparameter search, how is it providing feedback into our system?" Unfortunately, if, after seeing the test set results, we then go back and influence the system in some way, it's providing feedback through us. If we have any sway in how the dataset is handled, preprocessed, or provided to the model, and changes are applied to that system in response to any results, we propagate change through our model and begin fitting to those results. Only the very first time we received results from our "holdout" test set did we get an actual measure of our model's generalizability.
Finally?
I harped on the word "finally" earlier because it has a subtle implication that there has been many feedback loops where we saw the test results, got frustrated that it wasn't where we wanted it to be, and subsequently influenced the system. The more the results of a test influence our model training system, the more our model is fit to those data.
Yes, there are best practices to try and mitigate this, but when we allow the results from our test set to, through us, influence the system, we "train to the test" with each feedback loop. In the end this isn't that harmful to the training process, unless we we need to effectively measure generalizeability, which is usually the point of the "holdout" test set. The answer to the question "am I fitting to the test set?" isn't black and white, but treat it like another training loop. The training set is "fit" to the training dataset because it often iterates over the entire datasets tens, hundreds, or even thousands of times. Allowing the test-result feedback to the system once or twice is going to be less of a problem than, say, a dozen times.
The solution? An easy answer would be to say "well, just have another test set that we hold out the entire time!" Sure, but what if our results are good up until that test, after which they fail again? Do we say, "awe, bummer, it's not going to generalize", then go live on a farm? No, we are paid to do this, will likely allow that result to feedback to our system and try and improve the predictions.
So the cycle continues.
There are two takeaways I see from this mental model:
We should be wary if we are having a hard time getting our model to converge to good results on our test set, especially if the training and validation set results are good. Even if it eventually gives good test set results, that initial difficulty (when the training and validation are good, but the test set isn't) is an indicator that the results likely will not generalize beyond the distributions we have at hand.
For each dataset split, the "fitting" process is a function of how many times we let results feed back into the training system to alter that system somehow. We only know that the model generalizes when we see it consistently perform well on data that hasn't influenced the training system at all.
During some of my consulting I've worked to develop models for some large pharma companies. How did they evaluate that our team knew what we were doing? By witholding an entire dataset that they alone had access to, which they would use on our model once we finished the training process. Our team succeeded in part because we went out of our way to seek out a totally separate target dataset beyond what the pharma company had provided to us. This external dataset was crucial because we needed to know that our data generalized beyond what we were given. In the end, we passed.

