We as a community are acutely aware that the data we have with which to build models can be extremely noisy. There are two approaches to dealing with this. First, only use consistent, low-noise sets of data. Second, develop noise-robust fitting procedures. I don’t intend to really argue for one of these here, except to note that I believe we have to take the second approach and leverage all available data.
With regard to evaluating algorithms, the first approach does not require much further discussion here. The second approach is more interesting.
The correct way to approach developing noise-robust algorithms is to begin by assessing the typical nature and quantity of noise on our problem of interest. We then develop techniques to deal with these particular kinds of noise. The reason this is the correct approach is that general robustness to arbitrary noise is hard to achieve, very hard. We always want to leverage our knowledge and understanding of the problem at hand to make it easier. Dealing with noise is an issue where this approach can provide significant leverage.
However, if we have developed algorithms to deal with the kinds of idiosyncratic noise that arise in our data, we can’t evaluate them on data that doesn’t present similar kinds of noise. Suppose for example that we have two components to our fitting procedure: a learning algorithm A, and a noise robustness module B. Now suppose we use a set of data to evaluate the relative performance of A alone against A combined with B (AB). Clearly, if the evaluation data is noise free, or exhibits different kinds of noise than those for which B was developed we should not expect AB to perform better. In fact, we can reasonably expect it to perform worse.
More generally, it is not reasonable to compare algorithms perfected for noise-free data versus algorithms perfected for noisy data on noise-free data. We must decide a priori which case is most relevant to the problem at hand and evaluate that case.
(Note that when I refer to evaluation data I mean both the training and testing data used for the evaluation.)