Looking at the work of MIT’s Curtis Northcutt on large amounts of errors in popular datasets
This is one of the occasional posts we do about topics that interest us. If you are seeing this and you aren’t signed up to our email lists, you can receive blog posts like this, as well as our Friday Digest, by signing up below.
By Christopher Brennan
Those working in AI, including at Deepnews, love to announce their brand new shiny models, highlighting how innovative and efficient they are at accomplishing tasks and helping people.
However, the raw material that helps those models is data, and there is an increasing call by some in the community, such as entrepreneur Andrew Ng, to move from a “model-centric” view of developing artificial intelligence to something more “data-centric.” This means that when researchers want to improve something, they look more at what they are putting in to their model, rather than the code that makes it up.
Recent years have seen closer looks at what exactly is being fed into machines that help them perform, with implications for everything from highlighting quality news to making suggestions for law enforcement that can have terrible consequences. While questions of bias because of a lack of comprehensive data on certain groups have gotten a lot of deserved attention, there is also the question of whether some data being used in the development of AI is just not accurate.
Last week I had the pleasure to speak with Curtis Northcutt, a researcher at MIT who has led a look into data that are mislabelled in major test sets. These test sets are used to measure how good a new model is. For example, ImageNet is tens of thousands of images that have been labelled with classes (a picture of a baseball being labelled a baseball) so that when someone creates a new AI aimed at “seeing” the creators have something to check against to make sure their model is doing a good job.
What he found was that significant percentages, 5.8% in the case of ImageNet and higher for others, of the data in those test sets were mislabelled. The errors ran the gamut from misunderstanding the image completely, misidentifying the type of dog or failing to account for multiple labels, such as a bucket full of baseballs that was labelled “bucket.” If a model said that the image contained “baseballs” during a test then it would be called inaccurate, which impacts how useful we think it is. Northcutt likens the test sets to an answer key for a high school exam, through which a teacher determines who gets high scores and is then offered more opportunities.
“If those tests are wrong then the benchmarks are wrong and we have no idea how machine learning is progressing,” he said (making an appearance on Deepnews’s occasional Friday Clubhouse show, which you’ll be alerted to if you follow my account).
It is of course difficult and expensive to have humans go through every bit of thousands of rows of data and make sure that they are all correct, though Northcutt has been part of a push to help find out where there may be mistakes in a more systematic way. Through a process called “confident learning” he and his team were able to find the labels that were likely to be errors by looking at the places where a model was also confident that the image was something else (“baseballs” instead of just “bucket.”)
Those errors were then checked by human workers, which is how Northcutt’s paper came to its eye-popping final number of mislabelled data. You can find some interesting examples at Labelerrors.com.
The fact that there are so many mistakes is interesting in and of itself, but Northcutt’s team also went further and tested well-known AI models on the data both with and without the errors. What he found was that some less advanced, less complex models performed better on the corrected test sets, which of course means they would perform better in the “real world.”
“Simpler models, when there’s a lot of noise, don’t fit to that noise,” Northcutt said.
The problem appears to be that some of the more advanced models have been learning to perform better on the mislabelled data. In the education metaphor, this is the teacher “teaching to the test,” but the test itself is wrong. And these mislabelled tests are being used for AI models that get a lot of the attention. For example OpenAI’s CLIP, which I have written about before, tested itself on ImageNet.
So what do we do about all these errors? For his part, Northcutt, a strong supporter of keeping knowledge open source, has made cleaned versions of test sets available through his GitHub page.
Beyond that, there is probably something to Ng and other’s push to change mindsets to focus on data that can improve outcomes, rather than just announcing a model with nice top-line figures. At Deepnews, we developed a test set for our English language model based on various news categories (local news, national magazines, etc) by hand with human-labelled scores of 1, 2, 3, 4, or 5. We are doing the same thing now for French with similar categories. This makes sense because we are doing something new with AI and there is not any sort of publicly available test set we could use, but this process, though time consuming, also helps us be confident in where our model can deliver for clients and where it may need improvement.
Work like Northcutt’s is also increasingly important as we learn better how to create AI models that are effective and reliable. For those who want to dig deeper on the idea of “confident learning” and related topics, Northcutt also recommends this paper from Charles Elkan and Keith Noto at UC San Diego as well as work done by Snorkel.