Humans are leading AI systems astray because we can't agree on labeling | Code it | Scoop.it

Is it a bird? Is it a plane? Asking for a friend's machine-learning code

 

Top datasets used to train AI models and benchmark how the technology has progressed over time are riddled with labeling errors, a study shows.

 

Data is a vital resource in teaching machines how to complete specific tasks, whether that's identifying different species of plants or automatically generating captions. Most neural networks are spoon-fed lots and lots of annotated samples before they can learn common patterns in data.

 

But these labels aren’t always correct; training machines using error-prone datasets can decrease their performance or accuracy. In the aforementioned study, led by MIT, analysts combed through ten popular datasets that have been cited more than 100,000 times in academic papers and found that on average 3.4 per cent of the samples are wrongly labelled.

 

The datasets they looked at range from photographs in ImageNet, to sounds in AudioSet, reviews scraped from Amazon, to sketches in QuickDraw.

 

Examples of some of the mistakes compiled by the researchers show that in some cases, it’s a clear blunder, such as a drawing of a light bulb tagged as a crocodile, in others, however, it’s not always obvious. Should a picture of a bucket of baseballs be labeled as ‘baseballs’ or ‘bucket’?

 

What would happen if a self-driving car is trained on a dataset with frequent label errors that mislabel a three-way intersection as a four-way intersection?

 

read more at https://www.theregister.com/2021/04/01/mit_ai_accuracy/