Top datasets used to train AI models and benchmark how the technology has progressed over time are riddled with labeling errors, a study shows.

Data is a vital resource in teaching machines how to complete specific tasks, whether that’s identifying different species of plants or automatically generating captions. Most neural networks are spoon-fed lots and lots of annotated samples before they can learn common patterns in data.

But these labels aren’t always correct; training machines using error-prone datasets can decrease their performance or accuracy. In the aforementioned study, led by MIT, analysts combed through ten popular datasets that have been cited more than 100,000 times in academic papers and found that on average 3.4 per cent of the samples are wrongly labelled.

See https://www.theregister.com/2021/04/01/mit_ai_accuracy/

#technologyv #AI #openstandards

  • ChickenBumMcSplatter@lemmy.ml
    link
    fedilink
    arrow-up
    3
    ·
    4 years ago

    I purposely try to sneak in one bad label when doing captchas, for instance, I’ll click all the tractors and one car. If I’m not getting paid to train the AI, then don’t expect me to care.

    • ufra@lemmy.ml
      link
      fedilink
      arrow-up
      1
      ·
      4 years ago

      Where does the idea that they use captcha for training come from? How does it already know the answers if it is asking you to provide them? Maybe it is some complicated probability thing that I don’t understand…

      • ksynwa@lemmy.ml
        link
        fedilink
        arrow-up
        3
        ·
        4 years ago

        From what I have read the reCaptcha given to you is easier or harder based on how Google profiles you (no idea how it works for hCaptcha). If Google suspects you to be a bot it will give you harder challenges and will probably serve you images that it confidently knows the labels of so it can judge whether you’re a not or not. Otherwise it’ll give you a mix of images in which it knows the labels to some and not the others and consolidate information from that.