I wanted to extract some crime statistics broken by the type of crime and different populations, all of course normalized by the population size. I got a nice set of tables summarizing the data for each year that I requested.

When I shared these summaries I was told this is entirely unreliable due to hallucinations. So my question to you is how common of a problem this is?

I compared results from Chat GPT-4, Copilot and Grok and the results are the same (Gemini says the data is unavailable, btw :)

So is are LLMs reliable for research like that?

  • jet@hackertalks.com
    link
    fedilink
    English
    arrow-up
    28
    arrow-down
    1
    ·
    edit-2
    22 days ago

    LLMs are totally unreliable for research. They are just probable token generators.

    Especially if your looking for new data that nobody has talked about before, then your just going to get convincing hallucinations, like talking to a slightly drunk professor at a loud bar who can’t ever admit they don’t know something.

    Example: ask a llm this “what open source software developer died in the September 11th attacks?”

    It will give you names, and when you try to verify those names, you’ll find out those people didn’t die. It’s just generating probable tokens

    • mods_mum@lemmy.todayOP
      link
      fedilink
      arrow-up
      9
      ·
      22 days ago

      That’s seems pretty fucking important :) Thanks for educating me. I’ll stick to raw R for now.

      • INeedMana@lemmy.world
        link
        fedilink
        arrow-up
        5
        ·
        22 days ago

        Asking an LLM for raw R code that accomplishes some task and fixing the bugs it hallucinates can be a time booster, though

    • LANIK2000@lemmy.world
      link
      fedilink
      arrow-up
      4
      ·
      edit-2
      22 days ago

      Tried the example, got 2 names that did die in the attacks, but they sure as hell weren’t developers or anywhere near the open source sphere. Also love the classic “that’s not correct” with the AI response being “ah yes, of course”. Shit has absolutely 0 reflection. I mean it makes sense, people usually have doubts in their head BEFORE they write something down. The training data completely skips the thought process, LLMs can’t learn to doubt.

    • ViaFedi@lemmy.ml
      link
      fedilink
      arrow-up
      3
      arrow-down
      5
      ·
      22 days ago

      Solutions exist where you give the LLM a bunch of files e.g., PDFs which it then will solely base it’s knowledge on

      • jet@hackertalks.com
        link
        fedilink
        English
        arrow-up
        8
        arrow-down
        2
        ·
        22 days ago

        It’s still a probable token generator, you’re just training it on your local data. Hallucinations will absolutely happen.

        • slacktoid@lemmy.ml
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          1
          ·
          edit-2
          22 days ago

          This isn’t training its called a RAG Workflow, as there is no training step per se