• BotCheese@beehaw.org
    link
    fedilink
    arrow-up
    7
    ·
    1 年前

    And we’re nowhere near dome scalimg LLM’s

    I think we might be, I remember hearing openAI was training on so much literary data that they didn’t and couldn’t find enough for testing the model. Though I may be misrememberimg.

    • newde@feddit.nl
      link
      fedilink
      arrow-up
      5
      ·
      1 年前

      No that’s definitely the case. However, Microsoft is now working making LLM’s more dependent on several high quality sources. For example: encyclopedias will be more important sources than random reddit posts.

        • Zaktor
          link
          fedilink
          English
          arrow-up
          2
          ·
          1 年前

          Cunningham’s Law may be very helpful in this respect.

          “the best way to get the right answer on the internet is not to ask a question; it’s to post the wrong answer.”

    • lloram239@feddit.de
      link
      fedilink
      arrow-up
      4
      ·
      1 年前

      There are still plenty of videos to watch and games to play. We might be running short on books, but there are many other sources of information that aren’t accessible to LLMs at the moment.

      Also just because the training set contained most of the books, doesn’t mean the model itself was large enough to learn from all of them. The more detailed your questions get, the bigger the change it will get them wrong, even if that knowledge should have been in the training set. For example ChatGPT as walkthrough for games is pretty terrible, even so there should be more than enough walkthroughs in the training set to learn from, same for summarizing movies, it will do the most popular ones, but quickly fall apart with anything a little lesser known.

      There is of course also the possibility that using the LLM as knowledge store by itself is a bad idea. Humans use books for that, not their brain. So an LLM that is very good at looking things up in a library could answer a lot more without the enormous models size and training cost.

      Basically, there are still a ton of unexplored areas, even if we have collected all the digital books.