• Zaktor
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    What do you think happens to data when it’s scraped? Copying the data is a fundamental requirement for using it in training. These models are trained in big datacenters where the original work is split up and tokenized and used over and over again.

    The difference between you training a model and you reading a book (put online by its author in clear text, to avoid the obvious issue of actual piracy for human use) is that you reading on a website is the intention of the copyright holder and you as a person have a fundamental right to remember things and be inspired. You don’t however have a right to copy and use the text for other purposes, whether that’s making a t-shirt with a memorable line, printing it out to give to someone else, or tokenizing it to train a computer algorithm.

    • treadful@lemmy.zip
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      What do you think happens to data when it’s scraped? Copying the data is a fundamental requirement for using it in training. These models are trained in big datacenters where the original work is split up and tokenized and used over and over again.

      Tokenizing and calculating vectors or whatever is not the same thing as distributing copies of said work.

      The difference between you training a model and you reading a book (put online by its author in clear text, to avoid the obvious issue of actual piracy for human use) is that you reading on a website is the intention of the copyright holder and you as a person have a fundamental right to remember things and be inspired.

      Copyright holders can’t say what I do with their work, nor what I do with the knowledge of their book. They can only say how I copy and distribute it. I don’t need consent to burn an author’s book, create fan art around it, or quote characters in my blog. I do need their consent to copy and distribute their works directly.

      You don’t however have a right to copy and use the text for other purposes, whether that’s making a t-shirt with a memorable line, printing it out to give to someone else, or tokenizing it to train a computer algorithm.

      And at some point the resolution of said words is so specific that it becomes uncopyrightable. You can’t copyright most phrases nor words.

      • Zaktor
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Tokenizing and calculating vectors or whatever is not the same thing as distributing copies of said work.

        It very much is. You can’t just run a cipher on a copyrighted work and say “it’s not the same, so I didn’t copy it”. Tokenization is reversible to the original text. And “distributing” is separate from violating copyright. It’s not distriburight, it’s copyright. Copying a work without authorization for private use is still violating copyright.

        • treadful@lemmy.zip
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          1
          ·
          1 year ago

          You can’t just run a cipher on a copyrighted work and say “it’s not the same, so I didn’t copy it”.

          Yes I can. I can download a Web page, encrypt it on my machine, and I’m not distributing said work.

          And “distributing” is separate from violating copyright. It’s not distriburight, it’s copyright. Copying a work without authorization for private use is still violating copyright.

          That’s just false.

          • Zaktor
            link
            fedilink
            English
            arrow-up
            2
            arrow-down
            2
            ·
            1 year ago

            You absolutely do not know what you’re talking about. This is just trivial copyright law, but there’s a weird internet mythology that if you can access something on the net you can take it as long as you don’t share it further. The reason the mass-sharers tended to get prosecuted is because they were easier and more valuable targets, not because the people they were sharing it with weren’t also breaking the law.