The New York Times is suing OpenAI and Microsoft for copyright infringement, claiming the two companies built their AI models by “copying and using millions” of the publication’s articles and now “directly compete” with its content as a result.

As outlined in the lawsuit, the Times alleges OpenAI and Microsoft’s large language models (LLMs), which power ChatGPT and Copilot, “can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style.” This “undermine[s] and damage[s]” the Times’ relationship with readers, the outlet alleges, while also depriving it of “subscription, licensing, advertising, and affiliate revenue.”

The complaint also argues that these AI models “threaten high-quality journalism” by hurting the ability of news outlets to protect and monetize content. “Through Microsoft’s Bing Chat (recently rebranded as “Copilot”) and OpenAI’s ChatGPT, Defendants seek to free-ride on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment,” the lawsuit states.

The full text of the lawsuit can be found here

  • EvilMonkeySlayer@kbin.social
    link
    fedilink
    arrow-up
    10
    ·
    11 months ago

    No it doesn’t, the training data isn’t inside the LLM.

    This is factually incorrect. You can extract the data. How do you think the legal cases are being brought?

    For example

    The model has to contain the data in order to produce works.

    Wholesale commercial copyright infringement where you’re profiting off of others work on a large scale is a whole different ball game.

    They’re training their models on large amounts of pirated content and profiting off it.

    Of course the rights holders are going to say “wait a minute, why are you making money off my content without my permission? And how much of my work did you pirate to use?”

    You cannot hand wave away mass piracy to train their models, and then distribute said models based on an act of mass copyright infringement.

    Do you not understand the basics of the law?

    its idiotic to think that its reasonable to demand such a thing.

    Again, the law is the law. If they mass pirate a bunch of media which then the model contains chunks of they are breaking the law.

    I can’t believe this is a hard concept for someone to understand.

    • admiralteal@kbin.social
      link
      fedilink
      arrow-up
      4
      ·
      11 months ago

      Even if the compression is extremely lossy, compression is insufficient to be transformative.

      The whole “the original data isn’t in the model” argument is one only techbro idiots find compelling.

      • SinAdjetivos@beehaw.org
        link
        fedilink
        arrow-up
        0
        ·
        edit-2
        11 months ago

        No, that’s the current legal precedent within the US.

        Kelly v. Arriba Soft

        The court opinion:

        “The Court finds two of the four factors weigh in favor of fair use, and two weigh against it. The first and fourth factors (character of use and lack of market harm) weigh in favor of a fair use finding because of the established importance of search engines and the “transformative” nature of using reduced versions of images to organize and provide access to them. The second and third factors (creative nature of the work and amount or substantiality of copying) weigh against fair use.”

        That “compression is transformative” principle has been pretty solidly enshrined as precedence at this point (IE Perfect 10, Inc. v. Amazon.com, Inc.) however with no real guidelines as to what amount is required to be considered transformative

        The major argument as to whether the sort of LLM training in the parent article still constitutes fair use or not depends on whether there exists “market harm” or the “substantiality of copying” is especially egregious (note that these are the two fronts that the NYT is taking.) There is precedence for copying of style not being fair use Dr. Seuss Enters., L.P. v. Penguin Books USA, Inc. which I suspect is why NYT is approaching it the way that they are…

        Now, all that being said, my personal opinion is fuck the US legal system and fuck copyright. There is no solution to the core issues surrounding this topic that isn’t inherently contradictory and/or just a corporate power grab. However, the “techbro idiots” are “right” and you’re not, but it’s because they are idiots who are largely detached from any sort of material reality and see no problem with subjecting the rest of us to their insanity.

        • lemonflavoured@kbin.social
          link
          fedilink
          arrow-up
          1
          ·
          11 months ago

          Now, all that being said, my personal opinion is fuck the US legal system and fuck copyright.

          Some form of copyright has to exist, and - as angrily explained to me by authors - it needs to extend somewhat beyond the life of the author. I’m certainly never going to agree with it being indefinite though.

            • lemonflavoured@kbin.social
              link
              fedilink
              arrow-up
              1
              ·
              11 months ago

              Authors need to be able to make a living from writing, unless you want far less books to be written. And there is some logic in allowing them to leave some sort of rights to their family, even if it’s only for like 10 years.

    • Zima@kbin.social
      link
      fedilink
      arrow-up
      2
      ·
      11 months ago

      The model has to contain the data in order to produce works.
      as far as I understand, this isn’t true. can you elaborate on why it needs to contain the data?

      • EvilMonkeySlayer@kbin.social
        link
        fedilink
        arrow-up
        1
        ·
        11 months ago

        It contains large parts of the data in order to create. In my link I provided it shows that the models do contain chunks of the original works.

        Otherwise, how would it create the words etc.

        I am amazed that we now have people on the level of crypto coin idiocy going on about ai models who don’t understand this.

        • Zima@kbin.social
          link
          fedilink
          arrow-up
          1
          ·
          11 months ago

          You would probably claim I don’t deserve my job with my level of technical illiteracy however you think you are inferring that . Anyways they do make reasonable efforts to design models that don’t memorize and are able to generalize. This is quite basic or fundamental on machine learning in general.

          Previous models had semantic reasoning capacidad without memorization e.g. word2vec.

          You should also realize that just because current models are memorizing despite efforts to prevent it doesn’t mean that models need to memorize. Like i said initially they are actually designed to work without needing to memorize.

          • EvilMonkeySlayer@kbin.social
            link
            fedilink
            arrow-up
            1
            ·
            11 months ago

            You’re contradicting yourself.

            In one sentence you say it doesn’t memorize (with “reasonable effort”) then in the next you admit it does.

            “Reasonable effort” is weasel wording.

            Make up your mind.

            • Zima@kbin.social
              link
              fedilink
              arrow-up
              1
              ·
              11 months ago

              ?? Are you trolling. If you design a car to combust gasoline without burning the lubricants but you still end up burning them it doesn’t mean that the lubricants are needed for the combustion itself. Conversely you have not made any nuanced argument explaining why memorization is necessary. I gave you an example where we know there is no memorization and you ignored it.

              “Otherwise how would it create the words” is just saying you wouldn’t know.

              • EvilMonkeySlayer@kbin.social
                link
                fedilink
                arrow-up
                1
                ·
                11 months ago

                So, me pointing out the flaw in your argument is trolling?

                What?

                If you choose to use weasel wording to try and get out of something that is your call.

                • Zima@kbin.social
                  link
                  fedilink
                  arrow-up
                  1
                  ·
                  11 months ago

                  Ok i believe that you believe that. It’s ok. I have professional experience in this space so you’re either not reading carefully or you don’t understand much about the topic.

                  Perhaps you might want to reconsider this in more abstract terms. The engine example you ignored could help you with that.

                  Do you really think that the fact that we have language models that don’t memorize and are simple enough that we can know for certain is not all we need to show that language models don’t necessarily have to memorize? You keep repeating the same (illogical) argument and ignore the simpler arguments that disprove your claim.

                  • EvilMonkeySlayer@kbin.social
                    link
                    fedilink
                    arrow-up
                    1
                    ·
                    11 months ago

                    So, now it’s gone from “reasonable effort” to most definitely you can say without any doubt that all the trained models contain no copyrighted data at all?

                    Come on. Make up your mind.

            • Zima@kbin.social
              link
              fedilink
              arrow-up
              1
              ·
              11 months ago

              ?? Are you trolling. If you design a car to combust gasoline without burning the lubricants but you still end up burning them it doesn’t mean that the lubricants are needed for the combustion itself. Conversely you have not made any nuanced argument explaining why memorization is necessary. I gave you an example where we know there is no memorization and you ignored it.

              “Otherwise how would it create the words” is just saying you wouldn’t know.

    • CJOtheReal@ani.social
      link
      fedilink
      arrow-up
      0
      ·
      11 months ago

      This entire comment screams of 0 technical knowledge.

      The LLM does not contain the training data. It contains nothing but math it generates you an answer by calculations, in the end you get the awnser wich is statistically most likely what you want. Otherwise the fucking thing wouldn’t produce fake news and make shit up.

      Shure if you want it to write you a very specific thing and you know exactly what to ask, you might get a small text that is “copyrighted” but thats because you asked for it, not because it’s inside. It just gives you the awnser you most likely find helpful, statistically.

      Its like asking you to read a page very well and then asking you the next day to write down what was on the page, while giving you lots of hints. You didn’t actually copy from it in that case.

      • admiralteal@kbin.social
        link
        fedilink
        arrow-up
        7
        ·
        edit-2
        11 months ago

        Its like asking you to read a page very well and then asking you the next day to write down what was on the page, while giving you lots of hints. You didn’t actually copy from it in that case.

        My guy, if you compellingly re-wrote Harry Potter from memory and charged people for access to your work, you can definitely expect J.K. Rowling to sue you.

      • EvilMonkeySlayer@kbin.social
        link
        fedilink
        arrow-up
        3
        ·
        11 months ago

        This entire comment screams of 0 technical knowledge

        Yes, your comment does.

        There is literally software to extract this stuff from models now.

        This “it’s just math” is techbro idiocy. It’s like the idiots regurgitating crypto coin bullshit.

      • leaskovski@kbin.social
        link
        fedilink
        arrow-up
        1
        ·
        11 months ago

        It’s all black magic to me, so if you have resources on this, that would be great. My initial thought is that it would have surely have a data source to reference to? Your last example is some one referring to their memory of something and recreating it. By referring to that memory, that is in essence a reference back to the original data that someone has remembered?