HNNewShowAskJobs
Built with Tanstack Start
Zuckerberg approved training Llama on LibGen [pdf](storage.courtlistener.com)
155 points by stefan_ 6 months ago | 201 comments
  • mrtksn6 months ago

    We are approaching the "UBI or Guillotine" fork simply because rules and regulations work selectively. Just like with the "If we pay for copyright or business becomes impossible" defense, this is yet another wast unfairness against those who had to transfer their resources to learn a skill. Awful lot of people had hard life or got into debt for things that big tech is immune from.

    Or maybe we will come into the conclusion that all this works only if there's no such thing as IP, reset the playing field for everyone and if anyone wants to make money will have to actually work for it every single time. IIRC that's what's happening in China and its how they surpassed US in innovation.

    Technically, that's a deregulation - just not the kind of deregulation the big tech is pushing for. Maybe the next time there's a graph showing how regulations made EU lag behind, add the graph of China too to spice things up.

    With so many technical people out of work and promises of make the employed ones obsolete too, it can be a good idea to let people build thing instead of unfairly concentrating even more power onto kleptocratic entities.

    • JumpCrisscross6 months ago |parent

      > We are approaching the "UBI or Guillotine" fork

      Even in the 18th century, the French aristocracy mostly cruised through the Revolution from afar, surviving with fortunes largely intact to this day [1]. If the fork is UBI or guillotine, the selfish move by the private-jetting billionaire class—personally and financially more mobile and global than the French aristocracy ever was—is the latter.

      > if there's no such thing as IP, reset the playing field for everyone

      Your thesis is letting Altman, Zuckerberg and Musk have free rein would decrease inequality?

      > IIRC that's what's happening in China

      Not really [2].

      [1] https://www.bbc.com/news/magazine-37655777

      [2] https://www.chinaiplawupdate.com/2023/08/china-prosecutes-11...

      • lupire6 months ago |parent

        Extremely misleading citation.

        > Criminal trademark infringement made up the majority of IP crimes with 10,384 people prosecuted accounting for 88.9% of the total.

        Trademark infringement is of a completely different character from copyright.

        Trademark infringement is pure fraud and lying.

        Take out trademark infringement, and you have only 1 prosecution per year per 700,000 people.

        • JumpCrisscross6 months ago |parent

          > Take out trademark infringement, and you have only 1 prosecution per year per 700,000 people

          What is it in America? Did we even have a single criminal non-trademark IP prosecution in 2024?

      • XorNot6 months ago |parent

        The other way to look at it though is that revolution won't solve your problems, and Americans are far too confident that it will.

        • JumpCrisscross6 months ago |parent

          > other way to look at it though is that revolution won't solve your problems, and Americans are far too confident that it will

          Americans are largely not for a revolution because most of us aren’t idiots. There is idle chatter of a civil war, but that’s again (a) bluster (not that this can’t take on a life of its own) and (b) about consolidating control versus wholesale rebuilding the American class structure.

          • tyleo6 months ago |parent

            FWIW there is a difference between revolution and civil war. I see a decent number of people advocate for the first but basically no one advocate for the second. In either case the numbers aren’t a majority.

            • JumpCrisscross6 months ago |parent

              > see a decent number of people advocate for the first but basically no one advocate for the second

              Anyone advocating for the first (as a popular revolt) thinking it wouldn’t result in the second isn’t thinking realistically.

              • motorest6 months ago |parent

                > Anyone advocating for the first (as a popular revolt) thinking it wouldn’t result in the second isn’t thinking realistically.

                There are plenty of examples in modern Europe where revolutions and regime changes didn't involve a civil war.

                • JumpCrisscross6 months ago |parent

                  > plenty of examples in modern Europe where revolutions and regime changes didn't involve a civil war

                  Where internal power structures were preserved (or where the society was restructured under occupation), yes.

                  • motorest6 months ago |parent

                    > Where internal power structures were preserved (or where the society was restructured under occupation), yes.

                    No. See for example Spain's or Portugal's transition from autocracy to democracy. The latter involved a military coup and exile of it's former dictator.

            • lupire6 months ago |parent

              MAGA leadership advocates for the second.

        • tyleo6 months ago |parent

          I’m no advocate for revolution but the American problem is that our revolution actually worked. Americans freed themselves from a prior group of elites unlike the grandparent comment is claiming of the French elites.

          • JumpCrisscross6 months ago |parent

            > Americans freed themselves from a prior group of elites unlike the grandparent comment is claiming of the French elites

            The American Revolution was one of American elites overthrowing their overseers. It worked and was not super disruptive because power (and class) structures were preserved. From the states through to the system of law and the people in power. (We also didn’t do any mass or political executions.)

      • bushbaba6 months ago |parent

        unlike then, today global mobility is within the means of most the western world. A French Revolution today could very well extend globally to identify and re patriot.

        • JumpCrisscross6 months ago |parent

          > French Revolution today could very well extend globally to identify and re patriot

          We have zero historical or contemporary precedent for this, and strong incentives for everyone else in the world to not play along. (As they did in sheltering the French aristocracy.)

          In a hypothetical American revolution, foreign powers would be looking for their slice of the pie. To think through this dispassionately, imagine civil war breaking out in Russia or China. A second American revolution à la the first would put today’s billionaires and political elite in a room to draft a new constitution to their liking.

    • wesapien6 months ago |parent

      Isn't UBI just going to raise inflation? People who don't need it will claim it and use the existing tax loopholes. Tax laws will need to be rewritte.

      • webmaven6 months ago |parent

        The "U" in UBI is for "Universal". There is no means-testing. Everyone gets it regardless of assets or income, which means there is no need to spend any effort on checking whether someone is "poor enough".

        Though the state would have to make sure the person receiving the benefit actually exists, is still alive, etc.

        • wesapien6 months ago |parent

          I understand what UBI means but it's the effect is what I think people do not understand. Based on the Cantillon effect, UBI will just accelerate the separation between the rich and the poor.

          • aesh2Xa16 months ago |parent

            I'm not sure that Cantillon effect is majorly at play.

            The very nature of Cantillon is unequally obtained new money, whereas UBI is universal. Any effect it has would be related to the poorest/neediest spenders now purchasing the sort of goods they do (and, realistically, no increase in spending by the richest). You might see increased consumption in neighborhoods/regions with high concentrations of poor, too.

            The better fit for "UBI creates an economic problem" seems to be pricing stickiness. The above commenter focused on controlling general inflation through monetary/fiscal policy (keeping money supply stable, using tax mechanisms), but didn't actually address the concern about producers simply raising prices to absorb the UBI.

      • tricorn6 months ago |parent

        No, you can do a UBI that keeps the money supply the same, and use it as a way to stabilize the economy. With a $2000/mo UBI, 50% flat tax on other income, 25% VAT, phase it in by doing 10% of that the first year (and 90% of your current taxes, 90% of current support payments), second year 20% and 80%, so the impact isn't too disruptive. Adjust the flat tax rate as the Federal budget changed (a spending bill is automatically a tax bill as well). Adjust the VAT to control inflation.

        • nradov6 months ago |parent

          You've got to be kidding. As a regular middle class citizen my taxes are high enough already. There's no way I'll vote for UBI so that some slackers can sit around getting high and playing Xbox.

          • theasisa6 months ago |parent

            Based on your comments you are in US. Your taxes are very low among Western countries.

            That slacker is already getting high and playing on Xbox. With UBI they will have less worries about staying alive and the opportunity to try things to get more money. UBI is a great insentive for people to try new things without there being a financial risk of you losing your income. Just check the trials and their results - people are more productive and happy in general.

          • OKRainbowKid6 months ago |parent

            And instead, you vote for billionaires chilling on their yachts, paid for by your labor?

      • motorest6 months ago |parent

        > Isn't UBI just going to raise inflation?

        Even assuming this scaremongering scenario, the world would be in a far better place if society assured everyone would be guaranteed a certain income.

        Also, the scenario that supports the hypothesis of higher inflation is that more people in society are suddenly able to afford goods and services that were out of their reach without UBI. Can anyone actually put to words why that is undesirable?

        • aesh2Xa16 months ago |parent

          I think one criticism is that prices would change to capture the UBI.I think I read the idea in "Progress and Poverty," although I've certainly seen it elsewhere since:

          - If everyone suddenly has more money (say $2 more per day)

          - And milk is a basic necessity

          - The milk seller knows everyone needs milk and now has $2 more to spend

          - They can gradually raise the price of milk by close to $2

          - Consumers must still buy milk at the higher price

          - The intended benefit of the extra $2 is effectively captured by the milk seller

          The increases in general purchasing power can be absorbed by suppliers of essential goods. If you have just excess discretionary income in the general case, then non-essential goods can bump in price, too.

          • s__s6 months ago |parent

            The milk seller doesn’t even need to consciously increase prices to match the raise in household income. It will happen organically.

            For the sake of argument, imagine UBI provides everyone with a million dollars a year. That doesn’t make everyone a millionaire. It just makes everyone’s money less valuable.

            It’s no different on a smaller scale.

      • weatherlite6 months ago |parent

        It's gonna be complex and messy. On the one hand yes, many people receiving UBI = inflation. On the other hand many highly paid software devs (And soon after - accountants, lawyers, marketers, sales people etc etc) are losing their incomes = very deflationary.

        It's gonna be interesting that's for sure.

        • wesapien6 months ago |parent

          UBI has less friction as far as implementation since we don't need qualify anyone. With AI, we can afford to have that extra step (nuance) and be able to make sure its a needs based approach. The future requires various combinations of changes. Fix the tax system and then UBI (in this specific order) OR !UBI (needs based distribution).

          • tricorn6 months ago |parent

            Implement UBI as part of fixing taxes. A UBI combined with a flat tax plus a national sales tax, and including universal healthcare, can continue to be a progressive tax while eliminating a lot of the overhead of keeping track of it all. Look at the effective tax rates with a 50% flat tax, 25% sales tax, and $2000 per month UBI with UHC.

      • PaulRobinson6 months ago |parent

        If it's truly universal, no. Several experiments (controlled and natural), have shown this.

        • wesapien6 months ago |parent

          Has there been experiments/testing at city/state scale. UBI is country scale and it's way more complex than testing it on a small town of people who I assume are selected for their needs.

      • tharmas6 months ago |parent

        Indeed it would as the landlords would just raise rents accordingly.

        We saw a bit of that with Covid cheques.

        • PaulRobinson6 months ago |parent

          I think you'll find rising rents are more correlated with rising interest rates than Covid cheques, but given one of the key grievances perceived by UBI advocates is class inequality and lack of social mobility, if UBI became politically possible then so would rent controls and controls on prices of key essential commodities while waiting for it to "settle in".

          • tharmas6 months ago |parent

            Good point about the interest rates. However, in the UK landlords adjusted their rents accordingly when the Govt introduced Housing Benefits (years before interest rates began to rise). A lot of govt MPs are landlords.

            I'm not against the idea of UBI, I just see the landlords eating it up like they do with peoples wages.

    • ColdTakes6 months ago |parent

      There isn't going to be a revolution. Americans are all talk no action.

    • Workaccount26 months ago |parent

      The legal problem is in outputting IP, I still have yet to see a convincing argument that training on copyrighted data is a breach of IP laws.

      The trained models are trillionths the size of their training sets. There is no archive of copied data in them.

      • agilob6 months ago |parent

        >argument that training on copyrighted data is a breach of IP laws.

        You pay for access to materials, not using or remembering the material in its original format.

        • 93po6 months ago |parent

          Nearly every website does not charge me anything to retrieve information that is their intellectual property.

          • agilob6 months ago |parent

            Is the comment above about every website or libgen?

      • swatcoder6 months ago |parent

        Training on copyrighted works licensed for such use is inarguably conforming.

        Acquiring and using works without such license is just piracy. Whatever your stand on piracy is, most individuals and businesses are not free to incorporate it into their projects. Normal people have faced significant penalties for piracy, and concientious business operators avoid it.

        Sure would be disappointing to all those people if there were suddenly a ruling that said "well, but it's okay that these guys did it because they're filthy rich and went real hard with it"

        • Workaccount26 months ago |parent

          Again, models are not archives of data.

          Llama 3.1 70B is around 45GB is size, despite being trained on likely hundreds of petabytes of data. And before you say it, they are not fancy compression algo's either, the loss is so high they would be useless.

          • rmgk6 months ago |parent

            Your argument is essentially: “I have downloaded and watched this movie, but because I cannot recreate the images, there was no copyright infringement involved”.

            • 93po6 months ago |parent

              I would say it's more, I checked out a book from the library, read it, and learned some things about writing style and storytelling that I'm now going to apply to my own original works.

              • swatcoder6 months ago |parent

                Libraries obey copyright, loaning out books for which they've acquired some right to lend to members. When I borrow a library book and read it that way, everything that happens is respecting the rights of the copyright's owner.

                That has nothing to do with how LLM's were trained. They were trained on countless works for which Meta, etc had acquired no legitimate right for use at all.

                • 93po6 months ago |parent

                  i dont know of a law that says you have to purchase a book to be legally allowed to read it

                  • triceratops6 months ago |parent

                    The legal owner of the book has to allow you to read it. And the legal owner can't make additional copies to allow you to read it.

                    • 93po6 months ago |parent

                      If I find a book on a park bench and read it, am I breaking the law in terms of intellectual property?

                      • triceratops6 months ago |parent

                        If they're training LLMs on books found on park benches, we don't have a problem. That's obviously not what we're talking about though.

                        • 93po6 months ago |parent

                          My point is "the legal owner of the book has to allow you to read it" is not true

                          I will accept the argument they got the source material in a way where someone broke American law. I really do not think they've broken any laws whatsoever in terms of using it for LLM training

                          • triceratops6 months ago |parent

                            > they got the source material in a way where someone broke American law

                            Isn't inducing or offering someone incentives to break laws illegal by itself? I'll admit that isn't specifically an IP law violation, but it can't possibly be kosher.

                            For example if a buyer of goods can reasonably be expected to know the goods were stolen, they can also be charged. Isn't this the same thing?

              • Funes-6 months ago |parent

                I would go a step further, even, and say it's akin to borrowing a book and formally registering every little detail about it but the actual text itself, with extreme breadth and precision: grammar, style, lexicon (potential morpheme combinations, basically), wider discourse structure, use of special characters and formatting, etc., and then discarding the book.

              • JHonaker6 months ago |parent

                Yes but your library still legally obtained those copies in the first place.

                • adrian_b6 months ago |parent

                  Most, if not all, pirated books are copies of books that had been legally obtained, so this is not how they are distinguished from books borrowed from a library. The only thing that makes them pirated is that the price paid for the original book is considered to not have covered the right of also distributing copies of the book.

                  Nowadays the surviving public libraries might pay special prices for the right of lending books, but that was not true in the past, when they just bought the books from the market like anyone else, at the same price.

                  I am pretty sure that the public libraries that I frequented as a child, many decades ago, did not pay anything for a book above the price that I would have paid myself, but nonetheless at that time nobody would have thought that they do not have the right to lend the books to whomever they pleased.

                  • rmgk6 months ago |parent

                    The point in the Article is that Meta used LibGen to train, not legally obtained books from their local library. The problem is that if you and I made use of LibGen and some of the “right holders” (more likely some IP specialized law firms) realized that, we would be prosecuted.

                    Giving Meta exclusive access to those copies is the problem (which is effectively what we are doing if they are not prosecuted, or, alternatively, if we accepted that LibGen is fair use for everyone).

            • 6 months ago |parent
              [deleted]
          • dns_snek6 months ago |parent

            What our society has to decide is whether these use cases are beneficial or detrimental to society at large, and adjust IP laws accordingly.

            Whether LLMs are archives of data, a compression method, or whatever else is just an unimportant technical implementation detail.

          • swatcoder6 months ago |parent

            Was this replied to the wrong comment? I'm not sure what it has to do with what I wrote.

            But here's another way to think about what I'm saying, in case you missed it:

            Personally, I'd love to download a complete archive of JSTOR. I'd train myself, and maybe even I could even use it as input into some product I mean to launch soon. JSTOR doesn't offer a license for that, at least not to me, but I'm sure I can scrape their site or find an archive elsewhere and make it happen anyway.

            Do you think I should do that? What do you think might happen if I tried?

            • 6 months ago |parent
              [deleted]
      • JeremyNT6 months ago |parent

        How can it possibly be the case that it's ok for meta to download and ingest the entire contents of libgen but it is not ok for an individual human to selectively download a single work and read it?

        Whatever legal contortions used to justify this are, quite frankly, bullshit. This isn't how anything should work even if these companies can buy themselves a regulatory regime where it does.

    • bdndndndbve6 months ago |parent

      The idea that abolishing IP protections and letting AI companies run rampant is an offramp for wealth inequality is such a wild take to me?

      Realistically billionaires are using racist and homophobic populism as a way to direct working class energy away from wealth inequality. Making people think "woke" is the reason why the earth is on fire and they can't have health insurance.

    • netfl06 months ago |parent

      Ah yes, because the working class is primarily concerned with protecting their intellectual property…

      • impomura6 months ago |parent

        the working class is paywalled out of education because of IP laws that can seemingly be ignored by the AI companies

      • bdndndndbve6 months ago |parent

        I think OP is coming from the "temporarily embarassed billionaire" perspective where if only we had a libertarian hellscape without pesky laws they would be a funeral baron who runs Bartertown.

    • casey26 months ago |parent

      How can you get the definition of fairness so backwards? Giant corporations provide literally everything you take for granted and they should be punished because you are envious? I don't get it.

      There is a reason everyone with over 130 IQ wants to work for them rather than starting their own companies.

      • Lucasoato6 months ago |parent

        They shouldn’t be punished because people are envious, they should be punished because they’re not respecting other people's intellectual property without an agreement in place.

        We can’t protect IPs only when that benefits big corps. We should protect them always or accept that the world is better if we go in another direction, changing the rules for everybody.

        • visarga6 months ago |parent

          Training on copyrighted data should be legally allowed

          - of course exact reproduction of protected content is a no-no

          - but learning is ok, as long as it is transformative. User prompts and responses are pushing the model outside its training distribution anyway - users add their own intent, making usage transformative

          - when LLMs synthesize from multiple sources, the result is transformative

          - if you try to protect expression it is meaningless now, but if you protect abstract ideas it kneecaps creativity

          - the problems of copyright started with the apparition of internet, not with AI

          - revenues from royalty are almost zero today, as each new content competes against an unbounded list of other works that have been accumulating for decades online

          - because royalties are shit, creatives now focus on ads, and this leads to enshittification, attention grabbing junk everywhere, attention is scarce content is post-scarcity

          - we actually like interactive participation more than passive consumption; we now edit Wikipedia, contribute to open source, have papers published for free on arXiv, use social networks where our comments are shared with the world, play games instead of reading books - it is another age, the interactive age

          - AI is actually more than an infringement tool, it is useful for many legit purposes

          - and AI is the worst possible infringement tool, it can hallucinate details, get thins wrong; By comparison copying is free and easy and precise to the letter

          So the idea that training is infringement is pretty abusive, it tries to make copyright be about abstractions which is wrong. We can't return to 1990s, so we have to live with its demise. It's been dying for 3 decades already.

          • triceratops6 months ago |parent

            LLMs are allowed to "learn" from all this content because humans are allowed to. Most humans have to access the content legally to learn. But training LLMs it's basically "Copyright lol, yolo".

            Is there a reason a human can't torrent movies and say "But I'm just learning from them"?

          • xena6 months ago |parent

            How do writers eat when the market value of their writing is zero?

            • visarga6 months ago |parent

              It's been reduced to zero for 3 decades. When you publish your work, there are a million other works competing for attention. That is the real issue. When you search for an image, you get thousands of images instantly, faster than diffusion models. Content doesn't matter anymore, attention matters, curation matters too.

              Even if you forbid AI from training on copyrighted works, people are going to comment about them online, and the model will pick up the ideas. There is no way to protect ideas from spreading and reaching AIs.

              • miohtama6 months ago |parent

                Also AI models trained by Chinese are not going to stop using copyrighted material.

            • Workaccount26 months ago |parent

              How do chimney sweeps eat when everyone has a gas or electric furnace?

      • saagarjha6 months ago |parent

        People who are smart typically have better things to do than talk about their IQs. Or sell ads, for that matter.

      • bdndndndbve6 months ago |parent

        How can you get the definition of fairness so backwards? The King provides literally everything you take for granted and he should be punished because you are envious? I don't get it.

        There's a reason why every vassal with a sizeable estate wants to be in the King's court rather than starting their own country.

      • 6 months ago |parent
        [deleted]
  • boramalper6 months ago

    Alluded multiple times in the comments already but worth being explicit: Aaron Swartz killed himself 12 years ago yesterday for facing "a cumulative maximum penalty of $1 million in fines, 35 years in prison" [0] after downloading academic journal articles, which would be only a small percentage of what's available on LibGen.

    Free for me, not for thee.

    [0] https://en.wikipedia.org/wiki/Aaron_Swartz

    • JumpCrisscross6 months ago |parent

      > Free for me, not for thee

      Swartz was charged with 35 to 50 years, realistically faced up to 10, and was offered 6 months if he plead guilty [1]. That offer moreover wasn’t the final offer.

      Put another way, it’s not clear that the law is being applied to Zuckerberg differently than it was to Swartz given the law wasn’t actually ever applied to Swartz. (Or that they wouldn’t gladly trade this lawbreaking for $1mm in fines and a negotiation over penalties where the prosecution opens with 6 months jail.)

      The prosecutor acted inappropriately in that case; MIT, more wildly so. That doesn’t, however, carry over to a transgression of the law given we never got to that stage.

      [1] https://www.forbes.com/sites/forbesdev/2023/02/28/increase-w...?

      • inetknght6 months ago |parent

        > it’s not clear that the law is being applied to Zuckerberg differently than it was to Swartz given the law wasn’t actually ever applied to Swartz

        Has Zuckerberg actually been charged with something with equivalent potential consequences?

        If not, then your statement is false on its face.

        • JumpCrisscross6 months ago |parent

          > Has Zuckerberg actually been charged with something with equivalent potential consequences?

          I didn’t say Zuckerberg has been subjected to what Swartz was. Swartz never wielded the nation-state level power of a billionaire—it’s difficult to imagine how he could be subjected to similar psychological stress.

          I said the law isn’t being applied to Zuckerberg (or anyone who has downloaded LibGen, for that matter) differently because the law was never applied to Swartz. Given the unpopular Swartz prosecution ended Ortiz’s career, and the lack of recent criminal copyright cases, it’s unlikely anyone would attempt to apply it as they did then. To anyone, including Zuckerberg.

          TL; DR If you dislike what Zuckerberg is doing, you’re probably advocating for a clarification of the law. If you like it, erm, nothing much to do here.

          • inetknght6 months ago |parent

            > the law was never applied to Swartz

            Merely being charged with or investigated for a crime is absolutely an application of the law.

  • bagels6 months ago

    LibGen is the most generic name ever, had to look it up. Turns out that LibGen is a collection of pirated books.

    https://en.m.wikipedia.org/wiki/Library_Genesis

    • perihelions6 months ago |parent

      Shadow libraries are a heavily-discussed, recurring topic on HN,

      https://hn.algolia.com/?query=libgen&type=all ("LibGen")

      https://hn.algolia.com/?query=anna's%20archive&type=all ("Anna's Archive")

      https://hn.algolia.com/?query=z%20library&type=all ("Z-Library")

      https://hn.algolia.com/?query=scihub&type=all ("SciHub")

    • A_D_E_P_T6 months ago |parent

      It's not just a collection, it's the collection. It contains almost every scientific book ever printed, for one thing.

      Frankly, it's a massive boon to researchers. It's like a top-tier research university library at your fingertips, and usually more convenient than the real thing.

      • reddalo6 months ago |parent

        Also free. That helps.

        But the sad state of the affairs is that if Aaron Swartz does it, he ends up dead; if Meta does it, everything is fine.

        • A_D_E_P_T6 months ago |parent

          A lot of people would gladly pay. I'm a paying subscriber to Anna's Archive, which vastly improves the experience of that site. (It's borderline unusable without a subscription.)

          Thing is, the Elsevier/Springer model makes it incredibly difficult to pay them. With single papers or book chapters in the $30-40 range, an afternoon's research can easily cost $600. (Note that the authors and reviewers don't get royalties on this, and the Editor-in-Chief of any given journal usually only makes a small stipend!)

          There are services like DeepDyve, but they're intentionally gimped and difficult to use, because their user interface is 100% built around preventing you from downloading or screenshotting the papers you "rent"!

          If the publishers set up a $100/month all-open-access program, and if the experience were at least halfway decent, I'd bet that a lot of people sign up. And that's not cheap!

      • mistercheph6 months ago |parent

        Funny that the world where almost all human knowledge and art is free and accessible for everyone exists in parallel to one where articles about which McDonalds meal are you are paywalled, and funny which world civilized nations have chosen in order to protect The Suite Life of Zack & Cody and all the artists whose livelihoods depend on reruns of iCarly.

    • ppp9996 months ago |parent

      A lot less generic than X

  • resiros6 months ago

    I would argue that it's right call: 1) it's in the world's best interest. I am running llama locally on laptop, and the ability to have the distilled world's knowledge at your fingertips will generate much much more value than what it takes. 2) it does not 'take' any value from the book creators. No one's going to 'not buy a book' because an LLM has been trained with its content (in contrast you might argue that you are likely to not buy a book because you downloaded it from libgen).

    Copyright laws are not millennia-old ethical laws that everyone agrees on (like don't steal), they are a modern human construct that were created for the greater good (incentivize creation), and we should revisit them with new tech.

    • lnkl6 months ago |parent

      "1) it's in the world's best interest."

      How is pleasing Meta's shareholders in world's best interest.

      • TiredOfLife6 months ago |parent

        How is using llama for free locally pleasing shareholders?

      • jbentley16 months ago |parent

        Things can both please share holders and create value for users of that thing.

    • edoceo6 months ago |parent

      > incentivize creation

      Humans do that naturally (see: children)

      The copyright laws are to protect profit.

    • ulbu6 months ago |parent

      wat? facebook is going to 'not buy a book' for each book it's gone through. world's best interest that one of the wealthiest companies in the world don't pay their dues? world's best interest? when we know nothing about the societal and political effects llms will have in the hands of such people?

      what are you rationalising about?

  • 1vuio0pswjnm76 months ago

    PDF: https://ia902305.us.archive.org/34/items/gov.uscourts.cand.4...

    Text: https://www.courtlistener.com/docket/67569326/373/kadrey-v-m...

    "Meta's request is preposterous. With one possible exception, there is not a single thing in those briefs that should be sealed."

    "It is clear that Meta's sealing request is not designed to protect against the disclosure of sensitive business information that competitors could use to their advantage. Rather, it is designed to avoid negative publicity."

    "If Meta again submits an unreasonably broad sealing request, all materials will simply be unsealed."

    "One final comment. Between this sealing request and assertions in Meta's opposition brief such as "[t]hat document expressly discusses torrents and seeding", Opp. at 7, the Court is becoming concerned that Meta and its counsel are starting to travel down a familiar road. See In re Facebook, Inc. Consumer Privacy User Profile Litigation, 655 F. Supp. 3d 899 (N.D. Cal. 2023)."

  • consumer4516 months ago

    It is very difficult for me to believe that Meta's recent political relations moves are not related to the open cases where Meta is the defendant.

    • qwertox6 months ago |parent

      I don't understand your comment. This is about a lawsuit which shows that Zuckerberg OK'd the downloading and use of LibGen data. The case exists at least since mid-2023 and was in discovery phase until 13. Dec 2024. Shortly before the deadline Meta provided this new information, because they had to.

      • credit_guy6 months ago |parent

        I guess the parent is saying that the new administration could be more business friendly in prosecuting this type of cases. It might even drop this case altogether. But only if Meta is "friendly" to the administration too.

      • tux36 months ago |parent

        They're saying that Meta has been kowtowing to the incoming administration in hopes of getting in their good graces.

        Rather famously, some elements of that administration are above the criminal code, so that's not implausible.

      • lupire6 months ago |parent

        PP is referring to Facebook/Meta's new policy changes like banning intelligence/sanity-based insults on the Platform, but carving out an exception specifically and explicitly for transgender people as targets, and removing tampons from men's bathrooms.

      • monsieurbanana6 months ago |parent

        He's saying that he wants to pay Trump to win these lawsuits, which is a smart move as we know justice is for sale.

      • bamboozled6 months ago |parent

        The pivot would potentially help his cause though , would it not ?

    • aprilthird20216 months ago |parent

      The antitrust one is most relevant as the new party in power would be gleeful to see it broken up but otherwise disagrees with the concept of antitrust

      • frob6 months ago |parent

        "For my friends, everything; for my enemies, the law."

    • hatenberg6 months ago |parent

      Nah it’s just that Zuck watched the Barbie movie and realized Soace Karen was getting entirely too much limelight and declared a Year of Masculinity

      • lupire6 months ago |parent

        Calling someone Karen is a misogynist slur, and calling a man by a woman's name without consent is doubly a misogynist slur.

        • hatenberg6 months ago |parent

          Ok Karen

  • elashri6 months ago

    There are three positions around the usage of of shadow libraries.

    1- Should we develop this argument into more discussion as society and humans around the knowledge publication and the publication industry greed and the rent-seeking business model.

    2- Big Corporation shouldn't just ignore the copyright law while maintaining the strongest copyright protections and going after small folks.

    3- The usual argument about how LLMs training is different from people actually using pirated textbook because it is expensive (college and learning is hard and expensive specially in places like Africa).

    These are different angles and I think we can try to address all of them as they are not exclusive. There are good arguments around point 3 on two sides. I don't think there is a good argument why we should allow the status quo regarding the first point though. For two, it is more complicated to even discuss specially on HN.

    • miohtama6 months ago |parent

      We can rewrite copyright laws.

  • Havoc6 months ago

    I guess the zuck would download a car...

    Will be interesting to see where this lands, because all outcomes seem to have significant secondary effects.

    • mnky9800n6 months ago |parent

      I would download a car

      • blooalien6 months ago |parent

        https://cults3d.com/en/collections/best-stl-files-cars-3d-pr...

        You're welcome! :P

  • lxgr6 months ago

    In hindsight (considering how LLMs are trained etc.) it makes total sense, but "Big Tech vs. Big Copyright" is something I didn't have on my 2020s bingo card.

    I wonder who will come out on top, and whether there will be any incidental improvements for consumers, but unfortunately I can imagine an "AI training exemption" all too well.

    • kccqzy6 months ago |parent

      That's not surprising to me at all. Even in the 2000s there was a famous lawsuit about Google Books scanning books without approval and the proposed settlement was essentially allowing Google to sell scanned ebooks while giving copyright holders a cut[0]. At that time Google truly felt like don't-be-evil corporation, and lawyers for the copyright holders wanted to give Google all this data as long as Google pays the copyright holders. In the 2020s however I cannot imagine any Big Tech company to have that don't-be-evil spirit and I also cannot imagine them voluntarily paying anything to copyright holders.

      [0]: https://www.newyorker.com/business/currency/what-ever-happen...

    • nonrandomstring6 months ago |parent

      > "Big Tech vs. Big Copyright"

      Indeed. But when do those intersect or diverge?

      I don't blame him. What would you do? If I had a near perfect data training set of all the most useful books and a hungry AI to train, it would be the logical step.

      The reason this is news is because of the stinking hypocrisy of it all. It's really the same topic as the Swartz-Altman discussion here [0], in that these giant companies want to have it both ways.

      Where is Zuckerberg's shout-out for Alexandra Elbakyan? [1] Or for Brewster Kahle? Or any of the wast army of people who preserve and curate the vital culture of humanity by protecting it from intellectual property dungeons?

      The colossal hypocrisy is that a company like Meta wishes to live under the protective umbrella of "Intellectual Property". It wants to stop me just stealing it's stuff and setting up a better Facebook

      Were it exposed to the same rules it wishes to live by, it would be torn apart by vibrant and deserving competition within days.

      All the Zuckerberg, Meta or OpenAI are doing is setting the ground for the abolition of intellectual property. They are literally the proverbial people who will buy the rope with which to hang themselves.

      (Edit. that doesn't make sense insert <proverb about buying ropes that actually makes sense>)

      [0] https://news.ycombinator.com/item?id=42671427

      [1] https://en.wikipedia.org/wiki/Alexandra_Elbakyan

    • criley26 months ago |parent

      I don't view Big Tech as being against copyright. They simply hold a position that they will not pay for something unless forced to ("make me" - a very common position for the powerful to hold).

      In fact, I'd argue that Big Tech is pro copyright, because once they force the copyright holder to negotiate, the cost is irrelevant to them and they build a moat around that access.

      For example, Google stole Reddit content for Gemini until Reddit was forced to the table, and now Google has a seemingly exclusive agreement around Reddit data for AI purposes.

      • jsheard6 months ago |parent

        > I don't view Big Tech as being against copyright. They simply hold a position that they will not pay for something unless forced to

        Yep, the contradiction between them feeling entitled to use anything they want for training, while simultaneously having license terms which forbid using the output of their models to train other models is pretty glaring. Information wants to flow freely but only in one direction apparently.

        • lxgr6 months ago |parent

          > having licenses which forbid using the output of their models to train other models

          I haven't been following it closely, but aren't there already court rulings saying that generative AI output by itself is not copyrightable?

          • 6 months ago |parent
            [deleted]
          • jsheard6 months ago |parent

            There's not much caselaw as to whether those terms are actually enforceable yet, but it at least indicates what they want to happen.

      • swatcoder6 months ago |parent

        Yup. For Big Tech, the ideal outcome of these cases isn't that copyright is widely or deeply undermined as they rely heavily on it themselves (let alone how their customers and investors benefit from it).

        Their ideal outcome is that there's some narrow carveout that gives them permission to ignore copyright where they want to, while extending similar permission to as few/irrelevant others as possible.

      • n144q6 months ago |parent

        > I'd argue that Big Tech is pro copyright

        I agree but for a different reason -- cost is actually relevant, in the sense that only the biggest player can afford to pay for the copyrights. If you are a small player, however your tech stack is or how good your model is, if you can't afford it, you can't compete with Google.

      • 52-6F-626 months ago |parent

        In the past we called that tyranny, when a power thought it could act entirely without restraint.

        Now I guess it’s defended as good business and good science by so many flunkies.

        Knifes edge stuff. Tech people should all be reading the books, not Mark’s steamroller.

        There goes the gravy train

    • dialup_sounds6 months ago |parent

      Odds are that licensing gets streamlined into something like compulsory mechanical licensing and rates get negotiated into something that Big Tech and Big Media can both live with.

      The whole conflict boils down to one party having piles of money and another party having something they want. That's not an intractable problem.

    • CuriouslyC6 months ago |parent

      Big tech will win, because what they're doing is already basically legal, and they're worming their way up the new administration's ass.

    • visarga6 months ago |parent

      Maybe training on copyrighted data should be allowed if the size of the training set is huge, as each individual example is justa drop in the ocean compared to the full training set.

      If you train a model 20B parameters on 20T tokens, even with 1000 tokens per example, the model extracts about 1 byte of information per example. What is the value of 1 byte of copyright infringement?

      • lxgr6 months ago |parent

        By the same logic, pirating movies should be allowed as long as the person doing it watches enough of them for each individual one to be almost meaningless…?

        • webmaven6 months ago |parent

          If by "pirating" you mean distributing copies, probably not. But if you mean downloading copies, probably yes. Consider the case of the film student studying the entire ouevres of multiple directors.

        • visarga6 months ago |parent

          Yes, if they watch a billion movies, it should be free to watch any copyrighted one.

    • TiredOfLife6 months ago |parent

      The hilarious thing is that the same people that freely pirate music, videos, books and articles are on the side of huge copyright hoarders like Disney

      • jazzyjackson6 months ago |parent

        I just wish the big corps would change the law to allow everyone to pirate freely, but instead they’re arguing for a carve out specially for training language models.

  • Funes-6 months ago

    Yes. Every "AI" company is training their software on everything, regardless of what they claim, and making millions, billions of dollars on it.

    • consumer4516 months ago |parent

      YouTube was mostly a library of pirated content when Google bought it for $1.6B.

      Spotify began by uploading an employee's pirated MP3s, and is now valued at $92B.

      There are plenty of other examples. One of the ways to success is to ignore silly legal matters, build a product people want, and worry about the legality later. It's not just AI companies, the pattern is well established.

      • disqard6 months ago |parent

        Eric Schmidt said so, and it caused a giant uproar.

        To me, it's just "more of the same", but apparently he said the quiet part out loud, which was somehow verboten.

        (Edit to add: I'm not saying "I think this is okay", but rather "this is Standard Operating Procedure for startups" -- even Reddit was seeded with fake accounts and content, to give the appearance of an active online community. This sort of hustle is a core part of SV culture, and I don't think this is going to change in a hurry.)

        Excerpt:

        ...in the example that I gave of the TikTok competitor, and by the way, I was not arguing that you should illegally steal everybody's music. What you would do if you're a Silicon Valley entrepreneur, which hopefully all of you will be, is if it took off, then you'd hire a whole bunch of lawyers to go clean the mess up, right? But if nobody uses your product, it doesn't matter that you stole all the content.

        • consumer4516 months ago |parent

          Thank you so much for this reference. It is the truth, and I have bookmarked it.

          https://finance.yahoo.com/news/ex-google-ceo-schmidt-advised...

      • Funes-6 months ago |parent

        >was

        >began

        You're very obviously missing a key point here. It's rather simple: pirating is integral to "AI", as it is of the utmost importance with regards to its optimization and even to building its basic functionalities. It will never cease to happen nor is it part of some "preliminary" process in which executives "ignore silly legal matters" in order to kick-start their projects only to discard those practices once they eventually take off. Comparisons to YouTube, Spotify, etc., are invalid for this very reason.

        • consumer4516 months ago |parent

          I should have been very clear that "silly legal matters" was meant tongue in cheek. I do not think that this is cool at all.

          You raise a good point. However, both Spotify and YouTube benefited from network effects and being the biggest guerrilla in the room. Can you remove the initial illegality from their later success, since the latter dependend on the prior?

          What seems inevitable is that some deal is made with major rights holders, the little guy gets screwed, as has happened before.

          • 6 months ago |parent
            [deleted]
    • freefaler6 months ago |parent

      Not yet... they're not profitable still, but will be in the future (those who survive). Nvidia is making all the money from the eager investors who subsidize the "free" chatGPT & related tools.

    • emahhh6 months ago |parent

      Exactly. Not surprised at all.

    • visarga6 months ago |parent

      Do you mean "making cents per million tokens"? And the benefit obviously belongs to the person who prompts, because they solve a task or get help. The value of that help can be from trivial to life changing.

  • jpc06 months ago

    I'm at a moral impass on yhis specifically.

    Llama is probably one of the few LLMs that probably doesn't generate an income for Meta but I can't exactly see how other than by assisting their current ad generation.

    Them being open weight isn't as good a what a "proper" open source LLM would be, but vs OpenAI which likely did the same thing it's significantly better.

    On the other hand if copyright is enforced it should be enforced across the board, if I did the same thing while training an AI would I get the same treatment... Equal before the law and all that...

    On the third point, I cannot legally obtain scientific paper without very significant cost to myself. My local libraries don't have a reasonable selection and even the university libraries that will let me as a member of public or even alumni still hold membership, specifically exclude scientific papers in that membership and you need to pay per paper.

  • paolgiacomelli6 months ago

    Let's then stop calling it "Artificial Intelligence" and call it what it is, making "plagiarism software" because "It doesn't create anything, but copies existing works from existing artists and modifies them in such a way that they can escape copyright.

    Noam Chomsky, New York Times - March 8, 2023

  • oidar6 months ago

    A few questions I have on this:

    Is it possible for an LLM of llama3/sonnet3.5/GPT4o quality to be trained on freely available works?

    Are there other types of LLMs that can be trained on smaller data sets with comparable quality?

    If that is not possible, and the courts shut down training on copyrighted works - what position will the "rule following" nations be in compared to nations that don't follow those rules?

    • Workaccount26 months ago |parent

      It's not even clear that training on copyrighted data even is a breach of IP law. People start their arguments on that assumption so they have an argument, but in reality that question isn't even resolved yet, and frankly it looks like the courts will likely determine that it's not a breach of IP law to train on copyrighted data (but is a breach to output it).

      • jazzyjackson6 months ago |parent

        How did they get the copyrighted data? O, right, they downloaded it without permission.

      • jorams6 months ago |parent

        Note that training is not even relevant here. Downloading copyrighted content you don't have the right to download is illegal. Distributing content you don't have the right to distribute is illegal. Meta did both. They did so knowingly, very deliberately even. It is unambiguously copyright infringement, on a massive scale.

  • aprilthird20216 months ago

    Maybe I'm not aware, but as Meta's Llama is free for use, why is the lawsuit against them, vs OpenAI or any other company who have assuredly (from my experience in these companies) done the exact same thing if not worse and make a profit from it?

    • Havoc6 months ago |parent

      They made the "mistake" of being honest in their initial papers about what the dataset used is.

      Everyone, including them, quickly changed their tune though. Now dataset is confidential trade secrets lol

    • est316 months ago |parent

      I suppose it's easier to reach a deal with a company that is selling/renting out their model instead of giving it away for free, so they went against them first?

      Both from a budget at FB point of view, because probably they allocate a smaller budget to training than OpenAI does, so they can't be as generous, as it's not a core part of their business. And probably also from a point of view of the publishers, they probably don't like it either: with OpenAI they can do limited time deals, but with Llama the license allows redistribution so once it's published and a lot of business activity has been established on top of it, one can't come back and renegotiate, after say 10 years.

      It's in the best interest of big copyright to not have open(-ish) models in the ecosystem, they want entities they can seek rents from.

    • dialup_sounds6 months ago |parent

      0. There already are similar lawsuits against OpenAI https://www.documentcloud.org/documents/23963237-authors-v-o...

      1. Because they literally said they used Books3 in the original Llama paper. The provenance of datasets used by other models is not as well documented. Books3 is known to be pirated.

      2. Being free to use doesn't mitigate the authors' complaint in any way. (Compare: "I stole your bike, but then I gave it away.") The authors and artists (in the case of image models) want either a) to not have their work included in training sets or b) to be paid for that use via licensing. In either case they must enjoin the trainer of the model.

      • aprilthird20216 months ago |parent

        Your points make sense. Thanks.

        For point 1, there are employees at OpenAI who do know the provenance of the datasets used and I am sure (based on my experiences) that it includes knowingly downloaded and inserted copyrighted works. Is not one single employee of OpenAI willing to blow the whistle?

        • dialup_sounds6 months ago |parent

          Honestly, I doubt anybody cares that much. It's pretty much an open secret predicated on the untested idea that training is fair use, and the stakes aren't really that high in the long run even if it's not.

          Losses for the tech companies will just mean training data gets more expensive. They're all spending tens of billions on new data centers, so there's not even a question of whether they can afford it.

          • aprilthird20216 months ago |parent

            Yes, that's all true. Plus these days the datasets are sanitized from copyrighted works

    • lxgr6 months ago |parent

      "Free to use" has never been legitimate excuse in copyright litigation.

      While commercial motives do matter (as far as I understand), being able to find evidence and bring a case practically matters even more.

      • aprilthird20216 months ago |parent

        Yeah I understand that, for example, pirating copyrighted works for free is still illegal and can amount to damages for the copyright holder.

        Idk, just rubs me the wrong way when there are companies making money on the exact same product sourced the exact same way right now, as Meta chooses to make it free and gets sued. Seems like we should be logical enough to conclude if they get sued, every similar company should be investigated and fined (if wrongdoing was found) as well.

        • lxgr6 months ago |parent

          Meta makes their models free to most users (not all – there's a "large commercial use" exemption!), and I do appreciate that practically.

          Still, they don't publish their detailed training methodology, which must have immense value for them internally. Even if they choose to never exercise the "large user exemption" in their current license, they can decide to license Llama 4 under restrictive terms (or not release weights at all – better start using Meta products if you want to gain access!) whenever it's convenient for them to do so.

          All of that doesn't exactly scream "public good worthy of a copyright exemption" to me, in a world where libraries, retro computing/gaming archivists and others are still continuously harassed by copyright holders.

    • consumer4516 months ago |parent

      I am far from a lawyer, but while it's free to use, Meta also uses Llama in their for-profit products. That is significant, is it not?

    • patrickhogan16 months ago |parent

      Evidence

      • aprilthird20216 months ago |parent

        There is definitely lots of evidence in these companies internally that they ignored the fact that adding copyrighted works into the model might be legally grey. I hope someone steps up and whistle blows or shares the evidence from all the major AI companies. I think the most recent OpenAI whistleblower died before he could testify

  • hdjjhhvvhga6 months ago

    What a coincidence - it recently started giving 502 errors. I guess only the biggest companies are entitled to training their models on the heritage of humanity for free, smaller ones and individuals don't have this right.

  • vladkens6 months ago

    Content distribution services are so "convenient" that it's easier to get everything in one place on a pirate torrent, LOL.

  • kristianp6 months ago

    Are Meta still training new llama models using libgen while this court case is happening? You'd think there would be an injunction agaimst that for the duration of proceedings.

    • aprilthird20216 months ago |parent

      No they aren't

  • buyucu6 months ago

    Since I have loads of pdfs from libgen on my laptop, I guess I should not complain about the Zuck for just this once.

    Also libgen is amazing. You're missing out if you're not using it.

    • JTyQZSnP3cQGa8B6 months ago |parent

      I'm actually training a new generation biological LLM stored in my skull, and I use audio and video material from TPB in order to achieve the best AGI ever. You can't sue me because OpenAI and Facebook are doing it, also I don't have enough money in my bank account and that's why you can't sue me.

      It's funny that most AI bros I have met were severely against piracy a few years ago.

  • mkmk6 months ago

    A partial list of ISBNs in the dataset https://github.com/psmedia/Books3Info

  • ChrisArchitect6 months ago

    [dupe] https://news.ycombinator.com/item?id=42651007

  • puppycodes6 months ago

    IP laws falsely equate work with value, once again stuck in the gilded cage of coorelation and causation.

  • bananapub6 months ago

    glorious to come a day after the anniversary of Aaron Schwartz's death, chased to it by the US Feds for the a similar thing, just with with much more noble intentions than Zuckerberg's desire to rule the future.

  • 6 months ago
    [deleted]
  • gorbachev6 months ago

    This is wilfull copyright infringement. Statutory damages are up to a maximum of $150,000 per work infringed. Meta's market capitalization would cover 10,333,333 works, if maximum damages were levied against it.

    Not that the billionaire class will ever be held to the same legal standard as the rest of us.

  • agilob6 months ago

    My comment from 2 months ago saying:

    >FBI should go after OpenAI and Sam Altman the same way they went after Aaron Swartz.

    Should now be expanded to:

    FBI should go after OpenAI, Sam Altman and Meta the same way they went after Aaron Swartz.

    • portaouflop6 months ago |parent

      No they should go million times harder after them given that they built billion dollar businesses on top of this - but maybe that’s tue exact reason why they don’t?

      • ALittleLight6 months ago |parent

        Disagree. They were wrong to go after Aaron Swartz and would be wrong to go after OpenAI or Meta. The federal government should instead require any university receiving federal supports, support, or students on federal loans to require anyone associated with the university to publish only in completely open journals - providing unrestricted access to all. Then the government should void whatever intellectual property protections currently fetter academic literature.

        • portaouflop6 months ago |parent

          Fair point and I agree with you on that - I guess my point was that Aaron’s “crime” pales in comparison to what they do.

          But yea copyright is broken and is holding back humanity as a whole

    • Havoc6 months ago |parent

      That was my first thought too, though wiki suggests that at least a portion of Aaron's sentence wasn't about piracy but unauthorized access. i.e. "hacking"

      So not a direct analog

      • danparsonson6 months ago |parent

        That always seemed to me as just a convenient stick with which to beat him - much as Al Capone was done for tax evasion (without wishing to draw any other parallels between the two)

      • bagels6 months ago |parent

        He was sentenced? I thought he didn't make it to trial?

        • marcosdumay6 months ago |parent

          AFAIK, he wasn't. Also AFAIK, he had access, so this would be a really though "crime" for them to push.

    • nostradumbasp6 months ago |parent

      Not going to happen. All the US agencies and military branches stand to gain a ton in surveillance capabilities with GenAI technologies. More capable text condensation models, the ability to "befriend" and track hundreds of thousands of citizens online with fake persona's to extract information, fingerprinting citizens writing styles, anonymizing writing styles, harvesting resumes, mass propaganda campaigns, echo chambering groups of citizens, mass sentiment analysis for strategy development and psyops, smearing corporate/political campaigns to forward agendas, mass psychological weapon assessments, etc. It's too much for all branches of government to not forward this as quietly as possible while leveraging strategic corporate partnerships for data harvests.

      It's the new radium. Expect it to be in your shaving cream without accountability and hope future generations look back on it thinking we were really dumb. We're all part of the biggest experiment in human history we just have to trust who is at the wheel.

  • themerone6 months ago

    Artists having been imitating the works of others for thousands of years. Why is it so controversial to have an AI do the same?

    • lelandfe6 months ago |parent

      How many NYT articles do you think you could recreate from memory? https://nytco-assets.nytimes.com/2023/12/Lawsuit-Document-dk...

      Oh, you said zero? How beguiling. Maybe there is a difference.

      • Workaccount26 months ago |parent

        The problem there is outputting the data, not inputting it.

        OAI can put a dumb IP filter on ChatGPT output and resolve the case. Training plays no part here.

        • lelandfe6 months ago |parent

          The Time's lawsuit does allege in part that 1. the training data was not licensed and therefore OpenAI has committed copyright violation, and 2. that the resultant model is a copy or derivative work of The Time's body of copyrighted articles.

          That regurgitation is merely evidence of those two, and so putting a filter on the output explicitly does not resolve the case.

          • Workaccount26 months ago |parent

            The question is whether legally you need a license to view copyright. Training doesn't copy anything, I think this is where people are confused. People assume this is how training works, because they have a false intuition about how LLMs must work.

            LLM's are not data archives, I don't know how many times this has to be repeated.

            • lelandfe6 months ago |parent

              The NYT and their lawyers are confused, then:

              > Defendants’ generative artificial intelligence (“GenAI”) tools rely on large-language models (“LLMs”) that were built by copying and using millions of The Times’s copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more.

              https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

              This case is still ongoing a year later.

              • Workaccount26 months ago |parent

                They need to be confused because they need the judge to be confused too.

                NYT isn't suing because LLMs will print out NYT articles for free. They are suing because LLMs are poised to be better/more favorable news reporters than them. It's a long term survival case, not a copyright one (despite that being the weapon used in the fight)

                • munchler6 months ago |parent

                  I agree. That said, if AI puts journalism out of business, then AI will quickly run out of content to train on and report. I think this is a situation where technology has gotten way out ahead of the law.

      • munchler6 months ago |parent

        So if they fix this problem, you’d then be OK with generative AI?

        BTW, plenty of humans have memorized copyrighted material, such as song lyrics. Do you think that should be prohibited? Maybe the difference isn’t as great as you think.

        • 6 months ago |parent
          [deleted]
    • 52-6F-626 months ago |parent

      This is it, isn’t it? Thats why all the effort pours in to make these machines produce rembrandt and tolstoy copies. And why I still have to do my taxes by hand rather than the machine handling it with speed and accuracy.

      It’s the core of it all— jealousy of a creative spirit.

      Artists are not machines, but living souls.

      If you were remotely open enough to see for yourself, then you wouldn’t struggle with engaging in the world in a creative manner and you wouldn’t feel that jealousy but encouragement by what you see pouring out of your fellow humans as a reflection of each other.

      No machine will grant you that understanding, you just have to engage directly.

      It will never succeed to supplant it, no matter the billions of dollars burned to try.

      • CuriouslyC6 months ago |parent

        AI doesn't make art, it makes images - it's like a camera this way. Art is in the composition, the message and the aesthetic of the one using the tool to create an image.

        • 52-6F-626 months ago |parent

          Ignoring my point doesn’t make it go away.

          Using an AI tool to create an image of a painting betrays the person who seeks to be “artist” by short circuiting the practice that leads the prospect to their path of enlightenment through mastery.

          In our constricted 3d world there is no circumstance where an algorithmically generated image of a painting will equally serve the prospective artist in its procedural work on the prospective artist, internally. There is no other pursuit in art, and any pursuant will come to that conclusion in any number of ways but always through submission to the course of mastery (for which there is no shortcut).

          Worse, the companies at the helm of this side of the technology are pushing it in order to stand middle man to humankind’s modus operandi-to create.

          Keep your mind open to perspectives beyond the software industry.

    • 6 months ago |parent
      [deleted]
    • energy1236 months ago |parent

      I do think a new way of thinking about copyright is needed for AI. Allow tech firms to train on all material, but there should be an AI tax that serves as compensation back to the public commons for what was taken from it and privatized.

      The status quo favors large players who can navigate the legal system.

      • CuriouslyC6 months ago |parent

        Training on copyright content as research is entirely fair use. Just require that fair use defense to hinge on that research being public, i.e. that it's only a defense of open weight models.

        A tax on AI is stupid because the big players can dodge taxes well and have the ear of power now, so any regulation would favor them. It would only serve to prevent challengers to their dominance.

    • xena6 months ago |parent

      Food costs money. When the price of your labor becomes zero, you can't afford to eat.

      • 6 months ago |parent
        [deleted]
    • pieix6 months ago |parent

      The controversy here is that LibGen doesn't legally distribute its content. Mass-scale training on pirated content is... legally murky, to say the least.

    • recursivecaveat6 months ago |parent

      It would also be a crime for an artist to download all of libgen and imitate/learn from it. Zuck is just a billionaire and hence above the law.

    • threeducks6 months ago |parent

      [dead]