Judge dismisses majority of GitHub Copilot copyright claims

Posted by stronghup@reddit | programming | View on Reddit | 520 comments

[-]

Last_Tower_4970@reddit

they are scammers! we all agree on that! we just have to formulate a plan to make them pay!

[-]

sleeping-in-crypto@reddit

I don’t think people realize what is happening here nor what the endgame of all this is where copyright is not respected just because wants to use your property to train its models:

A world of walled gardens where the training software cannot access, gated by a labyrinthine network of access and licenses that only the most seasoned lawyers can unwind (at proportionate cost, of course).

If that’s the world you want, sure, keep promoting decisions where MS, OpenAI and others can keep using the entire world’s creative works without permission or license.

You know the worst part is, this is the biggest missed opportunity in human history to date: wouldn’t you love to be part of a generation-defining project to turn the world’s knowledge into AI? They robbed you of that by doing it for profit and then claiming they owe you nothing. You gave them everything…. And they believe they owe you nothing.

[-]

bwainfweeze@reddit

They aren’t turning the world’s knowledge into AI. They’re turning the worlds opinions int AI. And its foibles and its fears.

Half of us are dumber than average and language models can’t tell who is who.

[-]

Modest_MLE@reddit

It doesn't matter. They need all that material

[-]

bwainfweeze@reddit

meltbox@reddit

I understand what they’re saying but entirely disagree. Large models of all kinds have perfect recall and their feedback explicitly prioritizes closeness to the training data.

While the representation is not lossless it is always mimicking the input data as closely as possible and interpolating between points when not possible.

Go see the arc challenge and the interviews that Google researcher has done on AI. Current models appear to be purely recall driven. They don’t really have anything akin to what would be considered interdisciplinary reasoning or transfer like humans do.

So my position is that even though the encoding looks nothing like the original data an approximation of the original data can be recovered from the internal weights and the appropriate input. Therefore you are essentially distributing a lossy version of copyrighted material which is still not okay. IE re-encoding a movie with artifacts and worse quality is still illegal even if entire minute long segments are missing.

The other issue here is say a human is also capable of copyright infringement, but won’t because there are legal consequences for doing so.

But a machine gets no consequences because it’s not human? The argument is absurd even if you assume that LLMs do change data like a human because unlike a human it can’t in any reasonably effective way be restricted from violating copyright.

[-]

purleyboy@reddit

Here's an explanation on Wikipedia

[-]

hackingdreams@reddit

It's going to be hard difficult, possibly impossible, to 'prove' a piece of code was used in training.

Not really. The Plaintiffs will ask for the logs, and Microsoft will provide them, or admit they don't have them. Admitting they don't exist looks really, really bad for Microsoft in the light of the law. It's essentially "the dog ate my homework."

This isn't some shifty fly-by-night operation. This is a multi-billion dollar behemoth.

[-]

This will certainly be interesting.

[-]

Chroiche@reddit

it's not just decompression though, it combines things. Sometimes the result is vastly different to any of the training data, sometimes it's verbatim. The analogy to movie compression doesn't really hold up besides verbatim replication.

[-]

These things are not sentient; they do not learn, and they are not people.

How do you know this?

Can you propose a test for how we determine whether something or someone is sentient?

Will that test have very unfortunate implications about which people are not sentient and therefore don't count as real people?

[-]

EveryQuantityEver@reddit

How do you know this?

Because it's a fucking machine. If you're going to try to claim that these glorified autocompletes are sentient, then you're not someone that can be taken seriously.

[-]

Xyzzyzzyzzy@reddit

Do you believe in the literal, objective existence of human souls?

Because there's two, and only two, alternatives here:

It's impossible for a machine to be sentient, because sentient beings have souls, and machines cannot have souls. It is, theoretically, possible for a machine to faithfully simulate the physical processes taking place in a human nervous system. Since you believe humans are sentient, that means there must be some non-physical difference between an actual human nervous system and a simulated human nervous system - if the difference were physical, we could simulate it. That's the definition of a soul: an intangible, unmeasurable, yet real "special something" that all people possess.
It's possible for a machine to be sentient. If a machine faithfully simulates all of the physical processes of a human nervous system, that machine would possess sentience. "Because it's a fucking machine" isn't a real argument and it's not what you actually believe. Maybe it's meant to show - yet again - that you think all people should automatically agree with you and you despise everyone who has different opinions about things. I got that message loud and clear already, so you can cut the fake emotional bullshit and actually answer the question if you want.

[-]

kintar1900@reddit

I made no claim to any of those things. I'm talking about the claim made by the previous poster, and how that logic could conceivably be applied to humans.

[-]

EveryQuantityEver@reddit

And I'm saying that you can't do that, because AI and humans are not the same thing, not even close.

[-]

uCodeSherpa@reddit

No. This is idiotic. Humans can take what they learned and produce novel behavior. AI cannot.

[-]

Moleculor@reddit

Humans can take what they learned and produce novel behavior.

Prove, in this world where all art is derivative, that humans always produce behavior that can be described as novel.

Every action I ever take is influenced and inspired by my past experiences, past experiences that are overwhelmingly influenced by other people.

My speech, my mannerisms, etc? Most of them I can identify who inspired them, and major influences on them.

[-]

EveryQuantityEver@reddit

Prove, in this world where all art is derivative, that humans always produce behavior that can be described as novel.

That's not required. AI never produces things that are novel.

[-]

Moleculor@reddit

Nor do humans, so I'm not sure what your point is.

[-]

EveryQuantityEver@reddit

You tried to claim that all human behavior had to be novel for it to count. That's not true. However, while humans can create things that are novel, AI never can.

[-]

Moleculor@reddit

You tried to claim that all human behavior had to be novel for it to count.

Alright, fair enough. I misspoke. I asked for humans always producing novel behavior, and then went on to describe how none of my behavior is novel.

Prove to me that some of my behavior is novel.

[-]

EveryQuantityEver@reddit

I asked for humans always producing novel behavior, and then went on to describe how none of my behavior is novel.

Except that's not true. You came up with the ways to combine it, and I guarantee you that you had manerisms before you were exposed to those people.

[-]

Moleculor@reddit

You came up with the ways to combine it, and I guarantee you that you had manerisms before you were exposed to those people.

All my mannerisms developed from my interactions with the world. Everything I did was inspired by either seeing other people, or getting feedback from the world around me.

Everything about me is tied to, inspired by, or otherwise linked to things outside of me.

[-]

Xyzzyzzyzzy@reddit

Can you give an example of something created by a human that is truly novel?

This would help clarify what "novel" means to you. Currently it's not possible for anyone to respond to what you're saying, because you're using "novelty" in such a vague hand-wavey way that I can only define it as "a thing is novel when u/EveryQuantityEver says so".

[-]

kintar1900@reddit

Exactly. The definition of "novel" is INCREDIBLY vague, and no human produces anything that is not derivative in some way of something they have seen, heard, or otherwise experienced.

[-]

uCodeSherpa@reddit

ALL art is derivative

Well THAT is a claim that definitely needs some backup.

AI folk really like to spit these claims out of their ass without any semblance of support for them.

[-]

Moleculor@reddit

Well THAT is a claim that definitely needs some backup.

There's a massive amount that has been said on the topic.

[-]

uCodeSherpa@reddit

Mark Twain said something about a kaleidoscope of ideas, therefore he’s right. All art is derivative.

This is mentally handicapped dude.

[-]

Xyzzyzzyzzy@reddit

I don’t care WHO said something. I care about the merits of their statements.

You do care who said something, because you obviously don't care about the merits of your own statements.

[-]

Moleculor@reddit

If you found a quote from Mark Twain somewhere in those search results, you likely also found links explaining why it was right.

I won't take the time to explain to you what has already been explained. If you refuse to put in the effort, I have no reason to do so either.

[-]

jmlinden7@reddit

Generative AI, by definition, produces novel outputs. Maybe bad quality outputs sure, but novel

[-]

uCodeSherpa@reddit

No it doesn’t.

[-]

currentscurrents@reddit

They totally can produce novel behavior, unless you can find me this spaghetti tent on the internet somewhere.

[-]

uCodeSherpa@reddit

This is not novel behaviour.

I agree that if the AI can draw something looking like spaghetti and it can draw a tent, then it can draw a tent made of spaghetti.

This is not what anyone is talking about when they discuss AIs doing something new.

[-]

currentscurrents@reddit

That’s as new as anything humans make.

Look at fantasy creatures - they’re all just real creatures glued together. A unicorn is a horse with a horn, a gryphon is eagle+lion, a mermaid is woman+fish, etc.

[-]

uCodeSherpa@reddit

For the record, a human still made this spaghetti tent. The AI just drew it from a prompt.

But I mean. Okay. Lots of fantastical creatures are inspired by real creatures and this is proof that AIs work like a human brain? I mean, neurological science does not “know” how a human brain works, so you claiming to is basically a fantastical creature.

Either way, again, an AI successfully giving a lizard feathers is not novel. It is pretty fuckin cool that even a nobody such as myself can feed an AI a prompt of a fantastic creature and get something back that follows the constraints of what the AI has been trained on. No disagreement there.

However if you ask for an AI to invent a bunch of cool creatures, it can ONLY work within the boundaries of its training. It cannot imagine possibly something new.

Lots of humans lacking imagination (I fully admit I am one of those people) is not proof that AI produces novel behavior.

An AI would not have come up with the theory of relativity on its own, for example.

[-]

lelanthran@reddit

at the fundamental level both humans and LLMs are processing, storing, and recombining past experiences for new output.

Largely irrelevant, because ... scale matters, in law.

Humans aren't reading and absorbing a few billion lines of copyrighted code; the LLM is.

Just like possessing a single joint isn't illegal where I am, but possessing 4000 tons of weed is illegal.

Scale matters. The argument that "using copyright just the way a human does, but scaled up a billion times" is stupid.

[-]

accountForStupidQs@reddit

Scale only matters when a specific threshold is put into legislation. Saying one controversial statement and several million are both equally legal, and killing one person or one thousand are both illegal. Whether something counts as copyright infringement, until a law says otherwise, will not depend on if it's done slowly or quickly, once or one billion times.

[-]

Full-Spectral@reddit

It does matter, to the people who are being infringed on. If you copy one song, no one is going to bother coming after you. If you copy a hundred thousand songs and put them on a server for people to download, people are going to come after you, because it's now very relevant to them and impactful.

[-]

accountForStupidQs@reddit

Legality has naught to do with getting caught or someone suing you. Doing something illegal is still illegal, even if you don't get caught.

[-]

Full-Spectral@reddit

But for copyright, it is at the copyright holder's discretion. They own the copyright and can choose to take action or not. No one is going to waste time coming after Joe Blow for copying some code into his project used by 5 people. But a huge corporation, sucking up the entire internet is another issue and people will choose to take action over that.

[-]

lelanthran@reddit

Scale only matters when a specific threshold is put into legislation.

My point is that they put those thresholds into legislation because scale matters. IOW, you've got cause and effect backwards - a legislated threshold is the result of scale mattering.

This (slurping up the entire worlds corpus of copyrighted text to derive a new product) is something new, so why do you expect the legislation to be in place for this?

[-]

kintar1900@reddit

Humans aren't reading and absorbing a few billion lines of copyrighted code

Regardless of the rest of the discussion, this made me laugh. I've been in software for over 25 years. I'm positive I have read well over a billion lines of code, most of which falls under SOME form of copyright.

[-]

[-]

EveryQuantityEver@reddit

It really isn't something new, though. I don't see why they should get to claim they need special treatment, as if they somehow are entitled to do what they want.

[-]

Xyzzyzzyzzy@reddit

I don't see why we should let luddites abuse the legal system to impede technological advancements to protect their own narrow self-interest, but it happens anyways.

[-]

Ictogan@reddit

I'd argue that depending on how much of the original data ends up being stored in the model, the model may be considered to contain a condensation of the original work.

[-]

oorza@reddit

I'm not sure this applies here, I think new case law will be written.

In the case of a derived work, we're usually talking about things humans create. AI work can't be copyrighted and doesn't have IP protections currently, so what it outputs doesn't matter, so the question becomes: is the LLM itself a derived work? Obviously not, but that doesn't mean anything beyond this case law can't apply here.

[-]

Chii@reddit

Even if you can't comprehend how the copyrighted content is baked into the model, it's still baked in

but it can contain more than just what got baked in. The argument cannot hold, because this argument does not hold for a brain either.

The books or movies i've read and remembered in my brain doesn't constitute any infringement. It's only when i deliberately extract the movie out that it constitutes infringement.

Why should there be any differentiation under the eyes of copyright law between a brain and the LLM?

[-]

Red_not_Read@reddit

If I publish source code with a GPLv2 license, and you read an memorize it, and then verbatim regurgitate it into your closed-source application, then that's a license violation.

An LLM can be thought of as a container that stores copies of the source code it has seen, and then renders that source code on demand, only without the accompanying license text.

The specific detail of the algorithms and data structures that comprise the LLM, or the precise math that describes the format of the original source copy (knowledge) that the LLM holds is somewhat immaterial.

What's going to matter, I think, is whether Copilot is emitting what looks like verbatim copies of code (like a source code database), or if it can be argued that Copilot is learning and applying learned knowledge, which would not look like an exact copy of previously seen code, but may validly reflect algorithms and data structures previously seen.

It's going to be fascinating.

[-]

kintar1900@reddit

An LLM can be thought of as a container that stores copies of the source code it has seen

No. This is a VERY incorrect explanation of the way LLMs work, and shows a lack of understanding of the fundamental math underlying complex AI models.

The specific detail of the algorithms and data structures that comprise the LLM, or the precise math that describes the format of the original source copy (knowledge) that the LLM holds is somewhat immaterial.

If this argument holds, then it also holds for the human mind, because LLMs store data based in large part on the way biological systems store data.

What's going to matter, I think, is whether Copilot is emitting what looks like verbatim copies of code...

This has been tested in copyright trials before, where two artists or engineers came up with the exact same thing without ever seeing the others' work. It even gets WORSE when you start trying to apply this test to source code, because there are only so many ways to solve a given problem within the constraints of a given programming language and environment. It's not only possible, but highly likely that two human software engineers will produce eerily similar or exact copies of code for a given problem.

[-]

EveryQuantityEver@reddit

If this argument holds, then it also holds for the human mind, because LLMs store data based in large part on the way biological systems store data.

No, it doesn't, because LLMs are not people.

[-]

kintar1900@reddit

And our legal system has such a GREAT record at not applying human-like tests to non-humans ("Corporations are people!") or vice-versa?

I'm not claiming an LLM is intelligent or sentient. I'm talking about the mechanics of the arguments being made for copyright violation in TRAINING data.

[-]

EveryQuantityEver@reddit

("Corporations are people!")

If you would ever want to sue a company, or hold it accountable for a contract, yes, you would like them to be treated as one.

I'm talking about the mechanics of the arguments being made for copyright violation in TRAINING data

And those mechanics are irrelevant, but you're trying to say that they learn like people, when that's just not true at all.

[-]

uCodeSherpa@reddit

Said do NOT store data similar to biological systems. This is an absurd claim. World class neurological scientists do not know how biological systems store data and you’re out here stating that programmers have figured it out. Not only that, but that neurological science has seen this and been like “yup. That it. You guys got it”

Absolutely moronic claim.

[-]

kintar1900@reddit

World class neurological scientists do not know how biological systems store data

I think you're conflating "know how it's stored" with "are capable of reading the stored information".

Neuroscience agrees that changing the strength and number of connections between neurons, including the level of signal from connected neurons required to cause a neuron to fire, is the core mechanism for storing memory in a biological brain. This discovery is what lead to the creation of the first digital "neurons".

If I'm wrong, please provide a link because I would love to be corrected and to know what the current science says.

[-]

uCodeSherpa@reddit

This tells us absolutely nothing about whether AIs actually model a brain. Yeah, they both have mechanism for firing at different strengths.

By this logic, guns firing in a war accurately simulates a human brain.

There is a universe of information missing between “how the human brain neuron work” and “these two systems have a kind of similar way to pinging their brethren”

[-]

kintar1900@reddit

By this logic, guns firing in a war accurately simulates a human brain.

Illegal straw-man argument on the field. Offense is assessed a five-yard penalty.

[-]

hachface@reddit

This debate-club bullshit is so played out.

[-]

uCodeSherpa@reddit

It’s not my fault that your claim is absolutely, demonstrably absurd. Maybe do a little introspection and stop claiming that AIs work the same way that human brains work?

[-]

Red_not_Read@reddit

No. This is a VERY incorrect explanation of the way LLMs work, and shows a lack of understanding of the fundamental math underlying complex AI models.

Ugh, how rude. Thanks, but my practical knowledge of LLMs, neural nets, and transformers is just fine, thank you.

It's not necessary to get into the details of the how at this level of conversation. It's the what that matters and the what is that the network contains license source code.

You don't have to painfully explain that it's not in there as arrays of characters, or Huffman trees, or what have you, but as values encoded across billions of nodes across a vague multi-dimensional space. That's the how, and it doesn't matter.

It's not only possible, but highly likely that two human software engineers will produce eerily similar or exact copies of code for a given problem.

That's fine, and if the LLM emits similar ideas to those used in open source code, then that's fine... but if it emits literally the same blocks of non-trivial code... then I don't know how you can argue that it's somehow not plagiarism.

[-]

kintar1900@reddit

Thanks, but my practical knowledge of LLMs, neural nets, and transformers is just fine, thank you.

Okay, then. I'll avoid further basic explanations.

You don't have to painfully explain [that it's stored] as values encoded across billions of nodes across a vague multi-dimensional space

Thank you, because this is a good example of the difficulty we're having (we as in 'the world') talking about LLMs and what they are or are not. You seem to be of the opinion that since correctly-structured prompts can produce output which exactly mimics the training set, that it constitutes copyright infringement to train the model with that data. I am arguing that the way LLMs encode data is SO SIMILAR to the way the human mind encodes data that any legal conclusion which states that the weights and connections of the model constitute a copy of the source material will by definition require that human minds be treated the same way.

One experiment that I'd love to see performed, but which I just don't have the computing resources to perform myself, would be this:

Find an example of code being used as an argument for infringement
Train an LLM on source code that produces similar results (and in the same programming language) as the example code snippet, but which does not include the code snippet.
See how long it takes to produce a duplicate of the example code from the trained model.

My hypothesis is that it is possible. I think this experiment would put the argument to bed forever. The argument would then turn into whether or not the experiment's prompt generation step was run long enough or correctly enough to produce valid results.

[-]

a_marklar@reddit

Reddit fuzzes voting so complaining about -1 is not only weak, it's usually wrong.

I didn't down or upvote you, but when I see someone say this:

I am arguing that the way LLMs encode data is SO SIMILAR to the way the human mind encodes data that any legal conclusion which states that the weights and connections of the model constitute a copy of the source material will by definition require that human minds be treated the same way.

I roll my eyes. I'm sure other people would hit the downvote button instead.

[-]

kintar1900@reddit

Why does it make you roll your eyes? How am I wrong?

I post these things because I want discussion, including actionable corrections on my take. Unfortunately, what I usually see are just reiterations of the same (typically flawed or massively over-simplified) claims, or random mudslinging.

[-]

a_marklar@reddit

Well the truth is that I have a knee jerk reaction to anyone who anthropomorphizes software. Beyond that, the statement is not currently falsifiable so it's actually nice sounding bullshit. Combine it with the language like "SO SIMILAR", "by definition", the idea that we'd apply laws equally to humans and software, and my eyes can't stop themselves. Forgive me.

I post these things because I want discussion, including actionable corrections on my take.

Hell yeah. I'm replying because I get that and I would love honest feedback if I asked for it too.

[-]

kintar1900@reddit

Hell yeah. I'm replying because I get that and I would love honest feedback if I asked for it too.

There are too few Redditors with that attitude. Thank you.

Well the truth is that I have a knee jerk reaction to anyone who anthropomorphizes software

Can't say I blame you, and I wasn't trying to anthropomorphize LLMs. I personally can't stand it when people talk about AI systems "thinking" or "wanting", etc. My statement is entirely around the (false) claim further up this thread that a neural network stores a copy of the data it was trained on. I brought up the similarity to biological systems to point out logical fallacies in arguments about why training a neural net on copyrighted data constitutes copyright infringement.

Beyond that, the statement is not currently falsifiable so it's actually nice sounding bullshit.

Which statements? Everything I've said about the way ANNs encode data being similar to the way we think -- a phrase I should have included in my original statement -- that biological brains encode data is based on various papers and articles I've read since the mid-90s. HOWEVER, I have recently been informed by an acquaintance that I'm out of date and there's currently research being done on whether or not neurons themselves perform networked processing within themselves, which is FREAKING AWESOME! :D

the idea that we'd apply laws equally to humans and software, and my eyes can't stop themselves

While I understand, I'm apparently WAY more cynical about our legal system than you. We already treat corporations like they're individuals, and in some cases give them more rights than people. :/ Couple that with the greed expressed by US corporations, and I can 100% believe that if someone in a corporation's legal team thought there was a chance in hell that they could claim copyright on the output of a human because the person had been exposed to copyrighted data, they'd do it.

[-]

loup-vaillant@reddit

Reddit fuzzes voting so complaining about -1 is not only weak, it's usually wrong.

Reddit doesn’t fuzz when there are fewer than n votes (n is probably less than 5). A controversial vote with 10 ups and 10 down, sure, it will get fuzzed. but:

If you have more than 1 point, you know for sure you had at least one upvote.
If you have less than 1 point, you know for sure you had at least one down vote.

[-]

Red_not_Read@reddit

Have an upvote. We're here to argue conflicting opinions.

I'm actually pro-LLM, in software too, and my argument is basically that it's going to continue to be a challenge for normal people (by which I really mean non-tech, e.g. judges and the government) to make pragmatic decisions about all this.

[-]

kintar1900@reddit

Thanks, and I agree 100%. Our (the USA and to a lesser degree the EU) governments have consistently shown that they do not put sufficient weight on technical experts who weigh in on proposed tech regulations. It's disappointing.

[-]

totoro27@reddit

It actually does matter. If the model was as simple as what you described, then the legal conversation would be much simpler. I think it is inevitable that the legal conversation will get into what exactly these models are doing under the hood.

[-]

loup-vaillant@reddit

If this argument holds, then it also holds for the human mind,

Oh but it totally holds for the human mind: try and rewrite a novel from memory, then sell it as your own: if the original author ever sees this, they will sue your ass, and win.

[-]

Moleculor@reddit

An LLM can be thought of as a container that stores copies of the source code it has seen

No, this is why copyright claims were thrown out: the model provably does not contain substantial copies of existing works.

An LLM only contains 'copies' if you define 'copies' as 'incredibly tiny fragments'. A word or such.

It's like arguing that replied with a sardonic smile is a copy of someone else's work.¹

It's a sentence fragment from A Game Of Thrones, so technically, yes, you can find it contained within an existing work...

But you can also find it within Days of Atonement by Michael Gregorio, Life in the New World by Charles Sealsfield, and The Memoirs of Queen Hortense by Queen Hortense.

You can also find it within interviews, fanfiction of Hearts of Iron 4, and more.

An LLM is a mathematical slurry with numeric connections between all these tiny fragments. Their design is literally based on theories of how the human mind operates. And it only works because it doesn't contain whole, complete copies of works; they'd be too slow to search through.

¹ And I'm not even entirely sure that fragment is short enough to be a legitimate example, because my understanding is the fragments, called tokens, are generally only a few characters in size. Like... sard, I guess. Or maybe sardonic.

[-]

Ask an LLM to reproduce something that was in its training data, it has a good chance of producing something very close.

In short enough snippets that it's reasonable to think that a human might have reproduced it in the same situation without having seen the so-called original work.

Copyright cares about the work as a whole, or substantial enough portions of it that it threatens the profits of the person who made the work. The entire novel, not one sentence fragment from page 237.

It's why Google was so successful in defending itself from copyright lawsuits from the Author's Guild when they created their book search engine.

[-]

loup-vaillant@reddit

To be honest, I’ve sometimes straight up Ctrl-C Ctrl-V snippets of code, rearranged them to my style (indentation naming, a bit of refactoring…), and… well are a couple dozen lines enough to count as infringement? I never knew where the limit actually is to be honest.

But it does speed up my work sometimes, even when I know the end result would have been the same if I started from scratch. Especially when I’m the original author, who somehow has ceded all rights (including attribution in practice) to some previous employer.

[-]

Red_not_Read@reddit

Of course it isn't... Why don't you take a stab at describing how an LLM incorporates its training data, in a way that can be easily understood by normal people.

[-]

EveryQuantityEver@reddit

because this argument does not hold for a brain either.

An AI model is not a brain. The two cannot be considered to be the same.

[-]

batweenerpopemobile@reddit

brb, getting music industry to sue monster rancher franchise for deriving monsters from copyrighted data.

[-]

Monster-Fenrick@reddit

I don't think gathering the last two digits of track numbers constitutes a copyright violation. It's the equivalent of looking at specific page numbers in a book and counting how many words are on it and using that number to reference a table to decide what monster to create.

[-]

travelsonic@reddit

how the copyrighted content is baked into the model

Copyright status doesn't make sense to id as if copyright status is the problem IMO, as opposed to licensing status, if licensing is needed, etc. Implying copyright status makes for a pieces use in training being problematic or not, IMO, would miss that copyright is automatic in the U.S and many other countries - and therefore, works used with permission even (implicitly or explicitly) would still be "copyrighted works," for instance.

[-]

oorza@reddit

Decompression in lossy video codecs isn't as simple as you might think, the analogy stands fine I think. You can add a bunch of processing filters on both sides of video codecs - stuff like noise reduction/addition, color adjustments, etc.

If I take The Avengers film and add a fansub track, replace parts of the music with a custom score I wrote, and then add a ton of video filters to the (de)compression, it's nowhere near a verbatim replication. But it'd still be copyright infringement to sell it.

It's the same thing as an AI: taking an original piece of IP, layering some changesets on top of it, pushing it through a lossy codec, then decoding it again with more filters on top of it. That describes both an LLM and a bunch of weird pirated anime on the internet, but only the latter is currently illegal.

[-]

Xyzzyzzyzzy@reddit

If something is a derivative work, then you can use the derivative content to point to exactly which works it was derived from. If I write a song that starts out "is this the real life?/is this just fantasy?/caught in a landslide/no escape from reality", it's clearly derivative of Bohemian Rhapsody by Queen. You don't need to know anything about me to say that. You just need to show that my lyrics are the same as their lyrics.

If an LLM produces a derivative output, we should be able to show which prior work it is derived from, right? LLMs can produce indisputably derivative outputs, and when they do, we can show the original works they're derived from, the same as if a person creates a derivative work.

But - correct me if I'm wrong - you're going a step further and saying that the LLM itself is a derivative work of every item in its training data set, so every output produced by the LLM is derived from the entire training data set. If I have ChatGPT write an SQL snippet to add a new column to my AccountsPayable table, it's derived from all of the SQL in its training corpus. It's also derived from St. Paul's letter to the Ephesians, Quotations from Chairman Mao, and "To a Mouse" by Robert Burns. ("Wee, sleeket, cowran, tim'rous beastie/O, what a panic's in they breastie!")

That seems like a dramatic expansion of copyright, and a massive transfer of legal and economic power to existing copyright holders at the expense of all future creative work.

Even if we have different standards for human-written and LLM-produced works, LLMs are ubiquitous. A current copyright holder could claim that they have good reason to believe my work was written by an LLM, it's derivative, it's a violation of their copyright, and they'll sue me unless I pay them $5k to settle the claim. Even if the claim is frivolous, I can only be assured of winning in court if I can prove that my work was written before February 14th, 2019, when GPT-2 was made available to the public.

We already have patent trolling; now we can have copyright trolling, too. If I have seen further than others, it is by standing upon the shoulders of giants, so the giants are entitled to compensation. It's an RIAA lobbyist's wet dream!

[-]

oursland@reddit

If something is a derivative work, then you can use the derivative content to point to exactly which works it was derived from.

You should double-check your "facts". The whole reason there's a project out there to copyright each and every melody is precisely because you can lose a plagiarism case by simply having the same notes in a sequence. This is true regardless if you play them at a different pace or have nothing to do with an original work.

The reality is, if there is significant similarity and an expert claims that it is unlikely that two independent works would result in this similarity, then you're going to lose your plagiarism/copyright case.

[-]

Xyzzyzzyzzy@reddit

...that reinforces my point? They're suing people based on "this creative work resembles that prior creative work". It's a good thing you included that first sentence, because otherwise I'd think you're agreeing with me!

[-]

sparr@reddit

If you run it 100 times and you get 99 new movies and the exact original movie once, that's [at least] one instance of copyright infringement.

[-]

quetzalcoatl-pl@reddit

So, if I take Avengers, run it through H265 (compression), add subtitles (combine things), and add my voiceover (even more combining things, my personal products added) - then it is not copyright violation? YAY, hold my beer, I'm opening new business!

[-]

Helluiin@reddit

The analogy to movie compression doesn't really hold up besides verbatim replication.

if you compress a movie and put a filter over it thats also combining things. yet most people would probably call that copyrightinfringement

[-]

hackingdreams@reddit

it's not just decompression though, it combines things.

So? Now it's just incorporating copyrighted data from multiple sources instead of one.

The fact it can generate code that's verbatim to the training data indicates that it is, in fact, a sophisticated compression scheme. You just admitted it.

[-]

Which one?

I think under current copyright law the answer is "all of them". Taking a copy of something and incorporating it into another work is basis of a derivative work.

If the assertion is it's not derivative, then that's also saying that the model can be made without the copyrighted works, which it can't be.

[-]

kintar1900@reddit

Taking a copy of something and incorporating it into another work is basis of a derivative work.

You're still missing the point of the parent comment. For your statement to hold, you must have recognizable copies of the original work taken verbatim from the source. Creating a "new" movie by cutting scenes from Jaws and Back to the Future together would be copyright infringement (if we ignore parody law). Parent comment's point is that taking Jaws and Back to the Future and producing movies with the same scene structure, color palette, or character arcs is not copyright infringement, and is a much closer example to the way generative AI works.

[-]

rebbsitor@reddit

Parent comment's point is that taking Jaws and Back to the Future and producing movies with the same scene structure, color palette, or character arcs is not copyright infringement

This is a misunderstanding of copyright law. Creating a derivative work without the permission of the copyright holder is itself copyright infringement. The fact that the starting point is a copyrighted work that you're modifying means this is a derivative work and is copyright infringement.

[-]

Xyzzyzzyzzy@reddit

That's not actually how copyright law works in the US, though. Copyright only protects certain types of creative work from reproduction. It's not a blanket prohibition on all derivative works.

For example, if you write a cookbook, you hold copyright on the words that you wrote in the cookbook. You do not hold copyright on the recipes themselves. I can't copy-paste the text from your book to my book, but I can write instructions for making your split pea soup in my own words. You have copyright on the text and images, not the process of making the soup.

If I publish a paper describing a new sorting algorithm with improved performance under certain conditions, I have copyright on the paper and the code in the paper. I do not have copyright on the algorithm itself, because algorithms are not protected by copyright.

There's a well-understood process for your company to use the algorithm without risking a copyright violation. You give Alice the code for the algorithm from my paper. Alice writes, in her own words, a detailed description of what the code does, without any actual code. You pass that to Bob, who has not read my paper. Bob uses the description to code the algorithm. If I claim you infringed on my copyright, you can show that you didn't. You absolutely copied my algorithm, but there's no general prohibition on copying algorithms. If I wanted to protect my algorithm, I'd have to apply for a patent, which is an entirely different thing.

[-]

rebbsitor@reddit

You're correct that there isn't a blanket prohibition on derivative works. However, if someone makes a derivative work, their rights are only to the new parts they've created. They need permission of the copyright holder of the works they're derived from to legally distribute the derived work. Unless their use of the copyrighted work falls under Fair Use.

Your examples are not derivative works. They're things (ideas, facts, etc.) that were never subject to copyright protection in the first place.

However, what the person I responded to is talking about is a shot for shot recreation of a film generated by AI. There is artistic expression in the shot composition, arrangement of scenes, and the overall narrative that also have copyright protection.

Someone can’t legally take a book/movie and simply retell it, especially if the retelling is too close to the original in terms of plot, characters, and specific language. This would likely constitute copyright infringement because it involves reproducing the original work's protected elements.

Copyright law protects the specific expression of ideas, including the unique plot, characters, dialogue, and overall narrative structure of a book/movie. Retelling a story without substantial changes like summarizing or paraphrasing significant portions of the text can be seen as copying.

However, if someone retells a story in a way that transforms it significantly, adding original elements, or changing the setting, characters, or perspective, it might be considered a derivative work.

[-]

Xyzzyzzyzzy@reddit

I think we just understand this differently:

movies with the same scene structure, color palette, or character arcs

I read that as, like, making a movie "in the style of Jaws" or that is a "homage to Jaws" or something like that, which is clearly permitted. Not literally recreating Jaws from scratch shot-for-shot and scene-for-scene, which is 100% copyright infringement.

Sorry for the confusion!

[-]

__loam@reddit

I think the legal question is whether it's fair use, which is actually more complex than most people assume.

[-]

hackingdreams@reddit

It doesn't matter. If the model spits out code that looks sufficiently close to my GPL'd code because it was trained on my GPL'd code, you essentially created a sophisticated copy and paste machine. You can throw as many fancy terms at it as you like, but it ultimately does not matter how you got to the same damned code.

And folks, boy does Copilot like to generate the same code as it's provided - down to the bugs, comments, and often even copyright notices.

Building a copy machine with illudium unborkable compression technology doesn't matter in the slightest - it's still a copy machine.

[-]

kintar1900@reddit

It doesn't matter. If the model spits out code that looks sufficiently close to my GPL'd code because it was trained on my GPL'd code, you essentially created a sophisticated copy and paste machine.

If you believe this, I seriously hope you've never read any GPL'd code, and then written you own code that does something similar. By your own logic, being exposed to the GPL'd code has altered your training set, and given you the capability to produce other code which performs the same or similar function, and you are therefore violating the GPL terms.

[-]

hegbork@reddit

Yep. That's how it works. Have you heard about cleanroom implementations?

Btw. It was Microsoft who threatened the entire industry around 20-25 years ago when the code for NT was leaked. Anyone who looked at it would taint all their future work. At that time every open source operating system purged committers who admitted publicly to even breathing in the direction of that code.

[-]

__loam@reddit

I mean that's literally the legal reality right now lol.

[-]

accountForStupidQs@reddit

The how should absolutely matter, lest we say monkeys with typewriters are prima face copyright infringement because they may eventually produce the works of Agatha Christie

[-]

kintar1900@reddit

I think you're getting downvoted because you used prima face. It's pretty obvious nobody arguing against LLMs in this thread actually understands legal reasoning, much less the way LLMs actually work. :/

[-]

EveryQuantityEver@reddit

No, I think the one that doesn't understand is you. Mainly because you keep thinking that the "How" of how LLMs work is enough to paper over the "What" of what they're doing when it's copyright infringement. Its like saying that courts can't do anything with Bitcoin because "transactions are immutable!" and thinking that the court will just shrug its shoulders.

[-]

sparr@reddit

how do you prove that Back to the Future was one of the masks

If there exists any short* prompt that gets the model to reliably reproduce a clip from BttF, that is probably sufficient proof.

* too short to uniquely describe BttF

[-]

giltirn@reddit

Excellent answer!

[-]

purleyboy@reddit

An LLM is fundamentally a neural network. Each node (neuron) has an activation value and output weights. These numbers (and the node connections) are refined and adjusted with each piece of training data. The continual refining means that the end network is not a representation of any one piece of training data, but of all pieces of training data effectively overlaid. So, you generally will not get the training data back as output from an LLM. Compression is all about maintaining maximum original information (minimal information loss) with minimal storage. LLMs are not good at this. You may get an LLM to output a very small piece of code that is identical to training data, but often times this is because there are limited ways to perform a simple piece of logic. The actual training data is not stored as a facsimile in the LLM.

[-]

sonobanana33@reddit

Can you explain how it can output verbatim stuff then?

[-]

purleyboy@reddit

Here's a great article for you on memorization. This is the exception that is being addressed in future training techniques.

[-]

sonobanana33@reddit

This is the exception that is being addressed in future training techniques.

I think the lawsuit is on what has been done, not what might one day be done.

[-]

purleyboy@reddit

Yes, the NYT went on a determined hunt to find an instance of memorization. It took the firm they hired >10,000 prompt refinements to get a result they could use as the basis of the lawsuit. We'll see how that plays out. However, back to the technicality of it all, this is absolutely the exception and not the norm.

[-]

sonobanana33@reddit

10,000 prompt

Surely microsoft has more than 10k users? At 1 prompt per day… at least one user per day is violating :)

Doesn't sound that impressive if you put in perspective the fact that copilot has more than one user.

[-]

purleyboy@reddit

I'm not sure if you're serious or not. The 10,000 prompt refinements were not 10,000 random prompts but using very sophisticated techniques to attempt to essentially jailbreak the LLM and find an example of memorization and then continue to refine the prompt until they could get output as close as possible to training data. I haven't seen the prompt that was used but I've read that the prompt itself is going to be used to defend OpenAI. It may be such a contrived prompt that it works against NYT in court. We'll have to wait and see the case play out.

[-]

sonobanana33@reddit

It may be such a contrived prompt

So you actually have no idea of what the prompts are. Perhaps you violated copyright repeatedly yourself and are unaware of it?

[-]

purleyboy@reddit

As far as I'm aware the case evidence is not yet public, so we'll have to wait.

[-]

drekmonger@reddit

Case evidence is public, as I indicated in another comment:

Here's the complaint: https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf

Here's exhibit J, as mentioned in the complaint:https://nytco-assets.nytimes.com/2023/12/Lawsuit-Document-dkt-1-68-Ex-J.pdf

[-]

purleyboy@reddit

Thanks for that. I am guessing that OpenAI's argument will be based off using something like the ACR measure to demonstrate that it is unlikely that a typical prompt will expose incidents of memorization.

[-]

sonobanana33@reddit

we went from "it's impossible to reproduce the input" to "it's difficult"

[-]

purleyboy@reddit

It's a pretty big topic worthy of multiple Phds' worth of study. A gross over simplification is that it's like a human brain. I can read a book and give you a good synopsis but not a word for word replication. However, once in a while I may have memorized one thing word for word. In general (general being the key word), it's impossible to get the source training material out of an LLM. But there are exceptions.

[-]

sonobanana33@reddit

it's like a human brain

Remember that in r/programming people are likely to have taken ML and AI courses at university.

In general if something appears many times in the training data, it's probably very likely to be reproduced.

[-]

drekmonger@reddit

So you actually have no idea of what the prompts are

We do know what the prompts were.

Here's the complaint: https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf

Here's exhibit J, as mentioned in the complaint:https://nytco-assets.nytimes.com/2023/12/Lawsuit-Document-dkt-1-68-Ex-J.pdf

Essentially the prompts are asking for completions. The "prompts" contain no instructions. They are just start of the article and implicitly GPT-4 will attempt to complete it.

Out of something on the order of 10,000 tries (unclear if that's per prompt or all prompts in total), the model outputs something close (but not completely exact) to the source article used in training.

[-]

Swoop3dp@reddit

It's not really compression. LLMs don't reproduce the training data word by word. (unless the model overfits the data, which you want to avoid)

lossy image compression also dosent reproduce the image pixel by pixel

[-]

GlitteringFriggit@reddit

Some if the queries I've made to claude have returned 200+ line verbatim copies of code (including all the original comments and everything). And to be clear I wasn't trying to get it to return copied code, these were just random queries. I only noticed due to the "humanness" of the comments, searched online and found the exact code from a 13 year old stack overflow post.

[-]

liveart@reddit

First of all: it's not compression. At least not in any meaningful sense of the word. The comparison is like calling deleting 99.99% of a file 'compression'. The bits just are not there. Second I'm not sure why people don't just... look up copyright law. You don't need to get that far into it to reach the fair use portion on wikipedia and find:

Amount and substantiality of the portion used in relation to the copyrighted work as a whole: Courts look at both the quantity and quality of the copyrighted material that was used. Using a large portion of the copyrighted work is less likely to be fair use. However, courts have occasionally found use of an entire work to be fair use, and in other contexts, using even a small amount of a copyrighted work was determined not to be fair use because the selection was an important part—or the "heart"—of the work.

In no way, shape, or form is a 'substantial' portion of any of these works included in the model. Given the model sizes and the amount of data used if it were any type of compression it would be the most impressive compression in the world.

But lets say you're still not convinced. What else can we find under fair use?

Purpose and character of the use, including whether the use is of a commercial nature or is for nonprofit educational purposes: Nonprofit educational and noncommercial uses are more likely to be fair use. This does not mean that all nonprofit education and noncommercial uses are fair use or that all commercial uses are not fair. Instead, courts will balance the purpose and character of the use against the other factors below. Additionally, "transformative" uses are more likely to be considered fair. Transformative uses are those that add something new, with a further purpose or different character, and do not substitute for the original use of the work.

AI Models are by necessity transformative. Even if you want to torture the definition of 'compression' to try to make it fit the model is still massively transformative. An AI model looks and functions nothing like the original works. Two out of the four major standards for fair use are inherent in the creation of AI models. Ironically the stronger argument against fair use is the effect on the market/value of the copyrighted work. There is a real risk of AI models pushing down demand for the works they trained from, however it's tougher to prove and no one wants to just say "I just want money" so instead we have these threads where people try to torture the definition of compression and don't just look up what is generally considered fair use.

[-]

Ictogan@reddit

You say that, but the currently best performing compression algorithm in the Large Text Compression Benchmark(compressing a subset of wikipedia) is a program that trains a transformer model on the data http://www.mattmahoney.net/dc/text.html#1072 . So transformer models can very well contain exact copies of their training data, to the degree that they can even be used as a lossless compression algorithm.

[-]

Chii@reddit

But if I take the latest Avengers movie and compress it, it's still copyright infringement even if there is no facsimile of the movie

but the compressed content cannot be used to make a different move than the avengers. Therefore, the primary use is to infringe copyright.

With LLM's on the other hand, there's a case to be made that it can be used to create new works that don't infringe at all. Just because the LLM potentially contain some encoded form of the training data is irrelevant - the digits of pi also contain that same information, and yet you do not get to claim that people who use pi are infringing.

LLM is distilling information out of a large body of works. This information cannot be copyrighted - in the same way that a recipe cannot be copyrighted, only the particular expression of it. Somebody else can take the information of the recipe in a cookbook, and reproduce it in their own expression and it would not infringe (nor should it).

[-]

Syxez@reddit

Just because the LLM potentially contain some encoded form of the training data is irrelevant - the digits of pi also contain that same information, and yet you do not get to claim that people who use pi are infringing.

I think what ultimately matters is the end result. Pi contains the Avenger movie, but someone cannot get it without knowing the whole exact movie in the first place. What would be illegal is providing a pointer to the movie in Pi, this would allow anyone to create the movie using a very simple program.

The cases the were dismisses by the judge were dismisses on the ground that the similarity between generated code and code trained-on wasn't high enough. In the case of LLMs, you don't need any pointer to get the data, your search prompt is already integrated in token form and linked internally to the data you search, however, current LLMs do not encode the vast majority of the data with enough accuracy for the data to be recreated similar enough.

This is bound to change of course if training on the data becomes more intensive in the years to come. There are already cases of techniques where you train significantly more on "outlier data" to better integrate it, resulting in sometimes 100% accuracy recreations, like the exact ascii-art recreations of some current models.

[-]

quetzalcoatl-pl@reddit

just a tongue-in-cheek note:

Pi contains the Avenger movie, but someone cannot get it without knowing the whole exact movie in the first place. What would be illegal is providing a pointer to the movie in Pi

Actually, there's a nonzero chance that the data that forms the-pointer-to-the-position-of-Avengers-in-Pi is of similar bit-length to a decent copy of Avengers movie ;)

[-]

hackingdreams@reddit

With LLM's on the other hand, there's a case to be made that it can be used to create new works that don't infringe at all.

Except, they do, because they're made from the original content. I know this is hard folks, but if you take a movie, chop it up into little pieces, re-arrange it, and post it on YouTube as your work, guess what's going to happen to your YouTube channel? It's going to get flagged as copyright infringement, because odds are really fucking good you just committed copyright infringement.

You're never beating the allegations as long as you admit that some amount of the source work is being duped into the finished content, and the plaintiffs have plenty of evidence of that happening. Hell, so does anyone - ask your favorite LLM to generate a fast inverse square root algorithm, and 99.99999% of the time it'll spit out the Quake III algorithm, complete with a fucking GPL header.

[-]

Chii@reddit

because odds are really fucking good you just committed copyright infringement.

that's because you're just describing something akin to format transcribing. It has nothing to do with the operations of an LLM, as what copilot might output.

It's not to say that copilot cant produce infringing content - it's a case by case basis, based on what the user prompts copilot to do.

But i would make the general claim that the LLM in copilot itself (the neural weights) are not themselves infringing. Just like i would not infringe by distributing digits of pi. If someone points to a certain index of pi digit, and say "here are the avengers movie", then they would be infringing.

[-]

mccoyn@reddit

Just like i would not infringe by distributing digits of pi. If someone points to a certain index of pi digit, and say "here are the avengers movie", then they would be infringing.

This analogy doesn't hold up. The digits of pi are in no way derived from the Avengers movie. But, the neural weights are derived from the copyrighted code. If you left that code out of the training data, you would get different weights.

[-]

WaitForItTheMongols@reddit

but the compressed content cannot be used to make a different move than the avengers.

Sure it can, you just need a different decoding algorithm.

In the simplest case: I can take the Batman movie, XOR it with the Avengers movie, and get My Secret Algorithm.

Throw away the Batman movie, pretend it never existed.

Now, if I take My Secret Algorithm, and XOR it with the Avengers movie, I can get a totally different movie! The existence of My Secret Algorithm proves that the data in the compressed Avengers movie can be used to recover any movie of your choosing. If you use the same decompression algorithm as was used for compression, you get Avengers, but if you use a different algorithm, you get any other movie.

[-]

Chii@reddit

XOR it with the Avengers movie

but you would have to first infringe here, because this algorithm would be a derivative work of the avengers movie already. It has nothing to do with the data in the compression any more.

your proposed scenario is as if your prompt in the LLM is the actual infringing content.

[-]

WaitForItTheMongols@reddit

I'm not saying that's what you would do.

My point is to construct a logical proof that, because we can use XOR to generate Such An Algorithm, then Such An Algorithm exists. We could arrive at Such An Algorithm in a potentially non-infringing way as well. But the question of "Can you compress one set of data with one algorithm, and decompress with another algorithm, to recover a different set of data" has an answer that is emphatically YES.

[-]

sonobanana33@reddit

Is a fan cut of the avengers not infringing?

[-]

Chii@reddit

but that's not what the LLM is doing - it's not cutting pieces of the works from the training data and rearranging it, unless you go to the level of names or tiny snippets of code.

[-]

NickWalker12@reddit

It's going to be hard difficult, possibly impossible, to 'prove' a piece of code was used in training.

Laws can be passed requiring LLMs to make available publicly the full database of training data (losslessly compressed), which the LLM can be falsified against, as well as the associated license for each piece of media. Fair compensation can be given for the distribution cost incurred to the company.

But I'm honestly more shocked that this isn't already law in the EU, given:

LLMs can reproduce source data.
GDPR gives EU citizens the right of erasure.

What this is really getting at is that fair use laws have, for a very long time, allowed for much more egregious uses of copyrighted material than AI participates in. Many of the smaller, independent artists who have been the most vocal about being anti-AI are, themselves, far more guilty of re-use than the AI is. Even if we were to re-write laws to accommodate AI, we're not going to be able to find a balance that satisfies that group. Either we allow AI to re-use infinitesimal portions of copyrighted material, or we prevent much of what is currently protected by fair use.

[-]

loup-vaillant@reddit

The underlying vectors that form the core model of an LLM do not contain any facsimile of the training data.

And we know this how, exactly?

I can take a picture and encode it into some giant QR code, and while the QR-code itself will look that it does not contain any facsimile of the original image (it certainly won’t look anything like the original image), there’s no doubt all the information is there, and that’s what matters here.

Can you actually say the underlying vectors do not contain loads of information from the training data? I think not, those vectors are mostly a representation of that data. A lossy one for sure, but if it did not contain information from the training data, what would even be the point of that training?

It's going to be hard difficult, possibly impossible, to 'prove' a piece of code was used in training.

It is possibly just as hard to prove that a human has read a particular novel, or learned from a particular piece of code. The fact remains that if I reproduced source code from memory and distributed it as if it were my own, I’d be infringing copyright.

[-]

simon_o@reddit

It's going to be hard difficult, possibly impossible, to 'prove' a piece of code was used in training.

That has never been an obstacle when it's been the big corps suing the little guys.

[-]

purleyboy@reddit

Code copyright laws are a mess. Technically speaking, all code snippets on StackOverflow fall under CC by SA, meaning everyone here who has ever cut and pasted from StackOverflow should be publishing an attribution for each code snippet. No one does this. It's impractical. CC by SA also has copyleft implications. In fact, from a commercial sense, you may be better off telling your developers to use github copilot over using StackOverflow to get snippet type code generated because there are (currently) no legal issues with the generated code. Specifically, there are no copyleft issues.

[-]

hackingdreams@reddit

Technically speaking, all code snippets on StackOverflow fall under CC by SA, meaning everyone here who has ever cut and pasted from StackOverflow should be publishing an attribution for each code snippet. No one does this. It's impractical.

I've been at companies that very much track this, and add it to their open source disclosures. There's a whole industry built around tracking usage of open source code, making sure that license attribution is done correct, and making sure your products aren't in violation.

In this case, the burden to detect and complain about this copyright infringement lies on StackOverflow, as ultimately it's their copyright being violated by these offenders. StackOverflow apparently doesn't care, as it hasn't sued anyone over it yet. Therefore, this entirely faction of the argument is moot.

In fact, from a commercial sense, you may be better off telling your developers to use github copilot over using StackOverflow to get snippet type code generated because there are (currently) no legal issues with the generated code.

Except, you know, the very likely possibility that you get sued for copyright infringement right after Microsoft loses this case. The amount of emergency lawyering that would have to be done, the ridiculous degree of code auditing, the fucking five alarm panic this would cause at most companies means that this advice is the worst fucking advice I've literally ever heard. "Might as well commit copyright infringement, it's not yet been settled it's copyright infringement." No fucking thanks.

[-]

purleyboy@reddit

I am involved in a lot of M&A and am very familiar with the various OSS scanning tools and the legal implications. First, Github indemnifies paying customers for any law suits, but I understand if corporations find this unattractive.

Second, the copyright laws in code are ineffective today. As you mention, StackOverflow doesn't pursue anyone for violating their license. Legally though, companies should care about code being pasted from StackOverflow just as much, or possibly more, than code generated from Github Copilot.

I've used Blackduck, Snyk and Mend plenty of times. In over 50 M&As I've been involved in every single company has OSS license violations in their code base (typically lack of attribution). We fix it post acquisition, but I see enough to know how bad our industry is.

The reality is that most companies have violations on OSS licenses due to developers adding libraries or cutting and pasting code without any oversight. Right now the same is happening with GAI. Whether your company allows it or not, guaranteed your developers are using it.

[-]

balefrost@reddit

As you mention, StackOverflow doesn't pursue anyone for violating their license.

I don't think that's correct. The copyright lies with whoever contributed the code to SO. The original author licensed their contribution to SO as CC BY-SA, and so SO has to distribute it under that license.

The original author could dual-license their content, for example charging companies to use it commercially, releasing it under a proper open source license, or (in some jurisdictions) releasing it to the public domain.

I think the only person who could pursue such cases would be the original contributor. Most contributors don't care - they intended their work to be used freely.

I agree, though, that it's a legal minefield.

(Incidentally, I couldn't reply to /u/hackingdreams or even to your grandparent comment.)

[-]

dysprog@reddit

I've been at companies that very much track this, and add it to their open source disclosures. There's a whole industry built around tracking usage of open source code, making sure that license attribution is done correct, and making sure your products aren't in violation.

I had a company track me down and beg me to grant an MIT License for a code snippet I posted on stackoverflow.

Apparently the legal department had been looking for me for a few years and somehow found me on Facebook of all places.

(No idea why it took them so long, I use my legal name on SO, which is fairly google-able. There are about 5 people by that name, and it's also my gmail address)

I considered squeezing some dollars out of a corporation for the principle of it, but decided I'd rather not give payment related information in case it was some absurd scam.

[-]

kintar1900@reddit

This is the most cogent comment on this thread. Thank you.

[-]

KSRandom195@reddit

Specifically, there are no copyleft issues.

No copyleft issues so far.

[-]

purleyboy@reddit

Well, the US copyright office ruled last year that you cannot copyright AI generated content, unless a human has been significantly involved in refining the content. So, at the moment, technically you cannot copyright any raw code generated by a coding assistant. Which leads to a whole other discussion about what in the future is copywriteable and what is not.

[-]

Worth_Trust_3825@reddit

Corpos go at each other over code snippets. Check out IBM v Microsoft over public api copyright, and Oracle v Google over java usage in android, where the case hung mostly on google copying an indexof function.

[-]

I'm going to bet Microsoft settles the licensing issue if they can. I don't think they have a leg to stand on.

However this decision already sets a good precedent, with the judge agreeing that AIs don't copy code... which is true, they really don't.

You have more danger from a junior dev just copying code from Stack Overflow, than an AI writing something similar to something else.

[-]

Gubru@reddit

Hard disagree. If massive corporations can train on your data, then you can train on massive corporations' data.

[-]

nightcracker@reddit

A new specimen in the wild: the temporarily embarrassed data broker.

[-]

FatStoic@reddit

Your data: In public github/gitlab, publically accessible.

Their data: In private github/gitlab/homerolled equivalent, only accessible by them.

[-]

Gubru@reddit

Their data: every book, movie, tv show, and other piece of media ever published.

[-]

DarthNihilus@reddit

My data: All of that and a private gitea instance

[-]

lngns@reddit

Microsoft's official stance is that your code is free-to-use, but code by "entities with lawyers" is not.

[-]

amadvance@reddit

Yep, this is a huge blow to copyright. It's really good news.

It's unfortunate that it helps Microsoft, but we should keep the long-term goal in mind.

[-]

sonobanana33@reddit

Good luck when they sue you.

[-]

exodusTay@reddit

if this goes thru i hope someone trains an llm with all the leaked code online and makes it public.

[-]

otherwiseguy@reddit

This is literally how reading a textbook works. I do not cite all of the books I've ever read on programming when I write code. It's just how learning works--human or AI.

In general I'm ok with the training on whatever is publicly available as long as the output of AI cannot be copyrighted (and it can't). It seems like a decent trade off.

Copyright has absolutely no legal basis to prevent training. The only potential violation would be the produced output. I believe that the output snippets are too short to be copyright violations. All works are copyrighted by default upon creation, but that does not mean that every subset of that work can be copyrighted. There’s a minimum length and complexity requirement that’s poorly defined by US law.

You cannot ask an LLM to reproduce the training data as it literally does not have it

Surely you mean something other than what you said. LLMs would be useless it they could not reproduce the training data. All they do is learn the distribution of the training data.

At the end of the day all of this is more like lossy compression than anything else and that is how I expect the law to eventually treat it. Just because your JPEG has been encoded 100x it doesn't change where it came from.

[-]

purleyboy@reddit

A good analogy is to think of LLMs as storing concepts rather than raw data. When you hear about LLMs being built on vector databases think of tokens as living in a multidimensional vector space. Each token, and higher order sets (words) are then associated with others based on how close they are together in the vector space. So 'cat' and 'dog' may be very close in a sub vector space for 'domestic animals'. By having billions of dimensions we get emergent properties from LLMs that we are still trying to understand. It's mind blowingly amazing. Once in a while we get unexpected behaviors (e.g. hallucinations). We also sometimes see evidence of memorization, but this is unusual and is an area of study to remove it.

[-]

a_marklar@reddit

A good way to think of LLMs (and NNs) is lossy compression. Get garbage like concepts, hallucinations, etc, etc out of your mind. Ironically, that is all noise.

The concepts are expressed as a stream of tokens using statistical properties. In this way any output is typically non-deterministic. Unlike compression, you'll never get the same output twice.

Where did you hear this? The vast majority of neural networks are completely deterministic by default. They are literally pure functions. Are you thinking of MoE systems like ChatGPT that are not deterministic?

[-]

purleyboy@reddit

Yes, I'm taking about LLMs. Their output is non deterministic. Run the same prompt multiple times, you'll get a different output each time.

[-]

a_marklar@reddit

I just ran the same prompt 10x through llama.cpp with temperature=0.0 and got the same output each time. Which makes sense because that is how it is supposed to work.

[-]

Moleculor@reddit

The trick here is to get an LLM to reproduce a sufficient amount of copyrighted code that a human reproducing such would also be found guilty of copyright infringement.

Even if you actively tell an LLM "please be 100% deterministic", the results that I've seen aren't likely to generate more than short snippets of code from existing sources, if they even generate from existing sources at all (they often don't). And in the cases where they do produce snippets of code that seem to match code that exists elsewhere, the snippets that are small enough that it's reasonable to think that another human might have produced that same code without having seen the pre-existing code.

If we can successfully argue that a human that produced a snippet of code didn't copy it from existing code (because there are only so many ways you can solve problems), then should we be holding LLMs to some impossible higher standard? Eventually you'll get to snippets of code so small that there literally are few, or even just one, method of writing that code. (Example: #include <iostream>.)

[-]

purleyboy@reddit

Ah, yes. Agreed when temperature is 0.0.

[-]

bruisedandbroke@reddit

open source doesn't mean free. they're being trained on code which is licensed under the GNU GPL and other strong copyleft licenses, and the result ends up in closed source software which violates the terms of these licenses

copilot doesn't discriminate! everyone's work gets stolen

[-]

CyberKiller40@reddit

And that part of the lawsuit is still ongoing. The dimissed piece was the one which looked like a hoax to get a DMCA takedown against the whole tool. Licensing issues are still being decided.

[-]

TomWithTime@reddit

I wish we would either make a decision here or give up. It's not ok to use some code as is but it's ok to use it to inspire almost identical code? That seems so practically worthless.

Whether it's ai art or ai code, can't you just say (or as a corp / wealthy person with the money to do so, follow through with) that you training data was actually a human built and legally distinct replica? Based on my understanding of ai that should yield an identical or almost identical model, so the obstacles we're trying to establish are only obstacles to non corporate entities and we aren't protecting the original either way

I understand that's basically been IP/copyright law forever, it just seems so stupid

[-]

CyberKiller40@reddit

When you know how the ai tools work to generate things, you stop worrying. You can get an identical replica, yes, if you specify the similarity factor too close. But in general the content in the neutral network isn't code or anything at all and the nn can't know what it is. It starts as random noise and it's iteratively regenerated to be similar to a source. With enough iterations it will be identical, the point is to stop iterating at a reasonable point.

Another thing is, people attempt to protect/compare too small pieces of code. Similar in the SCO vs Novell case, where most of the copyrighted code was variable decelerations like 'int i;' it's like this here too, basic functions with just a small number of reasonably optimal solutions will end up very very similar even if humans would write them.

[-]

TomWithTime@reddit

That reflects my understanding, I take issue with the human elements in play. We make arbitrary rules that are easy to bend or bypass

basic functions with just a small number of reasonably optimal solutions will end up very very similar even if humans would write them

That's what I'm thinking, I just hope the people who enforce the rules can understand

[-]

It is quite literally learning by any definition of the word.

[-]

joe1134206@reddit

Nope. It isn't conscious. It does what it's told like any computer software. It doesn't understand concepts in true human terms. You wouldn't say your phone is "learning" when you update your domino's pizza app, but AI evangelists have adopted this holier than thou gatekeeping attitude about it that hasn't materialized in reality whatsoever.

[-]

r3drocket@reddit

To me, the distinction is that a large company is going to produce a large amount of profit off of this thing and while that thing does benefit society in general, perhaps it should be priced to produce less profit, or be freely available or very cheaply available to use.

And anytime an AI model starts to take jobs away from people we should start to wonder whether or not that's a good thing or not. It's a pretty big slap in the face, if your data was used to train something that ultimately puts you out of work, benefiting some capitalist and worsening our overall inequality.

If you read more about Adobe's efforts to produce a image generation model that doesn't disenfranchise artist, you realize very quickly that it still disenfranchises artist because it still reproduces artwork very similar to the things that the artists are trying to sell even though they were paid art for training on their art. It effectively allows Adobe to compete against anybody who wants to produce artwork.

[-]

ElijahQuoro@reddit

Does it mean, that if you have read open source code you should never apply ideas and approaches you have seen there? Where is the line here?

[-]

space_interprise@reddit

The line is between ideas and actual code lines, since AI isn't capable of truly creating something, only predicting the most likely next word it can become a issue when you go to more specific and the most likely word is a direct copy of someones code

[-]

travelsonic@reddit

since AI isn't capable of truly creating something

Being creative maybe, being able to make creative decisions, yes (200% agreed), but ... IDK, perhaps it is my pedantic side but "creating" seems like a stretch in the literal sense of "a combination of X elements from places Y and Z didn't exist before, it does now." Surely, when it happens (said combination came into existence) it is "created" whether man or machine made it? Surely it didn't will itself into existence out of thin air.

[-]

StickiStickman@reddit

since AI isn't capable of truly creating something, only predicting the most likely next word

People still say this, seriously? Of course it can. I make Llama generate a poem that hasn't existed before. Just like Github Copilot can also generate new code.

So far the only instance I've seen of it actually reusing code verbatim is for very well known code that's many times in the dataset and they specifically tried to get it to do that, like Dooms fast inverse square root.

[-]

if you're using other people's libraries and snippets, you follow the terms of their license. do it all you want under MIT or the unlicense but other libraries require attribution and for your code to be licensed under the same license.

people do this all the time, it's how you learn. if you're talking about design patterns, it's pretty much the only way to learn, but if youre talking about copying people's source code, it's theft if your software is closed source.

the line has been drawn longer than I've been alive! Microsoft are trying to ride out the AI bubble until it dries out, and relies on their expensive lawyers and the flawed American legal system to escape repercussions.

[-]

andrerav@reddit

if you're using other people's libraries and snippets, you follow the terms of their license. do it all you want under MIT or the unlicense but other libraries require attribution and for your code to be licensed under the same license.

Using a GPL licensed library (in the form of a linked library, package, etc) does not mean you have to license your own code as GPL.

Copying the code (or part of the code) from a GPL licensed library does mean that you have to license your own code as GPL.

[-]

MidgetAbilities@reddit

I think you’re thinking of LGPL. My understanding is using a GPL package will indeed cause your own code to require release under GPL as well (assuming you distribute the code in any capacity to users outside your organization).

[-]

purleyboy@reddit

Distribution is key here. If your code is running on your web server then it is not distributed.

[-]

MidgetAbilities@reddit

For GPL that is correct. But some licenses may define distribution differently. Most notable is AGPL which considers someone accessing the software over a network to be distribution as well. This isn’t a very common license though.

[-]

C_Madison@reddit

Using a GPL licensed library (in the form of a linked library, package, etc) does not mean you have to license your own code as GPL.

It does, at least according to the FSF, who created the GPL That's the whole point of the GPL and why there's a separate LGPL. That's why the GPL is also called a "viral" license. Using it "infects" your code. There are people who see it differently, but as long as there's no case law on it it's probably a good idea to follow what the authors say was the intention. For me this means: No GPL anywhere near my code.

[-]

waterkip@reddit

Yes you do, that is why they also have LGPL.

[-]

zanza19@reddit

Human and machine are different, that's quite a clear line.

[-]

theQuandary@reddit

If I read and memorize GPL code, I'm not free to write it down and use it without permission.

LLMs are notorious for spitting out exactly what they saw (there's been a few big security complaints over them doing this with copied secrets).

Coding assistants are only trained on public and open source code.

100% they are not. MS might claim that's the case, same as Facebook will pretend they don't use private messages and Google doesn't use youtube content, but none of that is true. Access to that content is a "competitive edge" they have over the competition and you can bet they are using that, especially in case of diffused models where it would be hard to "prove" they did it.

On top of that, code being "public" or even "open source" still does not give anyone right to re-use it without proper attribution.

[-]

purleyboy@reddit

I would assume that the legal argument is that the underlying LLM does not store nor have access to any training data. The LLM has been trained just like a human coder would learn by looking at public or OSS code. As to whether private IP is being used for training in an unauthorized manner, that would appear to be an easy case against Github, however, you assert this is happening with no evidence.

[-]

Pharisaeus@reddit

The LLM has been trained just like a human coder would learn by looking at public or OSS code.

If you memorize a piece of GPL code and verbatim reproduce it, I assure you it would violate the license, even though technically you just "learned it" and not copied ;)

[-]

Open Source is not the same as licensed. If you upload your code somewhere, but do not include any lincense document (e.g a license.txt file), no usage rights are granted to anyone (in most jurisdictions). Rights must be granted explicitly. You may learn from it and use your gained knowledge to reimplement something similar, but you may not use it directly. The question is whether AI really learns or just pieces together information. The same question could be asked for humans though, there are no precise legal definitions on what seperates "learning from it and reimplementing something similar yourself" from "copying and rephrasing it". I don't think the problem can be solved by intellectual property laws, especially not when international laws are even harder to enforce than local ones. At this point we should rather think about how to expand the current financing model where a few governments and big companies invest in Open Source software and open standards to something more cooperative and with less political influence from single governtments/companies.

[-]

purleyboy@reddit

I agree completely with your pinion. I think there's a fairly common misunderstanding about how training LLMs work. The training process is similar to how individuals learn from public access of code. The end result of generated code is non deterministic for significant chunks of code. LLMs do not store or copy the actual source code used for training.

[-]

Berkyjay@reddit

Dude, most of us have public repos and share the shit out of our code. We aren't medieval alchemists jealously guarding our work.

[-]

MaleficentFig7578@reddit

We get to use their code too. We get to train AIs on the Windows source code leak with impunity.

[-]

bastardoperator@reddit

Massive corporations also build compilers, build tools and even entire languages… and you have never once attributed any of those tools despite the fact you wouldn’t be able to do dick without them.

This entire case will be dismissed, the other portions of the case have been dismissed with prejudice. I think most OSS licenses are about to be invalidated in terms of financial loses. You can’t say it’s free and priceless at the same time.

Who do you think will benefit if there is more copyright? You, and the 4 cents of royalties you hope to get for your code, or Big Content?

Sadly in my view letting the likes of Altman have at it is the lesser of two evils. At least this way I get to use an AI when it helps.

[-]

Eirenarch@reddit

Joke's on them, I use their tool to write code for me muahahahaha

[-]

umtala@reddit

Apparently nobody read the article.

The court’s dismissal primarily focused on the accusation that GitHub Copilot violates the Digital Millennium Copyright Act (DMCA) by suggesting code without proper attribution. An amended version of the complaint had taken issue with GitHub’s duplication detection filter, which allows users to “detect and suppress” Copilot suggestions matching public code on GitHub.

The developers argued that turning off this filter would “receive identical code” and cited a study showing how AI models can “memorise” and reproduce parts of their training data, potentially including copyrighted code.

However, Judge Tigar found these arguments unconvincing. He determined that the code allegedly copied by GitHub was not sufficiently similar to the developers’ original work. The judge also noted that the cited study itself mentions that GitHub Copilot “rarely emits memorised code in benign situations.”

They claimed that Microsoft was violating DMCA. They didn't provide evidence of this claim, and in fact their evidence showed the opposite of what they were trying to prove. Therefore those claims were dismissed. The copyright infringement claims were not dismissed.

[-]

lookmeat@reddit

And honestly it's the wrong way to fight this. I am surprised that there hasn't been better organization in defining the rights of this. I guess the IP companies with a lot of lawyers already realized and just did contracts behind the scenes.

What AI is doing is very much on the space of "fair use", because they don't generate a copy of the art, but a program that contains inspiration of the art and has the potential to create a copy of it. You know what else has a potential to make a copy? Photographs, but you can't sue camera makers for copyright infringement! Hell you can't even sue the photographer, only whomever publishes a picture containing your IP (and even then there's a lot of stipulations of fair use, a picture containing your work isn't a copyright infringment, but a photocopy of it does).

That said, copyright lets you forbid use of work for certain things without a license. Artists can license their work for ML-use separately of other uses. Basically extend the Creative-Commons, GPL and other popular "free to use, but protecting the artist still" licenses to not allow the use of the art for ML-training. A separate commercial-ML license could be proposed for those uses.

lookmeat@reddit

This doesn’t map cleanly tot he camera metaphor because camera manufacturers give you a tool to reproduce. If you use that tool to ingest copyrighted material you are responsible for that.

You got it right in that it isn't the camera.

The problem with large AI models is not the model itself but the trained models which are already “sold” to the end user with the copyrighted data ingested.

You got it wrong in that the copyrighted data is not contained therein. Say that I read a copyrighted book on how to write better, then I write an novels using the rules and guidelines within that book, to the point that there's examples of all rules within the book, though not the contents itself. Someone could, just reading my novel, learn the same writing techniques without reading the original book. Tell me is that plagiarism?

Like I said in my first example, it does open the question of what is a clean room implementaiton. But lets be clear, the ML model does not contain any work on its own. Just enough knowledge that it could recreate it. If we are going to say that is a problem, it's going to cause a huge problem with writer groups.

So rather it's like Sony investigating paintings and pictures all over the world to define a color scheme (think RGB) and then making a camera that uses that. It has content and knowledge that came from analyzing other works. But is it plagiarism?

So I hope those examples explain to you that the AI doesn't have the knowledge. Otherwise any author who has consumed a copyrighted work and was inspired by it could be liable.

And again, it's not like authors do not have any protection or way to defend themselves. They only need to explicitly state they do not give a license for commercial ML use. And they could try to sue retroactively arguing they never gave that right away to the companies, they just assumed. It'll be a complicated lawsuit, and one that will set new precedents (but this tech is a unique situation) but it actually has a shot.

[-]

meltbox@reddit

Perhaps. I guess I just don’t agree that human and AI interpretation are equivalent. For one thing humans often can claim fair use because we don’t train a human on text to then output text. Humans can train on text, movies, images, emotions, real life experiences etc and then output some combination into a unique text. Human and LLM learning are not analogous and even the creator of neural nets admitted that neural nets shouldn’t have been called that because they don’t really mimic how neurons work.

AI today is also largely single discipline with some glue logic to make them seem to be able to merge models together. Plus some hard logic filters to keep undesired stuff from coming out like say…. black Nazis or copyrighted code with GPL headers.

Also for example color spaces aren’t based off works but rather human perception, maybe behavioral and neuroscience. While works might be sampled to understand what areas of that perception to prioritize they aren’t actually encoded int the color space in any way.

It’s like someone else pointed out. Pi contains a copyrighted movie in it but if you need the source material or information as large as the source material (series of pointers) to decode the copyrighted movie from pi then the copyright infringement is not on PI but rather the pointer which decodes it.

Just like a colorspace doesn’t infringe but rather the representation of the copyrighted work in that colorspace.

[-]

meltbox@reddit

By this definition so is zipping up a movie. Look ma, the bits are different now!

Come on…

[-]

jherico@reddit

No, it's not. Zipping is a lossless bi-direction encoding system. Training a network is ENTIRELY different.

Your analogy is like saying that showing a movie to someone and having them be able to recount the plot is the same as pointing a camera at a movie and recording the whole thing.

[-]

Kinglink@reddit

However, Judge Tigar found these arguments unconvincing. He determined that the code allegedly copied by GitHub was not sufficiently similar to the developers’ original work. The judge also noted that the cited study itself mentions that GitHub Copilot “rarely emits memorised code in benign situations.”

Judge basically says that it's not memorizing code, which kind of leads to it isn't just copying code. (Which is absolutely correct, so at least the judge seems to understand what AIs are doing.)

Copyright infringement is probably going to get settled. I don't see Microsoft trying to go "we trained the AI on stuff that has impermissible copyrights to do so" but maybe they're just mad enough to try that.

[-]

PaintItPurple@reddit

"Rarely emits memorized code in benign situation" is basically the opposite of "is not memorizing code." It means that the software is memorizing code, but is programmed to avoid emitting that code verbatim under normal circumstances.

[-]

Kinglink@reddit

No it's not memorizing code.

If you think a LLM has every single code it's used to train on in it somewhere, that would be literally impossible I used the comparison of a 16 gig model that's trained on 4 billion of pictures. If a each picture was in there, it'd be 4 bytes of data. It's just physically impossible. The same is true for code models. They don't have every piece of code sitting around in it's model.

What is going on is there's weights, and if you have certain weights just right it might recreate the exact input that it trained on. But it's rather hard to do that reliably with out heavily influencing the data set.

[-]

meltbox@reddit

It’s compressing it. Is a zip file no longer the zipped content? Sure. But it contains a synthesized version of it.

Neural nets are just lossy compression really.

[-]

Kinglink@reddit

Again no it's not. how are you taking 4 billion DISTINCT images and compressing it into a 16 gig model.

This isn't r/idiot This is r/programming, you should know that level of compression is literally impossible. Even if you consider it's "Lossy"

[-]

meltbox@reddit

It’s compressing the patterns commonly linked to words. It’s why image generation nets can be trained to output a likeness or cats or a ball. They learn the attributes important to those and compress it down into a series of weights which represent an axis.

That’s the cool thing about neural nets. Turns out you CAN compress that much data. It’s just lossy. Turns out most of that lost data isn’t really that important.

Also they make mistakes hence the occasionally cursed stuff they can output. Hence the lossy part.

[-]

Kinglink@reddit

You can make up any BS you want, but you didn't address anything I said. Explain how 4 billion DISTINCT images are compressed into 16 gigs... Saying "it's lossy" means nothing and in fact proves it's NOT a copy. But even if it's the most lossy thing ever, 4 bytes an image is not a copy in ANY universe.

But to further this a tiny bit. A lot of it is also deduplication. So for example the model learns every picture of a cat has a cat in it and extracts the attributes common to cats. This effectively is lossy compression on the attributes of a cat sourced from all the cats for example.

Which proves it's NOT COPYING but learning.

Nice try dude, but you basically are proving you either are ignoring what you understand about the technology, or just dump enough to think that's copying. Either way you already understand what it's doing... Stop trying to say it's compressing or copying it. You know it's not.

[-]

meltbox@reddit

Copying a likeness is copying. Again I said lossy compression, not lossless. You’re entirely right for lossless. But copying a portion of something is still copying.

Hence how Disney can sue you for drawing their characters even if they’re not carbon copied directly from a scene Disney has rendered or drawn.

I think maybe we fundamentally differ on what we consider copying.

[-]

Kinglink@reddit

No... even your definition of copying is wrong in this case... You still haven't explained how 16 gigs, can represent 4 billion images. You just keep saying compression.

Here's a compression I'll just put a 1 down if there's any green. That's Compression!!!"

Except, again that's not how it works, now stop annoying me with your stupidity and if you really want to have this conversation educate yourself, and then talk to someone else, you've wasted enough of my time.

[-]

josefx@reddit

So going by all this if I "train" an AI on ten thousand movies and it just happens to spit out a perfect copy of Terminator 2 I am legally in the clear, because it clearly could not store all the movies and that it did end up outputting Terminator 2 is just an AI thing that sometimes happens and copyright holders have to accept?

[-]

DeadlockAsync@reddit

That would be so statistically impossible that it would be evidence that you, as the user, coerced the AI to produce the output.

It'd be the same thing as using your phone's auto correct to output a novel. If the novel your phone's auto correct output perfectly matched an existing novel then it was you coercing it to do so, not the phone outputting it.

[-]

meltbox@reddit

Modern keyboard next word prediction would best be served by a LLM by the way. It’s usually a worse version of exactly that.

But also if your keyboard LLM was trained on a book it would be more likely to output that book. IE if you typed out the first sentence yourself it would very likely type out the rest of the first paragraph for you either verbatim or damn close depending on the context window and how much other data it was trained on.

And then are you really arguing that a 90% match of the first paragraph is not copyright infringement?

But keyboards instead use word distributions to predict the next most likely word and have no significant context window.

And this is also one huge reason a model won’t output the terminator in full when trained on it. Context window too small. Plus it’s merging the terminator with every other movie it’s seen and being seeded with random values.

Reddit pisses me off because clearly people don’t know how neural nets work. They’re just efficient lossy information compression and representation. The model can also only extrapolate in dimensions it already has represented by some weight. If the dimension it has to manipulate isn’t represented by some combination of those weights then the model is going to spit out nonsense.

This is why data cleaning is so important because it frees up weights to be used for what you need and not noise.

[-]

DeadlockAsync@reddit

And then are you really arguing that a 90% match of the first paragraph is not copyright infringement?

No, I am arguing that the copyright infringer is the user, not the LLM/system.

I don't think I am the user you meant to respond to either based off the contents of your comment.

[-]

Ur-Best-Friend@reddit

That's a remarkably good example.

[-]

Kinglink@reddit

Have you ever heard of clean room development. In one room a developer reverse engineers a software and then explains how the software works, passing that through a narrow slot to a second room that creates a personal version of their software.

This allows the person in the second room to create a similar bios, potentially the same bios but be able to prove that the copyright wasn't violated and yes IS a defense against a copyright.

Let's ask an alternative thing. Assume that an AI never watched Terminator 2 and spit out a perfect copy of Terminator 2.. are you legally in the clear? What if a group of people did the same thing and was proven to never have seen Terminator 2?

But you're also asking for an AI to copy a 2 hour movie exactly. When code or images are copied it's usually sections of the image, or snippits of code. Not entire files...

So really your comparision doesn't work, what you'd probably see is an AI gets a scene very similar to Terminator 2.. .and yet we have that too in movies, people call it an homage, but how many movies and tv shows have used the Akira Bike slide? About 40 of them can been seen right there. There's not frame for frame recreations, but again you're not going to get a frame for frame recreation at least not without a lot of work.

A lot of work that's going to look more and more like a clean room process when you really start to look at it.

[-]

meltbox@reddit

There’s still a difference between trying to record or create a scene which looks and feels like a Terminator scene vs what an AI will do which is essentially recall parameters of that scene and modify it slightly to not be exactly that scene.

For example if you take a copyrighted work and change some details on top you as a creator will still get sued. So why would AI be allowed to? Because it was harder to tell/prove? That’s an insane argument.

[-]

meltbox@reddit

This. Rarely emits copyrighted material would be like saying Napster is fine if the majority of content was legal if you didn’t add “pirated” to the end of the search.

Stupid ruling on this basis alone.

[-]

Ur-Best-Friend@reddit

Even if that were the case, which it's not, that actually wouldn't be copyright infringment. There are plenty of provisions protecting transformative art. I'm not sure there's a single song by The Prodigy that doesn't heavily employ the uses of samples, sometimes from several dozens of songs in a single one of their tracks. If that was copyright infringment they'd long since have been sued into the ground. Same with collage art etc.

If the end result is significantly different from what it was based on, that's not copyright infringment.

[-]

PaintItPurple@reddit

I can't speak to The Prodigy specifically, but big artists who don't get sued usually will get permission to use samples. Otherwise they can and do get sued. Heck, Robin Thicke's Blurred Lines didn't even use a direct sample and actually re-did the Marvin Gaye song they borrowed from, and they still got sued and had to pay out a bunch of money. Making the leap from "derivative" to "transformative" is not as trivial as you're making it out to be. In fact, artists who are sued this way almost never even try to argue that the use is transformative, and instead will often try to argue that the use of the copyrighted work is too minimal to matter.

[-]

hardolaf@reddit

It kind of is "memorizing" the code though. It's just that because of the trimming process, it becomes less precise and thus less likely to regurgitate the training data.

[-]

Kinglink@reddit

I like the word "training" because that's kind of what it's doing. In a lot of ways it's like a training a person, because if you take a junior and show him the first function he has ever seen.. he can write that function. He probably can't write another function, maybe he's able to make obvious changes (strings for instance) but even then, I remember students who thought that the strings were immutable otherwise the program doesn't work.

Then you show them more and more programs and they'll start realizing that each of the things are pieces that can be combined in other ways.

LLMs aren't (that) inventive, or creative (depends what you call their hallucinations, but those are kind of crap), but go down the same path.

The difference between a human and a LLM is you can teach the human the rules, but to teach an LLM the rules kind of violates the concepts of AI, because the idea is your focused on training data, to generate the internal rule set, rather than specifying the hard coded rules.

That being said, with any suitable training size it should be unable to remember any specific code. Though when you ask for a specific well known piece of code Q_rsqrt, don't be shocked when it is able to recreate it. Again, most people who really focus on it, probably can too, or the one piece they forget(the specific numbers), is what computers are actually good at memorizing.

[-]

hardolaf@reddit

I like the word "training" because that's kind of what it's doing. In a lot of ways it's like a training a person, because if you take a junior and show him the first function he has ever seen.. he can write that function. He probably can't write another function, maybe he's able to make obvious changes (strings for instance) but even then, I remember students who thought that the strings were immutable otherwise the program doesn't work.

The difference being that humans have memory and LLMs don't. LLMs are dumb and never actually "learn" anything. They're just a correlation graph for an instantaneous data point to generate a response based on some parameters. LLMs don't think, they don't actually learn anything even during training. All training is doing is creating a correlation graph function to take some input and create some output. They're dumb and can be trivially made to regurgitate all of the training prior to trimming. After trimming, they have a harder time outputting the training data because a lot of it has been trimmed but you can still often find full copies of training data that can be coaxed out them with the correct input.

Drawing parallels between humans and neural networks only makes them appear to be more magical than they really are which is to say a 40-50 year old math concept that's incredibly dumb and usually worse than just hiring more qualified people to make a better product for you via better methods. LLMs are largely just a way to spend money on CapEx and OpEx without increasing headcount because people management is a lot more complicated than scaling hardware and jobs up or down for the business because management hasn't yet been replaced by unfeeling androids who will casually cut 50% of staff at a moment's notice because they need to make the books work out for them.

[-]

jherico@reddit

The difference being that humans have memory and LLMs don't

I love it when people make points like this when we don't have a very good understanding of how human memory works, other than it's done using biological neural nets.

An LLM's memory is an elaborate encoding of concepts into an N-dimensional vector space, implemented as a set of neural nets. How do you know how similar or different this kind of memory is from that of a human? Are you a pioneering researcher into the way the human brain functions?

Training an LLM and training a human have roughly the same outcome... they can respond to queries in a sensible way based on what they've been exposed to. Arguments related to LLMs vs humans should focus on the differences and similarities that can be quantified. Just saying that LLMs have no memory or that it's fundamentally different from human long term memory is at best an "educated wish".

[-]

meltbox@reddit

We don’t for sure. But go take a look at the Arc AI challenge. It’s fantastic evidence that AIs are pretty much all recall no reasoning. Humans appear to be able to solve far more complex problems than AI with far less known underlying knowledge.

This would imply AI is mostly just a huge search machine with great pattern matching and not at all like how humans typically figure out complex problems.

It also implies they fundamentally do not function like humans as modern GPUs supposedly have similar raw processing power to a human brain.

It’s something about how we are wired vs a simple matrix cruncher.

[-]

Uristqwerty@reddit

I'd have to dig through my watch history, but I saw a video some months ago about how memories are stored (I think in mice, but presumably there are similarities). It seemed similar to a bloom filter in reverse, where all sorts of random factors that happened to be triggered at the moment the memory were created combined into a key, and if some fraction of those factors line up in the future, it recalls the memory. They also slightly change every time its recalled, to better fit the new context, and there was something about how the memory itself is encoded as a sequence of neurons triggering in time. I think there was also something about them playing back in reverse.

That could all be complete misremembered bunk, but the underlying point is that neuroscientists have figured out a lot in the past half-century.

[-]

hardolaf@reddit

biological neural nets

Which are, let's be clear, absolutely nothing like the mathematical neural nets invented 50ish years ago which were named "neural nets" as a marketing ploy and that the author of the paper coining the term would later go on to apologize for naming them neural nets after he learned more about how neurons worked from biologists.

[-]

jherico@reddit

You're reaching. You can say there are important distinctions but ANNs are not "absolutely nothing like" biological neural nets. They have MANY equivalent concepts and ANNs were inspired by a growing understanding of biological neural nets. One of the paper's authors was, after all, a neurophysiologist.

the author of the paper coining the term would later go on to apologize for naming them neural nets

Are you talking about McCulloch or Pitts? Because there were two authors on the paper. Either way, "citation needed". However, even if one or both of them did express regret at the use of the term, they would likely have been specifically talking about it's use for the concepts they described in their 1943 paper, which predates widespread use of computers and many advancements in how we build artificial neural nets. As our understanding of biological NNs has increased, so has the complexity of what we consider a neural net in computing contexts.

[-]

EnglishMobster@reddit

The difference being that humans have memory and LLMs don't.

LLMs very much do have memory.

What do you think checkpoints are? What do you think determines the weights in a model? What do you think sets different models apart?

If it's random chance and it couldn't "remember" anything, then LLMs would never be able to go through the training process. It remembers patterns, and those patterns inform weights in the data.

This is very similar to how a brain works, and that's intentional. Before they were called "LLMs" or "AI", these were called neural networks - because they work like neurons.

Brains are experts at finding and recognizing patterns. We only can do that because of our memory. LLMs also recognize patterns. They, too, can only do so because they have memory.

[-]

hardolaf@reddit

LLMs very much do have memory.

LLMs have a very limited amount of memory and no ability to actually remember things without having them in immediate, short-term memory in running program. So it's more correct to say that they have state but each time you restart the program, they have only the same initial state and no ability to remember anything from prior sessions unless you reload the exact state which is not how memory works in brains.

The topic of long-term memory for LLMs is still very much in its infancy with papers still being published this year on ways to achieve some sort of long-term memory.

This is very similar to how a brain works, and that's intentional. Before they were called "LLMs" or "AI", these were called neural networks - because they work like neurons.

As for being like human memory, there is a growing body of research which is explicitly demonstrating that neural nets (and LLMs by derivation) are fundamentally different from biological brains.

Also, please don't reinvent history as to the naming. The naming of neural nets was because the graph of them reminded the inventor of how neurons appeared to be interconnected in their graph representation structure. That same researcher would years later talk at conferences about how he regretted the naming as further learning on his part had shown him that the two were very dissimilar and that the naming misled a lot of people into thinking that his invention was meant to model neurons in a brain-like structure.

I'm not coming at this from a tech bro perspective, I'm telling you from the perspective of someone who has actually taken college classes on this stuff before it was cool. Handwaving them away as "not having memory" shows a fundamental misunderstanding of how LLMs work.

I've also had classes on it and I spend a significant amount of time working on projects that are "AI" adjacent or using "AI" and attending conferences around the hardware development of "AI" accelerators. I don't fault you for getting these things wrong because there is a lot of bullshit out there from professionals and even universities competing for that sweet, sweet "AI" money that everyone seems willing to shell out these days. Heck, you fell into one of the common misinformation tropes around the very origin of the naming of neural nets themselves. And it's not hard to get that misinformation as it's been paraded around for decades at this point even though everyone can go read the original paper and the author's rationale for free at any time.

[-]

Ur-Best-Friend@reddit

I never bought the argument that training AI is in any significant way different from a human learning from other humans. If you have an artist you like and you pick up some of their techniques through looking at/listening to their work, are you infringing on their copyrights? Well then every artist that has ever existed is a dirty thief, because no human has ever learned any complicated skill without learning from others that came before them.

As long as the output is significantly different, that's not copyright infringment in my book, and that goes for humans and AI both. People throw a huge fit when an AI song has a chord progression that's superficially similar to a human musician's song, and ignore far more blatantly similar tracks made by other human musicians.

[-]

meltbox@reddit

It is literally memorizing the code. Encoding it symbolically in a compressed format doesn’t make it any less of a memorization.

It’s repeatedly been shown that outside the training dataset AIs are less capable than humans with fall lesser memory retention to recall.

Essentially AIs are just incredibly good recall and pattern matching machines. Nothing more.

[-]

angryloser89@reddit

Judge basically says that it's not memorizing code, which kind of leads to it isn't just copying code. (Which is absolutely correct, so at least the judge seems to understand what AIs are doing.)

It sounds like you don't understand what AI is doing?

[-]

Kinglink@reddit

Go on.. tell me how it's copying code, I love hearing people explaining technology they don't understand.

[-]

emperor000@reddit

If you looked at that code to use as a basis of work you were doing, would you be copying it?

[-]

Kinglink@reddit

This is the question but if you learn from a piece of code. No you wouldn't be copying it. If you wrote the code line by line the same, that would possibly copying it, though size of it definitely will matter.

There's a workaround from almost every company I've heard you're not supposed to copy and paste code from stack overflow but rewriting and making it fit the style and guidelines of the code is acceptable. Seems reasonable.

Do people follow that, I think we know the answer is hell no, but if I tell you to print out the line "Hello World". There's only a certain amount of ways to do it, and I'm pretty sure the idea of a copy right on that is laughable.

[-]

st4rdr0id@reddit

Yeah, change a single character and now it's not identical, basically what pirates do with republished apps. Besides, what does "rarely" means here? A ton of testing cases with proper metrics would be needed to scientifically make that claim.

[-]

meltbox@reddit

Yeah this ruling is another judge with no idea what they’re talking about making a ruling which will haunt us for a century.

[-]

FollowTheSnowToday@reddit

This is Reddit not a Wendy's. Of course no one read it.

[-]

RyanCacophony@reddit

love the implication that the patrons of Wendys are such ardent academics lol

[-]

FollowTheSnowToday@reddit

I've always liked the meme-ish thing:

Ah, sir, this is a Wendy's.

There is no outstanding copyright claim, all the 1202(b) claims which would be "copyright claims" were dismissed.

What's left is a breach of contract claim for the open source licenses, that the AI models would be legal if they followed the license conditions. For such conditions to be nominally applicable it would first be found that copyright has been violated. Judge Tigar says as much in the June 24 order:

While OpenAI is likely correct that the attribution and notice terms in the Doe Licenses at issue are conditions, this does not impede Plaintiffs’ ability to bring a breach of contract claim.

...

Accordingly, the Court declines to read in a requirement that a plaintiff must bring suit for copyright infringement in the event of a breach of condition.

Plantiffs are allowed to bring the claim, but there's no likely path to victory here.

Also these articles are always trash, read the actual ruling

[-]

pixel_of_moral_decay@reddit

Which makes sense.

If you compare code you’ll find people regularly the same thing since we’re solving the same problems with the same tools, and all learned the same patterns.

The algorithm was trained on data, which is not subject to copyright, presentation is. As copyright exists today they don’t need the authors consent for that. It’s no different than you reading a book than speaking about what you read or applying it to your next project. You don’t need permission from the author to apply the knowledge only if you quote them or directly reuse their work.

You can read this comment and write a paper influenced by it, that’s not violating any copyright. Only if you quote it beyond fair use doctrine.

Most of these arguments against AI on the grounds of copyright will lose. That’s fundamentally not how copyright was intended to work and not how it’s written.

[-]

VeryDefinedBehavior@reddit

If AI winds up effectively killing copyright as a concept, then maybe AI isn't so bad.

[-]

Spitfire1900@reddit

No one seems to understand that the license you attach to your project does not matter if it’s uploaded to GitHub.

The ToS you agree to by using GitHub allows them to train against your software, even if it’s explicitly denied in your project’s license.

[-]

Luvax@reddit

It may be a ToS violation to upload code where this permission isn't given, but GitHub does not automatically receive the rights they want.

[-]

GrandOpener@reddit

If you have and upload to a GitHub repository, you have agreed to and are legally bound by their policies. At least in the US, I’m pretty sure the act of uploading your code indicates your continued agreement to their ToS.

Remember that code can be dual licensed. Even if you provide a license that prohibits use as training data, your agreement to the ToS can establish a separate license only to GitHub that permits it. Nothing is contradictory about that situation.

If you don’t agree with their ToS, your only real recourse is to not use GitHub.

[-]

Because I don't have any ability to grant it.

That's the problem.

If I submit code I don't own to Github, and then Github uses it in an external project, once the actual owners of the code find out, Github is screwed.

Nope. They'll come after you, because GitHub covered there butt with the agreements you've agreed to.

[-]

Tuna-Fish2@reddit

You don't understand. The actual owners of the code don't give a shit about me. No contract or license exists between me and them. They would much rather sue Github, because Github actually has money. The fact that there is a release from me to Github does not protect Github in any way, except that Github is allowed to turn around and sue me for the amount of money they lost when the actual owners of the code sue them. Except that I don't have that much money.

Otherwise you could find a homeless person who agrees to "sell" some copyrighted work to you while claiming to own it, and when the actual copyright holders come at you, you could just point them at the homeless guy.

[-]

mrbaggins@reddit

They would much rather sue Github, because Github actually has money.

That's nice.

No contract or license exists between me and them.

The copyright of the content you're distributing binds you and connects you to them. Same as I have no "contract" with disney about my DVD collection, I still can't put it on Youtube.

Except that I don't have that much money.

You say that like it magically makes the problem go away.

except that Github is allowed to turn around and sue me for the amount of money they lost when the actual owners of the code sue them.

What would happen is in the opening phase of the lawsuit against github, github will name you as another party, you will both be sued, and github will win their defense because they can prove it's your fault, leaving you on the hook not only for the judgement, but also likely for githubs costs as well.

Otherwise you could find a homeless person who agrees to "sell" some copyrighted work to you while claiming to own it, and when the actual copyright holders come at you, you could just point them at the homeless guy.

Congrats, you invented shell companies/phoenix companies.

Bad news: Veil-piercing (and the equivalent when targeting your "shell-homeless-dude" theory)

[-]

GrandOpener@reddit

Maybe. Certainly you're right that if you are not legally able to grant the rights that GitHub wants, they will not receive them.

But if all those contributions were made through PRs or commits on GitHub, then it would seem that every contributor did agree to the GitHub ToS and did actually give their own individual permission to have their code used as training data. So while you personally can't grant GitHub rights on the whole repository, GitHub still gets what they want.

It's hardly an unachievable goal. You can take your git repo anywhere and there's no shortage of competing host services or project management solutions that integrate with a git repository. Across my career I've only worked with a single company that used GitHub, it's not exactly an industry standard.

[-]

sonobanana33@reddit

I moved to codeberg. They even enabled CI for my account, so I can run CI.

[-]

spareminuteforworms@reddit

Yeah no.

[-]

Halofit@reddit

Yeah yes, actually. Copyright infringement does not require intent. Just because you didn't know you were infringing copyright does not mean you didn't infringe on copyright.

Websites have a specific carve-out for this, where they're not required to pre-emptively monitor for things they host, and rely instead on things like DMCA to resolve issues, but AI training has no such exceptions in law. You are required to ensure you're not infringing on copyright for every single thing you use.

[-]

sonobanana33@reddit

Yeah yes, actually. Copyright infringement does not require intent. Just because you didn't know you were infringing copyright does not mean you didn't infringe on copyright.

Yeah yes, actually, a license that allows redistribution doesn't mean you become the owner if someone redistributes to you.

[-]

Halofit@reddit

Ok, how does that disagree with me?

[-]

Luvax@reddit

Even if you are the copyright holder, you may not be in a position to grant such a license. Additionally, it might be up for debate if the uploader could have reasonably assumed what the specific details of such a license contain. Especially in europe where private entities are usually not expected and upheld to the same kind of legal understanding than companies are.

But sure, this is the US. But the blanket statement I replied to was simply not accurate and "It says so in the ToS, therfore it must be legal" is extreme bullshit.

[-]

Monad_No_mad@reddit

I'm not sure I understand this. How does Github not "receive the rights they want", Their ToS gives them exactly the rights they want and probably even shifts liability to the user if they upload something they did not have adequate rights to.

[-]

__loam@reddit

ToS are not necessarily legally binding if they conflict with the actual law.

[-]

Monad_No_mad@reddit

Yeah but in this case it's pretty clear, the user grants GitHub a license to do certain things to their repo and the user is responsible for making sure they have an appropriate license for what they upload.

If you upload copyrighted material to GitHub it's going to end up being your problem not githubs

[-]

__loam@reddit

These models didn't exist when the ToS was created so I could see there being laws against the retroactive inclusion of user data in training sets.

[-]

Monad_No_mad@reddit

GitHub has the ability to analyze, index, etc... what you upload.

[-]

__loam@reddit

Yes I understand what you're saying, I'm telling you that kind of contract won't necessarily hold up in court.

[-]

Monad_No_mad@reddit

In this case it will because by using GitHu .you

1) allow GitHub to do certain things with what you upload

2) acknowledge that you have the appropriate license for what you upload

It will be your problem, not githubs if you do not have the correct license.

Going even farther, it's hard to complain about rights for anything that's publicly available on the internet. Content has been scrapes, parsed and used for a long time.

[-]

__loam@reddit

Public availability has never represented carte blanche license to do whatever your want. And again, it doesn't matter what github's ToS says if the law disagrees. Just ask any company whose non-competes got voided this year.

[-]

Monad_No_mad@reddit

You are parroting something you don't understand.

If your software licensing is in conflict with githubs ToS then you are the one that will be liable as you are the one that put it there, agreeing to their terms of service.

And public availability is a license for many things, including analysis. This is a fundamental part of how the Internet works.

[-]

__loam@reddit

I mean it's rich to say I'm the one parroting something when you clearly don't think copyright applies to public content on the internet.

[-]

Monad_No_mad@reddit

Just think about it for a minute, how does search work on the internet?

If a model is trained on a website, how is this different then a website being indexed?

[-]

tav_stuff@reddit

Terms of service are not above the law. An analogy would be if I signed a work contract stating that after I leave my job I can’t work for another 2 years. It doesn’t matter if my contract says that — it’s illegal so I can legally ignore it

[-]

sonobanana33@reddit

Yeah and anything I lick is legally mine -_-'

You can state whatever you want… that's now how laws are created.

[-]

teslas_love_pigeon@reddit

May not be how laws are created but it is how corpos enforce their reign of terror. If you're an individual or a SMB do you really want to face the wraith of a trillion dollar corporation that has zero issues spending tens of millions in legal fees to make your life hell?

[-]

crazedizzled@reddit

The ToS you agree to by using GitHub allows them to train against your software, even if it’s explicitly denied in your project’s license.

If a trained LLM like Copilot is not considered a derivative work, then no kind of license is likely going to help you.

It was always a problem, but when one of the largest software companies in the world is selling tools that systematically violate licenses, it makes the problem a lot bigger.

[-]

solartacoss@reddit

i feel the same but with music.

[-]

Saki-Sun@reddit

With the current state of AI code, they seriously need to ingest and regurgitate more of mine.

Judge Jon Tigar’s ruling, unsealed last week, leaves only two claims standing: one accusing the companies of an open-source license violation and another alleging breach of contract.

that's actually impressive

do you not care about the licenses those libraries use? they were carefully chosen, right? why would they allow for them to not be taking into account?

[-]

M4mb0@reddit

Not sure what you mean. In this domain, permissive licenses such as MIT, BSD or Apache are commonplace. I think the people who take offense by copilot are mostly the copyleft crowd. But the solution for them is simple: prove your case in court.

If any entity — and it shouldn't matter if it's a person, copilot, or a thousand monkeys in a cellar hammering on type-writers — produces large snippets that is a clear copy of your code, in violation with the license, then sue. But it seems the people who file these copyright claims are not able to do that:

However, Judge Tigar found these arguments unconvincing. He determined that the code allegedly copied by GitHub was not sufficiently similar to the developers’ original work.

[-]

ZucchiniMore3450@reddit

I agree with you, I am supporting copy left and GPL and against closed source companies.

But this is like I am forbidden to read GPL code and then write MIT license code based on that knowledge.

Maybe our licenses need to catch up with time and add clause about AI.

And about copilot, what did people think when M$ bought github, that it was for altruistic reasons? We all knew they were training models and that's why some projects moved away. They must have consulted dozens of lawyers and judges before doing it, my opinion is not important.

Similar with art and AI, do I need to pay just to see some image? If it appears in my browser and in my disk cache and backup and I didn't forget about it and create art based on that image. Is that illegal? Yeah right.

[-]

PsychologicalStore96@reddit

So, licences are dead ?

[-]

FlukyS@reddit

Well it kind of hits at a weird question which is "inspiration" isn't copyrightable. For instance, I can write a song that is a ripoff of Metallica and I can do so based on my listening of Metallica, in that case as long as I didn't copy a chord progression, lyric...etc word for word it isn't infringing. This is well known and not controversial at all. Copyright comes into play though when an infringing work is released that isn't just inspired by something but directly copying word for word, note for note. In the case of copyrighted code for instance all code can be accidently reproduced a number of times in a number of different places if the problem being solved isn't novel, so it's hard to trace who owns a specific coding pattern. If though for instance someone has a Copilot subscription and it pastes a specific piece of GPL code though and it can be reproduced the works themselves are infringing and would have to be relicensed or removed so most companies that have a legal department won't even allow that conceptually at all. I think the big question here is when someone does infringe on it and is challenged for it in court to comply with the original license will that be upheld and not specifically that Copilot was trained on it because that hasn't been legislated yet.

[-]

ElMachoGrande@reddit

There are programmers who occasionally google a solution to their problem, learns how it works, and then they write their own version of the solution. Then there is another category of developers: liars.

In no way does Copilot reproduce "substantial" portions of any copyrighted work, which is what would be necessary to run afoul of the copyright.

If you're okay with something as long as it's done by a carbon-based brain but not okay with exactly the same thing when done by a silicon-based one, then as far as I'm concerned, you're a luddite.

[-]

__loam@reddit

If you get suggested someone else's code verbatim, that means that you're using the exact same variable names they did, which means that you're writing a common pattern that thousands before you have also written.

This reasoning is bizarre and completely wrong. Microsoft had to put filters on copilot because it was reproducing significant blocks of code licensed with GPL. This is just a fact that we know, you don't need to add weird conditions like having to use specific variable names.

If you're okay with something as long as it's done by a carbon-based brain but not okay with exactly the same thing when done by a silicon-based one, then as far as I'm concerned, you're a luddite.

Well the bar being high

I'm not sure I understand this. How is the bar high? Aren't most cases squashed by the bigger party before they can reach the court statistic?

[-]

FlukyS@reddit

Settled not squashed, the key thing with settled law is if both sides know for a fact something was wrong the settlement will be easier to reach. So it being settled before getting to court is a good thing. The bigger party doesn't matter in quite a lot of jurisdictions and especially it doesn't matter when their usage was provably incorrect. Like copyright cases are normally settled because there is basically no point in taking it to court and spending the money litigating it when both sides know generally going in what the outcome is. That's why it's settled law because there isn't normally any new judicial precedent that needs to be established as most of it has already been done in the 150 years of professional music recording and however long copyright has been litigated.

[-]

spareminuteforworms@reddit

Lets say you've got a 100,000 dollar/year business which could someday could be a 100,000,000 dollar/year (outrageous I know, nobody but big corps are powerful enough to innovate on such scale tell me you haven't worked in crop without telling me), well in your infancy you don't have the super powered lawyer guns to fight your case so your shit gets arbitrarily stolen. Do you think that harboring villains and thieves is ultimately to your benefit?

[-]

FlukyS@reddit

Checkout Busybox lawsuits if you think they will get away with it just because they have money

[-]

Purple-Ad-3492@reddit

Well it kind of hits at a weird question which is "inspiration" isn't copyrightable. For instance, I can write a song that is a ripoff of Metallica and I can do so based on my listening of Metallica, in that case as long as I didn't copy a chord progression, lyric...etc word for word it isn't infringing. This is well known and not controversial at all.

I'd argue that the legal frameworks for these domains differ when it comes to inspiration versus direct copying, in that for music the distinction is more abstract and challenging to prove. But there are cases -

The "Blurred Lines" case dealt with whether certain elements of a song were sufficiently similar to warrant a copyright claim based on a violation of intellectual property rights, which parallels assessing if contractual terms, like those in open-source licenses, have been breached. They were found guilty of infringement and made to pay royalties.

Another example - Olivia Rodrigo credited other artists retroactively for similarities in "vibe" and influence to avoid potential copyright claims after proclaiming her inspirations. The accusation against her for an Elvis Costello lift highlights this nuance, but Costello welcomed the influence, reflecting the more fluid nature in the industry from his perspective.

In contrast, code copyright is stricter. If GitHub Copilot generates code that closely mirrors copyrighted code, it might be considered infringement, as code can be precisely copied. The DMCA claim was dismissed because it couldn’t demonstrate exact replication of the developers’ code, and why the GitHub Copilot case's claims about open-source license violations and breach of contract violations were upheld, as they focus on strict adherence to specific licensing terms, which are more concrete and measurable.

[-]

WaitForItTheMongols@reddit

In the end, an AI isn't a person and doesn't learn like a person or create like a person. It doesn't have inspiration or creative choices. I don't think we can apply human mental processes to how we analyze these models.

[-]

FlukyS@reddit

It doesn't have inspiration or creative choices

Well yes and no, AI isn't an if statement, it is combining training data to do something you ask of it. What you ask is creative work, what it spits out is a combination of various inputs that can be done in ways that are unique. If it can produce any work that wasn't in the training set then you could argue that everything output from AI is in fact creative.

Now the argument that has happened from a legal standpoint and using super established law is can you copyright creative work from AI which you can't unless it has been transformed to the point where it is now your own creative work because only humans can have copyright and the creator or the assignee of copyright in the case of works that were commissioned are the only valid forms of copyrightable material. For example the monkey taking the selfie case, the monkey took the selfie, the selfie was therefore not taken by a human and thus cannot be assigned to the monkey. AI is not human so they can't get assignment either. So it can be considered creative but you don't have the right to protect that work after it is produced if it wasn't heavily changed.

I don't think we can apply human mental processes to how we analyze these models.

Well in terms of copyright you can definitely apply a lot of stuff here because it would be the same for literally any plagiarism case in any creative work. See the Robbie Williams case or the Ed Sheeran case for where the lines can be drawn from a musical standpoint. There is a line, like if I say "gimme fuel gimmie fire give me dababadai" or whatever the lyric is no one would have any ambiguity that it was from Fuel from Metallica, I can change the key but I'd still be infringing on the rights to the lyric writer.

[-]

mccoyn@reddit

If it can produce any work that wasn't in the training set then you could argue that everything output from AI is in fact creative.

FlukyS@reddit

big major do a compilation of all famous artist and sell it, but give no money to orignal authors

If some label is releasing a CD they are required always to license (if they don't have it already) the master recording and pay for publishing rights for both the composition of the music and separately if applicable the publishing of the lyrics of the song. That isn't negotiable. What people confuse in this area is generally that some artists sign deals with large upfront payments as an advance before recording starts that are required to be paid back to the label, they won't generally receive anything until that is paid back. There is also an area too which can fuck people over which is where you get the advance and if you sign with a label that also has publishing they can have a cross-collateralisation clause which would take publishing cash and use it to pay off recording cash. Also there is another way which is if the artist themselves sell off their publishing rights which generally has happened more recently especially for older acts that wanted to cash in before they die basically and in that case they aren't entitled to any money just the purchaser of the rights.

Long story short if someone releases your song in any way or uses your song on TV or in a rally or in a club...etc you have a right to be paid period. There are caveats but those are generally discussed when signing the recording and publishing deals.

"I aggre it’s the same voice, lirycs etc …, but on my tracks list it’s not the name of the people who take me on courts"

Doesn't work like that because records are generally easily traced back to the original sonically and to the time of production. The track on the CD is a copy of an original copy that has traceable elements.

Like you said, it’s not yet on the law, but it’s the first answer we got

I have some very specific code that I wrote that it's clear that it trained on, stuff that no one else has ever written because you'd have to be such a ridiculous dork.

It will literally reproduce the function names and variables that I used.

[-]

wildjokers@reddit

I am skeptical. So you have a prompt that you can provide to get a particular LLM to generate your code verbatim?

[-]

r3drocket@reddit

I write a lot of procedural generation code in openscad for generating organic structures, and I got it to pretty much reproduce the variables and module names I used in my openscad code, it failed to produce functional code, and mostly left the module bodies empty.

I tried this multiple times and one time it reproduced enough code to convince me it clearly has read my obscure openscad code, other times it produced different results.

Not verbatim but it's clearly been trained on my code. Yes I asked for a very obscure use case, but I wanted to see what would happen when I did.

[-]

currentscurrents@reddit

So what's the prompt and what file is it reproducing? Give us the details here so we can try it ourselves.

[-]

wildjokers@reddit

OpenSCAD is very niche and the amount of training data isn't that large. That probably leads to it just not having enough statistical data for it.

[-]

Here is the current USCO guidance on the copyrightability of AI generated material if you want to hear it from the horse's mouth. Naturally there are additional layers of nuance.

If a work’s traditional elements of authorship were produced by a machine, the work lacks human authorship and the Office will not register it. For example, when an AI technology receives solely a prompt from a human and produces complex written, visual, or musical works in response, the “traditional elements of authorship” are determined and executed by the technology—not the human user. Based on the Office’s understanding of the generative AI technologies currently available, users do not exercise ultimate creative control over how such systems interpret prompts and generate material. Instead, these prompts function more like instructions to a commissioned artist—they identify what the prompter wishes to have depicted, but the machine determines how those instructions are implemented in its output. For example, if a user instructs a text-generating technology to “write a poem about copyright law in the style of William Shakespeare,” she can expect the system to generate text that is recognizable as a poem, mentions copyright, and resembles Shakespeare’s style. But the technology will decide the rhyming pattern, the words in each line, and the structure of the text. When an AI technology determines the expressive elements of its output, the generated material is not the product of human authorship. As a result, that material is not protected by copyright and must be disclaimed in a registration application.

[-]