Judge dismisses majority of GitHub Copilot copyright claims
Posted by stronghup@reddit | programming | View on Reddit | 517 comments
Posted by stronghup@reddit | programming | View on Reddit | 517 comments
jeanbonswaggy@reddit
This isn't good news, massive corporations using your code without credit for profit will never be good
was_fired@reddit
If you read the article it actually is fairly good news.
The most important claim being made is that Microsoft violated open-source / copy-left licenses. That claim is going to go to trial so the question of: Can I make a closed source AI model based on GPL code will likely get its day in court.
purleyboy@reddit
It is going to be fascinating to see how this plays out. The underlying vectors that form the core model of an LLM do not contain any facsimile of the training data. That's a strong argument for not being a derivative of OSS. If it is deemed to be a derivative then it's potentially going to be unenforceable in the longterm as the number of LLMs continues to grow at such a fast rate (just look at HuggingFace). It's going to be hard difficult, possibly impossible, to 'prove' a piece of code was used in training.
meltbox@reddit
What? They do indeed. These vectors are just lossy compression.
It’s like arguing a zip file isn’t a facsimile of the data that was compressed to create it. Or h.264 files aren’t the original movie.
purleyboy@reddit
The underlying technology are neural networks. With each piece of new training data, backpropagation is used to rebalance all of the neural network nodes and activation values. This inherently "overwrites" or "adjusts" any prior data. The result is that the model understands concepts and does not contain the original data (generally speaking - there are some exceptions from memorization). This is not like using ElasticSearch where you'd literally store the complete text that has been input.
If I read a book I can tell you about the contents and offer opinions, but I cannot give you a verbatim word for word copy of the book (maybe a few key phrases). LLMs are similar to this, it learns concepts and general language structure but it doesn't store the literal contents of the training data.
Here's some further information. "...ChatGPT does not copy or store training information in a database. Instead, it learns about associations between words, and those learnings help the model update its numbers/weights. The model then uses those weights to predict and generate new words in response to a user request. It does not “copy and paste” training information – much like a person who has read a book and sets it down,..."
meltbox@reddit
I understand what they’re saying but entirely disagree. Large models of all kinds have perfect recall and their feedback explicitly prioritizes closeness to the training data.
While the representation is not lossless it is always mimicking the input data as closely as possible and interpolating between points when not possible.
Go see the arc challenge and the interviews that Google researcher has done on AI. Current models appear to be purely recall driven. They don’t really have anything akin to what would be considered interdisciplinary reasoning or transfer like humans do.
So my position is that even though the encoding looks nothing like the original data an approximation of the original data can be recovered from the internal weights and the appropriate input. Therefore you are essentially distributing a lossy version of copyrighted material which is still not okay. IE re-encoding a movie with artifacts and worse quality is still illegal even if entire minute long segments are missing.
The other issue here is say a human is also capable of copyright infringement, but won’t because there are legal consequences for doing so.
But a machine gets no consequences because it’s not human? The argument is absurd even if you assume that LLMs do change data like a human because unlike a human it can’t in any reasonably effective way be restricted from violating copyright.
purleyboy@reddit
Here's an explanation on Wikipedia
hackingdreams@reddit
Not really. The Plaintiffs will ask for the logs, and Microsoft will provide them, or admit they don't have them. Admitting they don't exist looks really, really bad for Microsoft in the light of the law. It's essentially "the dog ate my homework."
This isn't some shifty fly-by-night operation. This is a multi-billion dollar behemoth.
purleyboy@reddit
Sure, for Microsoft. Now go over to HuggingFace and see the ever increasing number of llms being published. We're seeing an acceleration of open source model weights being published. In coming years it will be almost impossible to regulate/govern these models.
meltbox@reddit
Sure. But an open source model is at least not profiting off this. Microsoft is.
Stealing movies is one thing. Selling stolen movies is a whole other level
__loam@reddit
That's great. Microsoft still may have to go through discovery and have to disclose their training set. If they lose in a big way here that could make it very difficult for large companies to develop these models. If that makes it so open source models are in a legal grey area, I think that's a good outcome because it makes it harder for large mega corporationa abuse smaller players.
ConnaitLesRisques@reddit
I feel this is like saying an h.265 stream of the Lion King contains no drawings of Simba.
meltbox@reddit
This. So much this. It’s such a stupid argument.
Just because I could decide to open it as vp9 and have no output doesn’t make it not copyright infringement.
WaitForItTheMongols@reddit
An LLM is ultimately a really fancy form of lossy data compression. You compress the training data into the vectors, and the elements of the material in the training set will come back out.
Any sufficiently advanced compression algorithm produces a bit stream that is indistinguishable from randomness (that's what it means to compress data to the maximum, so that all the redundant material is gone and what remains is purely the entropy of the data).
But if I take the latest Avengers movie and compress it, it's still copyright infringement even if there is no facsimile of the movie. The data doesn't become a movie until I play it back, decompressing and rendering the frames. But when you run an LLM, you get the code back too, and it's just as derived from the source material as my compressed movie is.
This will certainly be interesting.
Chroiche@reddit
it's not just decompression though, it combines things. Sometimes the result is vastly different to any of the training data, sometimes it's verbatim. The analogy to movie compression doesn't really hold up besides verbatim replication.
WaitForItTheMongols@reddit
Of course it's a new thing, but the point is that even if it's combining things, they come from the data it's pulling from. It's all derived works. Even if you can't comprehend how the copyrighted content is baked into the model, it's still baked in and still taking the work of others, usually without their permission.
meltbox@reddit
Agree. This would be like saying taking chicken and beef and putting them both through a grinder together make it not chicken on the other side.
I mean kind of. But it’s definitely still in there.
kintar1900@reddit
If you're going down THAT route, then any human producing code is also violating copyright, because we produce software based on learning from previous things we've seen. The AI models do that learning differently, and can't make intuitive leaps or reason about specific problems, but at the fundamental level both humans and LLMs are processing, storing, and recombining past experiences for new output.
EveryQuantityEver@reddit
No. These things are not sentient; they do not learn, and they are not people.
Xyzzyzzyzzy@reddit
How do you know this?
Can you propose a test for how we determine whether something or someone is sentient?
Will that test have very unfortunate implications about which people are not sentient and therefore don't count as real people?
EveryQuantityEver@reddit
Because it's a fucking machine. If you're going to try to claim that these glorified autocompletes are sentient, then you're not someone that can be taken seriously.
Xyzzyzzyzzy@reddit
Do you believe in the literal, objective existence of human souls?
Because there's two, and only two, alternatives here:
It's impossible for a machine to be sentient, because sentient beings have souls, and machines cannot have souls. It is, theoretically, possible for a machine to faithfully simulate the physical processes taking place in a human nervous system. Since you believe humans are sentient, that means there must be some non-physical difference between an actual human nervous system and a simulated human nervous system - if the difference were physical, we could simulate it. That's the definition of a soul: an intangible, unmeasurable, yet real "special something" that all people possess.
It's possible for a machine to be sentient. If a machine faithfully simulates all of the physical processes of a human nervous system, that machine would possess sentience. "Because it's a fucking machine" isn't a real argument and it's not what you actually believe. Maybe it's meant to show - yet again - that you think all people should automatically agree with you and you despise everyone who has different opinions about things. I got that message loud and clear already, so you can cut the fake emotional bullshit and actually answer the question if you want.
kintar1900@reddit
I made no claim to any of those things. I'm talking about the claim made by the previous poster, and how that logic could conceivably be applied to humans.
EveryQuantityEver@reddit
And I'm saying that you can't do that, because AI and humans are not the same thing, not even close.
uCodeSherpa@reddit
No. This is idiotic. Humans can take what they learned and produce novel behavior. AI cannot.
Moleculor@reddit
Prove, in this world where all art is derivative, that humans always produce behavior that can be described as novel.
Every action I ever take is influenced and inspired by my past experiences, past experiences that are overwhelmingly influenced by other people.
My speech, my mannerisms, etc? Most of them I can identify who inspired them, and major influences on them.
EveryQuantityEver@reddit
That's not required. AI never produces things that are novel.
Moleculor@reddit
Nor do humans, so I'm not sure what your point is.
EveryQuantityEver@reddit
You tried to claim that all human behavior had to be novel for it to count. That's not true. However, while humans can create things that are novel, AI never can.
Moleculor@reddit
Alright, fair enough. I misspoke. I asked for humans always producing novel behavior, and then went on to describe how none of my behavior is novel.
Prove to me that some of my behavior is novel.
EveryQuantityEver@reddit
Except that's not true. You came up with the ways to combine it, and I guarantee you that you had manerisms before you were exposed to those people.
Moleculor@reddit
All my mannerisms developed from my interactions with the world. Everything I did was inspired by either seeing other people, or getting feedback from the world around me.
Everything about me is tied to, inspired by, or otherwise linked to things outside of me.
Xyzzyzzyzzy@reddit
Can you give an example of something created by a human that is truly novel?
This would help clarify what "novel" means to you. Currently it's not possible for anyone to respond to what you're saying, because you're using "novelty" in such a vague hand-wavey way that I can only define it as "a thing is novel when u/EveryQuantityEver says so".
kintar1900@reddit
Exactly. The definition of "novel" is INCREDIBLY vague, and no human produces anything that is not derivative in some way of something they have seen, heard, or otherwise experienced.
uCodeSherpa@reddit
Well THAT is a claim that definitely needs some backup.
AI folk really like to spit these claims out of their ass without any semblance of support for them.
Moleculor@reddit
There's a massive amount that has been said on the topic.
uCodeSherpa@reddit
Mark Twain said something about a kaleidoscope of ideas, therefore he’s right. All art is derivative.
This is mentally handicapped dude.
Xyzzyzzyzzy@reddit
You do care who said something, because you obviously don't care about the merits of your own statements.
Moleculor@reddit
If you found a quote from Mark Twain somewhere in those search results, you likely also found links explaining why it was right.
I won't take the time to explain to you what has already been explained. If you refuse to put in the effort, I have no reason to do so either.
jmlinden7@reddit
Generative AI, by definition, produces novel outputs. Maybe bad quality outputs sure, but novel
uCodeSherpa@reddit
No it doesn’t.
currentscurrents@reddit
They totally can produce novel behavior, unless you can find me this spaghetti tent on the internet somewhere.
uCodeSherpa@reddit
This is not novel behaviour.
I agree that if the AI can draw something looking like spaghetti and it can draw a tent, then it can draw a tent made of spaghetti.
This is not what anyone is talking about when they discuss AIs doing something new.
currentscurrents@reddit
That’s as new as anything humans make.
Look at fantasy creatures - they’re all just real creatures glued together. A unicorn is a horse with a horn, a gryphon is eagle+lion, a mermaid is woman+fish, etc.
uCodeSherpa@reddit
For the record, a human still made this spaghetti tent. The AI just drew it from a prompt.
But I mean. Okay. Lots of fantastical creatures are inspired by real creatures and this is proof that AIs work like a human brain? I mean, neurological science does not “know” how a human brain works, so you claiming to is basically a fantastical creature.
Either way, again, an AI successfully giving a lizard feathers is not novel. It is pretty fuckin cool that even a nobody such as myself can feed an AI a prompt of a fantastic creature and get something back that follows the constraints of what the AI has been trained on. No disagreement there.
However if you ask for an AI to invent a bunch of cool creatures, it can ONLY work within the boundaries of its training. It cannot imagine possibly something new.
Lots of humans lacking imagination (I fully admit I am one of those people) is not proof that AI produces novel behavior.
An AI would not have come up with the theory of relativity on its own, for example.
lelanthran@reddit
Largely irrelevant, because ... scale matters, in law.
Humans aren't reading and absorbing a few billion lines of copyrighted code; the LLM is.
Just like possessing a single joint isn't illegal where I am, but possessing 4000 tons of weed is illegal.
Scale matters. The argument that "using copyright just the way a human does, but scaled up a billion times" is stupid.
accountForStupidQs@reddit
Scale only matters when a specific threshold is put into legislation. Saying one controversial statement and several million are both equally legal, and killing one person or one thousand are both illegal. Whether something counts as copyright infringement, until a law says otherwise, will not depend on if it's done slowly or quickly, once or one billion times.
Full-Spectral@reddit
It does matter, to the people who are being infringed on. If you copy one song, no one is going to bother coming after you. If you copy a hundred thousand songs and put them on a server for people to download, people are going to come after you, because it's now very relevant to them and impactful.
accountForStupidQs@reddit
Legality has naught to do with getting caught or someone suing you. Doing something illegal is still illegal, even if you don't get caught.
Full-Spectral@reddit
But for copyright, it is at the copyright holder's discretion. They own the copyright and can choose to take action or not. No one is going to waste time coming after Joe Blow for copying some code into his project used by 5 people. But a huge corporation, sucking up the entire internet is another issue and people will choose to take action over that.
lelanthran@reddit
My point is that they put those thresholds into legislation because scale matters. IOW, you've got cause and effect backwards - a legislated threshold is the result of scale mattering.
This (slurping up the entire worlds corpus of copyrighted text to derive a new product) is something new, so why do you expect the legislation to be in place for this?
kintar1900@reddit
Regardless of the rest of the discussion, this made me laugh. I've been in software for over 25 years. I'm positive I have read well over a billion lines of code, most of which falls under SOME form of copyright.
mwb1234@reddit
There is no world in which you’ve read a billion lines of code. If you read one line of code per second 24/7/365 it will take you 31 YEARS to read one billion lines. That’s longer than your entire career!
jmlinden7@reddit
An experienced coder can read a bit faster than one line per second
Ictogan@reddit
No you haven't. If reading each line took you one second and you did nothing but read new code for 24 hours each day, it would take you over 31 years to reach a billion lines.
ZorbaTHut@reddit
The actual definition of "derived work" is a lot more restrictive than that, though. US Law says:
and these are all very direct derived works that include major elements of the original. It doesn't include things like "getting inspired by a novel to write your own novel", nor does it include "learning from analyzing a novel and using that knowledge to write your own novel".
If the compression is lossy enough, it may no longer count as a derived work.
AngryGroceries@reddit
Also it's been stated elsewhere, but "combining" is doing a LOT of legwork here. Technically human learning is just "combining" things that come from "learned data". And in that sense human output is derivative.
While AI models are still nowhere near human levels of learning, it betrays a fundamental misunderstanding of how they work to be comparing them to winzip
EveryQuantityEver@reddit
It really isn't something new, though. I don't see why they should get to claim they need special treatment, as if they somehow are entitled to do what they want.
Xyzzyzzyzzy@reddit
I don't see why we should let luddites abuse the legal system to impede technological advancements to protect their own narrow self-interest, but it happens anyways.
Ictogan@reddit
I'd argue that depending on how much of the original data ends up being stored in the model, the model may be considered to contain a condensation of the original work.
oorza@reddit
I'm not sure this applies here, I think new case law will be written.
In the case of a derived work, we're usually talking about things humans create. AI work can't be copyrighted and doesn't have IP protections currently, so what it outputs doesn't matter, so the question becomes: is the LLM itself a derived work? Obviously not, but that doesn't mean anything beyond this case law can't apply here.
Chii@reddit
but it can contain more than just what got baked in. The argument cannot hold, because this argument does not hold for a brain either.
The books or movies i've read and remembered in my brain doesn't constitute any infringement. It's only when i deliberately extract the movie out that it constitutes infringement.
Why should there be any differentiation under the eyes of copyright law between a brain and the LLM?
Red_not_Read@reddit
If I publish source code with a GPLv2 license, and you read an memorize it, and then verbatim regurgitate it into your closed-source application, then that's a license violation.
An LLM can be thought of as a container that stores copies of the source code it has seen, and then renders that source code on demand, only without the accompanying license text.
The specific detail of the algorithms and data structures that comprise the LLM, or the precise math that describes the format of the original source copy (knowledge) that the LLM holds is somewhat immaterial.
What's going to matter, I think, is whether Copilot is emitting what looks like verbatim copies of code (like a source code database), or if it can be argued that Copilot is learning and applying learned knowledge, which would not look like an exact copy of previously seen code, but may validly reflect algorithms and data structures previously seen.
It's going to be fascinating.
kintar1900@reddit
No. This is a VERY incorrect explanation of the way LLMs work, and shows a lack of understanding of the fundamental math underlying complex AI models.
If this argument holds, then it also holds for the human mind, because LLMs store data based in large part on the way biological systems store data.
This has been tested in copyright trials before, where two artists or engineers came up with the exact same thing without ever seeing the others' work. It even gets WORSE when you start trying to apply this test to source code, because there are only so many ways to solve a given problem within the constraints of a given programming language and environment. It's not only possible, but highly likely that two human software engineers will produce eerily similar or exact copies of code for a given problem.
EveryQuantityEver@reddit
No, it doesn't, because LLMs are not people.
kintar1900@reddit
And our legal system has such a GREAT record at not applying human-like tests to non-humans ("Corporations are people!") or vice-versa?
I'm not claiming an LLM is intelligent or sentient. I'm talking about the mechanics of the arguments being made for copyright violation in TRAINING data.
EveryQuantityEver@reddit
If you would ever want to sue a company, or hold it accountable for a contract, yes, you would like them to be treated as one.
And those mechanics are irrelevant, but you're trying to say that they learn like people, when that's just not true at all.
uCodeSherpa@reddit
Said do NOT store data similar to biological systems. This is an absurd claim. World class neurological scientists do not know how biological systems store data and you’re out here stating that programmers have figured it out. Not only that, but that neurological science has seen this and been like “yup. That it. You guys got it”
Absolutely moronic claim.
kintar1900@reddit
I think you're conflating "know how it's stored" with "are capable of reading the stored information".
Neuroscience agrees that changing the strength and number of connections between neurons, including the level of signal from connected neurons required to cause a neuron to fire, is the core mechanism for storing memory in a biological brain. This discovery is what lead to the creation of the first digital "neurons".
If I'm wrong, please provide a link because I would love to be corrected and to know what the current science says.
uCodeSherpa@reddit
This tells us absolutely nothing about whether AIs actually model a brain. Yeah, they both have mechanism for firing at different strengths.
By this logic, guns firing in a war accurately simulates a human brain.
There is a universe of information missing between “how the human brain neuron work” and “these two systems have a kind of similar way to pinging their brethren”
kintar1900@reddit
Illegal straw-man argument on the field. Offense is assessed a five-yard penalty.
hachface@reddit
This debate-club bullshit is so played out.
uCodeSherpa@reddit
It’s not my fault that your claim is absolutely, demonstrably absurd. Maybe do a little introspection and stop claiming that AIs work the same way that human brains work?
Red_not_Read@reddit
Ugh, how rude. Thanks, but my practical knowledge of LLMs, neural nets, and transformers is just fine, thank you.
It's not necessary to get into the details of the how at this level of conversation. It's the what that matters and the what is that the network contains license source code.
You don't have to painfully explain that it's not in there as arrays of characters, or Huffman trees, or what have you, but as values encoded across billions of nodes across a vague multi-dimensional space. That's the how, and it doesn't matter.
That's fine, and if the LLM emits similar ideas to those used in open source code, then that's fine... but if it emits literally the same blocks of non-trivial code... then I don't know how you can argue that it's somehow not plagiarism.
kintar1900@reddit
Okay, then. I'll avoid further basic explanations.
Thank you, because this is a good example of the difficulty we're having (we as in 'the world') talking about LLMs and what they are or are not. You seem to be of the opinion that since correctly-structured prompts can produce output which exactly mimics the training set, that it constitutes copyright infringement to train the model with that data. I am arguing that the way LLMs encode data is SO SIMILAR to the way the human mind encodes data that any legal conclusion which states that the weights and connections of the model constitute a copy of the source material will by definition require that human minds be treated the same way.
One experiment that I'd love to see performed, but which I just don't have the computing resources to perform myself, would be this:
My hypothesis is that it is possible. I think this experiment would put the argument to bed forever. The argument would then turn into whether or not the experiment's prompt generation step was run long enough or correctly enough to produce valid results.
a_marklar@reddit
Reddit fuzzes voting so complaining about -1 is not only weak, it's usually wrong.
I didn't down or upvote you, but when I see someone say this:
I roll my eyes. I'm sure other people would hit the downvote button instead.
kintar1900@reddit
Why does it make you roll your eyes? How am I wrong?
I post these things because I want discussion, including actionable corrections on my take. Unfortunately, what I usually see are just reiterations of the same (typically flawed or massively over-simplified) claims, or random mudslinging.
a_marklar@reddit
Well the truth is that I have a knee jerk reaction to anyone who anthropomorphizes software. Beyond that, the statement is not currently falsifiable so it's actually nice sounding bullshit. Combine it with the language like "SO SIMILAR", "by definition", the idea that we'd apply laws equally to humans and software, and my eyes can't stop themselves. Forgive me.
Hell yeah. I'm replying because I get that and I would love honest feedback if I asked for it too.
kintar1900@reddit
There are too few Redditors with that attitude. Thank you.
Can't say I blame you, and I wasn't trying to anthropomorphize LLMs. I personally can't stand it when people talk about AI systems "thinking" or "wanting", etc. My statement is entirely around the (false) claim further up this thread that a neural network stores a copy of the data it was trained on. I brought up the similarity to biological systems to point out logical fallacies in arguments about why training a neural net on copyrighted data constitutes copyright infringement.
Which statements? Everything I've said about the way ANNs encode data being similar to the way we think -- a phrase I should have included in my original statement -- that biological brains encode data is based on various papers and articles I've read since the mid-90s. HOWEVER, I have recently been informed by an acquaintance that I'm out of date and there's currently research being done on whether or not neurons themselves perform networked processing within themselves, which is FREAKING AWESOME! :D
While I understand, I'm apparently WAY more cynical about our legal system than you. We already treat corporations like they're individuals, and in some cases give them more rights than people. :/ Couple that with the greed expressed by US corporations, and I can 100% believe that if someone in a corporation's legal team thought there was a chance in hell that they could claim copyright on the output of a human because the person had been exposed to copyrighted data, they'd do it.
loup-vaillant@reddit
Reddit doesn’t fuzz when there are fewer than n votes (n is probably less than 5). A controversial vote with 10 ups and 10 down, sure, it will get fuzzed. but:
Red_not_Read@reddit
Have an upvote. We're here to argue conflicting opinions.
I'm actually pro-LLM, in software too, and my argument is basically that it's going to continue to be a challenge for normal people (by which I really mean non-tech, e.g. judges and the government) to make pragmatic decisions about all this.
kintar1900@reddit
Thanks, and I agree 100%. Our (the USA and to a lesser degree the EU) governments have consistently shown that they do not put sufficient weight on technical experts who weigh in on proposed tech regulations. It's disappointing.
totoro27@reddit
It actually does matter. If the model was as simple as what you described, then the legal conversation would be much simpler. I think it is inevitable that the legal conversation will get into what exactly these models are doing under the hood.
loup-vaillant@reddit
Oh but it totally holds for the human mind: try and rewrite a novel from memory, then sell it as your own: if the original author ever sees this, they will sue your ass, and win.
Moleculor@reddit
No, this is why copyright claims were thrown out: the model provably does not contain substantial copies of existing works.
An LLM only contains 'copies' if you define 'copies' as 'incredibly tiny fragments'. A word or such.
It's like arguing that
replied with a sardonic smile
is a copy of someone else's work.¹It's a sentence fragment from A Game Of Thrones, so technically, yes, you can find it contained within an existing work...
But you can also find it within Days of Atonement by Michael Gregorio, Life in the New World by Charles Sealsfield, and The Memoirs of Queen Hortense by Queen Hortense.
You can also find it within interviews, fanfiction of Hearts of Iron 4, and more.
An LLM is a mathematical slurry with numeric connections between all these tiny fragments. Their design is literally based on theories of how the human mind operates. And it only works because it doesn't contain whole, complete copies of works; they'd be too slow to search through.
¹ And I'm not even entirely sure that fragment is short enough to be a legitimate example, because my understanding is the fragments, called tokens, are generally only a few characters in size. Like...
sard
, I guess. Or maybesardonic
.Red_not_Read@reddit
"A mathematical slurry"... I like it.
wildjokers@reddit
That is not how the Transformer model works at all.
loup-vaillant@reddit
That is not how the Transformer model works most of the time. Ask an LLM to reproduce something that was in its training data, it has a good chance of producing something very close. Just like we humans can (imperfectly) reproduce stuff from memory.
And there’s always the risk that sometimes, it happens by accident.
Moleculor@reddit
In short enough snippets that it's reasonable to think that a human might have reproduced it in the same situation without having seen the so-called original work.
Copyright cares about the work as a whole, or substantial enough portions of it that it threatens the profits of the person who made the work. The entire novel, not one sentence fragment from page 237.
It's why Google was so successful in defending itself from copyright lawsuits from the Author's Guild when they created their book search engine.
loup-vaillant@reddit
To be honest, I’ve sometimes straight up Ctrl-C Ctrl-V snippets of code, rearranged them to my style (indentation naming, a bit of refactoring…), and… well are a couple dozen lines enough to count as infringement? I never knew where the limit actually is to be honest.
But it does speed up my work sometimes, even when I know the end result would have been the same if I started from scratch. Especially when I’m the original author, who somehow has ceded all rights (including attribution in practice) to some previous employer.
Red_not_Read@reddit
Of course it isn't... Why don't you take a stab at describing how an LLM incorporates its training data, in a way that can be easily understood by normal people.
EveryQuantityEver@reddit
An AI model is not a brain. The two cannot be considered to be the same.
batweenerpopemobile@reddit
brb, getting music industry to sue monster rancher franchise for deriving monsters from copyrighted data.
Monster-Fenrick@reddit
I don't think gathering the last two digits of track numbers constitutes a copyright violation. It's the equivalent of looking at specific page numbers in a book and counting how many words are on it and using that number to reference a table to decide what monster to create.
travelsonic@reddit
Copyright status doesn't make sense to id as if copyright status is the problem IMO, as opposed to licensing status, if licensing is needed, etc. Implying copyright status makes for a pieces use in training being problematic or not, IMO, would miss that copyright is automatic in the U.S and many other countries - and therefore, works used with permission even (implicitly or explicitly) would still be "copyrighted works," for instance.
oorza@reddit
Decompression in lossy video codecs isn't as simple as you might think, the analogy stands fine I think. You can add a bunch of processing filters on both sides of video codecs - stuff like noise reduction/addition, color adjustments, etc.
If I take The Avengers film and add a fansub track, replace parts of the music with a custom score I wrote, and then add a ton of video filters to the (de)compression, it's nowhere near a verbatim replication. But it'd still be copyright infringement to sell it.
It's the same thing as an AI: taking an original piece of IP, layering some changesets on top of it, pushing it through a lossy codec, then decoding it again with more filters on top of it. That describes both an LLM and a bunch of weird pirated anime on the internet, but only the latter is currently illegal.
Xyzzyzzyzzy@reddit
If something is a derivative work, then you can use the derivative content to point to exactly which works it was derived from. If I write a song that starts out "is this the real life?/is this just fantasy?/caught in a landslide/no escape from reality", it's clearly derivative of Bohemian Rhapsody by Queen. You don't need to know anything about me to say that. You just need to show that my lyrics are the same as their lyrics.
If an LLM produces a derivative output, we should be able to show which prior work it is derived from, right? LLMs can produce indisputably derivative outputs, and when they do, we can show the original works they're derived from, the same as if a person creates a derivative work.
But - correct me if I'm wrong - you're going a step further and saying that the LLM itself is a derivative work of every item in its training data set, so every output produced by the LLM is derived from the entire training data set. If I have ChatGPT write an SQL snippet to add a new column to my AccountsPayable table, it's derived from all of the SQL in its training corpus. It's also derived from St. Paul's letter to the Ephesians, Quotations from Chairman Mao, and "To a Mouse" by Robert Burns. ("Wee, sleeket, cowran, tim'rous beastie/O, what a panic's in they breastie!")
That seems like a dramatic expansion of copyright, and a massive transfer of legal and economic power to existing copyright holders at the expense of all future creative work.
Even if we have different standards for human-written and LLM-produced works, LLMs are ubiquitous. A current copyright holder could claim that they have good reason to believe my work was written by an LLM, it's derivative, it's a violation of their copyright, and they'll sue me unless I pay them $5k to settle the claim. Even if the claim is frivolous, I can only be assured of winning in court if I can prove that my work was written before February 14th, 2019, when GPT-2 was made available to the public.
We already have patent trolling; now we can have copyright trolling, too. If I have seen further than others, it is by standing upon the shoulders of giants, so the giants are entitled to compensation. It's an RIAA lobbyist's wet dream!
oursland@reddit
You should double-check your "facts". The whole reason there's a project out there to copyright each and every melody is precisely because you can lose a plagiarism case by simply having the same notes in a sequence. This is true regardless if you play them at a different pace or have nothing to do with an original work.
The reality is, if there is significant similarity and an expert claims that it is unlikely that two independent works would result in this similarity, then you're going to lose your plagiarism/copyright case.
Xyzzyzzyzzy@reddit
...that reinforces my point? They're suing people based on "this creative work resembles that prior creative work". It's a good thing you included that first sentence, because otherwise I'd think you're agreeing with me!
sparr@reddit
If you run it 100 times and you get 99 new movies and the exact original movie once, that's [at least] one instance of copyright infringement.
quetzalcoatl-pl@reddit
So, if I take Avengers, run it through H265 (compression), add subtitles (combine things), and add my voiceover (even more combining things, my personal products added) - then it is not copyright violation? YAY, hold my beer, I'm opening new business!
Helluiin@reddit
if you compress a movie and put a filter over it thats also combining things. yet most people would probably call that copyrightinfringement
hackingdreams@reddit
So? Now it's just incorporating copyrighted data from multiple sources instead of one.
The fact it can generate code that's verbatim to the training data indicates that it is, in fact, a sophisticated compression scheme. You just admitted it.
a_marklar@reddit
Yes, that is the lossy part
ianitic@reddit
It's more like movie compression with a filter on top. If there wasn't a fine tuning step it would be closer to just the noise from movie compression (using this analogy). The only thing preventing that is the very lightweight fine tuning step.
Heck, one of the use cases of something called an autoencoder is literally compression and the pre-training step is nearly identical to LLMs.
meltbox@reddit
I can’t believe I had to scroll this far to find someone who knew what the hell they were talking about.
Pissed me off.
accountForStupidQs@reddit
But every vector is mutated by every element of the training set. If you were to overlay the first hour of every movie of the past 50 years, frame by frame, each frame would end up being black or dark near. Does that imply that the black frame is a facsimile of one of the movies? Which one? And if we say only half the movies released were used, how do you prove that Back to the Future was one of the masks, but Jaws wasn't one of the masks?
The way the training works you're almost performing the opposite operation of what you propose ideal compression is. Where ideal compression leaves only that unique minimum noise that clearly identifies one work from another, the training aims to ignore all uniqueness and encode only those qualities which are common to the whole set of training data
rebbsitor@reddit
I think under current copyright law the answer is "all of them". Taking a copy of something and incorporating it into another work is basis of a derivative work.
If the assertion is it's not derivative, then that's also saying that the model can be made without the copyrighted works, which it can't be.
kintar1900@reddit
You're still missing the point of the parent comment. For your statement to hold, you must have recognizable copies of the original work taken verbatim from the source. Creating a "new" movie by cutting scenes from Jaws and Back to the Future together would be copyright infringement (if we ignore parody law). Parent comment's point is that taking Jaws and Back to the Future and producing movies with the same scene structure, color palette, or character arcs is not copyright infringement, and is a much closer example to the way generative AI works.
rebbsitor@reddit
This is a misunderstanding of copyright law. Creating a derivative work without the permission of the copyright holder is itself copyright infringement. The fact that the starting point is a copyrighted work that you're modifying means this is a derivative work and is copyright infringement.
Xyzzyzzyzzy@reddit
That's not actually how copyright law works in the US, though. Copyright only protects certain types of creative work from reproduction. It's not a blanket prohibition on all derivative works.
For example, if you write a cookbook, you hold copyright on the words that you wrote in the cookbook. You do not hold copyright on the recipes themselves. I can't copy-paste the text from your book to my book, but I can write instructions for making your split pea soup in my own words. You have copyright on the text and images, not the process of making the soup.
If I publish a paper describing a new sorting algorithm with improved performance under certain conditions, I have copyright on the paper and the code in the paper. I do not have copyright on the algorithm itself, because algorithms are not protected by copyright.
There's a well-understood process for your company to use the algorithm without risking a copyright violation. You give Alice the code for the algorithm from my paper. Alice writes, in her own words, a detailed description of what the code does, without any actual code. You pass that to Bob, who has not read my paper. Bob uses the description to code the algorithm. If I claim you infringed on my copyright, you can show that you didn't. You absolutely copied my algorithm, but there's no general prohibition on copying algorithms. If I wanted to protect my algorithm, I'd have to apply for a patent, which is an entirely different thing.
rebbsitor@reddit
You're correct that there isn't a blanket prohibition on derivative works. However, if someone makes a derivative work, their rights are only to the new parts they've created. They need permission of the copyright holder of the works they're derived from to legally distribute the derived work. Unless their use of the copyrighted work falls under Fair Use.
Your examples are not derivative works. They're things (ideas, facts, etc.) that were never subject to copyright protection in the first place.
However, what the person I responded to is talking about is a shot for shot recreation of a film generated by AI. There is artistic expression in the shot composition, arrangement of scenes, and the overall narrative that also have copyright protection.
Someone can’t legally take a book/movie and simply retell it, especially if the retelling is too close to the original in terms of plot, characters, and specific language. This would likely constitute copyright infringement because it involves reproducing the original work's protected elements.
Copyright law protects the specific expression of ideas, including the unique plot, characters, dialogue, and overall narrative structure of a book/movie. Retelling a story without substantial changes like summarizing or paraphrasing significant portions of the text can be seen as copying.
However, if someone retells a story in a way that transforms it significantly, adding original elements, or changing the setting, characters, or perspective, it might be considered a derivative work.
Xyzzyzzyzzy@reddit
I think we just understand this differently:
I read that as, like, making a movie "in the style of Jaws" or that is a "homage to Jaws" or something like that, which is clearly permitted. Not literally recreating Jaws from scratch shot-for-shot and scene-for-scene, which is 100% copyright infringement.
Sorry for the confusion!
__loam@reddit
I think the legal question is whether it's fair use, which is actually more complex than most people assume.
hackingdreams@reddit
It doesn't matter. If the model spits out code that looks sufficiently close to my GPL'd code because it was trained on my GPL'd code, you essentially created a sophisticated copy and paste machine. You can throw as many fancy terms at it as you like, but it ultimately does not matter how you got to the same damned code.
And folks, boy does Copilot like to generate the same code as it's provided - down to the bugs, comments, and often even copyright notices.
Building a copy machine with illudium unborkable compression technology doesn't matter in the slightest - it's still a copy machine.
kintar1900@reddit
If you believe this, I seriously hope you've never read any GPL'd code, and then written you own code that does something similar. By your own logic, being exposed to the GPL'd code has altered your training set, and given you the capability to produce other code which performs the same or similar function, and you are therefore violating the GPL terms.
hegbork@reddit
Yep. That's how it works. Have you heard about cleanroom implementations?
Btw. It was Microsoft who threatened the entire industry around 20-25 years ago when the code for NT was leaked. Anyone who looked at it would taint all their future work. At that time every open source operating system purged committers who admitted publicly to even breathing in the direction of that code.
__loam@reddit
I mean that's literally the legal reality right now lol.
accountForStupidQs@reddit
The how should absolutely matter, lest we say monkeys with typewriters are prima face copyright infringement because they may eventually produce the works of Agatha Christie
kintar1900@reddit
I think you're getting downvoted because you used prima face. It's pretty obvious nobody arguing against LLMs in this thread actually understands legal reasoning, much less the way LLMs actually work. :/
EveryQuantityEver@reddit
No, I think the one that doesn't understand is you. Mainly because you keep thinking that the "How" of how LLMs work is enough to paper over the "What" of what they're doing when it's copyright infringement. Its like saying that courts can't do anything with Bitcoin because "transactions are immutable!" and thinking that the court will just shrug its shoulders.
sparr@reddit
If there exists any short* prompt that gets the model to reliably reproduce a clip from BttF, that is probably sufficient proof.
* too short to uniquely describe BttF
giltirn@reddit
Excellent answer!
purleyboy@reddit
An LLM is fundamentally a neural network. Each node (neuron) has an activation value and output weights. These numbers (and the node connections) are refined and adjusted with each piece of training data. The continual refining means that the end network is not a representation of any one piece of training data, but of all pieces of training data effectively overlaid. So, you generally will not get the training data back as output from an LLM. Compression is all about maintaining maximum original information (minimal information loss) with minimal storage. LLMs are not good at this. You may get an LLM to output a very small piece of code that is identical to training data, but often times this is because there are limited ways to perform a simple piece of logic. The actual training data is not stored as a facsimile in the LLM.
sonobanana33@reddit
Can you explain how it can output verbatim stuff then?
purleyboy@reddit
Here's a great article for you on memorization. This is the exception that is being addressed in future training techniques.
sonobanana33@reddit
I think the lawsuit is on what has been done, not what might one day be done.
purleyboy@reddit
Yes, the NYT went on a determined hunt to find an instance of memorization. It took the firm they hired >10,000 prompt refinements to get a result they could use as the basis of the lawsuit. We'll see how that plays out. However, back to the technicality of it all, this is absolutely the exception and not the norm.
sonobanana33@reddit
Surely microsoft has more than 10k users? At 1 prompt per day… at least one user per day is violating :)
Doesn't sound that impressive if you put in perspective the fact that copilot has more than one user.
purleyboy@reddit
I'm not sure if you're serious or not. The 10,000 prompt refinements were not 10,000 random prompts but using very sophisticated techniques to attempt to essentially jailbreak the LLM and find an example of memorization and then continue to refine the prompt until they could get output as close as possible to training data. I haven't seen the prompt that was used but I've read that the prompt itself is going to be used to defend OpenAI. It may be such a contrived prompt that it works against NYT in court. We'll have to wait and see the case play out.
sonobanana33@reddit
So you actually have no idea of what the prompts are. Perhaps you violated copyright repeatedly yourself and are unaware of it?
purleyboy@reddit
As far as I'm aware the case evidence is not yet public, so we'll have to wait.
drekmonger@reddit
Case evidence is public, as I indicated in another comment:
Here's the complaint: https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf
Here's exhibit J, as mentioned in the complaint:https://nytco-assets.nytimes.com/2023/12/Lawsuit-Document-dkt-1-68-Ex-J.pdf
purleyboy@reddit
Thanks for that. I am guessing that OpenAI's argument will be based off using something like the ACR measure to demonstrate that it is unlikely that a typical prompt will expose incidents of memorization.
sonobanana33@reddit
we went from "it's impossible to reproduce the input" to "it's difficult"
purleyboy@reddit
It's a pretty big topic worthy of multiple Phds' worth of study. A gross over simplification is that it's like a human brain. I can read a book and give you a good synopsis but not a word for word replication. However, once in a while I may have memorized one thing word for word. In general (general being the key word), it's impossible to get the source training material out of an LLM. But there are exceptions.
sonobanana33@reddit
Remember that in r/programming people are likely to have taken ML and AI courses at university.
In general if something appears many times in the training data, it's probably very likely to be reproduced.
drekmonger@reddit
We do know what the prompts were.
Here's the complaint: https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf
Here's exhibit J, as mentioned in the complaint:https://nytco-assets.nytimes.com/2023/12/Lawsuit-Document-dkt-1-68-Ex-J.pdf
Essentially the prompts are asking for completions. The "prompts" contain no instructions. They are just start of the article and implicitly GPT-4 will attempt to complete it.
Out of something on the order of 10,000 tries (unclear if that's per prompt or all prompts in total), the model outputs something close (but not completely exact) to the source article used in training.
Swoop3dp@reddit
It's not really compression. LLMs don't reproduce the training data word by word. (unless the model overfits the data, which you want to avoid)
If I, a human, read some GPL code I will learn new concepts from it. If I then apply those concepts somewhere else nobody will care, unless I actually copy the code.
LLMs basically do the same thing, just on a much bigger scale.
WaitForItTheMongols@reddit
Yes, but again, LLMs do not learn. They modify their parameters using various gradient descent techniques and genetic algorithms to optimize their weights to fit a value function, but that's not learning. When I read code, I learn. I process it using my understanding of the world based on other algorithms I'm familiar with. LLMs do not do this. They re-train to fit the new training data, but they do not have an understanding which they can then integrate new information into.
BlackHumor@reddit
Yes they do, that's what their parameters are.
Valmar33@reddit
There is nothing that is actually "learning" anything in an AI system ~ AI bros just love redefining words to make it sound like their souped-up algorithms are doing something magical and different.
BlackHumor@reddit
This has been called "machine learning" since 1959.
Uristqwerty@reddit
Programmers must really love knitting, for all the Strings they use. Except that it's understood that "string" has a precise domain-specific meaning that differs from the more general-purpose word. "Learning" in the context of AI is similarly jargon, carrying a set of connotations that only partially overlaps with everyday use.
BlackHumor@reddit
I would argue that the sense of "learning" in machine learning is much closer to learning for humans than a "string" of characters is to physical strings.
If you start with a system that does not know how to write an essay, and you run it through a properly designed process of machine learning, you will end up with a system that does know how to write an essay now. It will still have problems that humans don't share (such as a tendency to just make stuff up), but the machine has clearly learned something in some sense.
Uristqwerty@reddit
As I see it, learning in humans is more reverse-engineering a process that uses your existing mental and physical tools to create a similar result, and can be abstracted to apply to completely unrelated topics in the future, while machine learning is tuning a prediction algorithm that needs to see some sort of context before it can fill in the "learned" piece. You're figuring out "what did the original author think, and how did it influence their choices" more than the raw pattern of tokens in the end result.
DarthNihilus@reddit
Wow look at this "AI Bro" with his basic knowledge of software engineering terms.
kintar1900@reddit
More specifically, that's what the tokenization and attention layers effectively do; break down the frequencies of relationships of known data to predict relationships between unknown data. It's not "learning" the way that humans learn, but it's definitely capable of producing things that are effectively indistinguishable to a lay-person (and some things that are utter nonsense, to boot!)
jmlinden7@reddit
It really doesn't matter if they learn or not. What matters is if they produce output which violates copyright.
hackingdreams@reddit
Not even close. An LLM has no intrinsic idea of what it's even looking at. It's all just a stream of bits that ultimately forms a table of how frequently some bits connect to some other bits - it's a sponge-like frequency table, which is very much like something you'd see in a compression algorithm.
If you, a human, produced identical code to some GPL'd code after having read it earlier, you would be tagged for copyright infringement, and you'd be guilty of it. If you learned something from that code and applied some principles from it, you're fine. If you copy and pasted from it, you're going down.
There's more than enough sufficient evidence that these machines are essentially sophisticated copy and paste machines. You can call them "Markov-like copy and paste bots" - the next bit they paste is just derived from a frequency table they've stored by being trained on GPL'd data. That makes the whole training model a derivative work, and thus anything it creates a derivative work.
No amount of "but humans work the same way" argument is going to beat the fact that Microsoft had to install filters on its output to prevent it from literally generating code with copyright notices, straight from FOSS code. There's no argument there - it saw, it copied, it pasted.
uCodeSherpa@reddit
Humans DO NOT work the same way and anyone claiming this is a full on moron.
DarthNihilus@reddit
You're just as big of a moron for claiming the opposite.
We don't know how humans work.
uCodeSherpa@reddit
And yet, AI morons have claimed multiple times in this thread alone that AI works just like humans do!
I am very aware that neurological science has no idea how the brain actually works. It is the AI morons that seem to have no idea of this.
Helluiin@reddit
lossy image compression also dosent reproduce the image pixel by pixel
GlitteringFriggit@reddit
Some if the queries I've made to claude have returned 200+ line verbatim copies of code (including all the original comments and everything). And to be clear I wasn't trying to get it to return copied code, these were just random queries. I only noticed due to the "humanness" of the comments, searched online and found the exact code from a 13 year old stack overflow post.
liveart@reddit
First of all: it's not compression. At least not in any meaningful sense of the word. The comparison is like calling deleting 99.99% of a file 'compression'. The bits just are not there. Second I'm not sure why people don't just... look up copyright law. You don't need to get that far into it to reach the fair use portion on wikipedia and find:
In no way, shape, or form is a 'substantial' portion of any of these works included in the model. Given the model sizes and the amount of data used if it were any type of compression it would be the most impressive compression in the world.
But lets say you're still not convinced. What else can we find under fair use?
AI Models are by necessity transformative. Even if you want to torture the definition of 'compression' to try to make it fit the model is still massively transformative. An AI model looks and functions nothing like the original works. Two out of the four major standards for fair use are inherent in the creation of AI models. Ironically the stronger argument against fair use is the effect on the market/value of the copyrighted work. There is a real risk of AI models pushing down demand for the works they trained from, however it's tougher to prove and no one wants to just say "I just want money" so instead we have these threads where people try to torture the definition of compression and don't just look up what is generally considered fair use.
Ictogan@reddit
You say that, but the currently best performing compression algorithm in the Large Text Compression Benchmark(compressing a subset of wikipedia) is a program that trains a transformer model on the data http://www.mattmahoney.net/dc/text.html#1072 . So transformer models can very well contain exact copies of their training data, to the degree that they can even be used as a lossless compression algorithm.
Chii@reddit
but the compressed content cannot be used to make a different move than the avengers. Therefore, the primary use is to infringe copyright.
With LLM's on the other hand, there's a case to be made that it can be used to create new works that don't infringe at all. Just because the LLM potentially contain some encoded form of the training data is irrelevant - the digits of pi also contain that same information, and yet you do not get to claim that people who use pi are infringing.
LLM is distilling information out of a large body of works. This information cannot be copyrighted - in the same way that a recipe cannot be copyrighted, only the particular expression of it. Somebody else can take the information of the recipe in a cookbook, and reproduce it in their own expression and it would not infringe (nor should it).
Syxez@reddit
I think what ultimately matters is the end result. Pi contains the Avenger movie, but someone cannot get it without knowing the whole exact movie in the first place. What would be illegal is providing a pointer to the movie in Pi, this would allow anyone to create the movie using a very simple program.
The cases the were dismisses by the judge were dismisses on the ground that the similarity between generated code and code trained-on wasn't high enough. In the case of LLMs, you don't need any pointer to get the data, your search prompt is already integrated in token form and linked internally to the data you search, however, current LLMs do not encode the vast majority of the data with enough accuracy for the data to be recreated similar enough.
This is bound to change of course if training on the data becomes more intensive in the years to come. There are already cases of techniques where you train significantly more on "outlier data" to better integrate it, resulting in sometimes 100% accuracy recreations, like the exact ascii-art recreations of some current models.
quetzalcoatl-pl@reddit
just a tongue-in-cheek note:
Actually, there's a nonzero chance that the data that forms the-pointer-to-the-position-of-Avengers-in-Pi is of similar bit-length to a decent copy of Avengers movie ;)
hackingdreams@reddit
Except, they do, because they're made from the original content. I know this is hard folks, but if you take a movie, chop it up into little pieces, re-arrange it, and post it on YouTube as your work, guess what's going to happen to your YouTube channel? It's going to get flagged as copyright infringement, because odds are really fucking good you just committed copyright infringement.
You're never beating the allegations as long as you admit that some amount of the source work is being duped into the finished content, and the plaintiffs have plenty of evidence of that happening. Hell, so does anyone - ask your favorite LLM to generate a fast inverse square root algorithm, and 99.99999% of the time it'll spit out the Quake III algorithm, complete with a fucking GPL header.
Chii@reddit
that's because you're just describing something akin to format transcribing. It has nothing to do with the operations of an LLM, as what copilot might output.
It's not to say that copilot cant produce infringing content - it's a case by case basis, based on what the user prompts copilot to do.
But i would make the general claim that the LLM in copilot itself (the neural weights) are not themselves infringing. Just like i would not infringe by distributing digits of pi. If someone points to a certain index of pi digit, and say "here are the avengers movie", then they would be infringing.
mccoyn@reddit
This analogy doesn't hold up. The digits of pi are in no way derived from the Avengers movie. But, the neural weights are derived from the copyrighted code. If you left that code out of the training data, you would get different weights.
WaitForItTheMongols@reddit
Sure it can, you just need a different decoding algorithm.
In the simplest case: I can take the Batman movie, XOR it with the Avengers movie, and get My Secret Algorithm.
Throw away the Batman movie, pretend it never existed.
Now, if I take My Secret Algorithm, and XOR it with the Avengers movie, I can get a totally different movie! The existence of My Secret Algorithm proves that the data in the compressed Avengers movie can be used to recover any movie of your choosing. If you use the same decompression algorithm as was used for compression, you get Avengers, but if you use a different algorithm, you get any other movie.
Chii@reddit
but you would have to first infringe here, because this algorithm would be a derivative work of the avengers movie already. It has nothing to do with the data in the compression any more.
your proposed scenario is as if your prompt in the LLM is the actual infringing content.
WaitForItTheMongols@reddit
I'm not saying that's what you would do.
My point is to construct a logical proof that, because we can use XOR to generate Such An Algorithm, then Such An Algorithm exists. We could arrive at Such An Algorithm in a potentially non-infringing way as well. But the question of "Can you compress one set of data with one algorithm, and decompress with another algorithm, to recover a different set of data" has an answer that is emphatically YES.
sonobanana33@reddit
Is a fan cut of the avengers not infringing?
Chii@reddit
but that's not what the LLM is doing - it's not cutting pieces of the works from the training data and rearranging it, unless you go to the level of names or tiny snippets of code.
NickWalker12@reddit
Laws can be passed requiring LLMs to make available publicly the full database of training data (losslessly compressed), which the LLM can be falsified against, as well as the associated license for each piece of media. Fair compensation can be given for the distribution cost incurred to the company.
But I'm honestly more shocked that this isn't already law in the EU, given:
purleyboy@reddit
But laws are local to one jurisdiction. How do you stop Open Source models published outside of that jurisdiction?
NickWalker12@reddit
You can't, really, same as other laws. But you can make it illegal to distribute to, or operate within, protected jurisdictions, if you don't comply (like GDPR does with American companies), and use diplomacy & treaties to open many avenues for justice internationally (e.g. extradition). It's one of the many benefits of globalism - we can cooperate with allies wishes.
Uristqwerty@reddit
In computer science, bits don't have colour. In human society, and especially the law, bits do. The LLM might not contain any bits from the training data by the time it finishes pureeing it all through the relevant calculus and squeezing it into a brick of model weights for diostribution, but the intangible metadata about the training data's provenance exist in a parallel dimension that math cannot interact with, so no description of the algorithms is enough to say how much of the original colour remains.
EveryQuantityEver@reddit
They'll just make sure that all the training data is recorded and cataloged. It should be, there's no reason to not know what goes into it.
purleyboy@reddit
But who is going to regulate open source models? If the US cracks down domestically then the open source models will be trained and distributed outside of the jurisdiction of the US.
EveryQuantityEver@reddit
Outside the US, and outside of much of the western world. I don't have this paranoid fear that somehow all AI will never be in the US, not that it's useful to start with.
PaintItPurple@reddit
This seems a bit like saying that highly compressed data does not contain a facsimile of the source material. If you've created some fixed data that a computer program can use to recognizably reproduce a copyrighted work, it seems more than fair to say that the data contains a copy of the work, even if you can't point to the copy in the data in a way that humans can read.
KevinCarbonara@reddit
What this is really getting at is that fair use laws have, for a very long time, allowed for much more egregious uses of copyrighted material than AI participates in. Many of the smaller, independent artists who have been the most vocal about being anti-AI are, themselves, far more guilty of re-use than the AI is. Even if we were to re-write laws to accommodate AI, we're not going to be able to find a balance that satisfies that group. Either we allow AI to re-use infinitesimal portions of copyrighted material, or we prevent much of what is currently protected by fair use.
loup-vaillant@reddit
And we know this how, exactly?
I can take a picture and encode it into some giant QR code, and while the QR-code itself will look that it does not contain any facsimile of the original image (it certainly won’t look anything like the original image), there’s no doubt all the information is there, and that’s what matters here.
Can you actually say the underlying vectors do not contain loads of information from the training data? I think not, those vectors are mostly a representation of that data. A lossy one for sure, but if it did not contain information from the training data, what would even be the point of that training?
It is possibly just as hard to prove that a human has read a particular novel, or learned from a particular piece of code. The fact remains that if I reproduced source code from memory and distributed it as if it were my own, I’d be infringing copyright.
simon_o@reddit
That has never been an obstacle when it's been the big corps suing the little guys.
purleyboy@reddit
Code copyright laws are a mess. Technically speaking, all code snippets on StackOverflow fall under CC by SA, meaning everyone here who has ever cut and pasted from StackOverflow should be publishing an attribution for each code snippet. No one does this. It's impractical. CC by SA also has copyleft implications. In fact, from a commercial sense, you may be better off telling your developers to use github copilot over using StackOverflow to get snippet type code generated because there are (currently) no legal issues with the generated code. Specifically, there are no copyleft issues.
hackingdreams@reddit
I've been at companies that very much track this, and add it to their open source disclosures. There's a whole industry built around tracking usage of open source code, making sure that license attribution is done correct, and making sure your products aren't in violation.
In this case, the burden to detect and complain about this copyright infringement lies on StackOverflow, as ultimately it's their copyright being violated by these offenders. StackOverflow apparently doesn't care, as it hasn't sued anyone over it yet. Therefore, this entirely faction of the argument is moot.
Except, you know, the very likely possibility that you get sued for copyright infringement right after Microsoft loses this case. The amount of emergency lawyering that would have to be done, the ridiculous degree of code auditing, the fucking five alarm panic this would cause at most companies means that this advice is the worst fucking advice I've literally ever heard. "Might as well commit copyright infringement, it's not yet been settled it's copyright infringement." No fucking thanks.
purleyboy@reddit
I am involved in a lot of M&A and am very familiar with the various OSS scanning tools and the legal implications. First, Github indemnifies paying customers for any law suits, but I understand if corporations find this unattractive.
Second, the copyright laws in code are ineffective today. As you mention, StackOverflow doesn't pursue anyone for violating their license. Legally though, companies should care about code being pasted from StackOverflow just as much, or possibly more, than code generated from Github Copilot.
I've used Blackduck, Snyk and Mend plenty of times. In over 50 M&As I've been involved in every single company has OSS license violations in their code base (typically lack of attribution). We fix it post acquisition, but I see enough to know how bad our industry is.
The reality is that most companies have violations on OSS licenses due to developers adding libraries or cutting and pasting code without any oversight. Right now the same is happening with GAI. Whether your company allows it or not, guaranteed your developers are using it.
balefrost@reddit
I don't think that's correct. The copyright lies with whoever contributed the code to SO. The original author licensed their contribution to SO as CC BY-SA, and so SO has to distribute it under that license.
The original author could dual-license their content, for example charging companies to use it commercially, releasing it under a proper open source license, or (in some jurisdictions) releasing it to the public domain.
I think the only person who could pursue such cases would be the original contributor. Most contributors don't care - they intended their work to be used freely.
I agree, though, that it's a legal minefield.
(Incidentally, I couldn't reply to /u/hackingdreams or even to your grandparent comment.)
dysprog@reddit
I had a company track me down and beg me to grant an MIT License for a code snippet I posted on stackoverflow.
Apparently the legal department had been looking for me for a few years and somehow found me on Facebook of all places.
(No idea why it took them so long, I use my legal name on SO, which is fairly google-able. There are about 5 people by that name, and it's also my gmail address)
I considered squeezing some dollars out of a corporation for the principle of it, but decided I'd rather not give payment related information in case it was some absurd scam.
kintar1900@reddit
This is the most cogent comment on this thread. Thank you.
KSRandom195@reddit
No copyleft issues so far.
purleyboy@reddit
Well, the US copyright office ruled last year that you cannot copyright AI generated content, unless a human has been significantly involved in refining the content. So, at the moment, technically you cannot copyright any raw code generated by a coding assistant. Which leads to a whole other discussion about what in the future is copywriteable and what is not.
Worth_Trust_3825@reddit
Corpos go at each other over code snippets. Check out IBM v Microsoft over public api copyright, and Oracle v Google over java usage in android, where the case hung mostly on google copying an indexof function.
simon_o@reddit
?
carrottread@reddit
But it's not GPL for Microsoft. By accepting github TOS everyone agreed to grant them rights to use all their public code on github to train their AI model.
hackerbots@reddit
That isn't what the TOS says at all. Now is not the time to start pushing propaganda.
Halofit@reddit
People can and do share code on github that is not originally theirs so the original licence provisions still apply. Microsoft cannot change those terms.
Houndie@reddit
General question, what happens if person a writes code and license it under the GPL, person b takes that code and illegally reshares it under a more permissive license, and then person without knowing about person a's copy makes their own version of it off of person B's?
Halofit@reddit
C is still in violation of A's copyright. Intent isn't required for copyright infringement. I'd guess C could try and sue B for damages, but I'm not sure.
fbuslop@reddit
If Microsoft is going to use content because the person who created the repo accepted some TOS, they must validate if that person even has the rights to change the license for specific entities.
stonerism@reddit
It would be extremely interesting if we can pull more AI technology out of proprietary black boxes.
Kinglink@reddit
I'm going to bet Microsoft settles the licensing issue if they can. I don't think they have a leg to stand on.
However this decision already sets a good precedent, with the judge agreeing that AIs don't copy code... which is true, they really don't.
You have more danger from a junior dev just copying code from Stack Overflow, than an AI writing something similar to something else.
Gubru@reddit
Hard disagree. If massive corporations can train on your data, then you can train on massive corporations' data.
nightcracker@reddit
A new specimen in the wild: the temporarily embarrassed data broker.
FatStoic@reddit
Your data: In public github/gitlab, publically accessible.
Their data: In private github/gitlab/homerolled equivalent, only accessible by them.
Gubru@reddit
Their data: every book, movie, tv show, and other piece of media ever published.
DarthNihilus@reddit
My data: All of that and a private gitea instance
lngns@reddit
Microsoft's official stance is that your code is free-to-use, but code by "entities with lawyers" is not.
amadvance@reddit
Yep, this is a huge blow to copyright. It's really good news.
It's unfortunate that it helps Microsoft, but we should keep the long-term goal in mind.
sonobanana33@reddit
Good luck when they sue you.
exodusTay@reddit
if this goes thru i hope someone trains an llm with all the leaked code online and makes it public.
otherwiseguy@reddit
This is literally how reading a textbook works. I do not cite all of the books I've ever read on programming when I write code. It's just how learning works--human or AI.
In general I'm ok with the training on whatever is publicly available as long as the output of AI cannot be copyrighted (and it can't). It seems like a decent trade off.
We do need better open source ecosystems for AI models, but those are progressing.
hamthrowaway01101@reddit
The jokes on them, my code sucks
Which-Tomato-8646@reddit
I got bad news about every framework and library then
TistelTech@reddit
at least they get credit for their work on the library. looks good on resume etc. LLM steals credit.
Which-Tomato-8646@reddit
It’s transformative. Not much different from a student looking at your code to learn from it
WordAggravating4639@reddit
Reading Comprehension 0/10
DreamingInfraviolet@reddit
It's good news because I'd rather have great tools than restricting progress in the name of copyright. Someone using my free MIT code doesn't hurt me, but gives society benefits.
lngns@reddit
If you are using the MIT licence while accepting this, you are misunderstanding the MIT licence.
__loam@reddit
The issue with AI generally is that these systems are ostensibly creating value that relies completely on the premise that work created with prior labor should be free, and also that its output competes with the people who made that prior labor. The people who make great tools and bring value to our society need to be treated fairly.
r3drocket@reddit
I'm okay with good tools. I'm not okay with some company making a bunch of money off of it and then having to pay them. When I've seen the thing write code that effectively was code that I wrote that it's clear that it trained on.
Prod_Is_For_Testing@reddit
Copyright has absolutely no legal basis to prevent training. The only potential violation would be the produced output. I believe that the output snippets are too short to be copyright violations. All works are copyrighted by default upon creation, but that does not mean that every subset of that work can be copyrighted. There’s a minimum length and complexity requirement that’s poorly defined by US law.
I don’t think there’s a violation of existing copyright laws. These tools can only be judged against the existing laws as written, we can’t invent new rules on the bench just because I the tech feels like a violation
__loam@reddit
This is a very confident statement considering there are still numerous copyright cases with respect to this technology being actively litigated.
purleyboy@reddit
If your code is your IP and it is closed source then there is nothing to worry about. Coding assistants are only trained on public and open source code.
didroe@reddit
What if your code is open source?
mx2301@reddit
Then it depends on the license the open source code has to ablige. Like let's say you have code that is licensed under a strong Copyleft license like the GPL, which states that if the code is in a project, that the project has to open source the code. (Really strong Simplification)
purleyboy@reddit
But LLMs don't contain any code. The learning process generates vectors that are continuously rebranded with each new piece of training data. You cannot ask an LLM to reproduce the training data as it literally does not have it. So this is a new situation we find ourselves in. We'll see what the final judgments in this are. From my personal perspective I don't see any breaches of OSS licenses when training an LLM.
a_marklar@reddit
Surely you mean something other than what you said. LLMs would be useless it they could not reproduce the training data. All they do is learn the distribution of the training data.
At the end of the day all of this is more like lossy compression than anything else and that is how I expect the law to eventually treat it. Just because your JPEG has been encoded 100x it doesn't change where it came from.
purleyboy@reddit
A good analogy is to think of LLMs as storing concepts rather than raw data. When you hear about LLMs being built on vector databases think of tokens as living in a multidimensional vector space. Each token, and higher order sets (words) are then associated with others based on how close they are together in the vector space. So 'cat' and 'dog' may be very close in a sub vector space for 'domestic animals'. By having billions of dimensions we get emergent properties from LLMs that we are still trying to understand. It's mind blowingly amazing. Once in a while we get unexpected behaviors (e.g. hallucinations). We also sometimes see evidence of memorization, but this is unusual and is an area of study to remove it.
a_marklar@reddit
A good way to think of LLMs (and NNs) is lossy compression. Get garbage like concepts, hallucinations, etc, etc out of your mind. Ironically, that is all noise.
Where did you hear this? The vast majority of neural networks are completely deterministic by default. They are literally pure functions. Are you thinking of MoE systems like ChatGPT that are not deterministic?
purleyboy@reddit
Yes, I'm taking about LLMs. Their output is non deterministic. Run the same prompt multiple times, you'll get a different output each time.
a_marklar@reddit
I just ran the same prompt 10x through llama.cpp with temperature=0.0 and got the same output each time. Which makes sense because that is how it is supposed to work.
Moleculor@reddit
The trick here is to get an LLM to reproduce a sufficient amount of copyrighted code that a human reproducing such would also be found guilty of copyright infringement.
Even if you actively tell an LLM "please be 100% deterministic", the results that I've seen aren't likely to generate more than short snippets of code from existing sources, if they even generate from existing sources at all (they often don't). And in the cases where they do produce snippets of code that seem to match code that exists elsewhere, the snippets that are small enough that it's reasonable to think that another human might have produced that same code without having seen the pre-existing code.
If we can successfully argue that a human that produced a snippet of code didn't copy it from existing code (because there are only so many ways you can solve problems), then should we be holding LLMs to some impossible higher standard? Eventually you'll get to snippets of code so small that there literally are few, or even just one, method of writing that code. (Example:
#include <iostream>
.)purleyboy@reddit
Ah, yes. Agreed when temperature is 0.0.
bruisedandbroke@reddit
open source doesn't mean free. they're being trained on code which is licensed under the GNU GPL and other strong copyleft licenses, and the result ends up in closed source software which violates the terms of these licenses
copilot doesn't discriminate! everyone's work gets stolen
CyberKiller40@reddit
And that part of the lawsuit is still ongoing. The dimissed piece was the one which looked like a hoax to get a DMCA takedown against the whole tool. Licensing issues are still being decided.
TomWithTime@reddit
I wish we would either make a decision here or give up. It's not ok to use some code as is but it's ok to use it to inspire almost identical code? That seems so practically worthless.
Whether it's ai art or ai code, can't you just say (or as a corp / wealthy person with the money to do so, follow through with) that you training data was actually a human built and legally distinct replica? Based on my understanding of ai that should yield an identical or almost identical model, so the obstacles we're trying to establish are only obstacles to non corporate entities and we aren't protecting the original either way
I understand that's basically been IP/copyright law forever, it just seems so stupid
CyberKiller40@reddit
When you know how the ai tools work to generate things, you stop worrying. You can get an identical replica, yes, if you specify the similarity factor too close. But in general the content in the neutral network isn't code or anything at all and the nn can't know what it is. It starts as random noise and it's iteratively regenerated to be similar to a source. With enough iterations it will be identical, the point is to stop iterating at a reasonable point.
Another thing is, people attempt to protect/compare too small pieces of code. Similar in the SCO vs Novell case, where most of the copyrighted code was variable decelerations like 'int i;' it's like this here too, basic functions with just a small number of reasonably optimal solutions will end up very very similar even if humans would write them.
TomWithTime@reddit
That reflects my understanding, I take issue with the human elements in play. We make arbitrary rules that are easy to bend or bypass
That's what I'm thinking, I just hope the people who enforce the rules can understand
StickiStickman@reddit
Man, if reading other peoples code and learning from that counts as stealing to you, then we might as well shut down the entire internet.
zanza19@reddit
Machine and humans are different thing and that is well established in law.
The aim that this is "learning" is quite absurd. This isn't learning, it's training a robot so they can sell it.
Moleculor@reddit
The point is that it's not established in law, as demonstrated by the charges that were thrown out with prejudice.
StickiStickman@reddit
It is quite literally learning by any definition of the word.
joe1134206@reddit
Nope. It isn't conscious. It does what it's told like any computer software. It doesn't understand concepts in true human terms. You wouldn't say your phone is "learning" when you update your domino's pizza app, but AI evangelists have adopted this holier than thou gatekeeping attitude about it that hasn't materialized in reality whatsoever.
r3drocket@reddit
To me, the distinction is that a large company is going to produce a large amount of profit off of this thing and while that thing does benefit society in general, perhaps it should be priced to produce less profit, or be freely available or very cheaply available to use.
And anytime an AI model starts to take jobs away from people we should start to wonder whether or not that's a good thing or not. It's a pretty big slap in the face, if your data was used to train something that ultimately puts you out of work, benefiting some capitalist and worsening our overall inequality.
If you read more about Adobe's efforts to produce a image generation model that doesn't disenfranchise artist, you realize very quickly that it still disenfranchises artist because it still reproduces artwork very similar to the things that the artists are trying to sell even though they were paid art for training on their art. It effectively allows Adobe to compete against anybody who wants to produce artwork.
ElijahQuoro@reddit
Does it mean, that if you have read open source code you should never apply ideas and approaches you have seen there? Where is the line here?
space_interprise@reddit
The line is between ideas and actual code lines, since AI isn't capable of truly creating something, only predicting the most likely next word it can become a issue when you go to more specific and the most likely word is a direct copy of someones code
travelsonic@reddit
Being creative maybe, being able to make creative decisions, yes (200% agreed), but ... IDK, perhaps it is my pedantic side but "creating" seems like a stretch in the literal sense of "a combination of X elements from places Y and Z didn't exist before, it does now." Surely, when it happens (said combination came into existence) it is "created" whether man or machine made it? Surely it didn't will itself into existence out of thin air.
StickiStickman@reddit
People still say this, seriously? Of course it can. I make Llama generate a poem that hasn't existed before. Just like Github Copilot can also generate new code.
So far the only instance I've seen of it actually reusing code verbatim is for very well known code that's many times in the dataset and they specifically tried to get it to do that, like Dooms fast inverse square root.
space_interprise@reddit
One can still make a new poem by guessing the most likely next word. The thing is, how a computer, a deterministic machine, trully makes something new? And if you do try to make your own AI you will notice, the smaller the dataset the most likely it will just repeat it back since there isn't much variation to the most likely next word to be
ElijahQuoro@reddit
Are we truly creating something? Don’t people just refurbish existing ideas with some rare insights? I think we need to have a more critical view on it to define our policies.
IMO, copyrights on code, algorithms, licenses - they all suck. What should be actually sold is maintenance, real ongoing work on existing software which is used somewhere, OSS at least got this part right.
space_interprise@reddit
I do believe that theres still a lot of innovation to be made, specially given than a new find in chemistry can affect the process to make a chip that can cause a revolution on how we do code, for example.
As for the license thing, i do agree, but maybe expand to account for stuff like, this project is free and opensource for personal and small bussiness usage, but requires a license for high end usage, since, i think is fair that a developer makes some money if a company using their project is making a lot out of it, specially since most of the time it also correlates to a lot of maintenence work for the project
bruisedandbroke@reddit
if you're using other people's libraries and snippets, you follow the terms of their license. do it all you want under MIT or the unlicense but other libraries require attribution and for your code to be licensed under the same license.
people do this all the time, it's how you learn. if you're talking about design patterns, it's pretty much the only way to learn, but if youre talking about copying people's source code, it's theft if your software is closed source.
the line has been drawn longer than I've been alive! Microsoft are trying to ride out the AI bubble until it dries out, and relies on their expensive lawyers and the flawed American legal system to escape repercussions.
andrerav@reddit
Using a GPL licensed library (in the form of a linked library, package, etc) does not mean you have to license your own code as GPL.
Copying the code (or part of the code) from a GPL licensed library does mean that you have to license your own code as GPL.
MidgetAbilities@reddit
I think you’re thinking of LGPL. My understanding is using a GPL package will indeed cause your own code to require release under GPL as well (assuming you distribute the code in any capacity to users outside your organization).
purleyboy@reddit
Distribution is key here. If your code is running on your web server then it is not distributed.
MidgetAbilities@reddit
For GPL that is correct. But some licenses may define distribution differently. Most notable is AGPL which considers someone accessing the software over a network to be distribution as well. This isn’t a very common license though.
C_Madison@reddit
It does, at least according to the FSF, who created the GPL That's the whole point of the GPL and why there's a separate LGPL. That's why the GPL is also called a "viral" license. Using it "infects" your code. There are people who see it differently, but as long as there's no case law on it it's probably a good idea to follow what the authors say was the intention. For me this means: No GPL anywhere near my code.
waterkip@reddit
Yes you do, that is why they also have LGPL.
zanza19@reddit
Human and machine are different, that's quite a clear line.
theQuandary@reddit
If I read and memorize GPL code, I'm not free to write it down and use it without permission.
LLMs are notorious for spitting out exactly what they saw (there's been a few big security complaints over them doing this with copied secrets).
Retaining and using copyrighted material without permission is the purest definition of infringement. Doing it willingly and for commercial gain is just icing on top.
purleyboy@reddit
We live in interesting times, this situation is new and requires legal clarifications. OSS usage licenses typically require attribution if the actual code or a derivative is directly used. The question then is whether an LLM falls under this. To me, the weights in an LLM are not a copy of the originating training material and so should not fall under any existing OSS licensing conditions. I think that's the real legal crux of the issue currently being reviewed in court. I feel that the use of LLMs for everything is about to potentially upset any historical notion of copyright. As I said at the start, we live in interesting times.
bruisedandbroke@reddit
i can't help but worry, but I guess we all have to wait and see, right?
Heuristics@reddit
It's perfectly fine for society to decide that such license restrictions are not applicable to AI.
bruisedandbroke@reddit
I wasn't invited to the society meeting then, and I don't think any developers were ? this type of thing should be strict opt in only. we've failed people by not creating AI legislation sooner.
Heuristics@reddit
You would not be, it's up to the courts to decide, up to the politicians and the judges to elect the judges and up to the public to elect the politicians.
Pharisaeus@reddit
100% they are not. MS might claim that's the case, same as Facebook will pretend they don't use private messages and Google doesn't use youtube content, but none of that is true. Access to that content is a "competitive edge" they have over the competition and you can bet they are using that, especially in case of diffused models where it would be hard to "prove" they did it.
On top of that, code being "public" or even "open source" still does not give anyone right to re-use it without proper attribution.
purleyboy@reddit
I would assume that the legal argument is that the underlying LLM does not store nor have access to any training data. The LLM has been trained just like a human coder would learn by looking at public or OSS code. As to whether private IP is being used for training in an unauthorized manner, that would appear to be an easy case against Github, however, you assert this is happening with no evidence.
Pharisaeus@reddit
If you memorize a piece of GPL code and verbatim reproduce it, I assure you it would violate the license, even though technically you just "learned it" and not copied ;)
purleyboy@reddit
There's a difference between rote memorization (copying), and learning and applying a technique. LLMs do the latter. They cannot do direct copying as they do not store a copy of the original training data.
Pharisaeus@reddit
Not directly, but they store probability of "continuation" and for specific enough prefix then can totally verbatim reproduce parts of the training data. See the lawsuit between New York Times and OpenAI - the plaintiff was able to prompt ChatGPT into reproducing direct quotes from certain articles.
purleyboy@reddit
You're describing 'memorization' in LLMs, this can happen but is exceptional and not the norm. The NYT claim is ongoing, the counterclaim is that the NYT hired a company to create the claim and that it took them >10,000 refined prompt attempts to generate an output that forms the basis of the claim. We'll see how that plays out in court. I believe that github checks generated code against a database of OSS code to ensure there are no violations of copying code verbatim.
The difficulty here is that there is no good test for what is fair use and what is copying code that needs attribution. A single line of code would not be considered copying, but at what point is something considered copying? There are very few successful law suits that have set precedence here.
Necessary-Signal-715@reddit
Open Source is not the same as licensed. If you upload your code somewhere, but do not include any lincense document (e.g a license.txt file), no usage rights are granted to anyone (in most jurisdictions). Rights must be granted explicitly. You may learn from it and use your gained knowledge to reimplement something similar, but you may not use it directly. The question is whether AI really learns or just pieces together information. The same question could be asked for humans though, there are no precise legal definitions on what seperates "learning from it and reimplementing something similar yourself" from "copying and rephrasing it". I don't think the problem can be solved by intellectual property laws, especially not when international laws are even harder to enforce than local ones. At this point we should rather think about how to expand the current financing model where a few governments and big companies invest in Open Source software and open standards to something more cooperative and with less political influence from single governtments/companies.
purleyboy@reddit
I agree completely with your pinion. I think there's a fairly common misunderstanding about how training LLMs work. The training process is similar to how individuals learn from public access of code. The end result of generated code is non deterministic for significant chunks of code. LLMs do not store or copy the actual source code used for training.
Berkyjay@reddit
Dude, most of us have public repos and share the shit out of our code. We aren't medieval alchemists jealously guarding our work.
MaleficentFig7578@reddit
We get to use their code too. We get to train AIs on the Windows source code leak with impunity.
bastardoperator@reddit
Massive corporations also build compilers, build tools and even entire languages… and you have never once attributed any of those tools despite the fact you wouldn’t be able to do dick without them.
This entire case will be dismissed, the other portions of the case have been dismissed with prejudice. I think most OSS licenses are about to be invalidated in terms of financial loses. You can’t say it’s free and priceless at the same time.
If you actually care about open source use a public domain license.
ThatInternetGuy@reddit
They are not using your code. They are training an AI that learns how to code from your code. That's a big difference. Unless, you can create a prompt that forces the AI to spit out exact lines of your code, the AI company wins.
r3drocket@reddit
Years ago I wrote some code in a fairly uncommon language to generate stuff in 3D. I have gotten, not consistently, but at least once co-pilot to spit out the variable names and function names that I used in that code. It was not an exact reproduction of the code but it was close enough to convince me it clearly had trained on my code.
Jmc_da_boss@reddit
If you don't want corporations using your code don't open source it
aykcak@reddit
Not surprising though. The case was set up to fail. Copyright was entirely the wrong angle. I have no idea what they were trying to achieve here
Ginn_and_Juice@reddit
The good thing is that 95% of the code is shit code and they cant comb through all the shit to make a filter, so the model will only produce shittier code with time.
RoboticElfJedi@reddit
Who do you think will benefit if there is more copyright? You, and the 4 cents of royalties you hope to get for your code, or Big Content?
Sadly in my view letting the likes of Altman have at it is the lesser of two evils. At least this way I get to use an AI when it helps.
Eirenarch@reddit
Joke's on them, I use their tool to write code for me muahahahaha
umtala@reddit
Apparently nobody read the article.
They claimed that Microsoft was violating DMCA. They didn't provide evidence of this claim, and in fact their evidence showed the opposite of what they were trying to prove. Therefore those claims were dismissed. The copyright infringement claims were not dismissed.
lookmeat@reddit
And honestly it's the wrong way to fight this. I am surprised that there hasn't been better organization in defining the rights of this. I guess the IP companies with a lot of lawyers already realized and just did contracts behind the scenes.
What AI is doing is very much on the space of "fair use", because they don't generate a copy of the art, but a program that contains inspiration of the art and has the potential to create a copy of it. You know what else has a potential to make a copy? Photographs, but you can't sue camera makers for copyright infringement! Hell you can't even sue the photographer, only whomever publishes a picture containing your IP (and even then there's a lot of stipulations of fair use, a picture containing your work isn't a copyright infringment, but a photocopy of it does).
That said, copyright lets you forbid use of work for certain things without a license. Artists can license their work for ML-use separately of other uses. Basically extend the Creative-Commons, GPL and other popular "free to use, but protecting the artist still" licenses to not allow the use of the art for ML-training. A separate commercial-ML license could be proposed for those uses.
That said I don't think that developers worry so much about their code getting copied this way enough to warrant an exodus of github (since that's the only way to be able to put code in a place where you won't be forced to allow its use for ML training) so, honestly, I don't see this mattering enough in the long-run. A few people who don't know enough about coding or ML, but just enough to own code and fear for it being copied will take this far, but I doubt it'll make it through.
jherico@reddit
This. Training an LLM is the very definition of transformative.
Uristqwerty@reddit
Training may be transformative, but using it to generate new content similar to its training set isn't. If the whole process from scraping to generating were
[A -> B -> C]
, saying the[A -> B]
subset counts as transformative doesn't necessarily mean that the whole is as well. Especially since[B -> C]
is specifically judged by the training algorithm based on how well it matches[B -> A]
.lookmeat@reddit
It still works. If I use AI to make a perfect copy of someone else's work I committed copyright infringement^1. Take a simple example: I take a photograph of non-public domain painting in a museum. Then I use this photograph to print out copies of the painting and sell them without permission of the copyright holder. Clearly I am the one who infringed in it, and arguing that the camera manufacturer made a lot of effort to ensure that the photographs taken were very detailed copies, or the printer whose sole job is to make copies. It's my responsibility to make sure I'm not infringing any IP in how I prompt the AI. As long as the software isn't producing perfect copies, it isn't different than asking a really talented kid to make you a picture perfectly in the style of another artist.
The thing is that the AI models have their value of knowing how to do multiple styles, and the content used to train them maybe wouldn't have been available for that for free. An artist may choose to put their pictures for public viewing online, but may want money for any commercial use of their artwork. AI isn't the first, you should see how many companies would use memes for commercial purposes, without considering that the art may have a creator who holds the copyright. Here this being larger companies.
IANAL, but artists probably want to get together and sue ML creators who used their art with a non-free commercial use license without paying for the license. It's gonna get ugly, and probably will make it to the SC (unless the ML companies are willing to accept a missing in favor of artists early). You'd have to sell the argument that training an AI that you're going to sell is commercial use, and that they couldn't use another art (as they wouldn't be able to train the AI to do that artist's style otherwise, lessening the value of the model) and therefore they should have been paid. Otherwise, going forward, licenses must consider the use for ML models, prohibiting free use for this purpose.
^ There's a very interesting philosophical/legal discussion on what if I never had access to the original, never interacted with it, and wasn't influenced here, but the AI did use it for training. Without the AI it'd be a clear "clean room" implementation and would be fair use, but its it really a clean room if the AI has the knowledge of the original embedded within itself?
meltbox@reddit
Hard disagree. This doesn’t map cleanly tot he camera metaphor because camera manufacturers give you a tool to reproduce. If you use that tool to ingest copyrighted material you are responsible for that.
The problem with large AI models is not the model itself but the trained models which are already “sold” to the end user with the copyrighted data ingested. It would be like Sony selling me a camera with whole copyrighted books already on the SD card which IS illegal.
lookmeat@reddit
You got it right in that it isn't the camera.
You got it wrong in that the copyrighted data is not contained therein. Say that I read a copyrighted book on how to write better, then I write an novels using the rules and guidelines within that book, to the point that there's examples of all rules within the book, though not the contents itself. Someone could, just reading my novel, learn the same writing techniques without reading the original book. Tell me is that plagiarism?
Like I said in my first example, it does open the question of what is a clean room implementaiton. But lets be clear, the ML model does not contain any work on its own. Just enough knowledge that it could recreate it. If we are going to say that is a problem, it's going to cause a huge problem with writer groups.
So rather it's like Sony investigating paintings and pictures all over the world to define a color scheme (think RGB) and then making a camera that uses that. It has content and knowledge that came from analyzing other works. But is it plagiarism?
So I hope those examples explain to you that the AI doesn't have the knowledge. Otherwise any author who has consumed a copyrighted work and was inspired by it could be liable.
And again, it's not like authors do not have any protection or way to defend themselves. They only need to explicitly state they do not give a license for commercial ML use. And they could try to sue retroactively arguing they never gave that right away to the companies, they just assumed. It'll be a complicated lawsuit, and one that will set new precedents (but this tech is a unique situation) but it actually has a shot.
meltbox@reddit
Perhaps. I guess I just don’t agree that human and AI interpretation are equivalent. For one thing humans often can claim fair use because we don’t train a human on text to then output text. Humans can train on text, movies, images, emotions, real life experiences etc and then output some combination into a unique text. Human and LLM learning are not analogous and even the creator of neural nets admitted that neural nets shouldn’t have been called that because they don’t really mimic how neurons work.
AI today is also largely single discipline with some glue logic to make them seem to be able to merge models together. Plus some hard logic filters to keep undesired stuff from coming out like say…. black Nazis or copyrighted code with GPL headers.
Also for example color spaces aren’t based off works but rather human perception, maybe behavioral and neuroscience. While works might be sampled to understand what areas of that perception to prioritize they aren’t actually encoded int the color space in any way.
It’s like someone else pointed out. Pi contains a copyrighted movie in it but if you need the source material or information as large as the source material (series of pointers) to decode the copyrighted movie from pi then the copyright infringement is not on PI but rather the pointer which decodes it.
Just like a colorspace doesn’t infringe but rather the representation of the copyrighted work in that colorspace.
meltbox@reddit
By this definition so is zipping up a movie. Look ma, the bits are different now!
Come on…
jherico@reddit
No, it's not. Zipping is a lossless bi-direction encoding system. Training a network is ENTIRELY different.
Your analogy is like saying that showing a movie to someone and having them be able to recount the plot is the same as pointing a camera at a movie and recording the whole thing.
Kinglink@reddit
Judge basically says that it's not memorizing code, which kind of leads to it isn't just copying code. (Which is absolutely correct, so at least the judge seems to understand what AIs are doing.)
Copyright infringement is probably going to get settled. I don't see Microsoft trying to go "we trained the AI on stuff that has impermissible copyrights to do so" but maybe they're just mad enough to try that.
PaintItPurple@reddit
"Rarely emits memorized code in benign situation" is basically the opposite of "is not memorizing code." It means that the software is memorizing code, but is programmed to avoid emitting that code verbatim under normal circumstances.
Kinglink@reddit
No it's not memorizing code.
If you think a LLM has every single code it's used to train on in it somewhere, that would be literally impossible I used the comparison of a 16 gig model that's trained on 4 billion of pictures. If a each picture was in there, it'd be 4 bytes of data. It's just physically impossible. The same is true for code models. They don't have every piece of code sitting around in it's model.
What is going on is there's weights, and if you have certain weights just right it might recreate the exact input that it trained on. But it's rather hard to do that reliably with out heavily influencing the data set.
meltbox@reddit
It’s compressing it. Is a zip file no longer the zipped content? Sure. But it contains a synthesized version of it.
Neural nets are just lossy compression really.
Kinglink@reddit
Again no it's not. how are you taking 4 billion DISTINCT images and compressing it into a 16 gig model.
This isn't r/idiot This is r/programming, you should know that level of compression is literally impossible. Even if you consider it's "Lossy"
meltbox@reddit
It’s compressing the patterns commonly linked to words. It’s why image generation nets can be trained to output a likeness or cats or a ball. They learn the attributes important to those and compress it down into a series of weights which represent an axis.
That’s the cool thing about neural nets. Turns out you CAN compress that much data. It’s just lossy. Turns out most of that lost data isn’t really that important.
Also they make mistakes hence the occasionally cursed stuff they can output. Hence the lossy part.
Kinglink@reddit
You can make up any BS you want, but you didn't address anything I said. Explain how 4 billion DISTINCT images are compressed into 16 gigs... Saying "it's lossy" means nothing and in fact proves it's NOT a copy. But even if it's the most lossy thing ever, 4 bytes an image is not a copy in ANY universe.
Which proves it's NOT COPYING but learning.
Nice try dude, but you basically are proving you either are ignoring what you understand about the technology, or just dump enough to think that's copying. Either way you already understand what it's doing... Stop trying to say it's compressing or copying it. You know it's not.
meltbox@reddit
Copying a likeness is copying. Again I said lossy compression, not lossless. You’re entirely right for lossless. But copying a portion of something is still copying.
Hence how Disney can sue you for drawing their characters even if they’re not carbon copied directly from a scene Disney has rendered or drawn.
I think maybe we fundamentally differ on what we consider copying.
Kinglink@reddit
No... even your definition of copying is wrong in this case... You still haven't explained how 16 gigs, can represent 4 billion images. You just keep saying compression.
Here's a compression I'll just put a 1 down if there's any green. That's Compression!!!"
Except, again that's not how it works, now stop annoying me with your stupidity and if you really want to have this conversation educate yourself, and then talk to someone else, you've wasted enough of my time.
josefx@reddit
So going by all this if I "train" an AI on ten thousand movies and it just happens to spit out a perfect copy of Terminator 2 I am legally in the clear, because it clearly could not store all the movies and that it did end up outputting Terminator 2 is just an AI thing that sometimes happens and copyright holders have to accept?
DeadlockAsync@reddit
That would be so statistically impossible that it would be evidence that you, as the user, coerced the AI to produce the output.
It'd be the same thing as using your phone's auto correct to output a novel. If the novel your phone's auto correct output perfectly matched an existing novel then it was you coercing it to do so, not the phone outputting it.
meltbox@reddit
Modern keyboard next word prediction would best be served by a LLM by the way. It’s usually a worse version of exactly that.
But also if your keyboard LLM was trained on a book it would be more likely to output that book. IE if you typed out the first sentence yourself it would very likely type out the rest of the first paragraph for you either verbatim or damn close depending on the context window and how much other data it was trained on.
And then are you really arguing that a 90% match of the first paragraph is not copyright infringement?
But keyboards instead use word distributions to predict the next most likely word and have no significant context window.
And this is also one huge reason a model won’t output the terminator in full when trained on it. Context window too small. Plus it’s merging the terminator with every other movie it’s seen and being seeded with random values.
Reddit pisses me off because clearly people don’t know how neural nets work. They’re just efficient lossy information compression and representation. The model can also only extrapolate in dimensions it already has represented by some weight. If the dimension it has to manipulate isn’t represented by some combination of those weights then the model is going to spit out nonsense.
This is why data cleaning is so important because it frees up weights to be used for what you need and not noise.
DeadlockAsync@reddit
No, I am arguing that the copyright infringer is the user, not the LLM/system.
I don't think I am the user you meant to respond to either based off the contents of your comment.
Ur-Best-Friend@reddit
That's a remarkably good example.
Kinglink@reddit
Have you ever heard of clean room development. In one room a developer reverse engineers a software and then explains how the software works, passing that through a narrow slot to a second room that creates a personal version of their software.
This allows the person in the second room to create a similar bios, potentially the same bios but be able to prove that the copyright wasn't violated and yes IS a defense against a copyright.
Let's ask an alternative thing. Assume that an AI never watched Terminator 2 and spit out a perfect copy of Terminator 2.. are you legally in the clear? What if a group of people did the same thing and was proven to never have seen Terminator 2?
But you're also asking for an AI to copy a 2 hour movie exactly. When code or images are copied it's usually sections of the image, or snippits of code. Not entire files...
So really your comparision doesn't work, what you'd probably see is an AI gets a scene very similar to Terminator 2.. .and yet we have that too in movies, people call it an homage, but how many movies and tv shows have used the Akira Bike slide? About 40 of them can been seen right there. There's not frame for frame recreations, but again you're not going to get a frame for frame recreation at least not without a lot of work.
A lot of work that's going to look more and more like a clean room process when you really start to look at it.
meltbox@reddit
There’s still a difference between trying to record or create a scene which looks and feels like a Terminator scene vs what an AI will do which is essentially recall parameters of that scene and modify it slightly to not be exactly that scene.
For example if you take a copyrighted work and change some details on top you as a creator will still get sued. So why would AI be allowed to? Because it was harder to tell/prove? That’s an insane argument.
meltbox@reddit
This. Rarely emits copyrighted material would be like saying Napster is fine if the majority of content was legal if you didn’t add “pirated” to the end of the search.
Stupid ruling on this basis alone.
Ur-Best-Friend@reddit
Even if that were the case, which it's not, that actually wouldn't be copyright infringment. There are plenty of provisions protecting transformative art. I'm not sure there's a single song by The Prodigy that doesn't heavily employ the uses of samples, sometimes from several dozens of songs in a single one of their tracks. If that was copyright infringment they'd long since have been sued into the ground. Same with collage art etc.
If the end result is significantly different from what it was based on, that's not copyright infringment.
PaintItPurple@reddit
I can't speak to The Prodigy specifically, but big artists who don't get sued usually will get permission to use samples. Otherwise they can and do get sued. Heck, Robin Thicke's Blurred Lines didn't even use a direct sample and actually re-did the Marvin Gaye song they borrowed from, and they still got sued and had to pay out a bunch of money. Making the leap from "derivative" to "transformative" is not as trivial as you're making it out to be. In fact, artists who are sued this way almost never even try to argue that the use is transformative, and instead will often try to argue that the use of the copyrighted work is too minimal to matter.
hardolaf@reddit
It kind of is "memorizing" the code though. It's just that because of the trimming process, it becomes less precise and thus less likely to regurgitate the training data.
Kinglink@reddit
I like the word "training" because that's kind of what it's doing. In a lot of ways it's like a training a person, because if you take a junior and show him the first function he has ever seen.. he can write that function. He probably can't write another function, maybe he's able to make obvious changes (strings for instance) but even then, I remember students who thought that the strings were immutable otherwise the program doesn't work.
Then you show them more and more programs and they'll start realizing that each of the things are pieces that can be combined in other ways.
LLMs aren't (that) inventive, or creative (depends what you call their hallucinations, but those are kind of crap), but go down the same path.
The difference between a human and a LLM is you can teach the human the rules, but to teach an LLM the rules kind of violates the concepts of AI, because the idea is your focused on training data, to generate the internal rule set, rather than specifying the hard coded rules.
That being said, with any suitable training size it should be unable to remember any specific code. Though when you ask for a specific well known piece of code Q_rsqrt, don't be shocked when it is able to recreate it. Again, most people who really focus on it, probably can too, or the one piece they forget(the specific numbers), is what computers are actually good at memorizing.
hardolaf@reddit
The difference being that humans have memory and LLMs don't. LLMs are dumb and never actually "learn" anything. They're just a correlation graph for an instantaneous data point to generate a response based on some parameters. LLMs don't think, they don't actually learn anything even during training. All training is doing is creating a correlation graph function to take some input and create some output. They're dumb and can be trivially made to regurgitate all of the training prior to trimming. After trimming, they have a harder time outputting the training data because a lot of it has been trimmed but you can still often find full copies of training data that can be coaxed out them with the correct input.
Drawing parallels between humans and neural networks only makes them appear to be more magical than they really are which is to say a 40-50 year old math concept that's incredibly dumb and usually worse than just hiring more qualified people to make a better product for you via better methods. LLMs are largely just a way to spend money on CapEx and OpEx without increasing headcount because people management is a lot more complicated than scaling hardware and jobs up or down for the business because management hasn't yet been replaced by unfeeling androids who will casually cut 50% of staff at a moment's notice because they need to make the books work out for them.
jherico@reddit
I love it when people make points like this when we don't have a very good understanding of how human memory works, other than it's done using biological neural nets.
An LLM's memory is an elaborate encoding of concepts into an N-dimensional vector space, implemented as a set of neural nets. How do you know how similar or different this kind of memory is from that of a human? Are you a pioneering researcher into the way the human brain functions?
Training an LLM and training a human have roughly the same outcome... they can respond to queries in a sensible way based on what they've been exposed to. Arguments related to LLMs vs humans should focus on the differences and similarities that can be quantified. Just saying that LLMs have no memory or that it's fundamentally different from human long term memory is at best an "educated wish".
meltbox@reddit
We don’t for sure. But go take a look at the Arc AI challenge. It’s fantastic evidence that AIs are pretty much all recall no reasoning. Humans appear to be able to solve far more complex problems than AI with far less known underlying knowledge.
This would imply AI is mostly just a huge search machine with great pattern matching and not at all like how humans typically figure out complex problems.
It also implies they fundamentally do not function like humans as modern GPUs supposedly have similar raw processing power to a human brain.
It’s something about how we are wired vs a simple matrix cruncher.
Uristqwerty@reddit
I'd have to dig through my watch history, but I saw a video some months ago about how memories are stored (I think in mice, but presumably there are similarities). It seemed similar to a bloom filter in reverse, where all sorts of random factors that happened to be triggered at the moment the memory were created combined into a key, and if some fraction of those factors line up in the future, it recalls the memory. They also slightly change every time its recalled, to better fit the new context, and there was something about how the memory itself is encoded as a sequence of neurons triggering in time. I think there was also something about them playing back in reverse.
That could all be complete misremembered bunk, but the underlying point is that neuroscientists have figured out a lot in the past half-century.
hardolaf@reddit
Which are, let's be clear, absolutely nothing like the mathematical neural nets invented 50ish years ago which were named "neural nets" as a marketing ploy and that the author of the paper coining the term would later go on to apologize for naming them neural nets after he learned more about how neurons worked from biologists.
jherico@reddit
You're reaching. You can say there are important distinctions but ANNs are not "absolutely nothing like" biological neural nets. They have MANY equivalent concepts and ANNs were inspired by a growing understanding of biological neural nets. One of the paper's authors was, after all, a neurophysiologist.
Are you talking about McCulloch or Pitts? Because there were two authors on the paper. Either way, "citation needed". However, even if one or both of them did express regret at the use of the term, they would likely have been specifically talking about it's use for the concepts they described in their 1943 paper, which predates widespread use of computers and many advancements in how we build artificial neural nets. As our understanding of biological NNs has increased, so has the complexity of what we consider a neural net in computing contexts.
EnglishMobster@reddit
LLMs very much do have memory.
What do you think checkpoints are? What do you think determines the weights in a model? What do you think sets different models apart?
If it's random chance and it couldn't "remember" anything, then LLMs would never be able to go through the training process. It remembers patterns, and those patterns inform weights in the data.
This is very similar to how a brain works, and that's intentional. Before they were called "LLMs" or "AI", these were called neural networks - because they work like neurons.
Brains are experts at finding and recognizing patterns. We only can do that because of our memory. LLMs also recognize patterns. They, too, can only do so because they have memory.
hardolaf@reddit
LLMs have a very limited amount of memory and no ability to actually remember things without having them in immediate, short-term memory in running program. So it's more correct to say that they have state but each time you restart the program, they have only the same initial state and no ability to remember anything from prior sessions unless you reload the exact state which is not how memory works in brains.
The topic of long-term memory for LLMs is still very much in its infancy with papers still being published this year on ways to achieve some sort of long-term memory.
As for being like human memory, there is a growing body of research which is explicitly demonstrating that neural nets (and LLMs by derivation) are fundamentally different from biological brains.
Also, please don't reinvent history as to the naming. The naming of neural nets was because the graph of them reminded the inventor of how neurons appeared to be interconnected in their graph representation structure. That same researcher would years later talk at conferences about how he regretted the naming as further learning on his part had shown him that the two were very dissimilar and that the naming misled a lot of people into thinking that his invention was meant to model neurons in a brain-like structure.
I've also had classes on it and I spend a significant amount of time working on projects that are "AI" adjacent or using "AI" and attending conferences around the hardware development of "AI" accelerators. I don't fault you for getting these things wrong because there is a lot of bullshit out there from professionals and even universities competing for that sweet, sweet "AI" money that everyone seems willing to shell out these days. Heck, you fell into one of the common misinformation tropes around the very origin of the naming of neural nets themselves. And it's not hard to get that misinformation as it's been paraded around for decades at this point even though everyone can go read the original paper and the author's rationale for free at any time.
Ur-Best-Friend@reddit
I never bought the argument that training AI is in any significant way different from a human learning from other humans. If you have an artist you like and you pick up some of their techniques through looking at/listening to their work, are you infringing on their copyrights? Well then every artist that has ever existed is a dirty thief, because no human has ever learned any complicated skill without learning from others that came before them.
As long as the output is significantly different, that's not copyright infringment in my book, and that goes for humans and AI both. People throw a huge fit when an AI song has a chord progression that's superficially similar to a human musician's song, and ignore far more blatantly similar tracks made by other human musicians.
meltbox@reddit
It is literally memorizing the code. Encoding it symbolically in a compressed format doesn’t make it any less of a memorization.
It’s repeatedly been shown that outside the training dataset AIs are less capable than humans with fall lesser memory retention to recall.
Essentially AIs are just incredibly good recall and pattern matching machines. Nothing more.
angryloser89@reddit
It sounds like you don't understand what AI is doing?
Kinglink@reddit
Go on.. tell me how it's copying code, I love hearing people explaining technology they don't understand.
emperor000@reddit
If you looked at that code to use as a basis of work you were doing, would you be copying it?
Kinglink@reddit
This is the question but if you learn from a piece of code. No you wouldn't be copying it. If you wrote the code line by line the same, that would possibly copying it, though size of it definitely will matter.
There's a workaround from almost every company I've heard you're not supposed to copy and paste code from stack overflow but rewriting and making it fit the style and guidelines of the code is acceptable. Seems reasonable.
Do people follow that, I think we know the answer is hell no, but if I tell you to print out the line "Hello World". There's only a certain amount of ways to do it, and I'm pretty sure the idea of a copy right on that is laughable.
st4rdr0id@reddit
Yeah, change a single character and now it's not identical, basically what pirates do with republished apps. Besides, what does "rarely" means here? A ton of testing cases with proper metrics would be needed to scientifically make that claim.
meltbox@reddit
Yeah this ruling is another judge with no idea what they’re talking about making a ruling which will haunt us for a century.
FollowTheSnowToday@reddit
This is Reddit not a Wendy's. Of course no one read it.
RyanCacophony@reddit
love the implication that the patrons of Wendys are such ardent academics lol
FollowTheSnowToday@reddit
I've always liked the meme-ish thing:
However, at Wendy's most likely you have to read the menu, even if it is pictures.
coderman93@reddit
FYI, his is a quote from The Office.
FollowTheSnowToday@reddit
/r/todayilearned
nzodd@reddit
Oh, I never read those, I always just get the doner kebab.
RyanCacophony@reddit
Oh I totally get the reference, just found it extra funny being used in this context
Kinglink@reddit
I'm glad it's not just me... Do people think Wendy's is the library?
Full-Spectral@reddit
The burgers are better at Wendy's.
Kinglink@reddit
Barely.
mozilla666fox@reddit
To be fair, you get a coloring book and crayons at Wendy's.
Ashamed_Tangerine355@reddit
perfect response lol
not_a_novel_account@reddit
There is no outstanding copyright claim, all the 1202(b) claims which would be "copyright claims" were dismissed.
What's left is a breach of contract claim for the open source licenses, that the AI models would be legal if they followed the license conditions. For such conditions to be nominally applicable it would first be found that copyright has been violated. Judge Tigar says as much in the June 24 order:
Plantiffs are allowed to bring the claim, but there's no likely path to victory here.
Also these articles are always trash, read the actual ruling
pixel_of_moral_decay@reddit
Which makes sense.
If you compare code you’ll find people regularly the same thing since we’re solving the same problems with the same tools, and all learned the same patterns.
The algorithm was trained on data, which is not subject to copyright, presentation is. As copyright exists today they don’t need the authors consent for that. It’s no different than you reading a book than speaking about what you read or applying it to your next project. You don’t need permission from the author to apply the knowledge only if you quote them or directly reuse their work.
You can read this comment and write a paper influenced by it, that’s not violating any copyright. Only if you quote it beyond fair use doctrine.
Most of these arguments against AI on the grounds of copyright will lose. That’s fundamentally not how copyright was intended to work and not how it’s written.
VeryDefinedBehavior@reddit
If AI winds up effectively killing copyright as a concept, then maybe AI isn't so bad.
Spitfire1900@reddit
No one seems to understand that the license you attach to your project does not matter if it’s uploaded to GitHub.
The ToS you agree to by using GitHub allows them to train against your software, even if it’s explicitly denied in your project’s license.
Luvax@reddit
It may be a ToS violation to upload code where this permission isn't given, but GitHub does not automatically receive the rights they want.
GrandOpener@reddit
If you have and upload to a GitHub repository, you have agreed to and are legally bound by their policies. At least in the US, I’m pretty sure the act of uploading your code indicates your continued agreement to their ToS.
Remember that code can be dual licensed. Even if you provide a license that prohibits use as training data, your agreement to the ToS can establish a separate license only to GitHub that permits it. Nothing is contradictory about that situation.
If you don’t agree with their ToS, your only real recourse is to not use GitHub.
Tuna-Fish2@reddit
Only if you have the ability to grant such a license. If, for example, the code was GPL licensed and contained contributions from people other than you, no such license is granted.
mrbaggins@reddit
You failing to adhere to the guidelines is YOUR problem, not githubs.
You're saying you're allowed to put it there. If the license that restricts YOU says you can't change the license terms, then you shouldn't put it on a service that you're agreeing will be given those rights.
Tuna-Fish2@reddit
That is true, but Github can only extract money from me, not license to the relevant code. Because I don't have any ability to grant it.
If I submit code I don't own to Github, and then Github uses it in an external project, once the actual owners of the code find out, Github is screwed. I might personally also be screwed, but assuming I don't have the kind of money that corporate copyright cases get settled for (which is a fair bet), this doesn't exactly help Github.
mrbaggins@reddit
That's the problem.
Nope. They'll come after you, because GitHub covered there butt with the agreements you've agreed to.
Tuna-Fish2@reddit
You don't understand. The actual owners of the code don't give a shit about me. No contract or license exists between me and them. They would much rather sue Github, because Github actually has money. The fact that there is a release from me to Github does not protect Github in any way, except that Github is allowed to turn around and sue me for the amount of money they lost when the actual owners of the code sue them. Except that I don't have that much money.
Otherwise you could find a homeless person who agrees to "sell" some copyrighted work to you while claiming to own it, and when the actual copyright holders come at you, you could just point them at the homeless guy.
mrbaggins@reddit
That's nice.
The copyright of the content you're distributing binds you and connects you to them. Same as I have no "contract" with disney about my DVD collection, I still can't put it on Youtube.
You say that like it magically makes the problem go away.
What would happen is in the opening phase of the lawsuit against github, github will name you as another party, you will both be sued, and github will win their defense because they can prove it's your fault, leaving you on the hook not only for the judgement, but also likely for githubs costs as well.
Congrats, you invented shell companies/phoenix companies.
Bad news: Veil-piercing (and the equivalent when targeting your "shell-homeless-dude" theory)
GrandOpener@reddit
Maybe. Certainly you're right that if you are not legally able to grant the rights that GitHub wants, they will not receive them.
But if all those contributions were made through PRs or commits on GitHub, then it would seem that every contributor did agree to the GitHub ToS and did actually give their own individual permission to have their code used as training data. So while you personally can't grant GitHub rights on the whole repository, GitHub still gets what they want.
It's only in the unusual case that GPL contributions are somehow made elsewhere and then copied to GitHub by someone other than the original author that things get potentially problematic for GitHub.
Tuna-Fish2@reddit
You are describing, for example, the way contributions to the Linux kernel work. This is not a particularly unusual case.
ykafia@reddit
ToS are not always legally binding
TheBlackCat13@reddit
Licensed aren't always either
ykafia@reddit
True, that's why it should be discussed in court
Ghi102@reddit
Exactly this. The only thing they can really do if you violate their TOS is to stop hosting your stuff and maybe ban you from the platform.
Otis_Inf@reddit
I'm not in the US. I couldn't care less about some site's EULA. the law is what counts.
spareminuteforworms@reddit
Sounds like a plan.
NeverComments@reddit
It's hardly an unachievable goal. You can take your git repo anywhere and there's no shortage of competing host services or project management solutions that integrate with a git repository. Across my career I've only worked with a single company that used GitHub, it's not exactly an industry standard.
sonobanana33@reddit
I moved to codeberg. They even enabled CI for my account, so I can run CI.
spareminuteforworms@reddit
Yea it seems like it ought to be achievable with the right insurances/contracts.
ficiek@reddit
This sounds like bullshit because anyone can take my code and upload it. Are you a lawyer or are you just guessing?
GrandOpener@reddit
I'm not a lawyer, just an old software developer who's been forced to understand these discussions over many years. What I wrote above is not legal advice, but I am quite confident that it is correct.
You do bring up an interesting edge case though. Let's say you have created code, host it somewhere other than GitHub, and provide only a custom license that permits copying by humans but forbids using it as training data. Let's also suppose this custom license is well written (by a lawyer) and comprehensively prohibits use as training data for anyone who receives it indirectly.
(Side note: it needs to be a custom license, not GPL. One of the core principles of GPL is that you cannot put additional restrictions on how the recipient of your code uses it. Using GPL as training data is obviously legal--the court cases are arguing about whether the LLM weights or outputs constitute derivative works, in which case they would also need to be GPL licensed. If Microsoft wins their case that use as training data is not creating a derivative work, then GPL will more or less explicitly grant permission to use it as training data.)
So anyway, in our example let's say Bob now takes your code and uploads it to GitHub. This is a problem. The GitHub ToS requires users to be able to grant the permissions they want on the code that is uploaded (on the whole this is normal and necessary, otherwise they couldn't even operate a service that lets you share your code with other people). Since Bob is not able to grant them permission to use the code as training data, Bob has violated the ToS. GitHub does not actually receive the legal right to use your code as training data, because Bob cannot legally grant that right.
Now in practice, GitHub is still going to use it because they are going to assume that Bob has the rights he claimed to. From my perspective the biggest troublemaker here is Bob, not GitHub. So what is your recourse? Well, if you don't personally notice Bob's repository, nothing happens. GPL code gets infringed all the time because authors don't notice. That's not legal, but it's reality. If/when you do personally notice Bob's repository, you'll need to have your lawyer send letters to Bob and GitHub. In theory you would have the legal right to force GitHub to remove your code from their training data (and probably also remove Bob's repository), but this may be very expensive depending on how badly GitHub wants to fight it.
Spitfire1900@reddit
That is an issue with more minutia, if you are the copyright holder of the work and host it on GitHub then you’re giving them permission to train on it.
If you are cloning someone else’s repo, that was originally hosted on a self-hosted git instance onto GitHub then they do not, but you’re at fault in that situation, not GitHub.
Astrogat@reddit
Don't GitHub have to do due diligence to make sure you actually have the copyright in that case? If not, do they delete it (which would require them to retrain a new model) if it turns out someone didn't have the rights to give away the copyright?
Spitfire1900@reddit
The due diligence is the responsibility of the uploader. GitHub may decide to delete the repo if it cannot comply with their terms, but they can’t remove it from already trained models.
There’s still a lot of unanswered questions, but none of them are going to be fixed simply by updating the license to say “you may not use the source code of this work to train AI”.
Astrogat@reddit
Isn't this a breach of DMCA? If they are notified that they are hosting copyrighted content, aren't they obliged to remove it? Why wouldn't that include the model?
sonobanana33@reddit
Yeah no.
Halofit@reddit
Yeah yes, actually. Copyright infringement does not require intent. Just because you didn't know you were infringing copyright does not mean you didn't infringe on copyright.
Websites have a specific carve-out for this, where they're not required to pre-emptively monitor for things they host, and rely instead on things like DMCA to resolve issues, but AI training has no such exceptions in law. You are required to ensure you're not infringing on copyright for every single thing you use.
sonobanana33@reddit
Yeah yes, actually, a license that allows redistribution doesn't mean you become the owner if someone redistributes to you.
Halofit@reddit
Ok, how does that disagree with me?
Luvax@reddit
Even if you are the copyright holder, you may not be in a position to grant such a license. Additionally, it might be up for debate if the uploader could have reasonably assumed what the specific details of such a license contain. Especially in europe where private entities are usually not expected and upheld to the same kind of legal understanding than companies are.
But sure, this is the US. But the blanket statement I replied to was simply not accurate and "It says so in the ToS, therfore it must be legal" is extreme bullshit.
Monad_No_mad@reddit
I'm not sure I understand this. How does Github not "receive the rights they want", Their ToS gives them exactly the rights they want and probably even shifts liability to the user if they upload something they did not have adequate rights to.
__loam@reddit
ToS are not necessarily legally binding if they conflict with the actual law.
Monad_No_mad@reddit
Yeah but in this case it's pretty clear, the user grants GitHub a license to do certain things to their repo and the user is responsible for making sure they have an appropriate license for what they upload.
If you upload copyrighted material to GitHub it's going to end up being your problem not githubs
__loam@reddit
These models didn't exist when the ToS was created so I could see there being laws against the retroactive inclusion of user data in training sets.
Monad_No_mad@reddit
GitHub has the ability to analyze, index, etc... what you upload.
__loam@reddit
Yes I understand what you're saying, I'm telling you that kind of contract won't necessarily hold up in court.
Monad_No_mad@reddit
In this case it will because by using GitHu .you
1) allow GitHub to do certain things with what you upload
2) acknowledge that you have the appropriate license for what you upload
It will be your problem, not githubs if you do not have the correct license.
Going even farther, it's hard to complain about rights for anything that's publicly available on the internet. Content has been scrapes, parsed and used for a long time.
__loam@reddit
Public availability has never represented carte blanche license to do whatever your want. And again, it doesn't matter what github's ToS says if the law disagrees. Just ask any company whose non-competes got voided this year.
Monad_No_mad@reddit
You are parroting something you don't understand.
If your software licensing is in conflict with githubs ToS then you are the one that will be liable as you are the one that put it there, agreeing to their terms of service.
And public availability is a license for many things, including analysis. This is a fundamental part of how the Internet works.
__loam@reddit
I mean it's rich to say I'm the one parroting something when you clearly don't think copyright applies to public content on the internet.
Monad_No_mad@reddit
Just think about it for a minute, how does search work on the internet?
If a model is trained on a website, how is this different then a website being indexed?
__loam@reddit
Search is fair use. It obviously benefits the rightsholder by driving attention to their site. A big machine that reproduces your work with no attribution is vastly and obviously different. The nature of the use matters hugely to how the law is applied.
Monad_No_mad@reddit
That's simply not true though, search often summarizes content or displays it without someone needing to Direct traffic, or think about how indexing images works.
I'm not sure why young people.have so many misconceptions about the internet
__loam@reddit
Indexing images has already been litigated and is fair use. Many have argued that summaries and things like ai answers at the top of search are not fair use and to my knowledge this has not been litigated in a court of law.
Also thank you for the completely arrogant and condescending comment. It's very cool coming from someone who clearly doesn't understand what they're talking about.
Monad_No_mad@reddit
I'm being condescending because the portion of the GitHub ToS is clearly not superseded by the law. Also because we already saw how you could scan every book in existence, show that content in search results and it's still fair use.
svick@reddit
If I don't have the rights to some code, but upload it to GitHub anyway, GitHub didn't receive the rights because I couldn't have given them.
Monad_No_mad@reddit
Yeah but if you do that you are liable, not GitHub
painefultruth76@reddit
Disney can apparently accident you to death if you use their streaming platform...anything is possible.
svick@reddit
For the record, that case was about forced arbitration, not about absolving Disney of any fault outright.
painefultruth76@reddit
It was an asinine assertion, Disney apologist.
Halofit@reddit
No they can't.
Zulban@reddit
Your thinking sounds reasonable but doesn't pass a simple legal thought experiment. If I upload a repo that has a LICENSE saying "I don't agree to any GitHub ToS" clearly that doesn't make it so.
dravonk@reddit
Someone else uploaded (open source) code written by me unto GitHub, before it was owned by Microsoft and Copilot existed, who do I get to sue?
nnomae@reddit
I can't transfer to you a right I do not own. If I upload my employers code to github I don't own it and as such can not transfer to them any permissions based on that code since those rights are not mine to transfer.
__loam@reddit
I'm very skeptical that TOS is going to hold up in court against actual copyright law. Meta is saying the same thing with respect to images on their sites.
Kinglink@reddit
That could be true, but let's say you write a GPL3 program and I take it and use it in my code. It's now GPL3 code (or at least abides by it) Now what if I take that code and upload it to github or other publicly available places? That was perfectly fine, even expected.
But just because that's fine and expected, does that mean I can change your license to allow training on it... That's kind of the heart of this case. (And the part that's not been decided yet)
tav_stuff@reddit
Terms of service are not above the law. An analogy would be if I signed a work contract stating that after I leave my job I can’t work for another 2 years. It doesn’t matter if my contract says that — it’s illegal so I can legally ignore it
sonobanana33@reddit
Yeah and anything I lick is legally mine -_-'
You can state whatever you want… that's now how laws are created.
teslas_love_pigeon@reddit
May not be how laws are created but it is how corpos enforce their reign of terror. If you're an individual or a SMB do you really want to face the wraith of a trillion dollar corporation that has zero issues spending tens of millions in legal fees to make your life hell?
crazedizzled@reddit
Which is what is going to be argued in court, because that is not how copyright law works.
thebuccaneersden@reddit
Yes, this judge was probably highly qualified to make that decision with prejudice. /s
BarelyAirborne@reddit
This is not good for open source. I'm reluctant to give up my code if it's going to be ingested and regurgitated by someone else's bot that's not going to give me even a whiff of attribution.
neopointer@reddit
I wish there was a license to completely opt-out of this crap.
MikusR@reddit
There is. Don't release the source.
neopointer@reddit
What about the possibility of doing open source without being ripped off?
svick@reddit
If a trained LLM like Copilot is not considered a derivative work, then no kind of license is likely going to help you.
Though that is a big if, and it's basically what this court case is about.
fkih@reddit
I wish there was a license to opt-in. No license? No training.
wolfpack_charlie@reddit
They'd never have a big enough dataset
crazedizzled@reddit
That's the point.
wolfpack_charlie@reddit
More like that's why it'll never be opt-in, unfortunately
boobsbr@reddit
There is: the ToS.
fkih@reddit
Not really relevant, but go off!
neopointer@reddit
That would be the dream. If you don't let it be used for training explicitly, then it's not allowed.
HerrEurobeat@reddit
Well if they are using GPL licensed code to create their product (aka training their AI), shouldn't they need to be able to provide the original sources and copyright information? Since this is virtually impossible, using GPL should opt you out.
I could very well imagine though that they find a loophole in the definition, like that training on code isn't using that piece of code itself, making it not fall under the license or something
Drogzar@reddit
In that case, I'm just very quickly take your code to "train" my "artificial stupidity" (SQL database that stores whatever I train it with) and then ask it to write it down again, and then I can use your code to my own closed source project because it was just "used to train" instead of copied.
xcdesz@reddit
I dont get this need for attribution. Almost all software out there has massive requirements / dependency lists of open source code being used with the most attribution being that you need to copy a string of some license file into your project somewhere.
Long before AI, most downstream developers will happily use your open source code without bothering to look up the persons name who developed it. Why does "attribution" matter to you so much? Also, in many cases its a massive corporation that is behind those open sourced frameworks you are using.
dravonk@reddit
It was always a problem, but when one of the largest software companies in the world is selling tools that systematically violate licenses, it makes the problem a lot bigger.
solartacoss@reddit
i feel the same but with music.
Saki-Sun@reddit
With the current state of AI code, they seriously need to ingest and regurgitate more of mine.
I personally relish our AI overlords, although I suspect I will be long retired before they take over.
Additional-Bee1379@reddit
That already happened without the ai.
merrymailingjacky@reddit
Developers should have an option to choose if their work is used to train AI. And no one in their right mind would agree to that.
FenixFVE@reddit
Abolish Copyright!
Uristqwerty@reddit
Copyright is DRM implemented through laws, that unlike technological DRM will disable itself after some number of decades leaving the work open, and permits both archiving and fair use even long before it expires. I do not want to learn what technological DRM schemes companies invent if they feel they can't rely on copyright law's protections, and I do not want to see half humanity's culture self-destruct when the license servers get shut down instead of unlocking.
bananahead@reddit
So no open source?
FenixFVE@reddit
Force everything to be opensource
svick@reddit
That's not what abolishing copyright would achieve. In fact, it would mean that GPL-derived code, which is forced open source today, wouldn't be anymore.
jackstraw97@reddit
How would that be possible without copyright?
Copyright is the underlying structure which allows open source software to have the requirements on the people making use of that open source code to keep downstream projects compliant with the license (open sourcing the derivative work, etc.)
If you’re suggesting that getting rid of copyright by forcing everything to be public domain would help, it would likely have the opposite effect. Everybody who would have previously been bound by the GPL or another license that has requirements of the deriver would instead let people who make derivative works simply close source their whole project no matter how many “open source” sources their code used.
There is no such thing as open source without copyright.
travelsonic@reddit
Disagree. The idea behind copyright is not bad. The problem, IMO, is what it has become thanks to the music industry, movie industry, Disney, etc lobbying. Bring the duration back to the original 14 years (and, perhaps more controversially, retroactively apply it to works based on data so that what should be public domain can actually be so) and things would be a lot better - the copyright would exist, the duration would be well within an author's lifetime -AS ORIGINALLY INTENDED-, and the public domain would get regular, consistent, frequent additions/updates, AS ORIGINALLY INTENDED.
trevr0n@reddit
Only if we also abolish capitalism. Otherwise, reform copyright.
Smooth-Zucchini4923@reddit
Aren't those the most important claims? Weird way to phrase this headline.
emperor000@reddit
But isn't 20 a majority out of 22?
bwainfweeze@reddit
Hang out on Hackernews more. We talk about how writers don’t get to pick the titles for their articles most of the time. That’s the editor and the editors are trying to make clickbait titles we all fall for. It’s a racket.
M4mb0@reddit
Good.
autopoiesies@reddit
tell me you've never contributed to open source without telling me you've never contributed to open source
Doctor_McKay@reddit
I agree with them and I am the sole maintainer of several projects with a combined 2,415 stars on GitHub and 137,876 downloads in the past month. You?
GetPsyched67@reddit
Might as well just give it up to AI and extinguish yourself from this career if you don't care.
M4mb0@reddit
I have contributed to many, among them very widely used ones such as cpython, pandas and pytorch.
autopoiesies@reddit
that's actually impressive
do you not care about the licenses those libraries use? they were carefully chosen, right? why would they allow for them to not be taking into account?
M4mb0@reddit
Not sure what you mean. In this domain, permissive licenses such as MIT, BSD or Apache are commonplace. I think the people who take offense by copilot are mostly the copyleft crowd. But the solution for them is simple: prove your case in court.
If any entity — and it shouldn't matter if it's a person, copilot, or a thousand monkeys in a cellar hammering on type-writers — produces large snippets that is a clear copy of your code, in violation with the license, then sue. But it seems the people who file these copyright claims are not able to do that:
ZucchiniMore3450@reddit
I agree with you, I am supporting copy left and GPL and against closed source companies.
But this is like I am forbidden to read GPL code and then write MIT license code based on that knowledge.
Maybe our licenses need to catch up with time and add clause about AI.
And about copilot, what did people think when M$ bought github, that it was for altruistic reasons? We all knew they were training models and that's why some projects moved away. They must have consulted dozens of lawyers and judges before doing it, my opinion is not important.
Similar with art and AI, do I need to pay just to see some image? If it appears in my browser and in my disk cache and backup and I didn't forget about it and create art based on that image. Is that illegal? Yeah right.
PsychologicalStore96@reddit
So, licences are dead ?
FlukyS@reddit
Well it kind of hits at a weird question which is "inspiration" isn't copyrightable. For instance, I can write a song that is a ripoff of Metallica and I can do so based on my listening of Metallica, in that case as long as I didn't copy a chord progression, lyric...etc word for word it isn't infringing. This is well known and not controversial at all. Copyright comes into play though when an infringing work is released that isn't just inspired by something but directly copying word for word, note for note. In the case of copyrighted code for instance all code can be accidently reproduced a number of times in a number of different places if the problem being solved isn't novel, so it's hard to trace who owns a specific coding pattern. If though for instance someone has a Copilot subscription and it pastes a specific piece of GPL code though and it can be reproduced the works themselves are infringing and would have to be relicensed or removed so most companies that have a legal department won't even allow that conceptually at all. I think the big question here is when someone does infringe on it and is challenged for it in court to comply with the original license will that be upheld and not specifically that Copilot was trained on it because that hasn't been legislated yet.
ElMachoGrande@reddit
There are programmers who occasionally google a solution to their problem, learns how it works, and then they write their own version of the solution. Then there is another category of developers: liars.
I see the AI as pretty much the same.
spareminuteforworms@reddit
There are people who write attribution into their code base though and abide the license of projects they depend on.
ElMachoGrande@reddit
If you find some snippet of code on the internet, looks at it, understand it, and then write your own version of that solution (because you always want it o confom to your code style anyway), do you really provide attribution? I sure as hell don't, unless I copy code verbatim (which I don't).
FlukyS@reddit
And a thing people don't realise but it's actually pretty fucking dangerous is no copyright string doesn't mean it is public domain, it is considered still the copyright of the writer unless explicitly waived. So in the case of copying from stackoverflow I'm not sure they have something in their EULA to waive that but copying random code that doesn't have a proper copyright statement is very very dangerous.
ElMachoGrande@reddit
Learning how to do something from code, and then writing your own implementation is not prohibited by copyright.
Doctor_McKay@reddit
Developers: put code on github so it can be looked at and learned from
Copilot: looks at and learns from public code on github
Developers: angry for some reason
__loam@reddit
You're kind of purposefully missing the point in bad faith here. Licenses exist for a variety of reasons, and in particular the goal of GPL is to protect open source by forcing derivative works to also be open source. It doesn't really matter what your opinion of this is, GPL is legally binding and private companies have been forced to comply with the copyleft clause and make their codebases open source by the courts. Even if you don't care about the lofty ideals of GPL, you still need to worry about copyleft licenses because they might represent a legal liability to your employer or company. In the more abstract sense, copyright is important because it helps incentivize innovation by letting the creator benefit exclusively from their work (or give it out for free and protect it from people claiming the work for a private company). There's some very problematic implications to the idea that large companies can derive value from labor they don't own just by laundering it through an LLM.
Doctor_McKay@reddit
Reading code and using it as inspiration to write your own isn't a derivative work.
__loam@reddit
That's great but it's also not what we're talking about if it's about copilot.
Doctor_McKay@reddit
It really is. Copilot isn't just spitting out someone else's code verbatim. It's interpreting context and making appropriate adjustments, i.e. exactly what a human developer does when adapting code.
__loam@reddit
The problem is that it often does spit out someone else's code verbatim. It's also not a human. It's a computational system that might be subject to different legal regimes than a human even if it was working in the same way as a human being, which it absolutely does not.
Doctor_McKay@reddit
Copilot isn't exclusively an LLM; it's also context aware and will do things like use the correct variable and method names to match the codebase it's working in. If you get suggested someone else's code verbatim, that means that you're using the exact same variable names they did, which means that you're writing a common pattern that thousands before you have also written.
In no way does Copilot reproduce "substantial" portions of any copyrighted work, which is what would be necessary to run afoul of the copyright.
If you're okay with something as long as it's done by a carbon-based brain but not okay with exactly the same thing when done by a silicon-based one, then as far as I'm concerned, you're a luddite.
__loam@reddit
This reasoning is bizarre and completely wrong. Microsoft had to put filters on copilot because it was reproducing significant blocks of code licensed with GPL. This is just a fact that we know, you don't need to add weird conditions like having to use specific variable names.
Human beings generally have more legal rights than computer programs. It would be pretty weird if tools like my lawnmower had rights.
FlukyS@reddit
That's Theseus' ship though
ElMachoGrande@reddit
Philosophically, yes. Legally, it's all crystal clear. Inspiration and knowledge is fair use.
FlukyS@reddit
Sorry didn't go deeper into it. I meant that if it was rebuilt but does the same thing it's probably fine even if taken as inspiration. I can be inspired by Led Zeppelin but it doesn't mean by song inspired by their work requires payment to them.
spareminuteforworms@reddit
I don't really know how it can be managed. Are you going to effectively grant spatial rights to spans in the lexicons of all languages? Why not just spam copyright claims to every span possible? So I just try to proceed with honor instead. Attribute the source of the code or inspiration really if someone comes to that code for any reason and has questions about it god willing stackoverflow is still there and they can read a hell of a good discussion.
FlukyS@reddit
Well that's the funny part is there is a bit of common sense involved like if I make up a sentence that has never been said "fluflamabingbong is underrated in fluterflam" now if I can prove that it was first invented here and I see it any time after today like on a tshirt then that's my revenue from tshirt sales. That's how it is managed, if you can't prove it was novel (like in the case of prior art) you can't copyright it so there is a balance that is understood legally here.
spareminuteforworms@reddit
In the case of prior art why doesn't that basically assign copyright to the prior?
FlukyS@reddit
Yep unless the period of copyright has expired. So for instance I can record some great Bach and release it and his great great grandkids don't get a penny. It just means you can't claim it's yours other than any specifics you did that were potentially transformative.
spareminuteforworms@reddit
Seems like there might need to be a high throughput mechanism to track the assignments of rights and new claims. Seems like right now megacorps are usurping that duty and that seems legally quite gameable... big guy always wins.
FlukyS@reddit
Well the bar being high means these cases are rare enough, the court system already is high throughput for this sort of thing
spareminuteforworms@reddit
I'm not sure I understand this. How is the bar high? Aren't most cases squashed by the bigger party before they can reach the court statistic?
FlukyS@reddit
Settled not squashed, the key thing with settled law is if both sides know for a fact something was wrong the settlement will be easier to reach. So it being settled before getting to court is a good thing. The bigger party doesn't matter in quite a lot of jurisdictions and especially it doesn't matter when their usage was provably incorrect. Like copyright cases are normally settled because there is basically no point in taking it to court and spending the money litigating it when both sides know generally going in what the outcome is. That's why it's settled law because there isn't normally any new judicial precedent that needs to be established as most of it has already been done in the 150 years of professional music recording and however long copyright has been litigated.
spareminuteforworms@reddit
Lets say you've got a 100,000 dollar/year business which could someday could be a 100,000,000 dollar/year (outrageous I know, nobody but big corps are powerful enough to innovate on such scale tell me you haven't worked in crop without telling me), well in your infancy you don't have the super powered lawyer guns to fight your case so your shit gets arbitrarily stolen. Do you think that harboring villains and thieves is ultimately to your benefit?
FlukyS@reddit
Checkout Busybox lawsuits if you think they will get away with it just because they have money
Purple-Ad-3492@reddit
I'd argue that the legal frameworks for these domains differ when it comes to inspiration versus direct copying, in that for music the distinction is more abstract and challenging to prove. But there are cases -
The "Blurred Lines" case dealt with whether certain elements of a song were sufficiently similar to warrant a copyright claim based on a violation of intellectual property rights, which parallels assessing if contractual terms, like those in open-source licenses, have been breached. They were found guilty of infringement and made to pay royalties.
Another example - Olivia Rodrigo credited other artists retroactively for similarities in "vibe" and influence to avoid potential copyright claims after proclaiming her inspirations. The accusation against her for an Elvis Costello lift highlights this nuance, but Costello welcomed the influence, reflecting the more fluid nature in the industry from his perspective.
In contrast, code copyright is stricter. If GitHub Copilot generates code that closely mirrors copyrighted code, it might be considered infringement, as code can be precisely copied. The DMCA claim was dismissed because it couldn’t demonstrate exact replication of the developers’ code, and why the GitHub Copilot case's claims about open-source license violations and breach of contract violations were upheld, as they focus on strict adherence to specific licensing terms, which are more concrete and measurable.
WaitForItTheMongols@reddit
In the end, an AI isn't a person and doesn't learn like a person or create like a person. It doesn't have inspiration or creative choices. I don't think we can apply human mental processes to how we analyze these models.
FlukyS@reddit
Well yes and no, AI isn't an if statement, it is combining training data to do something you ask of it. What you ask is creative work, what it spits out is a combination of various inputs that can be done in ways that are unique. If it can produce any work that wasn't in the training set then you could argue that everything output from AI is in fact creative.
Now the argument that has happened from a legal standpoint and using super established law is can you copyright creative work from AI which you can't unless it has been transformed to the point where it is now your own creative work because only humans can have copyright and the creator or the assignee of copyright in the case of works that were commissioned are the only valid forms of copyrightable material. For example the monkey taking the selfie case, the monkey took the selfie, the selfie was therefore not taken by a human and thus cannot be assigned to the monkey. AI is not human so they can't get assignment either. So it can be considered creative but you don't have the right to protect that work after it is produced if it wasn't heavily changed.
Well in terms of copyright you can definitely apply a lot of stuff here because it would be the same for literally any plagiarism case in any creative work. See the Robbie Williams case or the Ed Sheeran case for where the lines can be drawn from a musical standpoint. There is a line, like if I say "gimme fuel gimmie fire give me dababadai" or whatever the lyric is no one would have any ambiguity that it was from Fuel from Metallica, I can change the key but I'd still be infringing on the rights to the lyric writer.
mccoyn@reddit
If this is true, my program that averages two numbers is creative.
FlukyS@reddit
Well it is but it isn't novel as there is probably more likely than not enough prior work over the years using that to not be copyrightable by itself. You can slap the copyright notice on it though if it makes you feel better but there is a bar there when it comes to novel works that you have to clear when you are defending your copyright in the case of someone using it. So kind of built into copyright protection is a basic sanity check.
mccoyn@reddit
I didn't say "copyrightable", I said "creative".
FlukyS@reddit
Well creativity is subjective, ask a person who is 70 years old if Slipknot is good music, a lot will say no and say it isn't creative.
cym13@reddit
True, but also of note is that AI don't just decide to go learn from something by themselves, they're purposefully trained by humans, using data sources selected by humans. These humans know what value they expect the AI to produce in the end, and if they're using AI as a proxy to do things that they wouldn't dare do themselves because "it's not a human, our laws and contracts don't bind it", then I think fair to hold these humans responsible.
FlukyS@reddit
Well careful, the copyright infringement would happen on the released infringing works not on the generation or training of the model. A good legal question would be if I infringe on a copyrighted work using Copilot will Microsoft also be caught in that lawsuit if it can be proven that I used Copilot as part of the infringement. Generating any model isn't infringement, like I can write down the chords to a song on a sheet of paper but it only becomes infringement on the publishing rights for the composition if I try make money from that specific usage of the copyrighted work. So if you train the model and it doesn't produce infringing works then it's entirely fine so in a lot of ways it can be down to the model generation and maybe there should be tools added to confirm the suggestions from Copilot aren't literal copypasted segments or give content warnings for infringing works that are uploaded. Your comment itself doesn't really stand up legally, it could be made as a law but it would be incredibly hard to legislate for this, I think the current system would hold up pretty well though.
cym13@reddit
To be clear, my comment had no intention to be legal commentary, it was a political one (in the innocent sense): I think that, in general, we should not exempt from responsability the people that create models using data that doesn't belong to them. I don't think the law is there yet, but I think we should steer it in that direction.
PsychologicalStore96@reddit
If you do analogies with music production, i read « a big major do a compilation of all famous artist and sell it, but give no money to orignal authors and the defense is "I aggre it’s the same voice, lirycs etc …, but on my tracks list it’s not the name of the people who take me on courts" so why should I gave money to them ? »
Like you said, it’s not yet on the law, but it’s the first answer we got
FlukyS@reddit
If some label is releasing a CD they are required always to license (if they don't have it already) the master recording and pay for publishing rights for both the composition of the music and separately if applicable the publishing of the lyrics of the song. That isn't negotiable. What people confuse in this area is generally that some artists sign deals with large upfront payments as an advance before recording starts that are required to be paid back to the label, they won't generally receive anything until that is paid back. There is also an area too which can fuck people over which is where you get the advance and if you sign with a label that also has publishing they can have a cross-collateralisation clause which would take publishing cash and use it to pay off recording cash. Also there is another way which is if the artist themselves sell off their publishing rights which generally has happened more recently especially for older acts that wanted to cash in before they die basically and in that case they aren't entitled to any money just the purchaser of the rights.
Long story short if someone releases your song in any way or uses your song on TV or in a rally or in a club...etc you have a right to be paid period. There are caveats but those are generally discussed when signing the recording and publishing deals.
Doesn't work like that because records are generally easily traced back to the original sonically and to the time of production. The track on the CD is a copy of an original copy that has traceable elements.
For AI it isn't law but for copyright protection it is not just well established it is almost impenetrable legally unless you are infringing on some other law or someone else's rights.
Jmc_da_boss@reddit
They have always been tenuous at best. With limited and volatile precedent in court.
lIlIlIIlIIIlIIIIIl@reddit
Is that because algorithms can't be copyrighted?
Ravarix@reddit
Code copyright was always a myth. The system barely works for the medium it's designed for.
Zulban@reddit
If I upload a repo to GitHub that has a LICENSE saying "I don't agree to any GitHub ToS" clearly that doesn't make it so. Licenses are for others who see the project, they don't nullify the GitHub ToS.
NeverComments@reddit
If a piece of code is violating a license it’s violating that license regardless of whether it was generated it via AI, grabbed off StackOverflow, written by an employee, or came to you in a dream. This lawsuit has zero bearing on that.
r3drocket@reddit
I have some very specific code that I wrote that it's clear that it trained on, stuff that no one else has ever written because you'd have to be such a ridiculous dork.
It will literally reproduce the function names and variables that I used.
wildjokers@reddit
I am skeptical. So you have a prompt that you can provide to get a particular LLM to generate your code verbatim?
r3drocket@reddit
I write a lot of procedural generation code in openscad for generating organic structures, and I got it to pretty much reproduce the variables and module names I used in my openscad code, it failed to produce functional code, and mostly left the module bodies empty.
I tried this multiple times and one time it reproduced enough code to convince me it clearly has read my obscure openscad code, other times it produced different results.
Not verbatim but it's clearly been trained on my code. Yes I asked for a very obscure use case, but I wanted to see what would happen when I did.
currentscurrents@reddit
So what's the prompt and what file is it reproducing? Give us the details here so we can try it ourselves.
wildjokers@reddit
OpenSCAD is very niche and the amount of training data isn't that large. That probably leads to it just not having enough statistical data for it.
I have also never had ChatGPT produce valid OpenSCAD code, it seems to mix up syntax from a few different code-cad languages.
horror-pangolin-123@reddit
So basically piracy is ok if it's done via AI?
NeverComments@reddit
Using AI to skirt copyright is like asking a contractor to do it for you and thinking that gives you some magical legal loophole. If the material is infringing then it's infringing. It doesn't matter where you sourced it.
sleeping-in-crypto@reddit
I don’t think people realize what is happening here nor what the endgame of all this is where copyright is not respected just because wants to use your property to train its models:
A world of walled gardens where the training software cannot access, gated by a labyrinthine network of access and licenses that only the most seasoned lawyers can unwind (at proportionate cost, of course).
If that’s the world you want, sure, keep promoting decisions where MS, OpenAI and others can keep using the entire world’s creative works without permission or license.
You know the worst part is, this is the biggest missed opportunity in human history to date: wouldn’t you love to be part of a generation-defining project to turn the world’s knowledge into AI? They robbed you of that by doing it for profit and then claiming they owe you nothing. You gave them everything…. And they believe they owe you nothing.
bwainfweeze@reddit
They aren’t turning the world’s knowledge into AI. They’re turning the worlds opinions int AI. And its foibles and its fears.
Half of us are dumber than average and language models can’t tell who is who.
SmolLM@reddit
Amazing news
Neoshadow42@reddit
Sorry that you don't know how to code, but it's a skill you could learn and feel good about instead of stealing everyone else's.
StickiStickman@reddit
You realize the majority of programmers are literally using this already?
You being in denial is just gonna mean you'll fall behind. It's a skill you could learn and feel good about instead of being a grumpy luddite being angry at everyone else.
crazedizzled@reddit
Yeah but the majority of programmers don't know how to code.
Neoshadow42@reddit
Irrelevant, this doesn't excuse the fact that the tooling can't exist in its current form without stealing from people. Using GitHub copilot isn't a skill.
StickiStickman@reddit
Except there's no "stealing" about about learning with publicly available code.
wobfan_@reddit
so you imply that people who use copilot don't use the code but only use it to learn and then code it themselves? interesting take
sonobanana33@reddit
A majority of hacks and noobs yes.
Guypersonhumanman@reddit
No they dont
travelsonic@reddit
Personally, I like this event from the standpoint of "invoking the DMCA was idiotic, and it's good it was tossed in such a way where we still get to see the rest of the case through."
SmolLM@reddit
Wow epic pwnage
JustAPasingNerd@reddit
So you are like 12?
RedPandaDan@reddit
If it's decided that AI model output isn't copyright, doesn't that mean the end of all software licenses, proprietary or not?
Someone posted the windows 2000 source code online a few years ago, couldn't someone build an AI with that and nothing else in its data set and launder the whole thing?
mccoyn@reddit
This is ultimately decided by a court. Take this similar situation. It isn't copyright infringement to study a piece of code and generate new code using the knowledge gained from that study. But, it is copyright infringement to memorize a piece of code and write an exact copy of that code. Where is the line between these? Generally, you should stay away from where most people will think the line is, otherwise you will be in court arguing which side of the line you are on.
NeverComments@reddit
If the AI outputs content that infringes upon someone else’s copyright then using that content is copyright infringement. Separately, the content output by an AI that does not infringe upon someone else’s copyright cannot be copyrighted as-is because it is not the product of human creativity.
Using AI content as part of a larger creative work allows you to copyright that content as an original creation.
Valmar33@reddit
But humans have to choose data to feed into the AI model. So, there is some thin amount of creativity going on here. That is, deciding how to shape the model by choosing the data.
NeverComments@reddit
Here is the current USCO guidance on the copyrightability of AI generated material if you want to hear it from the horse's mouth. Naturally there are additional layers of nuance.
skratlo@reddit
It may be over in the US, but it isn't in the rest of the developed world. I'm looking forward to how this will play out in EU.
iamtherealjebus@reddit
Good, we need to unite and push humanity forward. Stop trying to slow us down
sonobanana33@reddit
Lol… Surely newton wouldn't have been so important if all he ever did was copy? :D
notjshua@reddit
Love it!