Is SE [going to be] selling our content for AI model training? And what exactly does "reinvest back into our communities" mean?

Question

Wired published an article titled Stack Overflow Will Charge AI Giants for Training Data. I read through it, and a few sections seemed noteworthy to me

"Community platforms that fuel LLMs absolutely should be compensated for their contributions so that companies like us can reinvest back into our communities to continue to make them thrive," Stack Overflow's Chandrasekar says. "We're very supportive of Reddit's approach."

Wait... who is getting compensated? I'm assuming SE, but then you refer to "their contributions", presumably referring to those belonging to "Community platforms". Can you clarify that sentence a bit? Maybe I'm reading too much into the phrasing, but I'm a bit confused by that...

Chandrasekar described the potential additional revenue as vital to ensuring Stack Overflow can keep attracting users and maintaining high-quality information. He argues that will also help future chatbots, which need "to be trained on something that's progressing knowledge forward. They need new knowledge to be created."

What specific actions are going to be taken to "keep attracting users and maintaining high-quality information"?

But fencing off valuable data also could deter some AI training and slow improvement of LLMs, which are a threat to any service that people turn to for information and conversation. Chandrasekar says proper licensing will only help accelerate development of high-quality LLMs ... When AI companies sell their models to customers, they "are unable to attribute each and every one of the community members whose questions and answers were used to train the model, thereby breaching the Creative Commons license," Chandrasekar says.

What specific license will our questions/answers be licensed to these companies under? Also, how will that work with the existing license? I am not a lawyer, but quoting from the current license (CC BY-SA 4.0 specifically):

If You Share the Licensed Material (including in modified form), You must ... retain the following if it is supplied by the Licensor with the Licensed Material ... identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated);

Again, I'm not a lawyer, but if the companies the content is being sold to cannot give attribution to "the community members", how exactly will that work from a licensing point of view?

Related on MSO: Will Data Dumps continue once the company starts to "Charge AI Giants for Training Data"? — Ryan M, Commented Apr 21, 2023 at 0:55
I would love to learn more about this as well, since I'm also only learning about it from the press. — balpha, Commented Apr 21, 2023 at 6:14
This whole thing doesn't really make sense to me. It seems to forget the whole SA part in CC BY SA. If they're violating that license since these models are a derivative work of user contributions on SE, we should also be asking for all these models to be available to us under a CC BY SA compatible license, but the Wired post definitely does not seem to be advocating for that. — Erik A, Commented Apr 21, 2023 at 8:50
Is it illegal for a firm to train an AI model on a CC BY-SA 4.0 corpus and make a commercial use of it without distributing the model under CC BY-SA? — Franck Dernoncourt, Commented Apr 22, 2023 at 2:03
The answer is yes. See the recent announced partnerships with Google and OpenAI. Although it certainly depends on the details. meta.stackexchange.com/questions/399619/… — NoDataDumpNoContribution, Commented May 15 at 14:04

Philippe · Accepted Answer · 2023-04-21 14:55:38Z

14

I'm here to respond to some of the points raised in this post. No doubt I'll miss some, so I appreciate your forgiveness and understanding that I'm working to answer as many questions as I can. Some things I'm not competent to opine on (legal interpretations for instance), and I've passed off to get input from the appropriate team members there. It's also important to remember that this is very early in this initiative and we're still figuring out exactly how to do some of this. Details may change.

First, I'd like to say that the intent of what Prashanth is saying is very simple: to return value to the community for the work that you have put in. The money that we raise from charging these huge companies that have billions of dollars on their balance sheet will be used for projects that directly benefit the community.

It's hard to improve upon Andreas' phrasing of this (so I won't really try):

If the intention really is to bring the value back to the community, by charging the companies using our content for their AI models, and spending that money on the platform, I whole-heartedly agree with the intention.

Next, I think it's important to call out that we’re not going to charge developers or community members to use the api for their own projects. We may need to tighten up access controls some to prevent abuse, but there will be a method for community members to access the api and its data for their own use.

But the community is - you are - being denied your rightful attribution as it stands right now. Prashanth is saying that you should at least get to benefit from the financial impact. This is about protecting your interest in the content that you have created.

That's the intent of what he's saying, and of the article. I hope this clarifies things somewhat.

answered Apr 21, 2023 at 14:55

PhilippeStaffMod

21.1k14 gold badges63 silver badges86 bronze badges

94

"you are being denied rightful attribution so we are going to charge for it and you still get no attribution" is not a very convincing argument. This might have worked with a slightly better relationship between SE and the community, but now and when everyone is getting the news from a random news article this just isn't very convincing.
– Mad Scientist
Commented Apr 21, 2023 at 15:04
27

This has all the energy of a certain 'news' agency's decisions to actively mislead the public in the name of better ratings. Yeah, cool, our content's going to not have the license respected because it was impractical from a logistics standpoint before this whole AI mess blew up, and that doesn't change but we're gonna make a boatload of money from it. I...do not know how y'all missed the part where the data is already out in the wild and you can't just take that back retroactively
– Makoto
Commented Apr 21, 2023 at 15:57
17

@MadScientist: It definitely feels like what happened is that Reddit found a way to make some revenue and Stack Overflow wanted in on that action. That's not a bad idea, but attribution really seems like a separate question. I'm also concerned that SO will get in the habit of charging people to use data that was freely contributed. I rather researchers have access without having to ask permission, much less pay for it.
– Jon Ericson
Commented Apr 21, 2023 at 16:08
18

@JonEricson I don't mind too much about SE charging for this specific situation, though I am a bit worried that if license terms alone are not enough they might e.g. stop the data dumps. But otherwise the choice seems to be either Meta/Google/Microsoft use all the data for free and without attribution or SE takes their cut as the middleman. But SE pretending this is about the community when it pretty clearly is about money is just a tiny bit insulting.
– Mad Scientist
Commented Apr 21, 2023 at 16:13
1

Thanks for the (pretty) quick reply and for the details. I do have a doubt though, what exactly do you mean by "community" in "directly benefit the community"? If you mean the whole platform, does it include Teams, Talent, or one of the others? Because the users themselves are not part of those, only the Q&A sites.
– Shadow Wizard
Commented Apr 21, 2023 at 17:28
4

@ShadowTheSpringWizard considering they now have a teams product named "Communites" that line is pretty blurry
– Kevin B
Commented Apr 21, 2023 at 17:46
1

@KevinB yeah, but I doubt that he meant specific Team.
– Shadow Wizard
Commented Apr 21, 2023 at 18:21
10

I was referring to the community of users who contributed to the Q&A site, the public platform. I am not willing to engage in the type of wordplay that would lead to the sneakiness that is mentioned here.... nor would anyone at this company even dream of suggesting it, I'm certain. But if they did, I wouldn't be the public face of it.
– Philippe StaffMod
Commented Apr 21, 2023 at 20:46
14

Well - very directly, since well, there's certainly going to be some ongoing unhappiness over this, where do you see the money going - and how soon would we see this ideally? This feels like a massive change in the social contract, and it would be very nice to have/see what actual benefits this has over "Trust us", especially with the complete lack of clarity over how we found out about it. What's the benefits for us here?
– Journeyman Geek
Commented Apr 21, 2023 at 22:58
3

I am not in a position to commit to particular projects yet (as I said, this is the beginning of the journey, and we still have to do some discovery work, etc) but I can say that "if it's a long-standing community ask, it's fairly certainly on the table" for us to consider. Ideally, I'd like to set up a way for us to accept suggestions, but I don't want to commit to that yet either. That's just my random post-5PM thinking on a Friday afternoon. More when I have more certainty.
– Philippe StaffMod
Commented Apr 21, 2023 at 23:05
30

And I think that's a key problem here - We're finding out information in drips, with the media finding out some of it before we are and there's no clarity over what any of this actually means
– Journeyman Geek
Commented Apr 22, 2023 at 0:08
1

I am not going to repeat what other already commented before me. Just wanting to point out that even if SE realized that maybe this was worth discussing only after another meddling kid posted this on Meta, I would probably have preferred you guys posting a separate post that linked at this one instead of "hiding" this behind an answer. You know... just so that we could actually answer you with our concerns instead of posting comments that may be deleted at any time given their status as "second-class citizens"
– SPArcheon - on strike
Commented May 11, 2023 at 17:05
3

Just stumbled over this post in context of the mod strike. Not sure why anybody would take this answer seriously after the actions of the company in the last years. We just got a very clear reminder that SE is not on the users side, and especially not on the side of the experts and power users. "We will sell your content, but don't worry, it's all being invested into things you benefit from" - unless you present me a bulletproof legal document signed by the CEO and approved by the Prosus board that puts this in writing, this is laughable.
– l4mpi
Commented Jun 6, 2023 at 13:41
15

How does turning off the data dumps directly benefit the community?
– Restore The Data Dumps Again
Commented Jun 9, 2023 at 17:12
7

This answer certainly aged like rotten milk.
– Sébastien Renauld
Commented Jun 10, 2023 at 14:26

| Show 2 more comments

Andreas moved to Codidact · Accepted Answer · 2023-04-21 02:58:13Z

35

_{«You» is hereafter used interchangeably to refer to Prashanth Chandrasekar, Stack Exchange, and the company’s employees. It does not refer to the creator of the question.}

If the intention really is to bring the value back to the community, by charging the companies using our content for their AI models, and spending that money on the platform, I whole-heartedly agree with the intention.

I am also happy that you seek to address the license violations that these companies are making. It's appreciable that you enforce this. At the same time, selling our content under terms acceptable for any buyers, does not sound compatible with the existing license. You have previously illegally relicensed our content. This is not acceptable. You can however provide an opt-in ability for all users, for them to agree to have their content sold under a different license, on the basis that the money earned goes back to the community.

Some users may want to reserve themselves against their content being used to train these models. Will you provide an option for this?

If the money is not given back to the community, and instead directly spent at your other commercial products, or sent straight into the pockets of the investors, I will whole-heartedly hate you for it. The community will feel used. This is the content provided by the community that you are selling, not your product. I will be extremely uncomfortable with anyone capitalizing on this content, in this manner.

internet forum for computer programming help

It's not clear to me if this is a statement by the CEO himself, or a grave mislabeling by Wired. However, Wired already labelled SO "The programmer Q&A site", so a clarification about this mislabeling will be most helpful.

edited Apr 21, 2023 at 2:58

answered Apr 21, 2023 at 1:24

Andreas moved to Codidact

4,1793 gold badges14 silver badges24 bronze badges

3

Re "seek to address the license violations": Well, it is probably about the money (because OpenAI has boatloads of money), not the license (my emphasis): "Because we have no standing to ask another site to take down content they have reproduced from our site, there is unfortunately very little we can do to address scrapers, and we are no longer pursuing these avenues as a company."
– This_is_NOT_a_forum
Commented Apr 21, 2023 at 8:56
3

@This_is_NOT_a_forum I honestly find that very ironic; now that there's money on the table, they suddenly care, but only about a specific category of them, because the rest aren't worth it to them
– Zoe - Save the data dump
Commented Apr 21, 2023 at 10:20
2

SE is not restricted by the CC-BY-SA license. In addition to CC-BY-SA licensing, you grant SE “the perpetual and irrevocable right and license to access, use, process, copy, distribute, export, display and to commercially exploit such Subscriber Content”; see the Terms of Service. So selling the content to third parties to be used without attribution is perfectly within their rights.
– Emil Jeřábek
Commented Apr 21, 2023 at 13:59
1

@EmilJeřábek A counterpoint is that third parties can also still use SO content for free, so long as they do provide attribution (thanks to said CC-BY-SA licensing).
– TylerH
Commented Apr 21, 2023 at 15:11
@TylerH They can, yes.
– Emil Jeřábek
Commented Apr 21, 2023 at 16:31
@Emil I'm not sure that's a correct interpretation– I'm no lawyer, but to me that line reads as just emphasizing that the content being removed doesn't revoke SE's license on the content. The full line reads: "...and you grant Stack Overflow the perpetual and irrevocable right and license to [use in roughly any way] such Subscriber Content, even if such Subscriber Content has been contributed and subsequently removed by you".
– zcoop98
Commented Apr 21, 2023 at 17:26
5

@EmilJeřábek: It would also be interesting to know when "commercially exploit" was added to the ToS. I'm at least 99% sure that it has not always been there and most contributed content is not licensed to SE for commercial exploitation.
– Ben Voigt
Commented Apr 22, 2023 at 15:18
1

@zcoop98 I’m no lawyer either, but see the discussion in opensource.stackexchange.com/questions/5663/…
– Emil Jeřábek
Commented Apr 28, 2023 at 19:35

Add a comment |

Jon Ericson · Accepted Answer · 2023-04-24 18:36:11Z

This quote caught my eye as I was reading:

When AI companies sell their models to customers, they "are unable to attribute each and every one of the community members whose questions and answers were used to train the model, thereby breaching the Creative Commons license," Chandrasekar says.

I look at it from a different angle than you, however. When I asked ChatGPT:

Were you trained on Stack Overflow data?

It replied:

As an AI language model, I was trained on a large and diverse corpus of text, which includes various sources of data such as books, articles, and web pages. However, I cannot confirm whether Stack Overflow data was included in my training corpus or to what extent it may have influenced my training.

OpenAI, the creators of the GPT-3.5 architecture that I am based on, has not released detailed information on the exact sources and amount of data used to train their language models, including mine. However, it's worth noting that my training data was pre-processed and filtered to ensure that it meets ethical and legal standards, such as removing personally identifiable information and sensitive content.

Regardless of my training data, my responses are generated based on statistical patterns and relationships between words and concepts that I learned during training, and I do not have any direct access to external data sources.

I did regenerate this response once to get what I was looking for. It's probably a mistake to read any nuance into these answers. Given that ChatGPT can code, it's virtually impossible that Stack Overflow data wasn't used. Due to the nature of the process, the model can't determine which bits of text it used to produce any specific output.

Of course I can't either. At one point I learned how to use the tar command, but I couldn't attribute all of the input I used to generate my answers. (Honestly, I don't recall writing one of these answers in the first place!) Charging other companies for using Stack Overflow data feels like a violation of the spirit of the endeavor. I mean, if it's not illegal to sell a book based on the data, I don't suppose it's illegal to sell the data itself. It just . . . why would anyone buy it unless the free copy stops existing?

Coincidentally, I ran across a site that uses the data dump from the dearly departed (in my opinion) Stack Overflow Documentation project. There's no attribution or even a link to the data. It's pretty annoying since the whole point of CC BY-SA is to maintain a link between the creator and consumer of content.

In my opinion, companies that use Stack Exchange content to train their language models should be very clear that they are using it. If Stack Overflow wants to charge for using its brand or special access to the data, that seems fine to me. In some ways I'd like to avoid blame for the trivial influence my answers have had on ChatGTP's buggy and sophomoric output. As long as my (possibly buggy and sophomoric) output remains freely available to anyone looking for it, I'm good.

Update after thinking a bit more

I asked Google Bard my "Were you trained on Stack Overflow data?" question based on Laurel's comment:

Yes, I was trained on a massive dataset of text and code, including Stack Overflow data. This data includes questions, answers, and code snippets from Stack Overflow, which is a question-and-answer website for programmers. This data has helped me to learn about a wide range of programming topics, and to be able to answer your questions in a comprehensive and informative way.

That's a more direct acknowledgement. Then I asked "How much wood could a woodchuck chuck?" (as you do) and got this answer from Bard:

A woodchuck could chuck about 700 pounds of wood on a good day, with the wind at his back. This is based on the research of Richard Thomas, a New York fish and wildlife technician who calculated the volume of dirt in a typical 25–30-foot (7.6–9.1 m) long woodchuck burrow and had determined that if the woodchuck had moved an equivalent volume of wood, it could move "about 700 pounds (320 kg) on a good day, with the wind at his back".

Sources

www.myth-weavers.com/showthread.php?t=395451

github.com/kevin-hartman/MIDS

en.wikipedia.org/wiki/How_much_wood_would_a_wood

This tells me two things:

Woodchucks are active little buggers.
It's possible for a large language model to cite sources.

That means Stack Exchange Inc could ask for Stack Exchange/Overflow content to be listed as a source. This would immediately make answers from language models more useful. It would also directly address the attribution concern. Whether or not they come to a commercial agreement is an entirely separate question.

Based on my experience as a former Community Manager, it seems likely management didn't investigate how this announcement would go down when the community found out about it. As a community manager at College Confidential I learn about potential revenue sources weeks or even months before deals are signed. I just got out of a weekly team meeting where we discussed upcoming clients to make sure that legal, product, marketing and community are ready the moment the contracts are signed. I'm not sure why this isn't part of Stack Overflow (the company) culture, but keeping community managers in the loop really makes a difference for our entire company.

From my perspective, the community benefits when the company that hosts it stays viable as a business. Meanwhile a company that hosts a community benefits when it correctly values that community. Practically that means a community-centered business might need to say no to profitable deals that harm the community. Fortunately, Stack Exchange has an excellent community team who might help the business side of the company generate revenue without eroding the value of the community. As always, I wonder if they will listen.

Stack Overflow (and the rest of the SE network) was definitely used to make Bard — Laurel, Commented Apr 21, 2023 at 16:50
Legally, it's murky territory, and I expect that it will remain murky for a while. Copyright laws around the world were originally framed in terms of commercial copying and they still haven't adjusted to the fact that for a couple of decades the average person has been able to easily make multiple high quality copies of text and other data. So I expect it will take some time for judicial & legislative bodies to understand what ML / AI are actually doing, and how copyright applies to them. And of course the capabilities & techniques of AI will continue to rapidly evolve. — PM 2Ring, Commented Apr 22, 2023 at 9:48
I suppose it comes down to: is GPT copying its TD (training data), or is it simply learning it, like a human reader would? Or is that a false dichotomy? ;) In a sense, its network weights constitute a lossily compressed image of its TD (with a compression ratio of ~50 for GPT-3). Some of its output is regenerated from thousands or millions of sources, but it's also quite possible for it to reproduce extended sequences from its TD which are virtually exact copies of one person's work. — PM 2Ring, Commented Apr 22, 2023 at 10:07
@PM2Ring does the JPEG algorithm copy its input image, or simply learn from it? The implementation details are not what matters, the externalities are, therefore the question of whether GPT "copies or learns" is a distraction at best, a profoundly insidious loophole at worst. — M-Pixel, Commented Apr 23, 2023 at 0:31
@M-Pixel Ok, JPEG compression & decompression performs a mathematical algorithm (the Discrete Cosine Transform & quantisation) and produces a modified copy of the input image, but the reproduction is sufficiently good to be perceived as a genuine copy (unless the quality setting is really low). So I think we can safely classify that as copying, not one-shot learning. — PM 2Ring, Commented Apr 23, 2023 at 6:12
The intention of GPT training is to learn the language sequences in the TD, and that's probably covered under Fair Use, going from the examples (eg Google Books) mentioned in Bryan Krause's links. In general, a prompt cannot reliably cause ChatGPT to regenerate a sequence from the TD. However, it most certainly may reproduce extended sequences with high fidelity. And those reproductions might not be covered by Fair Use, and probably ought to have some form of attribution. But as I said above, this is still murky territory, and IANAL. — PM 2Ring, Commented Apr 23, 2023 at 6:13
Copyright notices say that you cannot copy the work or store it in any information retrieval system, in any way, mechanical, electronic, etc. However, a human with eidetic memory is certainly permitted to read a book, copying it perfectly into their memory, without fear that they're breaking any law. However, if they then reproduce the book from memory, then they're infringing on the copyright. Will the courts apply similar logic to GPT? Who knows... — PM 2Ring, Commented Apr 23, 2023 at 6:19
@Jon "I'd like to avoid blame for the trivial influence my answers have had on ChatGTP's buggy and sophomoric output." I can relate to that. Back in Ancient Times (as I'm sure you know), before proper OSS licences were developed, it was common to see "If you break it, it's yours" clauses in open source software copyright / license notices. That is, you may use and modify my software, but if you mess it up, please take my name off it. ;) — PM 2Ring, Commented Apr 23, 2023 at 6:26
One thing to note, citations generated by AI may look correct, but be completely fabricated. These sorts of AI are transformational. If I chop newspaper headlines up and create a collage, I don't have to give credit to the newspaper's writers; I've created something new even though it contains pieces of their original writing that is intelligible. (that's not an analogy; It's an illustration of transformative in the sense if copyright. The technology works much differently because it has to generate a semantically relevant grammatical response) — ColleenV, Commented Apr 24, 2023 at 20:09
Also I wish I could upvote a second time for your point about coordinating with the CM team to evaluate how the community fits into the business plans. — ColleenV, Commented Apr 24, 2023 at 20:13
@ColleenV: I checked three sources referenced in the woodchuck answer. They did seem relevant to the question. It might be that this is a special-cased question? I haven't found a question that lists SO/SE as a source in any case. — Jon Ericson, Commented Apr 25, 2023 at 1:37
@JonEricson Why is Kevin Hartman’s github is cited for Richard Thomas’s work? Wikipedia doesn’t have a page at the provided link and the forum post cited is just a copy and paste from Wikipedia that cites a 1988 AP article. — ColleenV, Commented Apr 25, 2023 at 10:26
These AI’s aren’t trained to be accurate, just to fabricate plausible relevant text based on what is in their training set, which includes plagiarized/duplicated texts, misinformation, satire etc. A more specialized model trained on a cleaner data set can be overlayed on the general model to improve the info while keeping the ability to parse and generate natural language. I don’t know how much that can push the general model toward accurate statements. My husband is working with the image generation AI’s, and the real magic seems to be in training models for specific purposes. — ColleenV, Commented Apr 25, 2023 at 10:29

This_is_NOT_a_forum · Accepted Answer · 2023-04-22 17:38:38Z

I seriously don't know where to start.

SE has always emphasized the CC license model which includes attribution and you have to be a lawyer to figure out that TOS change introduced in 2018 meant content is licensed to SE under additional licensing terms which gave them right to use our content and do whatever they want to do with it, including removing attribution. See: Do Stack Exchange’s ToS mean that the user-generated content is double-licensed to them?

Now, even if reading the TOS made it clear that this change means dual licensing, I wouldn't immediately call that a problem because there are certain situations under which SE may need those additional terms in order to run the sites. For instance, when a user decides to delete their account, or the account gets deleted because of rule violation, or similar.

There is nothing wrong with that.

Also, SE is a for-profit organization, and there is also nothing wrong with SE making money on top of user-provided content. But, again, I have never envisioned in my wildest dreams that making money off our content could include selling that content in a way that would remove attribution.

Training AI on our content to improve site capabilities, like better search and finding duplicates and similar questions or even for improving writing question titles is one thing, but training AI on our content in ways that will allow AI to generate full-fledged answers and other stand-alone information where attribution will be lost is a completely different story.

I didn't sign up to have my content used for generative AI training that will remove attribution.

First, I'd like to say that the intent of what Prashanth is saying is very simple: to return value to the community for the work that you have put in. The money that we raise from charging these huge companies that have billions of dollars on their balance sheet will be used for projects that directly benefit the community.

The stance SE and (or) CEO has seems to be: "Companies are using SE content for generative AI, violating the CC license, so let's make some money out of it."

Thanks, but no thanks!

I would expect that SE with their position can at least try bringing up the attribution topic in public discourse and media, and pursue higher standards for generative AI, where AI would have to include attributions with generated texts.

I know that this may not be enough to prevent abuse of our content by companies that train AI models, but it is the minimal first step.

And last but not least, how did the community find out about those plans?

Did CEO Prashanth Chandrasekar come here personally and discuss with the community before making such plans, or at least had he the decency to inform the community first hand about those plans? Did he come here and explain how will community benefit form this money?

No, he told about those plans to other media and wrote an extremely vague post on the Stack Overflow blog that didn't even mention anything close to selling our content for training AI.

CEO is as detached from the community, as he can possibly be.

This kind of behavior on behalf of the company and the CEO can only lead to another wave of valuable members leaving the sites and reducing their contributions.

And eventually, when the quality of content drops (and the sites are already full of trash), the results of this lucrative sales deal will be garbage in, garbage out.

"And last but not least, how did the community find out about those plans?" related: Why hasn't there been an official community announcement or blogpost about charging for usage of subscriber content in the training of LLMs? — starball, Commented Apr 28, 2023 at 21:53

Mad Scientist · Accepted Answer · 2023-04-21 20:40:36Z

In the end this simply means that SE will sell the community-created content under a non-CC commercial license that has no attribution requirement, circumventing the usual restrictions that apply to all the content here.

Of course we don't know the details, but I see no other conclusion that fits the statements by the CEO. Though this is rather unsettled legal territory, so I think it is quite possible that some future court cases might obsolete all ideas SE had here.

I'm not entirely sure SE has the legal right to do this, but the ToS are a bit too hard to interpret for me as a non-lawyer in this case. I would read the statements by the CEO as an indicator that at least SE thinks they have the legal right to do this.

I'm not entirely opposed to this idea from the start, but I am a bit wary as this is essentially selling content that was created entirely by the community for free. The part that pushes my opinion here deeply into the negative area is the way SE has handled this so far. Reading this from a random article instead of having it announced here is a bad start, even some SE employees did not know about this. Reading this from a random article just after the CEO posted a blog post on the same topic of AI that was devoid of any substance feels deliberate, and is rather insulting. Explaining to us now that SE is doing this to protect our right to attribution by adding themselves as a middleman that just charges for content they didn't create is an outright insult.

Another danger here is that SE might be tempted to add some kind of restrictions to ensure that they can charge big companies for our data. I think that would be a grave mistake and I would strongly urge SE to be extremely careful. If they stop publishing the data dumps or restrict the API in ways that interfere with regular users, it will harm relations with the community sigificantly.

related: Can SE just resell our data, relicense it and remove the attribution requirement? also, related to "Reading this from a random article instead of having it announced here is a bad start", see Why hasn't there been an official community announcement or blogpost about charging for usage of subscriber content in the training of LLMs? — starball, Commented Apr 21, 2023 at 20:51
Quoting Phillipe's answer post: "We may need to tighten up access controls some to prevent abuse, but there will be a method for community members to access the api and its data for their own use." — starball, Commented Apr 21, 2023 at 20:52

Bryan Krause · Accepted Answer · 2023-04-21 15:25:50Z

Right now, as far as I understand it the legal doctrine that purveyors of large AI projects have been operating under in the US is the idea that their use is Fair Use under US copyright law. Fair Use covers cases where you do not need permission of the creator/rightsholder to use the content.

However, this has not really been tested in courts yet. See some articles about this:

https://creativecommons.org/2023/02/17/fair-use-training-generative-ai/

https://www.linkedin.com/pulse/fine-line-fair-use-ai-copyright-concerns-geoctrl/

https://www.theverge.com/23444685/generative-ai-copyright-infringement-legal-fair-use-training-data

Importantly, if training an AI is Fair Use, the license for content you post here doesn't matter. The license is for giving people permission to use your copyright-protected content under certain terms. Fair use is not protected by copyright, so it doesn't matter what terms you set.

If SE wants to sell access to the data here, that would suggest they believe Fair Use will not hold up for this use of content, and presumably that they're willing to defend that idea legally, otherwise they'd be selling something that's already free. If you like the license on your content here and are bothered by companies stepping around it to train their AI models, that's probably a good thing for you, unless you feel comfortable taking on legal action against the large companies using this data yourself.

At that point, the license terms do apply, so it's worth considering whether it's possible to comply with those terms while selling the data for training. That seems tricky to me and I don't know enough to comment on it.

tl;dr: It seems there are a lot of legal issues to work out here, and importantly, the terms of the CC-BY license are not where it makes the most sense to start.

"If SE wants to sell access to the data here, that would suggest they believe Fair Use will not hold up for this use of content" To me, it suggests that they believe they can prevent its use by technical means, such as ceasing the data dumps. In fact, if use of this content for LLM training is not fair use, then SE does not have the right to license the content for use without attribution, as they only hold a CC BY-SA license to the content. If it is fair use, then, as you note, their permission is not required; their only option would be to block access to it somehow. — Ryan M, Commented May 12, 2023 at 1:11
To me, that would be much worse than unauthorized use of SE content without attribution, and would make me reconsider contributing to the network going forward. By contributing to SE, I'm contributing to a publicly accessible dataset that will live on even if SE goes away. Without the data dumps, it's just another walled garden for a company to exploit for profit. Why would I donate my time to that? — Ryan M, Commented May 12, 2023 at 1:15

starball · Accepted Answer · 2024-07-01 18:58:23Z

8

I have a similar hope to Andreas that profits from our contributions go back into improving the platform.

In particular, I'd like to see design, discussion, and development time get put into the long lists of bug tickets and feature requests.

MSE: is:q [feature-request] or [bug] -[status-completed] -[status-by-design] -[status-declined]
MSO: is:q [feature-request] or [bug] -[status-completed] -[status-by-design] -[status-declined]
That's not even counting the same search queries performed on all other network sites' meta sites.

As for how much of that profit, the part of me that doesn't want to expect too much says at least 50%, and the part of me that sees a company possibly about to make a bunch of money off the backs of people who have always been on a mission to freely make the internet a better place says as close to 100% as you can reasonably make it.

Update: Community Asks Sprint. Very exciting. Hopefully this initiative stays alive long term.

edited Jul 1 at 18:58

answered Apr 21, 2023 at 4:20

starball

26.8k8 gold badges52 silver badges129 bronze badges

2

Wishful thinking. They'll just keep improving Teams, as that's what gives them money back. But even so, I do agree it's good, business can use as much money as it can get, and even stash some for darker days.
– Shadow Wizard
Commented Apr 21, 2023 at 6:08
1

@ShadowTheSpringWizard - I'll tell you that of the lists that I've seen (which were generated using almost exactly the search terms featured above) for discussion, there were absolutely zero features that were pointed exclusively at the Teams product. Obviously some will benefit both the Teams product and the Public Platform, but I've seen no lists that include items exclusive to Teams, and many items that benefit the Public Platform exclusively.
– Philippe StaffMod
Commented Apr 24, 2023 at 0:56

Add a comment |

Stack Exchange Network

Is SE [going to be] selling our content for AI model training? And what exactly does "reinvest back into our communities" mean?

7 Answers 7

Update after thinking a bit more

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
discussion
legal
licensing
ai-tools
.

Linked

Hot Network Questions

Is SE [going to be] selling our content for AI model training? And what exactly does "reinvest back into our communities" mean?

7 Answers 7

Update after thinking a bit more

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged discussionlegallicensingai-tools.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
discussion
legal
licensing
ai-tools
.