Skip to main content

in reply to mcc

So developers will stop sharing information on #StackOverflow and future #Copilot and friends will be forever stuck in the past, answering questions about historically relevant frameworks and languages.
#LLM #StuckOverflow
in reply to chris@strafpla.net

@chris Yeah. But for this to be true, we need a Stack Overflow replacement. And when Reddit went evil, the move to Lemmy doesn't seem to have succeeded as well as the move from Twitter to Mastodon.
in reply to mcc

IIRC Mastodon is older than Lemmy and the current move to Mastodon/Fedi happened in multiple waves, so it may be too early for higher expectations.
For stackoverflow I expect some degradation of quality since they accept “AI” generated content. This may additionally frustrate high quality authors and motivate them to leave. We’ll see.
What would a federated stack overflow look like if we were to invent it?
in reply to chris@strafpla.net

@chris I don't know. It's an interesting question because Stack Overflow is inherently more search-focused than Lemmy or Mastodon.

A good model for a distributed/ownerless SO might wind up looking more like bluesky than mastodon.

in reply to mcc

@chris And, of course, there's the weird element that the SO license *already* does not permit AI on a facial reading, and a distributed SO would probably be *easier* to scrape than the centralized one. So you're not actually preventing AI exploitation, you're only punishing one corporation (SO) for the AI bait-and-switch.
in reply to mcc

I personally see less problem in scraping a federated pool of knowledge but I absolutely hate that stackoverflow now owns this knowledge and can keep people from using it but sell “AI” as a service to them.
in reply to chris@strafpla.net

@chris I suppose one thing to consider is if a federated pool of knowledge is CC-BY-SA, then we only need a court ruling that OpenAI violates CC-BY-SA and the federated pool becomes AI-safe. Whereas SO can, (or already has) change the TOS so they own rights to relicense all content.

…but of course, CC-BY-SA is also incredibly inconvenient for a SO clone because everyone will generally want to copypaste sample code!

in reply to mcc

So we’d be looking for Schrödingers license, allowing and forbidding closed derivative works at the same time :-)

(I have a feeling that a lot of licenses only work because nobody has a close look at how their objects are used.)

in reply to chris@strafpla.net

@chris If I were actually trying to create a stackoverflow clone, I'd have the default license be something like "all code blocks are CC0 but all human text outside the code blocks is CC-BY-SA". That would I think match the unspoken expectations both contributors and readers have.

chris@strafpla.net reshared this.

in reply to mcc

That seems like a good and very straight forward approach, it’s would at least meet my expectations exactly.
This entry was edited (4 weeks ago)
in reply to mcc

@chris I *am* worried about the effect "AI" scraping is gonna have on copyleft in general, tho. I think people have for many years released copyleft on the rule of "hey, why not" and now the answer is "bc AI". (More thoughts: https://mastodon.social/@mcc/112209121196262534 ) Like, my proposed license in the last post would be very AI-friendly.
in reply to mcc

Most open source licenses, including permissive ones, require attribution. “AI” does not and cannot do attribution, so the vast majority of open source licenses are already AI-safe.

Of course, “AI” companies are already getting away scot-free with blatantly violating those licenses, so “safe” isn't really the correct word…

@chris

in reply to argv minus one

@argv_minus_one Until today we considered it sufficient, if a derivative work attributed all sources in one place. A collage of images or an application would come with a file or metadata transporting the necessary attributions and licenses.
Changing this would do damage.

I feel as if we’re aiming at our collective foot because we discovered a black spot.

in reply to chris@strafpla.net

@chris @argv_minus_one I have never once seen an LLM style "AI" product which even attempts to comply with attribution licenses.

This said, if the current subject of discussion is Stack Overflow, the content license on Stack Overflow is CC-BY-SA which is substantially stricter than an attribution license.

in reply to mcc

@argv_minus_one Oh, sorry if I wasn’t clear. I don’t mean to suggest that the current batch tries to comply, I just wrote that being compliant to the attribution clause would be possible easily.

We have different opinions on ML. I consider it okay for Altman et al. - and for the public! - to train the emperors new “#AI” on CC material if they follow the rules for derivative works (and the rest of the rules).
And I think that #AISafe is a dangerous can of worms.

in reply to chris@strafpla.net

@chris @mcc SO publishes database dumps so we could all make a fork and start from there with something more libre
in reply to Szymon Nowicki

@hey Good idea!
I was wondering if they still did and I expected, that they already stopped doing this.
I had this tool that indexed local copies of SO for referencing but I keep forgetting to reinstall it and update the database.
Thanks for reminding me!
in reply to chris@strafpla.net

@chris they still do (https://archive.org/details/stackexchange) and still out of their own infrastructure.

IIRC they made Stack Exchange as a response of entshittication of another Q&A service and when they designed it they made a promise to make the content on open license and publicly available so once they go evil people can move on somewhere else taking the content with them.

Which I guess might be heading into this direction.

in reply to mcc

@chris

Or the move from SlashDot to SoylentNews.

Simplest way: If you see a service that has hints of this, warn your friends, and, get a large bucket of popcorn.

in reply to Billy Smith

@BillySmith @chris Don't look at me. I was part of the exodus from SlashDot to Kuro5hin. Which I thought actually went pretty well actually
in reply to mcc

This is really sad, I’m sorry.
It shows again that information can only persist if it is copied and spread.
That’s why publishing on corporate platform, exclusively is such a bad idea. Just imagine youtube would really successfully lock ‘their’ content away one day.
This entry was edited (4 weeks ago)
in reply to chris@strafpla.net

@chris
Louis Rossman is working on software to get around this. :D

https://en.wikipedia.org/wiki/Louis_Rossmann

https://www.youtube.com/watch?v=dqTYg6vnQvw :D

in reply to Billy Smith

@BillySmith On top of storing a lot of text over the years I’ve been downloading most videos I consider exceptional for a while. I tried to extend this to everything that I read/watched using tools like https://archivebox.io but this still a crutch because it will only be archived for myself. To me something like peertube looks very interesting as a concept.
in reply to chris@strafpla.net

@chris
I've done the same.

When i looked at the streaming approach, i could see the future enshittification.

Peertube is great. :D

Another approach can be found here:

https://www.kickstarter.com/projects/mirlo/mirlo

and

https://mirlo.space/

in reply to mcc

@chris

I lost access to my SD account back in '99, but couldn't be bothered to find it again.

It was interesting to watch, but the hints of the bust-out were always there.

in reply to Billy Smith

@BillySmith To me it’s interesting that something that was so interesting to me 25 years ago completely vanished from my perception today. I may remember slashdot about 4 times a year or less. To be fair I may think of kuro5hin about 5 times, but only because I mention “Metamorphosis of prime intellect” to someone, a #SciFi story that was published there.
in reply to mcc

@chris I've said it before and I'm sorry if I sound like a broken record:

Then they'll just scrape from the Stack Overflow replacement. Any creative works any human ever puts on the internet again is just training data now. There is no way we can share code with each other anymore *without* also giving it as a free gift to Sam fucking Altman and his ilk.

in reply to datarama

@datarama @chris If it is really the case I cannot prevent Altman from creating derivative works of anything I make, then I at least want to create the maximum possible financial consequences for any company which intentionally helps him. Stack Overflow may not have been able to prevent Altman from scraping their site. But they didn't have to accept his money.
in reply to mcc

@chris No, they didn't, and they're assholes for doing so. People *should* be leaving that moral dumpster fire of a site behind.

I just can't see how we can build an alternative without AI barons just using that as a pool of free labour instead. Licenses and copyright only apply to people like you and me now, not to them (as you've also pointed out).

in reply to datarama

yeah we share mcc's concerns about what this means for the commons. we refuse to give up on that, but it's going to be hard.
in reply to Irenes (many)

@ireneista @chris My point is that I can't see how it's even *possible* to maintain an open internet commons anymore, because a robot strip-mine is not the same as a commons.

I hope I'm wrong! But I can't see how. Anything that is openly available is free training data, and we can say "please don't use this as training data" all we want; they don't care (and they don't have to).

in reply to datarama

well, you're describing a constraint. the engineering mindset does say to start by doing that.
in reply to Irenes (many)

we don't know the solution yet either, but then we still don't feel like we have a sufficiently precise formulation of the goal, so... there is certainly stuff to think about
in reply to Irenes (many)

@ireneista @datarama This. What is the precise thing we are aiming at - and has it the form of a foot?

(and then: who is the multitude of “we” and how many different things can be called “a thing”?)

in reply to Irenes (many)

@ireneista @datarama I meant to address that the people discussing about #AISafe may consider their group as a homogenous “we” of known size but I have doubts about this.
Non the less a very interesting link, thank you!
in reply to Irenes (many)

@ireneista @chris In the EU, ML scrapers are legally required to respect a "machine-readable opt-out" for copyrighted content when training commercial systems (they can ignore copyright entirely for academic research).

The only current specification for that opt-out is W3C's TDMReP (https://www.w3.org/community/reports/tdmrep/CG-FINAL-tdmrep-20240202/). But it doesn't really work for eg. free software.

1/

in reply to datarama

@datarama @ireneista So if Ia public fediverse instance sets the TDM-reserved flag its content can’t be used in a study on the prevalence of nazi propaganda on the fediverse?
in reply to datarama

@ireneista @chris Say I self-host some code I've written and set up my site's TDMReP to tell scrapers that I have opted that code out.

Now someone copies my code and puts it on Github. And now it is no longer opted out. Someone else (including employees at the companies I'd want to opt out from) can unilaterally void my opt-out, simply by copying my things elsewhere.

And free software licenses by their nature permit them to!

2/

in reply to datarama

@ireneista @chris The only way around this would be by putting my opt-out in the license. But 1) that's not "machine-readable", and 2) if my license says you can't copy my code and put it elsewhere, it's not exactly friendly to the commons (or free software in any sense).

And this is about EU, which actually *has* a restriction on this. Other jurisdictions don't even have that.

To me, it looks like anyone who isn't an AI executive is fucked.

3/3

in reply to datarama

@datarama @ireneista Only fucked if your goal is to avoid machine learning on CC works completely, which IMO is pointless / missing the point. You can’t allow and disallow derivative works at the same time and you even have no idea what kind of “derivative works” people will come up in the future. But derivative works from public works should be public and not be sold as closed source.
(And then there’s the questions AGPL tries to answer.)
in reply to chris@strafpla.net

@chris don't worry, they'll probably just stick bots in every matrix/gitter/slack/discord/zulip they can find and train models on that instead
in reply to caitp

@caitp @chris "so, why exactly do I have to wear a fursuit to fix my issues with systemd?"
This entry was edited (4 weeks ago)
in reply to likely not a disguised martian

@kyonshi @caitp When I made this years resolution to really embrace systemd for a year or two I didn’t know about this perk!
Tangentially: Wherever part of a system you are in, systemd pops up. Even in this tread.
in reply to chris@strafpla.net

@chris
So this would be the perfect time to start answering questions with subtle bugs in it, and just wait a while until your code is replicated in all kind of custom projects
in reply to Robbert

@mjrider You’re right, the bugs in current AI generated content are too obvious to really spread :-)
in reply to chris@strafpla.net

@chris TBH SO has felt a little stuck in the past even before this. Seems like the answers I find are quite old and don't accurately reflect the state of the art. I find many answers that use deprecated features of APIs, frameworks, etc.
in reply to mcc

an article went around recently about "rewilding" the Internet that made the analogy to clear cutting an old growth forest. You get incredible wood, but you can only do it once.
in reply to mcc

Earlier today I edited my (small) set of Stack Overflow posts to add the sentence "I do not consent to my words being used to train OpenAI" to the end. Within hours, all these edits were reversed and I got a warning email for "removing or defacing content". I did not remove any content. If this small sentence is "defacing", it is a very minor defacement. In no way was the experience of other users made worse by me adding one sentence.

To Stack Overflow, you are not a person. You are "content".

reshared this

in reply to mcc

Not only does Stack Overflow say you don't have a right to remove your words from Stack Overflow, according to Stack Overflow, you don't even have the right to decide what words Stack Overflow publishes under your name.
in reply to mcc

Stack Overflow is subject to the CCPA privacy law, just sayin’.
in reply to mcc

this email makes me so pissed off, it's a for profit fucking enterprise, that content which posted for free leads to the profits of the shareholders, it's unpaid labour!!!!!!
in reply to Sashin

@Sashin @mcc User-Generated Content, baby.

It's been a goldmine for a bit longer than the term "Web 2.0" has been around, but until recently we have been taking it as a social contract that we give it to the corporation and they give it to the world for some ad revenue.

That social contract is rapidly coming apart as investors see more profit potential in newly enabled modes of exploitation.

This entry was edited (4 weeks ago)
Unknown parent

chris@strafpla.net

@deflockcom @hacks4pancakes I don’t know about a #searchEngine that unlists “#AI” / #LLM generated content, but this thread may be tangentially interesting to you:

https://mstdn.strafpla.net/@chris/112039450597316623

Unknown parent

chris@strafpla.net
@datarama @ireneista Better: The nazis just have to set the tdm:non-research constraint and they can spread propaganda without anybody being allowed to analyze it. Nifty!
Unknown parent

chris@strafpla.net
Thank you, I missed that.
(And I now have to check what “commercial” means. Let’s hope the study is not published in a book or in a newspaper.)
This entry was edited (4 weeks ago)