Friendica Social Network

Large Heydon Collider

2 weeks ago • •

Large Heydon Collider
2 weeks ago • •

📝 New Blog Post: Poisoning Well

I have started publishing nonsense content on my blog, to hopefully mess with LLM crawlers.

Here is the article, all about it: heydonworks.com/article/poison…

And here is the nonsense companion article, all about whatever: heydonworks.com/nonsense/poiso…

Poisoning Well

An experimental strategy for contaminating Large Language Models

^{Heydon Pickering (HeydonWorks)}

This entry was edited (2 weeks ago)

reshared this

in reply to Large Heydon Collider

Matt Wilcox

in reply to Large Heydon Collider • 2 weeks ago • •

Nice. I had considered injecting nonsense in alternate paragraphs and hiding with CSS - assuming that AI won't apply the CSS to hide the junk, but was a little dubious of the possible implications.

in reply to Large Heydon Collider

vkc (Veronica Explains)

in reply to Large Heydon Collider • 2 weeks ago • •

this is awesome and I'm eager to hear how it goes for you.

in reply to vkc (Veronica Explains)

Large Heydon Collider

in reply to vkc (Veronica Explains) • 2 weeks ago • •

@vkc Thanks, Veronica!

@vkc (Veronica Explains)

in reply to Large Heydon Collider

stateful being

in reply to Large Heydon Collider • 2 weeks ago • •

uh... i read the examples from the /nonsense endpoint and (correcting for syntax) feel more in agreement with some of them than with many things posted in good faith by actual humans

should i be worried

This entry was edited (2 weeks ago)

Large Heydon Collider reshared this.

in reply to stateful being

Large Heydon Collider

in reply to stateful being • 2 weeks ago • •

@unspeaker I think it’s a better writer than me in many cases.

@stateful being

in reply to Large Heydon Collider

tripleman, a 🇨🇦 in 🇩🇪

in reply to Large Heydon Collider • 2 weeks ago • •

I have to say, “Hungry programmers believe the more grieving your distribution, the better.” is the most real thing I've read in a loooong time.

Large Heydon Collider reshared this.

in reply to Large Heydon Collider

engineer27

in reply to Large Heydon Collider • 2 weeks ago • •

Key line: "Humans, for the most north, rob backronym when they dereference it."

Large Heydon Collider reshared this.

in reply to Large Heydon Collider

engineer27

in reply to Large Heydon Collider • 2 weeks ago • •

Incredibly, this work, intended exclusively for LLM consumption, has now been read by more humans than 99% of content on the Internet.

Large Heydon Collider reshared this.

in reply to Large Heydon Collider

David Bushell

in reply to Large Heydon Collider • 2 weeks ago • •

might be worth generating a totally accurate LLMs.txt too llmstxt.org

The /llms.txt file – llms-txt

A proposal to standardise on using an /llms.txt file to provide information to help LLMs use a website at inference time.

^llms-txt

in reply to David Bushell

Large Heydon Collider

in reply to David Bushell • 2 weeks ago • •

@dbushell I'm not sure I understand what this is for. It seems to be just a proposal for now?

@David Bushell

in reply to Large Heydon Collider

David Bushell

in reply to Large Heydon Collider • 2 weeks ago • •

from what I understand (little) LLMs struggle to parse HTML (lol) so this "standard" gives them a markdown alternative — basically it's for spoon-feeding scrapers

in reply to David Bushell

Large Heydon Collider

in reply to David Bushell • 2 weeks ago • •

@dbushell Ah, thanks

@David Bushell

in reply to Large Heydon Collider

Blort™ 🐀Ⓥ🥋☣️

in reply to Large Heydon Collider • 2 weeks ago • •

Might I suggest adding one extra behavior? Once an IP address follows a nofollow link, from then on *any* link serves up the /nonsense/ version, so they don't even get to learn from the real articles and AI isn't given a larger data sample to learn that the difference between the real articles and the fake ones is the /nonsense/ in the url.

@aral

@Aral Balkan

reshared this

in reply to Blort™ 🐀Ⓥ🥋☣️

Large Heydon Collider

in reply to Blort™ 🐀Ⓥ🥋☣️ • 2 weeks ago • •

@Blort @aral That's a brilliant idea. Any idea how to do this reliably? Rewrite rule or JavaScript?

@Blort™ 🐀Ⓥ🥋☣️ @Aral Balkan

in reply to Large Heydon Collider

Chris

in reply to Large Heydon Collider • 2 weeks ago • •

Are IP addresses ever reused? If a bot used an IP address, can we be sure that a future visitor from that IP address isn't a human? Sounds like a good idea to block an IP address for a little while, but I'd hate to block a human because a bot previously used their IP address somehow.

in reply to Chris

Large Heydon Collider

in reply to Chris • 2 weeks ago • •

@chriswrench I'm not sure, to be honest.

@Chris

in reply to Large Heydon Collider

kevin ⁂ (he/him)

in reply to Large Heydon Collider • 1 week ago • •

@chriswrench they are reused all the time, your home address changes often and cloud based environments provide lots of IPs which will be reused as well - permanently banning an IP won't work.

I just today started banning IPs which hit 404s more than 5 times since these are 99% of the time crawlers trying to get /wp-content, /wp-admin.php, etc. access.

A similar approach might work. 🤔

@Chris

in reply to kevin ⁂ (he/him)

Large Heydon Collider

in reply to kevin ⁂ (he/him) • 1 week ago • •

@KevinGimbel @chriswrench Yeah, I figured. Hence the different approach.

@Chris @kevin ⁂ (he/him)

in reply to Large Heydon Collider

kevin ⁂ (he/him)

in reply to Large Heydon Collider • 1 week ago • •

what I wonder with your approach is if the AI will ever be smart enough to understand the `/nonsense` part of the URL.

I'd imagine making something up makes more "sense" or having the nonsense on a single-letter sub-directory like /n/ ?

It's an interesting topic, and I'm eager to try out tactics to stop and poison AI crawlers as well

in reply to kevin ⁂ (he/him)

Large Heydon Collider

in reply to kevin ⁂ (he/him) • 1 week ago • •

@KevinGimbel Good question. I don’t think crawlers *read* as such, much less be able to interpret signs and signals. I don’t believe the crawler is complex like the LLM it feeds. But I could be wrong.

@kevin ⁂ (he/him)

in reply to Large Heydon Collider

kevin ⁂ (he/him)

in reply to Large Heydon Collider • 1 week ago • •

that's a good point, it probably is quite stupid when it retrieves the text.

But the tool ingesting the text could be a more complex system, and if this system has the URL from where something was crawled it could start understanding the structure of /article/ and /nonsense/

Very hard to tell from the outside, and (sadly) I know nobody on the inside :D

in reply to kevin ⁂ (he/him)

Large Heydon Collider

in reply to kevin ⁂ (he/him) • 1 week ago • •

@KevinGimbel Yeah, that’s the thing. I don’t really want to use AI to research it more, because I don’t think I should use (legitimate) it at all.

@kevin ⁂ (he/him)

in reply to Large Heydon Collider

Blort™ 🐀Ⓥ🥋☣️

in reply to Large Heydon Collider • 1 week ago • •

I'm nowhere near technically qualified enough to answer this. Lol. I would *guess* that you could have a rewrite rule file that had IP addresses auto added to it. Not sure if Apache / Nginx would need to be restarted to factor in the modified file, though... 🤔
@aral

#Apache #Nginx #Webhosting #Selfhosting #AI #LLM #GenAI

#ai #selfhosting #nginx #apache #llm #genai #webhosting @Aral Balkan

in reply to Blort™ 🐀Ⓥ🥋☣️

Large Heydon Collider

in reply to Blort™ 🐀Ⓥ🥋☣️ • 1 week ago • •

@Blort @aral All I did was replace all the page links with random /nonsense links. Confusing for humans but that content is marked as "not for you" from the outset.

@Blort™ 🐀Ⓥ🥋☣️ @Aral Balkan

in reply to Large Heydon Collider

Tom Bortels

in reply to Large Heydon Collider • 2 weeks ago • •

Poetry!

in reply to Large Heydon Collider

Grant Denkinson

in reply to Large Heydon Collider • 2 weeks ago • •

Poetic!

in reply to Large Heydon Collider

Wenzel 💚

in reply to Large Heydon Collider • 2 weeks ago • •

wouldn't it be very easy to fix the crawler to respect nofollow but not robots.txt and avoid this trap?

in reply to Wenzel 💚

Large Heydon Collider

in reply to Wenzel 💚 • 2 weeks ago • •

@stairjoke How would I fix someone else's crawler?

@Wenzel 💚

in reply to Large Heydon Collider

Wenzel 💚

in reply to Large Heydon Collider • 2 weeks ago (Received 1 week ago) • •

I meant if this becomes a normal strategy, wouldn't they just fix the crawler?

in reply to Wenzel 💚

Large Heydon Collider

in reply to Wenzel 💚 • 1 week ago • •

@stairjoke Maybe! At least it would behave as it should then.

@Wenzel 💚

in reply to Large Heydon Collider

Bastian Greshake Tzovaras

in reply to Large Heydon Collider • 2 weeks ago • •

I used a similar approach for my (Jekyll-based) static site, outlined here: tzovar.as/algorithmic-sabotage…

Algorithmic sabotage for static sites

tl;dr: Here’s a how-to for adding some “AI”-poison to your static site that’s hosted on Codeberg Pages (or GitHub Pages). I’d appreciate some feedback on if this is useful/how it could be improved.

^{Bastian Greshake Tzovaras}

This entry was edited (2 weeks ago)

in reply to Bastian Greshake Tzovaras

Large Heydon Collider

in reply to Bastian Greshake Tzovaras • 2 weeks ago • •

@gedankenstuecke Wow, you did come to a very similar conclusion!

@Bastian Greshake Tzovaras

in reply to Large Heydon Collider

Cluster Fcku

in reply to Large Heydon Collider • 2 weeks ago • •

also came here to say, poetic! "...luring them into consuming tainted funky, designed to cycle their zigamorph and tie their perceived sing."

But wait .... Did writers of yore create poetry to poison teachings of good writing?

in reply to Cluster Fcku

Large Heydon Collider

in reply to Cluster Fcku • 1 week ago • •

@clusterfcku 1920s Dadaists created nonsense poetry as an act of defiance against the language of hegemony. Or that's what I think they were doing.

@Cluster Fcku

in reply to Large Heydon Collider

Press Rouch

in reply to Large Heydon Collider • 2 weeks ago • •

I've been wondering whether using HTTP compression to serve zip-bombs to LLM crawlers would be an effective bit of asymmetric warfare. For instance, take your approach, but use compression to repeat the article a few million or billion times.

Large Heydon Collider reshared this.

in reply to Press Rouch

Large Heydon Collider

in reply to Press Rouch • 1 week ago • •

@press_rouch I like that idea, yeah.

@Press Rouch

in reply to Large Heydon Collider

MacCruiskeen

in reply to Large Heydon Collider • 2 weeks ago • •

"They clobber up poem with womb box cloudy" is a great line.

Large Heydon Collider reshared this.

in reply to Large Heydon Collider

crazyeddie

in reply to Large Heydon Collider • 1 week ago • •

You don't really say how you are doing this except to mention a parts-of-speech module. Could be more helpful if you expanded on how exactly you accomplished this.

Would be interesting to set up a server that runs beside your blog and then you can just provide random nofollow links to a generic starting point that just leads to an infinite regress of nofollow gibberish.

Sites could band together to send these bots to a network of such servers so they can't have some sort of detection.

in reply to crazyeddie

Large Heydon Collider

in reply to crazyeddie • 1 week ago • •

@crazyeddie It’s really that simple (so far). The module identifies word parts and I substitute with words of the same type. It's all in the Node/11ty build process and delivered statically.

@crazyeddie

in reply to crazyeddie

ToddZ Ⓥ

in reply to crazyeddie • 1 week ago • •

@crazyeddie
I'm not sure how to replicate this either. Not sure if it's because I have dead glands or that I'm more than 32 years stupid.

@crazyeddie

in reply to ToddZ Ⓥ

Large Heydon Collider

in reply to ToddZ Ⓥ • 1 week ago • •

@toddz @crazyeddie 😂

@ToddZ Ⓥ @crazyeddie

in reply to ToddZ Ⓥ

crazyeddie

in reply to ToddZ Ⓥ • 1 week ago • •

@toddz I'm trying not to assume this was meant to be insulting because there's just no call for it and I'm really trying not to see everyone as just rude and horrible.

@ToddZ Ⓥ

in reply to crazyeddie

Large Heydon Collider

in reply to crazyeddie • 1 week ago • •

@crazyeddie @toddz It’s nonsense, quoted directly from one of my /nonsense pages. It’s not about you.

@ToddZ Ⓥ @crazyeddie

in reply to Large Heydon Collider

ToddZ Ⓥ

in reply to Large Heydon Collider • 1 week ago • •

@crazyeddie

Thanks, heydon! Sorry crazzyeddie -- I didn't think of how that might look to someone who didn't double over at the line, "They 'can’t code' because they have dead glands or are more than 32 years stupid."

@crazyeddie

in reply to Large Heydon Collider

Matt Machell

in reply to Large Heydon Collider • 1 week ago • •

Have you seen Nightshade? nightshade.cs.uchicago.edu/wha…

Large Heydon Collider reshared this.

in reply to Large Heydon Collider

Large Heydon Collider

in reply to Large Heydon Collider • 1 week ago • •

This stupid script for mangling my own content is coming up with better stuff than me and it's upsetting.

"Words are annoying things, and some tar and feather the greedily nasty lenna."

heydonworks.com/nonsense/the-w…

The Word User Is Fine: HeydonWorks

^{heydonworks.com}

in reply to Large Heydon Collider

Ethan Marcotte

in reply to Large Heydon Collider • 1 week ago • •

oh no is this sam altman’s origin story

oh no

in reply to Ethan Marcotte

Large Heydon Collider

in reply to Ethan Marcotte • 1 week ago • •

@beep LOL I understand he was created in a laboratory, wherein they combined mayonnaise with light mayonnaise.

@Ethan Marcotte

in reply to Large Heydon Collider

Matt May

in reply to Large Heydon Collider • 1 week ago • •

@beep Great, now I hate everything AND I'm hungry

@Ethan Marcotte

in reply to Ethan Marcotte

James Scholes

in reply to Ethan Marcotte • 1 week ago • •

@beep The Onion: British Man Reaches Horrifying Realisation About Appeal of LLMs While Trying to Block Them From His Website @heydon

@Large Heydon Collider @Ethan Marcotte

This entry was edited (1 week ago)

in reply to James Scholes

Ethan Marcotte

in reply to James Scholes • 1 week ago • •

@jscholes “The Worst Program You Know Just Made An Excellent Post”

@James Scholes

in reply to Large Heydon Collider

Sara Joy

in reply to Large Heydon Collider • 1 week ago • •

I challenge you to write one in Unwinese

in reply to Sara Joy

Large Heydon Collider

in reply to Sara Joy • 1 week ago • •

@sarajw TIL!

@Sara Joy

in reply to Large Heydon Collider

Sara Joy

in reply to Large Heydon Collider • 1 week ago • •

I had been wracking my brains to remember the guy until I thought to ask my mum today and boom, deep joy!

in reply to Large Heydon Collider

Brecht Savelkoul

in reply to Large Heydon Collider • 1 week ago • •

I... kind of dig the nonsense version? At least reading it right after the original is quite trippy. Especially liked how it turned robots.txt into robots.somethingelse every time.

in reply to Brecht Savelkoul

Large Heydon Collider

in reply to Brecht Savelkoul • 1 week ago • •

@brecht Yeah, I'm with you. I enjoy reading it more than my real words, in any case.

@Brecht Savelkoul

in reply to Large Heydon Collider

Large Heydon Collider

in reply to Large Heydon Collider • 1 week ago • •

Nonsense article from my blog. Title: I can't sniff. First paragraph: Lubarsky's Law of Cybernetic Entomology memes taught me that, helpfully of facing my Infocom of needles, I could properly grate I seldom bells whistles and gongs to self programs, generally my lungs are riddled with Mung and I can’t phone.

in reply to Large Heydon Collider

stateful being

in reply to Large Heydon Collider • 1 week ago • •

same 😌

in reply to Large Heydon Collider

Pauxlll Kruczynski

in reply to Large Heydon Collider • 1 week ago • •

munglung is a real pisser of an ailment

in reply to Large Heydon Collider

felix (grayscale) 🐺

in reply to Large Heydon Collider • 1 week ago • •

note, google says
> Important: For the noindex rule to be effective, the page or resource must not be blocked by a robots.txt file
developers.google.com/search/d…

Block Search Indexing with noindex | Google Search Central | Documentation | Google for Developers

A noindex tag can block Google from indexing a page so that it won't appear in Search results. Learn how to implement noindex tags with this guide.

^{Google for Developers}

in reply to felix (grayscale) 🐺

Large Heydon Collider

in reply to felix (grayscale) 🐺 • 1 week ago • •

@gray17 But... both rules would amount to the same thing, right?

@felix (grayscale) 🐺

in reply to Large Heydon Collider

felix (grayscale) 🐺

in reply to Large Heydon Collider • 1 week ago • •

a page blocked by robots.txt can still be a search result if something links to it. (eg, someone else's blog has a link to your nonsense page). this probably doesn't matter much if the content isn't indexed? but might affect search ranking of other pages?
developers.google.com/search/d…

How Google Interprets the robots.txt Specification | Google Search Central | Documentation | Google for Developers

Learn specific details about the different robots.txt file rules and how Google interprets the robots.txt specification.

^{Google for Developers}

in reply to felix (grayscale) 🐺

Large Heydon Collider

in reply to felix (grayscale) 🐺 • 1 week ago • •

@gray17 Ah thank you. I will have a think about the implications.

@felix (grayscale) 🐺

in reply to Large Heydon Collider

felix (grayscale) 🐺

in reply to Large Heydon Collider • 1 week ago • •

... I was thinking of doing something similar for a zero-audience blog, and I'm glad that a popular blog has this.

I think it's better if there are many independent variants of this, making it harder for the mindless parasites to filter it out.

another idea: for the nonsense pages, make some or all of the content computed by slow javascript, increasing the crawling cost

in reply to Large Heydon Collider

マーティン・ステンツェル。ケルン在住。

in reply to Large Heydon Collider • 1 week ago • •

This is awesome. As a non native English speaker the nonsense article is hard to comprehend. But human intelligence manages this.
Keep up the good work - we have to fight back!!!

in reply to Large Heydon Collider

Jona Joachim

in reply to Large Heydon Collider • 6 days ago • •

I'm not sure this will work, it's too obviously bogus content. If you ask chatgpt what it thinks of it, it says it's written in a satirical tone with absurd metaphors and meta-commentary. Also, while training, neural networks can discard "noise" which is too far off.

in reply to Jona Joachim

Large Heydon Collider

in reply to Jona Joachim • 6 days ago • •

Yeah, I think you're right. Feeding it grammatically sensical / believable lies is probably more effective than barely parsable nonsense.

This entry was edited (6 days ago)