Skip to main content


📝 New Blog Post: Poisoning Well

I have started publishing nonsense content on my blog, to hopefully mess with LLM crawlers.

Here is the article, all about it: heydonworks.com/article/poison…

And here is the nonsense companion article, all about whatever: heydonworks.com/nonsense/poiso…

This entry was edited (2 weeks ago)

reshared this

in reply to Large Heydon Collider

Nice. I had considered injecting nonsense in alternate paragraphs and hiding with CSS - assuming that AI won't apply the CSS to hide the junk, but was a little dubious of the possible implications.
in reply to Large Heydon Collider

uh... i read the examples from the /nonsense endpoint and (correcting for syntax) feel more in agreement with some of them than with many things posted in good faith by actual humans

should i be worried

This entry was edited (2 weeks ago)

Large Heydon Collider reshared this.

in reply to Large Heydon Collider

I have to say, “Hungry programmers believe the more grieving your distribution, the better.” is the most real thing I've read in a loooong time.

Large Heydon Collider reshared this.

in reply to Large Heydon Collider

Key line: "Humans, for the most north, rob backronym when they dereference it."

Large Heydon Collider reshared this.

in reply to Large Heydon Collider

Incredibly, this work, intended exclusively for LLM consumption, has now been read by more humans than 99% of content on the Internet.

Large Heydon Collider reshared this.

in reply to Large Heydon Collider

might be worth generating a totally accurate LLMs.txt too llmstxt.org
in reply to David Bushell

@dbushell I'm not sure I understand what this is for. It seems to be just a proposal for now?
in reply to Large Heydon Collider

from what I understand (little) LLMs struggle to parse HTML (lol) so this "standard" gives them a markdown alternative — basically it's for spoon-feeding scrapers
in reply to Large Heydon Collider

Might I suggest adding one extra behavior? Once an IP address follows a nofollow link, from then on *any* link serves up the /nonsense/ version, so they don't even get to learn from the real articles and AI isn't given a larger data sample to learn that the difference between the real articles and the fake ones is the /nonsense/ in the url.

@aral

reshared this

in reply to Large Heydon Collider

Are IP addresses ever reused? If a bot used an IP address, can we be sure that a future visitor from that IP address isn't a human? Sounds like a good idea to block an IP address for a little while, but I'd hate to block a human because a bot previously used their IP address somehow.
in reply to Large Heydon Collider

@chriswrench they are reused all the time, your home address changes often and cloud based environments provide lots of IPs which will be reused as well - permanently banning an IP won't work.

I just today started banning IPs which hit 404s more than 5 times since these are 99% of the time crawlers trying to get /wp-content, /wp-admin.php, etc. access.

A similar approach might work. 🤔

in reply to Large Heydon Collider

what I wonder with your approach is if the AI will ever be smart enough to understand the `/nonsense` part of the URL.

I'd imagine making something up makes more "sense" or having the nonsense on a single-letter sub-directory like /n/ ?

It's an interesting topic, and I'm eager to try out tactics to stop and poison AI crawlers as well

in reply to kevin ⁂ (he/him)

@KevinGimbel Good question. I don’t think crawlers *read* as such, much less be able to interpret signs and signals. I don’t believe the crawler is complex like the LLM it feeds. But I could be wrong.
in reply to Large Heydon Collider

that's a good point, it probably is quite stupid when it retrieves the text.

But the tool ingesting the text could be a more complex system, and if this system has the URL from where something was crawled it could start understanding the structure of /article/ and /nonsense/

Very hard to tell from the outside, and (sadly) I know nobody on the inside :D

in reply to kevin ⁂ (he/him)

@KevinGimbel Yeah, that’s the thing. I don’t really want to use AI to research it more, because I don’t think I should use (legitimate) it at all.
in reply to Large Heydon Collider

I'm nowhere near technically qualified enough to answer this. Lol. I would *guess* that you could have a rewrite rule file that had IP addresses auto added to it. Not sure if Apache / Nginx would need to be restarted to factor in the modified file, though... 🤔
@aral

#Apache #Nginx #Webhosting #Selfhosting #AI #LLM #GenAI

in reply to Blort™ 🐀Ⓥ🥋☣️

@Blort @aral All I did was replace all the page links with random /nonsense links. Confusing for humans but that content is marked as "not for you" from the outset.
in reply to Large Heydon Collider

I used a similar approach for my (Jekyll-based) static site, outlined here: tzovar.as/algorithmic-sabotage…
This entry was edited (2 weeks ago)
in reply to Large Heydon Collider

also came here to say, poetic! "...luring them into consuming tainted funky, designed to cycle their zigamorph and tie their perceived sing."

But wait .... Did writers of yore create poetry to poison teachings of good writing?

in reply to Cluster Fcku

@clusterfcku 1920s Dadaists created nonsense poetry as an act of defiance against the language of hegemony. Or that's what I think they were doing.
in reply to Large Heydon Collider

I've been wondering whether using HTTP compression to serve zip-bombs to LLM crawlers would be an effective bit of asymmetric warfare. For instance, take your approach, but use compression to repeat the article a few million or billion times.

Large Heydon Collider reshared this.

in reply to Large Heydon Collider

You don't really say how you are doing this except to mention a parts-of-speech module. Could be more helpful if you expanded on how exactly you accomplished this.

Would be interesting to set up a server that runs beside your blog and then you can just provide random nofollow links to a generic starting point that just leads to an infinite regress of nofollow gibberish.

Sites could band together to send these bots to a network of such servers so they can't have some sort of detection.

in reply to crazyeddie

@crazyeddie It’s really that simple (so far). The module identifies word parts and I substitute with words of the same type. It's all in the Node/11ty build process and delivered statically.
in reply to crazyeddie

@crazyeddie
I'm not sure how to replicate this either. Not sure if it's because I have dead glands or that I'm more than 32 years stupid.
in reply to ToddZ Ⓥ

@toddz I'm trying not to assume this was meant to be insulting because there's just no call for it and I'm really trying not to see everyone as just rude and horrible.
in reply to crazyeddie

@crazyeddie @toddz It’s nonsense, quoted directly from one of my /nonsense pages. It’s not about you.
in reply to Large Heydon Collider

@crazyeddie

Thanks, heydon! Sorry crazzyeddie -- I didn't think of how that might look to someone who didn't double over at the line, "They 'can’t code' because they have dead glands or are more than 32 years stupid."

in reply to Large Heydon Collider

This stupid script for mangling my own content is coming up with better stuff than me and it's upsetting.

"Words are annoying things, and some tar and feather the greedily nasty lenna."

heydonworks.com/nonsense/the-w…

in reply to Ethan Marcotte

@beep The Onion: British Man Reaches Horrifying Realisation About Appeal of LLMs While Trying to Block Them From His Website @heydon
This entry was edited (1 week ago)
in reply to Large Heydon Collider

I had been wracking my brains to remember the guy until I thought to ask my mum today and boom, deep joy!
in reply to Large Heydon Collider

I... kind of dig the nonsense version? At least reading it right after the original is quite trippy. Especially liked how it turned robots.txt into robots.somethingelse every time.
in reply to Large Heydon Collider

note, google says
> Important: For the noindex rule to be effective, the page or resource must not be blocked by a robots.txt file
developers.google.com/search/d…
in reply to Large Heydon Collider

a page blocked by robots.txt can still be a search result if something links to it. (eg, someone else's blog has a link to your nonsense page). this probably doesn't matter much if the content isn't indexed? but might affect search ranking of other pages?
developers.google.com/search/d…
in reply to Large Heydon Collider

... I was thinking of doing something similar for a zero-audience blog, and I'm glad that a popular blog has this.

I think it's better if there are many independent variants of this, making it harder for the mindless parasites to filter it out.

another idea: for the nonsense pages, make some or all of the content computed by slow javascript, increasing the crawling cost

in reply to Large Heydon Collider

This is awesome. As a non native English speaker the nonsense article is hard to comprehend. But human intelligence manages this.
Keep up the good work - we have to fight back!!!
in reply to Large Heydon Collider

I'm not sure this will work, it's too obviously bogus content. If you ask chatgpt what it thinks of it, it says it's written in a satirical tone with absurd metaphors and meta-commentary. Also, while training, neural networks can discard "noise" which is too far off.
in reply to Jona Joachim

Yeah, I think you're right. Feeding it grammatically sensical / believable lies is probably more effective than barely parsable nonsense.
This entry was edited (6 days ago)