📝 New Blog Post: Poisoning Well
I have started publishing nonsense content on my blog, to hopefully mess with LLM crawlers.
Here is the article, all about it: heydonworks.com/article/poison…
And here is the nonsense companion article, all about whatever: heydonworks.com/nonsense/poiso…
Poisoning Well
An experimental strategy for contaminating Large Language ModelsHeydon Pickering (HeydonWorks)
This entry was edited (2 weeks ago)
reshared this
Matt Wilcox
in reply to Large Heydon Collider • • •vkc (Veronica Explains)
in reply to Large Heydon Collider • • •Large Heydon Collider
in reply to vkc (Veronica Explains) • • •stateful being
in reply to Large Heydon Collider • • •uh... i read the examples from the /nonsense endpoint and (correcting for syntax) feel more in agreement with some of them than with many things posted in good faith by actual humans
should i be worried
Large Heydon Collider reshared this.
Large Heydon Collider
in reply to stateful being • • •tripleman, a 🇨🇦 in 🇩🇪
in reply to Large Heydon Collider • • •Large Heydon Collider reshared this.
engineer27
in reply to Large Heydon Collider • • •Large Heydon Collider reshared this.
engineer27
in reply to Large Heydon Collider • • •Large Heydon Collider reshared this.
David Bushell
in reply to Large Heydon Collider • • •The /llms.txt file – llms-txt
llms-txtLarge Heydon Collider
in reply to David Bushell • • •David Bushell
in reply to Large Heydon Collider • • •Large Heydon Collider
in reply to David Bushell • • •Blort™ 🐀Ⓥ🥋☣️
in reply to Large Heydon Collider • • •Might I suggest adding one extra behavior? Once an IP address follows a nofollow link, from then on *any* link serves up the /nonsense/ version, so they don't even get to learn from the real articles and AI isn't given a larger data sample to learn that the difference between the real articles and the fake ones is the /nonsense/ in the url.
@aral
reshared this
Large Heydon Collider and Aral Balkan reshared this.
Large Heydon Collider
in reply to Blort™ 🐀Ⓥ🥋☣️ • • •Chris
in reply to Large Heydon Collider • • •Large Heydon Collider
in reply to Chris • • •kevin ⁂ (he/him)
in reply to Large Heydon Collider • • •@chriswrench they are reused all the time, your home address changes often and cloud based environments provide lots of IPs which will be reused as well - permanently banning an IP won't work.
I just today started banning IPs which hit 404s more than 5 times since these are 99% of the time crawlers trying to get /wp-content, /wp-admin.php, etc. access.
A similar approach might work. 🤔
Large Heydon Collider
in reply to kevin ⁂ (he/him) • • •kevin ⁂ (he/him)
in reply to Large Heydon Collider • • •what I wonder with your approach is if the AI will ever be smart enough to understand the `/nonsense` part of the URL.
I'd imagine making something up makes more "sense" or having the nonsense on a single-letter sub-directory like /n/ ?
It's an interesting topic, and I'm eager to try out tactics to stop and poison AI crawlers as well
Large Heydon Collider
in reply to kevin ⁂ (he/him) • • •kevin ⁂ (he/him)
in reply to Large Heydon Collider • • •that's a good point, it probably is quite stupid when it retrieves the text.
But the tool ingesting the text could be a more complex system, and if this system has the URL from where something was crawled it could start understanding the structure of /article/ and /nonsense/
Very hard to tell from the outside, and (sadly) I know nobody on the inside :D
Large Heydon Collider
in reply to kevin ⁂ (he/him) • • •Blort™ 🐀Ⓥ🥋☣️
in reply to Large Heydon Collider • • •I'm nowhere near technically qualified enough to answer this. Lol. I would *guess* that you could have a rewrite rule file that had IP addresses auto added to it. Not sure if Apache / Nginx would need to be restarted to factor in the modified file, though... 🤔
@aral
#Apache #Nginx #Webhosting #Selfhosting #AI #LLM #GenAI
Large Heydon Collider
in reply to Blort™ 🐀Ⓥ🥋☣️ • • •Tom Bortels
in reply to Large Heydon Collider • • •Grant Denkinson
in reply to Large Heydon Collider • • •Wenzel 💚
in reply to Large Heydon Collider • • •Large Heydon Collider
in reply to Wenzel 💚 • • •Wenzel 💚
in reply to Large Heydon Collider • • •Large Heydon Collider
in reply to Wenzel 💚 • • •Bastian Greshake Tzovaras
in reply to Large Heydon Collider • • •Algorithmic sabotage for static sites
Bastian Greshake TzovarasLarge Heydon Collider
in reply to Bastian Greshake Tzovaras • • •Cluster Fcku
in reply to Large Heydon Collider • • •also came here to say, poetic! "...luring them into consuming tainted funky, designed to cycle their zigamorph and tie their perceived sing."
But wait .... Did writers of yore create poetry to poison teachings of good writing?
Large Heydon Collider
in reply to Cluster Fcku • • •Press Rouch
in reply to Large Heydon Collider • • •Large Heydon Collider reshared this.
Large Heydon Collider
in reply to Press Rouch • • •MacCruiskeen
in reply to Large Heydon Collider • • •Large Heydon Collider reshared this.
crazyeddie
in reply to Large Heydon Collider • • •You don't really say how you are doing this except to mention a parts-of-speech module. Could be more helpful if you expanded on how exactly you accomplished this.
Would be interesting to set up a server that runs beside your blog and then you can just provide random nofollow links to a generic starting point that just leads to an infinite regress of nofollow gibberish.
Sites could band together to send these bots to a network of such servers so they can't have some sort of detection.
Large Heydon Collider
in reply to crazyeddie • • •ToddZ Ⓥ
in reply to crazyeddie • • •I'm not sure how to replicate this either. Not sure if it's because I have dead glands or that I'm more than 32 years stupid.
Large Heydon Collider
in reply to ToddZ Ⓥ • • •crazyeddie
in reply to ToddZ Ⓥ • • •Large Heydon Collider
in reply to crazyeddie • • •ToddZ Ⓥ
in reply to Large Heydon Collider • • •@crazyeddie
Thanks, heydon! Sorry crazzyeddie -- I didn't think of how that might look to someone who didn't double over at the line, "They 'can’t code' because they have dead glands or are more than 32 years stupid."
Matt Machell
in reply to Large Heydon Collider • • •Large Heydon Collider reshared this.
Large Heydon Collider
in reply to Large Heydon Collider • • •This stupid script for mangling my own content is coming up with better stuff than me and it's upsetting.
"Words are annoying things, and some tar and feather the greedily nasty lenna."
heydonworks.com/nonsense/the-w…
The Word User Is Fine: HeydonWorks
heydonworks.comEthan Marcotte
in reply to Large Heydon Collider • • •oh no is this sam altman’s origin story
oh no
Large Heydon Collider
in reply to Ethan Marcotte • • •Matt May
in reply to Large Heydon Collider • • •James Scholes
in reply to Ethan Marcotte • • •Ethan Marcotte
in reply to James Scholes • • •Sara Joy
in reply to Large Heydon Collider • • •Large Heydon Collider
in reply to Sara Joy • • •Sara Joy
in reply to Large Heydon Collider • • •Brecht Savelkoul
in reply to Large Heydon Collider • • •Large Heydon Collider
in reply to Brecht Savelkoul • • •Large Heydon Collider
in reply to Large Heydon Collider • • •stateful being
in reply to Large Heydon Collider • • •Pauxlll Kruczynski
in reply to Large Heydon Collider • • •felix (grayscale) 🐺
in reply to Large Heydon Collider • • •> Important: For the noindex rule to be effective, the page or resource must not be blocked by a robots.txt file
developers.google.com/search/d…
Block Search Indexing with noindex | Google Search Central | Documentation | Google for Developers
Google for DevelopersLarge Heydon Collider
in reply to felix (grayscale) 🐺 • • •felix (grayscale) 🐺
in reply to Large Heydon Collider • • •developers.google.com/search/d…
How Google Interprets the robots.txt Specification | Google Search Central | Documentation | Google for Developers
Google for DevelopersLarge Heydon Collider
in reply to felix (grayscale) 🐺 • • •felix (grayscale) 🐺
in reply to Large Heydon Collider • • •... I was thinking of doing something similar for a zero-audience blog, and I'm glad that a popular blog has this.
I think it's better if there are many independent variants of this, making it harder for the mindless parasites to filter it out.
another idea: for the nonsense pages, make some or all of the content computed by slow javascript, increasing the crawling cost
マーティン・ステンツェル。 ケルン在住。
in reply to Large Heydon Collider • • •Keep up the good work - we have to fight back!!!
Jona Joachim
in reply to Large Heydon Collider • • •Large Heydon Collider
in reply to Jona Joachim • • •