AI scraping

TalkWelcome to LibraryThing!

Join LibraryThing to post.

AI scraping

1DebiCates
Mar 8, 12:09 pm

Is there a way to stymy AI scraping of member's content? Like our reviews, our comments on groups? Or is everything hopelessly up for AI training based on our content?

2keristars
Mar 8, 1:20 pm

The only way to prevent it is fully private accounts + groups, I believe. If it's visible publicly, the genAI scrapers are going to grab it.

Some parts of LT were made visible to logged-in accounts only, some years back, to try to mitigate bot traffic. I'm not sure if that's still the case, and I can't remember what was closed off.

3DebiCates
Mar 8, 1:37 pm

>2 keristars: Yes, I remember at some point while re-setting up my account, I elected to have my reviews only available to LT members, not the general web.

That is better than nothing, and I appreciate that option, but I'm sure some of the accounts that get set up here with 0 books and no activity are up to no good, including my concern of scraping LT for AI purposes.

I'm just spitting in the wind, I'm sure. It's a plague with no cure.

4SandraArdnas
Mar 8, 1:40 pm

I doubt there's much we as individuals can do. At site level, there are attempts to set up robot.txt that would serve as more of a legally worded and based directions to AI bots. Common robot.txt is habitually ignored by them, even if it explicitly says no scraping. While the legalese ones cannot prevent them either, they are a much firmer legal basis for suing if ignored. And they are promoted by some of the big enough players that, if they get enough traction, they just might actually have effect. Cloudflare, and LT has been among their users a while now IIRC, is among them
https://developers.cloudflare.com/bots/additional-configurations/managed-robots-...

5DebiCates
Mar 8, 1:45 pm

>2 keristars: Oh, and I meant to say thank you for suggesting the only option one can think of. Since the fundamental pleasures and purpose of LT to me is to socialize freely with others, it makes taking that option a golden goose killer. :,(

Sigh.

6DebiCates
Edited: Mar 8, 1:49 pm

>4 SandraArdnas: Aha, some promising news at last. Thank you!

Speaking of legalize, wonder if adding a copyright sign ©️ to one's "work" (reviews and comments) would at least be taking a stand, impotent and weird as that may sound.

Even if there was a snowball's chance it would do any good, it would be mightily annoying to see that pompous ©️ in everything I wrote, I'm sure.

7SandraArdnas
Mar 8, 1:52 pm

>3 DebiCates: I'm not sure that's how it works for reviews. My understanding is the options indicate your choice how LT can use your reviews. I chose they can use them for commercial purposes as well, expecting it means they can include them in their commercials products to libraries. I don't think there's a difference between your choice and mine as far as AI scrapers are concerned. They consider anything they can crawl fair game and if it's visible, they can get to it.

8AnnieMod
Edited: Mar 8, 1:54 pm

>3 DebiCates: That settings for the reviews back then meant that LT cannot use your reviews in their library products and pass to external partners, not that they would not be visible from the site (logged in or not).

Now we have a newer feature that allows to have them visible only for your connections but before that, your reviews were always scrapable and visible unless your library was private.

9SandraArdnas
Mar 8, 1:58 pm

>6 DebiCates: They ignore entire portals filled with copyrighted material and scrape away until taken to court, at which point they come to a settlement to pay for used content. Using the © would just make you eligible in some class action lawsuit would be my guess :D

10DebiCates
Mar 8, 2:33 pm

>7 SandraArdnas: >8 AnnieMod: Ah, yes, you are right, I remember that now. I guess I just changed it to fit how I wished it worked.

But now there is that newer option...which I'm going to think about implementing. Darn, everything comes with a price of open social functionality. Not that my reviews are that great or anything but that what if everyone did that? It would be a lesser experience.

11keristars
Mar 8, 2:53 pm

>10 DebiCates: Yeah, it feels really crappy that we don't have any control about these scrapers, and the only way to fight back is to not participate at all.

I suppose I've just resigned myself to having all my public writing hoovered up. Plus all the other personal data that is so difficult to prevent corps from collecting.

(wow, that's kind of depressing. sorry. 🫣)

12AnnieMod
Edited: Mar 8, 3:19 pm

>11 keristars: Yeah. I’ve accepted a long time ago that if I want something private, it just does not go anywhere online. Posting on a site makes it mostly fair game for scrappers and bots (and these days LLMs and AI in general).

>10 DebiCates: Keep in mind that the new setting is only there if your library is private. With a public library, reviews are always visible. If you don’t want them to be, put them in the private comments I guess. :)

13DebiCates
Mar 8, 5:59 pm

>9 SandraArdnas: >11 keristars: >12 AnnieMod: It's funny, I didn't get a solution, I realized I remembered things wrong, the newest tweak won't work for me, even the best solution I could do would only buy me a seat at some unlikely class action lawsuit, and all this was exchanged in a way that any ol' AI crappy scraper could get at it in order to imitate humanity more effectively. Yet, in spite of all that, I thank you all very much and enjoyed an intelligent commiseration with real people.

Right? I mean, surely....oh, of course we all are!

ha ha ha

14SandraArdnas
Mar 9, 5:29 am

>13 DebiCates: There's people deliberately poisoning the data for AI bots, so that's one way to go about saying no to scraping indiscriminately :D https://theconversation.com/how-poisoned-data-can-trick-ai-and-how-to-stop-it-25... focuses mostly on outright malicious attacks, but I understand there's also a significant number of people who simply feed it nonsense on reddit and such. IIRC, randomly inserting words in text for instance can seriously poison training data, even though a human reading it could parse nonsense from sense.

On my part, I never understood how hoovering up everything could possibly lead to good things, so I chalk it up to 'move fast and break things'. Curating what you train your model on takes time and money. That's simply a no go if you're chasing profits rather than intelligence

15bnielsen
Mar 9, 9:16 am

>14 SandraArdnas: That's also what I was thinking. I'm currently taking a look at "Marc View" from TinyCat and the amount of errors in automatically imported stuff from various libraries is rather interesting :-)

16timspalding
Mar 9, 9:24 am

LibraryThing blocks all the reputable AI bots at Cloudflare, a service that site between our servers and the web, stopping bots and other bad actors. So, in theory, we aren't being scraped by OpenAI and Claude. Google's a little trickier, because they seem to mix their general index with the AI index, but we have the AI indexing blocked.

17LeslieWx
Mar 9, 11:51 pm

18DebiCates
Mar 10, 12:33 am

>16 timspalding: I appreciate the information and the efforts on your end! Thanks Tim.