Tripwires for suspicious CK data

TalkRecommend Site Improvements

Join LibraryThing to post.

Tripwires for suspicious CK data

This topic is currently marked as "dormant"—the last message is more than 90 days old. You can revive it by posting a reply.

1brightcopy
Edited: Nov 23, 2010, 3:24 pm

A few recent threads (http://www.librarything.com/topic/102886
http://www.librarything.com/topic/95059 http://www.librarything.com/topic/102754) have made me wonder if there couldn't be some "tripwires" for CK that could help detect bad data being entered. Even if it didn't prevent that data from being entered, it'd be nice for it to log it to a helpers' log.

An example might be a Series field that ONLY contains numeric data, or even one that doesn't contain anything in parens (that last bit might have to be scaled back if there are a LOT of legit unordered series being entered). I'm mainly thinking of tripwires for Series, but I bet similar tripwires could apply to other CK fields.

A log for this would help in two ways. First, it would help the helpers go back and clean up bad data. Second, it would make it easier to catch someone who is changing a lot of CK data (like in some of the above threads) at an earlier date and send them a profile message to help them understand they aren't using CK correctly. This would probably make the total amount of cleanup a lot smaller.

ETA: This wouldn't be something that prevents a user from entering bad data or something that requires an "are you sure?" step from the user. It's solely aimed at making a helpers' log for suspicious edits.

2readafew
Nov 23, 2010, 12:42 pm

a big one for series would be someone creating a lot of unique series. I basically not putting the series number in parens.

3brightcopy
Edited: Nov 23, 2010, 12:44 pm

Another one: putting [ and ] in a Series field (probably didn't understand about parens). Or even putting more than one set of parens in the field.

4readafew
Nov 23, 2010, 12:45 pm

oh, a search that could return all series with one book, the vast majority are going to be some kind of wrong.

5brightcopy
Nov 23, 2010, 12:48 pm

4> That'd be good, too, though probably a parallel helper feature rather than a tripwire. I can see it being a tripwire if LT could wait 24 hours after someone enters a single-book series to see if they enter more. But that would likely be a lot more trouble. Perhaps they could have some jobs set to run some queries for what's been done in the last 24 hours and spit out lines to a tripwire log. Again, possibly a lot more trouble than what I'm envisioning. The simpler the feature the more likely it will happen.

6kevmalone
Nov 23, 2010, 1:09 pm

I think we'd have run into problems with a "no parens = suspect" rule before publisher series (think classics) but now we could maybe trip an edit "Did you mean to enter this as a publisher series?"
That being said there doesn't seem to be a lot of scope for adding new edits on any other than series fields - no future dates in date fields maybe.

7brightcopy
Nov 23, 2010, 1:15 pm

6> Took me a couple of re-reads but I think I understand your last sentence. Basically, the thing about "trigger an event if they enter a single book series but wait until 24 hours" doesn't have a lot of parallels in the other fields. Correct me if I'm wrong.

Honestly, my own impression is that Series is the #1 CK field people flub. Any other opinions on that? Perhaps followed by Canonical Title/Canonical Name. I wonder if the new Original/Alternative Title fields will have their own screwups.

Perhaps a good tripwire for Canonical Name would be if it doesn't contain a comma.

8readafew
Nov 23, 2010, 1:20 pm

Date fields, people are supposed to use 2010-11-23 and all sorts of stuff gets entered, I've fixed hundreds of them but I haven't looked in quite a while. With CK search working I might try it again.

91dragones
Nov 23, 2010, 1:37 pm

4, 5 > One should not immediately decide that a single book series is some kind of wrong, but apparently that happens every day. You should realize that writing and publishing books takes time. The second book in a series is often published a year or more after the first. And it can take two or more years for the third volume in a series to appear after the first is published... and that's only with anticipated and usual delays. There can be unforseen delays that stretch out the time even further between volumes appearing in the series.

Then there's books owned by only one member... a lot of those might be spam, but a lot of them aren't spam. They could be more obscure books or they could be new books that are not yet well known. For example, I'm the only one on LT listing a copy of Oracle's Legacy: Dawn of Illumination. It's not spam, but is newly published, just over a month ago. I'll eventually (hopefully about January) have a review online.

Oracle's Legacy: Dawn of Illumination is also the third book of the series, but probably few would have known, even though the second book was published about a year ago, and the first in the series published two years ago. There was no series designated for them until after I got my copies of all three volumes from the author. A series can - and IMO should - be created with the publication of volume one, in anticipation of the other volumes arriving in due time.

10kevmalone
Nov 23, 2010, 1:51 pm

>7 brightcopy: My last sentence says (or attempts to say) that none of the other CK fields seem to lend themselves to any new type of edit. Things that could be done to form consistent data (context-sensitive "Important Places" e.g.) has been done. The one exception seems to be dates, I suggested "no future dates" @8 suggests format trapping.
Sorry for any confusion.

I agree that series data is getting most recent attention, justly so since the addition of the publisher series field.

11readafew
Nov 23, 2010, 2:02 pm

9> I didn't say all single series are wrong, but the vast majority of books in a single series are because someone has incorrectly entered the series most of the time by adding the order # to the end without paranthesis.

example

Harry Potter #1
Harry Potter #2
Harry Potter #3

Instead of
Harry Potter (1)
Harry Potter (2)
Harry Potter (3)

I've got a couple series that currently have only one entry or did when I entered them.

12lorax
Nov 23, 2010, 2:08 pm

9>

I don't think anyone's suggesting that single-book series are all wrong, or should be disallowed, but that if someone creates a large number of single-book series in a short time, that it's an indicator that they may be using the field wrong, and someone should take a look at it. And nobody's suggesting any of these are spam!

(For example, someone using a comma rather than parentheses to delimit the book number will create a whole lot of single-book series. This probably isn't what they wanted to do, and I think it would be good to have a way to catch these other than relying on other members stumbling across them.)

13lilithcat
Nov 23, 2010, 2:16 pm

I don't see why not having parentheses in a series name should send up a warning. I have a lot of books in series for which I could not tell you the order if my life depended on it.

14EveleenM
Nov 23, 2010, 2:17 pm

#1
or even one that doesn't contain anything in parens (that last bit might have to be scaled back if there are a LOT of legit unordered series being entered).

Please, not that! I have a list of at least 50 travel guide series I've been working on, and I don't use series numbers on them - it seems more sensible to let them sort in alphabetical order by location. Having to confirm a "Yes, I really mean it!" step on every work I add to the series would drive me demented.

15brightcopy
Edited: Nov 23, 2010, 2:45 pm

13/14> Again, I don't want to imply that (most) of these things are wrong, just suspicious. In other words, something that might more frequently be an error than not. As always, helpers are there to use their judgment and not to blindly follow a process. I'd like to turn on a tripwire for the lacking-parentheses for a while and just see what percentage is valid and what is in error. Who knows, it may be that it turns out to be overwhelmingly valid data.

Also, I think you might be misreading the whole purpose of a tripwire. It wouldn't really be there to prevent users from entering bad data or to throw up a warning or anything. Rather, it's there to write to a log file that helpers can go through. So no, there wouldn't be any "Yes, I really mean it!" step.

16EveleenM
Nov 23, 2010, 3:07 pm

#15
Also, I think you might be misreading the whole purpose of a tripwire. It wouldn't really be there to prevent users from entering bad data or to throw up a warning or anything. Rather, it's there to write to a log file that helpers can go through. So no, there wouldn't be any "Yes, I really mean it!" step

Right, I did misunderstand. If it's just a matter of logging the data, then I think that's a very good idea. Fire away!

17brightcopy
Edited: Nov 23, 2010, 3:24 pm

Another possible tripwire: validating the html in fields that accept it, such as the Disambiguation Notice. This could be as simple or as complex as Tim has the time/desire for.

16> Yeah, the idea being that maybe we can both nip some of the repeat offenders in the bud before they make the same mistake on lots of data as well as catching obscure stuff that might be missed otherwise. ETA: I put a note in the first post just to clarify.

181dragones
Nov 23, 2010, 3:36 pm

11.> Yes, okay I see what you mean. I even ran across an example of it the other day and it didn't register. Someone started The Arrington Series without putting the order number at the end of the first volume... which might have been okay while that was the only book in print for that series... but volume 2 was recently published and had not been added to the series yet... so I correctly entered volume 2 into the series and added (1) to the end of the first volume; these books now show on the series page in the correct order. Someone might get confused by the order of the books otherwise - especially if (as I believe) there are more to come... In the case of this particular series, I think it's probably best to read them in chronological order.

19stephmo
Nov 23, 2010, 7:34 pm

Characters and places all entered on one line instead of using the + to get more entry points is an error I see regularly.

Abbreviating states and leaving off countries is another...

20brightcopy
Nov 23, 2010, 7:46 pm

19> Building off that idea - anything longer than a certain number of characters in a field/single line of a field. That number will probably vary per field, but (to pick something random) over 100 characters for a place name line is probably an error.

(And yes, some would be exempt like First words/last words, etc. But you get the point.)

21keristars
Nov 23, 2010, 8:12 pm

For places, multiple sets of parentheses could be a trigger for shunting it to the log. Especially since multiple sets of parentheses on a single line get screwed up by the parser. Hate that "feature".

22Avron
May 29, 2011, 12:05 am

Late to the party.

But looking at http://www.librarything.com/series could allow others to catch some of the series entry mistakes. I have a look most days and often find an entry or two that needs cleaning up. I've also found there's a number of series that I can add many books to via search. Presumably because someone owns a book and doesn't add the info to any others.

Series, date, and characters are the three I most often see misused in some way.
And while I don't know how you would check, the series info being entered in the wrong language is also fairly common.

23brightcopy
May 29, 2011, 12:37 am

I still think this idea is a good one and perhaps a unique way of actually improving the overall CK data quality here.