New posters: LOOK HERE FIRST!

TalkBug Collectors

Join LibraryThing to post.

New posters: LOOK HERE FIRST!

This topic is currently marked as "dormant"—the last message is more than 90 days old. You can revive it by posting a reply.

1lorax
Dec 4, 2007, 1:56 pm

There are a number of known issues that keep being reported; either people don't do a search or don't use the terms that will bring up the previous reports. Two perennial issues (not things related to recent changes, which will hopefully be fixed soon) are the following:

1.

Symptom: "I just changed XYZ, or combined two books, or separated two books, and it's not updating!" Alternatively, "I just added a book and I can't edit the information!"

Issue: Caching. LT is a heavily-used site and the database doesn't update instantaneously for all changes. Come back in half an hour -- if the change still hasn't propogated, then go ahead and post with all the details.

2.

Symptom: "These two totally unrelated books are combined and I can't separate them!"

Issue: There are two: ISBN re-use (which publishers aren't supposed to do, but which happens anyway) and "hash collision", where the two different titles are converted into the same thing for internal storage. You can't do anything globally; in the first case, deleting the ISBN for your own copy can separate it from the herd (both the correct and incorrect matches, which will remain together) while in the latter case making a small modification to the title (changing punctuation) will sometimes separate out your own copy (but again, can't fix the global problem).

2SatsumaHouse
Dec 4, 2007, 3:30 pm

Re #2: So how does this--"...'hash collision', where the two different titles are converted into the same thing for internal storage"--happen?

3lorax
Edited: Dec 4, 2007, 5:42 pm

The two titles get converted -- hashed -- into some internally-used format, which is shorter than the titles and easier to deal with. I don't know the details of the algorithm they use.

So to make stuff up, say that Title A: A Book gets turned into, for instance, x134a and Title B gets turned into z231w, and the hashed versions are used internally. Then along comes Title C, and it gets turned into x134a as well; the system says "oho, here's another copy of good old x134a" and puts them together. The Wikipedia article on hash collisions may be helpful.

In this case, if the owner of Title A were to do something like change it to "Title A : A Book" (note the additional space) it would perhaps fix the problem. The solutions are more difficult than the statement of the problem, unfortunately; this is mostly intended as a "this one's a known issue" post to cut down on the number of reports for bug collectors to sort through.

4conceptDawg
Dec 4, 2007, 7:38 pm

That is a good description. I probably would have gotten a little too verbose with my lingo had I tried that. To offer a little more insight, we use crc32 and md5 variously on different hashes in the system. I can't remember off the top of my head which one we are using for book titles.

These algorithms are good and you have to have A LOT of items to have collisions. But, well....we have a lot of data, so collisions happen.

Hashing REALLY helps to speed many functions on the site.

5nperrin
Dec 4, 2007, 8:30 pm

1: Your second problem can also happen for a third reason. If two books have a title and author entered identically with each other, but are different books, they cannot be separated no matter what. This happens, for example, with City of Glass by Paul Auster. There is a graphic novel of the same name which is often listed under Auster rather than under the artists who drew/wrote it (it is definitely a different work), and if I call my copy of the novel "City of Glass" by "Paul Auster" and someone calls his copy of the graphic novel "City of Glass" by "Paul Auster" (with IDENTICAL spelling), it is impossible to separate them. This is the reason separation by ISBN is often requested.

6Heather19
Dec 4, 2007, 10:45 pm

lorax: Thanks for this thread! I admit I get a little annoyed at so many repeat threads, but I was new once too, so I can understand the confusion. Taking the time to search a little makes a world of difference, though.

7lorax
Dec 5, 2007, 7:36 pm

@5:

Good point. I haven't seen a lot of confusion around this case, perhaps because it's obvious why they're combined -- there's frustration that they can't be separated, but not head-scratching.

8lorax
Feb 23, 2008, 8:21 pm

Just making this visible since we're having a lot of questions on the issue lately.

9PaulFoley
Edited: Feb 23, 2008, 9:18 pm

we use crc32 and md5 variously on different hashes in the system. I can't remember off the top of my head which one we are using for book titles

Obviously CRC-32 given the range of book ids and the number of reported collisions (I'd be very surprised if you had a single collision using MD5)

These algorithms are good and you have to have A LOT of items to have collisions.

It's not very hard to collide a 32-bit hash...especially when the function you're using is not really designed for that purpose.

10timspalding
Feb 23, 2008, 9:18 pm

Unfortunately, we're using CRC32, not MD5. We need to convert. It's going to be a long weekend's worth of work, I think.

11jjwilson61
Feb 23, 2008, 9:20 pm

Hashes are very useful, but any elementary data structures text will have a section on how to handle hash collisions. Just pretending that they are very rare and won't happen isn't a satisfactory answer for anything but a toy application.

12timspalding
Feb 23, 2008, 9:22 pm

Ignorant question, but I suppose I get to ask them.

Can we turn an MD5 into a integer and store it in a BIGINT? That's the easy fix, since those primary keys went to BIGINT some time ago.

I find MD5s annoying. Using a 32-character string is annoying and wasteful and I find the prospect of a key in binary very annoying.

13timspalding
Feb 23, 2008, 9:23 pm

Just pretending that they are very rare and won't happen isn't a satisfactory answer for anything but a toy application.

Blow me.

14Heather19
Feb 24, 2008, 12:07 am

1: "Issue: There are two: ISBN re-use (which publishers aren't supposed to do, but which happens anyway) and "hash collision" "

ISBN reuse seems to happen semi-often for me, and it's really frustrating that they can't be uncombined. However, what about title re-use? In at least one case, the very same author uses the same title for two different books. I know books with the same title can be uncombined if the authors are different (right?), but this... Is it even possible? The books in question are completely different, different storyline/cover/date/etc, but have the same title and author. High Risk by Carolyne Keene.

Tim, you seem to be trying to figure out some way to help the hash collision problem. But what about others? Like my posts in this thread:
http://www.librarything.com/talktopic.php?topic=30373
Sometimes it seems to be wrong user data, which may be corrected if that user can be convinced to change their data. But what about if we (the user) can't find the problem?

I'm trying to get my LT catalogue sorted out, and to have so many instances of wrongly combined works, with no way to seperate them, is very frustrating.

15grizzly.anderson
Feb 24, 2008, 12:38 am

Tim - re #12 - I don't think a bigint will work for you - IIRC in 32 character hash, you have essentially 32 Bytes (2^256) of possible combinations. In a bigint you have roughly 8 Bytes of possible combinations (2^63 because you lose a bit for the sign). In other words it just won't fit

re #13 - lol - reminds me *a lot* of things I put in the comments in my code. Take a deep breath. Good luck with the long weekend, and thanks for all the hard work on a generally amazing site.

16PaulFoley
Feb 24, 2008, 1:06 am

Can we turn an MD5 into a integer and store it in a BIGINT?
Course you can. (Though you might like to use a hex string with some hyphens in it, like common UUIDs)

17timspalding
Feb 24, 2008, 1:25 am

ISBN collision. ISBN collision is never an issue for combination—LibraryThing doesn't consider a work as being a collection of ISBNs, but of title-plus-author strings. ISBNs are one basis for suggesting combinations, but do not force them.

ISBNs can collide when it comes to covers. They are ISBN based, the majority still coming from Amazon, which uses ISBNs rather than our work system—the bastards.

It's possible your issues are hash collisions. Except in rare situations involving manually-added books or non-Latin titles, that's the only way that a book can be combined with something truly odd. I'd have to look at them for a while to figure it out. Sorry that I can't look at them now.

18timspalding
Feb 24, 2008, 1:31 am

>16 PaulFoley:

Thinking it through myself.

Fundamentally, an MD5 is not a 32-byte thing, but a 16-byte thing, right?

BIGINT is an 8-byte thing. So, using a BIGINT to store an MD5 would have to lose half of it, right?

Still, that would be a lot better than a CRC32, which is a 4-byte thing. It would square the number of possibilities, or—ideally—reduce the number of collisions by 4,294,967,295 times, right?

See, this stuff is alien to me. I'm sure there's some easy way to chop an MD5 into a BIGINT-sized integer. Anyone know it?

19PaulFoley
Feb 24, 2008, 1:56 am

Fundamentally, an MD5 is not a 32-byte thing, but a 16-byte thing, right?
A 128-bit thing, yes.
BIGINT is an 8-byte thing
Is it? I don't know what language you're talking about; I assumed Perl (spit)...I don't use Perl (spit), but the documentation says it's arbitrary size, so I just assumed it was.

Oh, duh! I just realized you mean in the database! Yes, you could chop it in half, it's better than CRC-32 -- or use a 16-character char field (storing 32 chars - hex text - is a waste of space; storing it in packed format doesn't mean you have to think of it as an annoying "a key in binary" - either treat it as a number when you read it in, if you like decimal numbers for book ids, or as a UUID)

20timspalding
Edited: Feb 24, 2008, 2:07 am

Right, MySQL. (See http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html)

I like integers. They're easy to work with. Also, about ten tables depend on this and use BIGINT for the value. I'd sooner change to a random number between 1 and 4—and aren't there really only four essential plots?—than try to change off that standard.

Now to find someone who can write a PHP function to turn an MD5 into a 8-byte integer. Not my idea of a fun evening! I'm sure it's one of those things someone who does this stuff can write as fast as their pen can travel.

Once I have that it would be pretty easy, just:

1. Remember every book's work number.
2. Change the CRC32 routine to use little-MD5s when calculating the edition number.
3. Redo the values in the editions-to-work database.

Hey presto. Anyone want to help me with my brain damage over the math? It'll solve about 2^8 miscombined books!

21E59F
Feb 24, 2008, 2:46 am

I don't do PHP, but it should be simple. You just want to turn, in effect, a 16-byte hash into an 8-byte hash, right? Since the MD5 should be pretty uniform, why not just hash the two halves of it together by doing an XOR, or something like that? If the MD5 pseudo-randomness is good enough, you wouldn't be clustering too much that way, would you?

It wouldn't hurt to have a fail-safe, though. Maybe a checksum of the ISBN when creating a new book hash - used as a collision check, that should solve two problems, both hash collision and different works with identical title and author, at the price of requiring an edition that nobody has entered before to be combined manually.

22PaulFoley
Feb 24, 2008, 2:51 am

Depending on how you store bignums in PHP, I guess either (using GMP) gmp_init(substr(md5(####),16),16) or base2dec(substr(md5(####),16), 16), where #### is your title or whatever, and base2dec is defined here

23PaulFoley
Feb 24, 2008, 3:14 am

Since the MD5 should be pretty uniform, why not just hash the two halves of it together by doing an XOR, or something like that?
Since it should be pretty uniform, there's no gain from doing that.

24GreyHead
Feb 24, 2008, 5:20 am

This seems to get pretty close:
base_convert(substr(md5($text), 16), 16, 10)

Converts "the quick brown fox jumped over the lazy dog" to "13767452364649108884"

Though the max integer from an md5 of 'ffff ffff ffff ffff' is 431 more than the unsigned bigint value of 18,446,744,073,709,551,615 given by the mysql manual (could just be rounding oddities in my code).

25timspalding
Feb 24, 2008, 9:49 am

Easy, just base_convert(substr(md5($text), 16), 16, 10) - 431 :)

26timspalding
Feb 24, 2008, 9:50 am

I'm going to run a test on it.

27Margalioth
Edited: Feb 24, 2008, 1:36 pm

Lorax, I think you had a great idea setting up this thread (I don't post much in bug collectors, site talk, etc, but I read them regularly and see the same Qs over and over and over...).
However, at this point I can't imagine what a new poster will now think if they do read this thread first!!
:-)

ETA the smiley so you can tell I was laughing and not frowning as I wrote this!

28AndrewB
Feb 24, 2008, 1:23 pm

...*starts the "Look HERE, not THERE" thread*

29GreyHead
Edited: Feb 25, 2008, 4:27 am

Big grin - I open a whole stack of threads in new tabs each at the 'first unread' location. When I actually get here I've no idea what the Group or Thread Title is . . . (though I see that the tab tab says "New pos...").

> 25 : Tim : subtracting 431 just gives you some negative integers around the low end of the md5 hashes.

30PaulFoley
Feb 25, 2008, 4:54 am

29> I do the same, but the window title shows the complete thread title and group for the current tab.

re: the "431" thing: http://www.php.net/base_convert says base_convert uses floats internally, so it can't accurately support numbers longer than 53 bits (sort of). Some of the comments give versions that work on bignums.

31jjwilson61
Feb 26, 2008, 1:50 pm

13> Sorry about the toy application remark. LT is obviously not a toy and I realize that you and the other developers are working very hard on it. Not fixing serious bugs because they are rare is a pet peeve of mine but you need to allocate your scarce resources as you see fit.

32HelloAnnie
Feb 26, 2008, 2:05 pm

Did Tim really say "blow me"? That was really surprising to me.

33timspalding
Edited: Feb 26, 2008, 4:52 pm

Ha. The web doesn't convey tone. It was a long day and when someone calls two years of my work a "toy application" I can muster nothing more profound than dismissive irritation.

34HelloAnnie
Feb 26, 2008, 4:49 pm

I actually found it funny, but then again it wasn't directed at me. Anytime you see "blow me" on the web somewhere, it will arise giggles out of me, but then again, I'm immature. It just surprised me because you seem serious and studious! And yes, it's hard to tell sarcasm and other forms of tone on the Internet.

35AnnaClaire
Edited: Feb 26, 2008, 7:17 pm

Did Tim really say "blow me"? That was really surprising to me. (#32)

Me too. Of course, there's no reason to think it applied to me as this is the first time I've piped up in this thread. But still, coming from Tim...

Edited for HTML.

36timspalding
Feb 26, 2008, 9:09 pm

I do apologize if anyone was intended. Come have a beer with us at the Portland office and you'll probably hear worse... :)

37Heather19
Feb 26, 2008, 9:13 pm

lol I had a feeling it was meant as dismissive more then anything else, I think too highly of Tim for it to be anything but. And, 36, someday I'm going to have to meet some of you at one of those get-togethers.... but I'll make sure it's after I turn 21. lol

38AnnaClaire
Feb 26, 2008, 10:08 pm

I do apologize if anyone was intended. (#36)

Intended? Intended for what? ;)

39timspalding
Feb 26, 2008, 11:28 pm

Sigh. I can't even spell. It's fortunate I didn't say "bloke me." That's probably some English slang that requires satisfaction by duelling.

19? Wow. You're almost half my age. That's sad. If I had known young people were about I would have curbed my tongue! :)

40PaulFoley
Feb 27, 2008, 12:50 am

It has rather different meaning in American vs. British English...http://www.effingpot.com/slang.shtml