This topic is currently marked as "dormant"—the last message is more than 90 days old. You can revive it by posting a reply.
1lorax
There are a number of known issues that keep being reported; either people don't do a search or don't use the terms that will bring up the previous reports. Two perennial issues (not things related to recent changes, which will hopefully be fixed soon) are the following:
1.
Symptom: "I just changed XYZ, or combined two books, or separated two books, and it's not updating!" Alternatively, "I just added a book and I can't edit the information!"
Issue: Caching. LT is a heavily-used site and the database doesn't update instantaneously for all changes. Come back in half an hour -- if the change still hasn't propogated, then go ahead and post with all the details.
2.
Symptom: "These two totally unrelated books are combined and I can't separate them!"
Issue: There are two: ISBN re-use (which publishers aren't supposed to do, but which happens anyway) and "hash collision", where the two different titles are converted into the same thing for internal storage. You can't do anything globally; in the first case, deleting the ISBN for your own copy can separate it from the herd (both the correct and incorrect matches, which will remain together) while in the latter case making a small modification to the title (changing punctuation) will sometimes separate out your own copy (but again, can't fix the global problem).
1.
Symptom: "I just changed XYZ, or combined two books, or separated two books, and it's not updating!" Alternatively, "I just added a book and I can't edit the information!"
Issue: Caching. LT is a heavily-used site and the database doesn't update instantaneously for all changes. Come back in half an hour -- if the change still hasn't propogated, then go ahead and post with all the details.
2.
Symptom: "These two totally unrelated books are combined and I can't separate them!"
Issue: There are two: ISBN re-use (which publishers aren't supposed to do, but which happens anyway) and "hash collision", where the two different titles are converted into the same thing for internal storage. You can't do anything globally; in the first case, deleting the ISBN for your own copy can separate it from the herd (both the correct and incorrect matches, which will remain together) while in the latter case making a small modification to the title (changing punctuation) will sometimes separate out your own copy (but again, can't fix the global problem).
2SatsumaHouse
Re #2: So how does this--"...'hash collision', where the two different titles are converted into the same thing for internal storage"--happen?
3lorax
The two titles get converted -- hashed -- into some internally-used format, which is shorter than the titles and easier to deal with. I don't know the details of the algorithm they use.
So to make stuff up, say that Title A: A Book gets turned into, for instance, x134a and Title B gets turned into z231w, and the hashed versions are used internally. Then along comes Title C, and it gets turned into x134a as well; the system says "oho, here's another copy of good old x134a" and puts them together. The Wikipedia article on hash collisions may be helpful.
In this case, if the owner of Title A were to do something like change it to "Title A : A Book" (note the additional space) it would perhaps fix the problem. The solutions are more difficult than the statement of the problem, unfortunately; this is mostly intended as a "this one's a known issue" post to cut down on the number of reports for bug collectors to sort through.
So to make stuff up, say that Title A: A Book gets turned into, for instance, x134a and Title B gets turned into z231w, and the hashed versions are used internally. Then along comes Title C, and it gets turned into x134a as well; the system says "oho, here's another copy of good old x134a" and puts them together. The Wikipedia article on hash collisions may be helpful.
In this case, if the owner of Title A were to do something like change it to "Title A : A Book" (note the additional space) it would perhaps fix the problem. The solutions are more difficult than the statement of the problem, unfortunately; this is mostly intended as a "this one's a known issue" post to cut down on the number of reports for bug collectors to sort through.
4conceptDawg
That is a good description. I probably would have gotten a little too verbose with my lingo had I tried that. To offer a little more insight, we use crc32 and md5 variously on different hashes in the system. I can't remember off the top of my head which one we are using for book titles.
These algorithms are good and you have to have A LOT of items to have collisions. But, well....we have a lot of data, so collisions happen.
Hashing REALLY helps to speed many functions on the site.
These algorithms are good and you have to have A LOT of items to have collisions. But, well....we have a lot of data, so collisions happen.
Hashing REALLY helps to speed many functions on the site.
5nperrin
1: Your second problem can also happen for a third reason. If two books have a title and author entered identically with each other, but are different books, they cannot be separated no matter what. This happens, for example, with City of Glass by Paul Auster. There is a graphic novel of the same name which is often listed under Auster rather than under the artists who drew/wrote it (it is definitely a different work), and if I call my copy of the novel "City of Glass" by "Paul Auster" and someone calls his copy of the graphic novel "City of Glass" by "Paul Auster" (with IDENTICAL spelling), it is impossible to separate them. This is the reason separation by ISBN is often requested.
6Heather19
lorax: Thanks for this thread! I admit I get a little annoyed at so many repeat threads, but I was new once too, so I can understand the confusion. Taking the time to search a little makes a world of difference, though.
7lorax
@5:
Good point. I haven't seen a lot of confusion around this case, perhaps because it's obvious why they're combined -- there's frustration that they can't be separated, but not head-scratching.
Good point. I haven't seen a lot of confusion around this case, perhaps because it's obvious why they're combined -- there's frustration that they can't be separated, but not head-scratching.
9PaulFoley
we use crc32 and md5 variously on different hashes in the system. I can't remember off the top of my head which one we are using for book titles
Obviously CRC-32 given the range of book ids and the number of reported collisions (I'd be very surprised if you had a single collision using MD5)
These algorithms are good and you have to have A LOT of items to have collisions.
It's not very hard to collide a 32-bit hash...especially when the function you're using is not really designed for that purpose.
Obviously CRC-32 given the range of book ids and the number of reported collisions (I'd be very surprised if you had a single collision using MD5)
These algorithms are good and you have to have A LOT of items to have collisions.
It's not very hard to collide a 32-bit hash...especially when the function you're using is not really designed for that purpose.
10timspalding
Unfortunately, we're using CRC32, not MD5. We need to convert. It's going to be a long weekend's worth of work, I think.
11jjwilson61
Hashes are very useful, but any elementary data structures text will have a section on how to handle hash collisions. Just pretending that they are very rare and won't happen isn't a satisfactory answer for anything but a toy application.
12timspalding
Ignorant question, but I suppose I get to ask them.
Can we turn an MD5 into a integer and store it in a BIGINT? That's the easy fix, since those primary keys went to BIGINT some time ago.
I find MD5s annoying. Using a 32-character string is annoying and wasteful and I find the prospect of a key in binary very annoying.
Can we turn an MD5 into a integer and store it in a BIGINT? That's the easy fix, since those primary keys went to BIGINT some time ago.
I find MD5s annoying. Using a 32-character string is annoying and wasteful and I find the prospect of a key in binary very annoying.
13timspalding
Just pretending that they are very rare and won't happen isn't a satisfactory answer for anything but a toy application.
Blow me.
Blow me.
14Heather19
1: "Issue: There are two: ISBN re-use (which publishers aren't supposed to do, but which happens anyway) and "hash collision" "
ISBN reuse seems to happen semi-often for me, and it's really frustrating that they can't be uncombined. However, what about title re-use? In at least one case, the very same author uses the same title for two different books. I know books with the same title can be uncombined if the authors are different (right?), but this... Is it even possible? The books in question are completely different, different storyline/cover/date/etc, but have the same title and author. High Risk by Carolyne Keene.
Tim, you seem to be trying to figure out some way to help the hash collision problem. But what about others? Like my posts in this thread:
http://www.librarything.com/talktopic.php?topic=30373
Sometimes it seems to be wrong user data, which may be corrected if that user can be convinced to change their data. But what about if we (the user) can't find the problem?
I'm trying to get my LT catalogue sorted out, and to have so many instances of wrongly combined works, with no way to seperate them, is very frustrating.
ISBN reuse seems to happen semi-often for me, and it's really frustrating that they can't be uncombined. However, what about title re-use? In at least one case, the very same author uses the same title for two different books. I know books with the same title can be uncombined if the authors are different (right?), but this... Is it even possible? The books in question are completely different, different storyline/cover/date/etc, but have the same title and author. High Risk by Carolyne Keene.
Tim, you seem to be trying to figure out some way to help the hash collision problem. But what about others? Like my posts in this thread:
http://www.librarything.com/talktopic.php?topic=30373
Sometimes it seems to be wrong user data, which may be corrected if that user can be convinced to change their data. But what about if we (the user) can't find the problem?
I'm trying to get my LT catalogue sorted out, and to have so many instances of wrongly combined works, with no way to seperate them, is very frustrating.
15grizzly.anderson
Tim - re #12 - I don't think a bigint will work for you - IIRC in 32 character hash, you have essentially 32 Bytes (2^256) of possible combinations. In a bigint you have roughly 8 Bytes of possible combinations (2^63 because you lose a bit for the sign). In other words it just won't fit
re #13 - lol - reminds me *a lot* of things I put in the comments in my code. Take a deep breath. Good luck with the long weekend, and thanks for all the hard work on a generally amazing site.
re #13 - lol - reminds me *a lot* of things I put in the comments in my code. Take a deep breath. Good luck with the long weekend, and thanks for all the hard work on a generally amazing site.
16PaulFoley
Can we turn an MD5 into a integer and store it in a BIGINT?
Course you can. (Though you might like to use a hex string with some hyphens in it, like common UUIDs)
Course you can. (Though you might like to use a hex string with some hyphens in it, like common UUIDs)
17timspalding
ISBN collision. ISBN collision is never an issue for combination—LibraryThing doesn't consider a work as being a collection of ISBNs, but of title-plus-author strings. ISBNs are one basis for suggesting combinations, but do not force them.
ISBNs can collide when it comes to covers. They are ISBN based, the majority still coming from Amazon, which uses ISBNs rather than our work system—the bastards.
It's possible your issues are hash collisions. Except in rare situations involving manually-added books or non-Latin titles, that's the only way that a book can be combined with something truly odd. I'd have to look at them for a while to figure it out. Sorry that I can't look at them now.
ISBNs can collide when it comes to covers. They are ISBN based, the majority still coming from Amazon, which uses ISBNs rather than our work system—the bastards.
It's possible your issues are hash collisions. Except in rare situations involving manually-added books or non-Latin titles, that's the only way that a book can be combined with something truly odd. I'd have to look at them for a while to figure it out. Sorry that I can't look at them now.
18timspalding
>16 PaulFoley:
Thinking it through myself.
Fundamentally, an MD5 is not a 32-byte thing, but a 16-byte thing, right?
BIGINT is an 8-byte thing. So, using a BIGINT to store an MD5 would have to lose half of it, right?
Still, that would be a lot better than a CRC32, which is a 4-byte thing. It would square the number of possibilities, or—ideally—reduce the number of collisions by 4,294,967,295 times, right?
See, this stuff is alien to me. I'm sure there's some easy way to chop an MD5 into a BIGINT-sized integer. Anyone know it?
Thinking it through myself.
Fundamentally, an MD5 is not a 32-byte thing, but a 16-byte thing, right?
BIGINT is an 8-byte thing. So, using a BIGINT to store an MD5 would have to lose half of it, right?
Still, that would be a lot better than a CRC32, which is a 4-byte thing. It would square the number of possibilities, or—ideally—reduce the number of collisions by 4,294,967,295 times, right?
See, this stuff is alien to me. I'm sure there's some easy way to chop an MD5 into a BIGINT-sized integer. Anyone know it?
19PaulFoley
Fundamentally, an MD5 is not a 32-byte thing, but a 16-byte thing, right?
A 128-bit thing, yes.
BIGINT is an 8-byte thing
Is it? I don't know what language you're talking about; I assumed Perl (spit)...I don't use Perl (spit), but the documentation says it's arbitrary size, so I just assumed it was.
Oh, duh! I just realized you mean in the database! Yes, you could chop it in half, it's better than CRC-32 -- or use a 16-character char field (storing 32 chars - hex text - is a waste of space; storing it in packed format doesn't mean you have to think of it as an annoying "a key in binary" - either treat it as a number when you read it in, if you like decimal numbers for book ids, or as a UUID)
A 128-bit thing, yes.
BIGINT is an 8-byte thing
Is it? I don't know what language you're talking about; I assumed Perl (spit)...I don't use Perl (spit), but the documentation says it's arbitrary size, so I just assumed it was.
Oh, duh! I just realized you mean in the database! Yes, you could chop it in half, it's better than CRC-32 -- or use a 16-character char field (storing 32 chars - hex text - is a waste of space; storing it in packed format doesn't mean you have to think of it as an annoying "a key in binary" - either treat it as a number when you read it in, if you like decimal numbers for book ids, or as a UUID)
20timspalding
Right, MySQL. (See http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html)
I like integers. They're easy to work with. Also, about ten tables depend on this and use BIGINT for the value. I'd sooner change to a random number between 1 and 4—and aren't there really only four essential plots?—than try to change off that standard.
Now to find someone who can write a PHP function to turn an MD5 into a 8-byte integer. Not my idea of a fun evening! I'm sure it's one of those things someone who does this stuff can write as fast as their pen can travel.
Once I have that it would be pretty easy, just:
1. Remember every book's work number.
2. Change the CRC32 routine to use little-MD5s when calculating the edition number.
3. Redo the values in the editions-to-work database.
Hey presto. Anyone want to help me with my brain damage over the math? It'll solve about 2^8 miscombined books!
I like integers. They're easy to work with. Also, about ten tables depend on this and use BIGINT for the value. I'd sooner change to a random number between 1 and 4—and aren't there really only four essential plots?—than try to change off that standard.
Now to find someone who can write a PHP function to turn an MD5 into a 8-byte integer. Not my idea of a fun evening! I'm sure it's one of those things someone who does this stuff can write as fast as their pen can travel.
Once I have that it would be pretty easy, just:
1. Remember every book's work number.
2. Change the CRC32 routine to use little-MD5s when calculating the edition number.
3. Redo the values in the editions-to-work database.
Hey presto. Anyone want to help me with my brain damage over the math? It'll solve about 2^8 miscombined books!
21E59F
I don't do PHP, but it should be simple. You just want to turn, in effect, a 16-byte hash into an 8-byte hash, right? Since the MD5 should be pretty uniform, why not just hash the two halves of it together by doing an XOR, or something like that? If the MD5 pseudo-randomness is good enough, you wouldn't be clustering too much that way, would you?
It wouldn't hurt to have a fail-safe, though. Maybe a checksum of the ISBN when creating a new book hash - used as a collision check, that should solve two problems, both hash collision and different works with identical title and author, at the price of requiring an edition that nobody has entered before to be combined manually.
It wouldn't hurt to have a fail-safe, though. Maybe a checksum of the ISBN when creating a new book hash - used as a collision check, that should solve two problems, both hash collision and different works with identical title and author, at the price of requiring an edition that nobody has entered before to be combined manually.
22PaulFoley
Depending on how you store bignums in PHP, I guess either (using GMP) gmp_init(substr(md5(####),16),16) or base2dec(substr(md5(####),16), 16), where #### is your title or whatever, and base2dec is defined here
23PaulFoley
Since the MD5 should be pretty uniform, why not just hash the two halves of it together by doing an XOR, or something like that?
Since it should be pretty uniform, there's no gain from doing that.
Since it should be pretty uniform, there's no gain from doing that.
24GreyHead
This seems to get pretty close:
base_convert(substr(md5($text), 16), 16, 10)
Converts "the quick brown fox jumped over the lazy dog" to "13767452364649108884"
Though the max integer from an md5 of 'ffff ffff ffff ffff' is 431 more than the unsigned bigint value of 18,446,744,073,709,551,615 given by the mysql manual (could just be rounding oddities in my code).
base_convert(substr(md5($text), 16), 16, 10)
Converts "the quick brown fox jumped over the lazy dog" to "13767452364649108884"
Though the max integer from an md5 of 'ffff ffff ffff ffff' is 431 more than the unsigned bigint value of 18,446,744,073,709,551,615 given by the mysql manual (could just be rounding oddities in my code).
25timspalding
Easy, just base_convert(substr(md5($text), 16), 16, 10) - 431 :)
26timspalding
I'm going to run a test on it.
27Margalioth
Lorax, I think you had a great idea setting up this thread (I don't post much in bug collectors, site talk, etc, but I read them regularly and see the same Qs over and over and over...).
However, at this point I can't imagine what a new poster will now think if they do read this thread first!!
:-)
ETA the smiley so you can tell I was laughing and not frowning as I wrote this!
However, at this point I can't imagine what a new poster will now think if they do read this thread first!!
:-)
ETA the smiley so you can tell I was laughing and not frowning as I wrote this!
29GreyHead
Big grin - I open a whole stack of threads in new tabs each at the 'first unread' location. When I actually get here I've no idea what the Group or Thread Title is . . . (though I see that the tab tab says "New pos...").
> 25 : Tim : subtracting 431 just gives you some negative integers around the low end of the md5 hashes.
> 25 : Tim : subtracting 431 just gives you some negative integers around the low end of the md5 hashes.
30PaulFoley
29> I do the same, but the window title shows the complete thread title and group for the current tab.
re: the "431" thing: http://www.php.net/base_convert says base_convert uses floats internally, so it can't accurately support numbers longer than 53 bits (sort of). Some of the comments give versions that work on bignums.
re: the "431" thing: http://www.php.net/base_convert says base_convert uses floats internally, so it can't accurately support numbers longer than 53 bits (sort of). Some of the comments give versions that work on bignums.
31jjwilson61
13> Sorry about the toy application remark. LT is obviously not a toy and I realize that you and the other developers are working very hard on it. Not fixing serious bugs because they are rare is a pet peeve of mine but you need to allocate your scarce resources as you see fit.
32HelloAnnie
Did Tim really say "blow me"? That was really surprising to me.
33timspalding
Ha. The web doesn't convey tone. It was a long day and when someone calls two years of my work a "toy application" I can muster nothing more profound than dismissive irritation.
34HelloAnnie
I actually found it funny, but then again it wasn't directed at me. Anytime you see "blow me" on the web somewhere, it will arise giggles out of me, but then again, I'm immature. It just surprised me because you seem serious and studious! And yes, it's hard to tell sarcasm and other forms of tone on the Internet.
35AnnaClaire
Did Tim really say "blow me"? That was really surprising to me. (#32)
Me too. Of course, there's no reason to think it applied to me as this is the first time I've piped up in this thread. But still, coming from Tim...
Edited for HTML.
36timspalding
I do apologize if anyone was intended. Come have a beer with us at the Portland office and you'll probably hear worse... :)
37Heather19
lol I had a feeling it was meant as dismissive more then anything else, I think too highly of Tim for it to be anything but. And, 36, someday I'm going to have to meet some of you at one of those get-togethers.... but I'll make sure it's after I turn 21. lol
38AnnaClaire
I do apologize if anyone was intended. (#36)
Intended? Intended for what? ;)
39timspalding
Sigh. I can't even spell. It's fortunate I didn't say "bloke me." That's probably some English slang that requires satisfaction by duelling.
19? Wow. You're almost half my age. That's sad. If I had known young people were about I would have curbed my tongue! :)
19? Wow. You're almost half my age. That's sad. If I had known young people were about I would have curbed my tongue! :)

