ISBN's for old books?

TalkTalk about LibraryThing

Join LibraryThing to post.

ISBN's for old books?

This topic is currently marked as "dormant"—the last message is more than 90 days old. You can revive it by posting a reply.

1Stbalbach
Jun 6, 2009, 2:28 pm

Old books don't have ISBN numbers which means there is no way to scan them in using a bar code scanner, or search or index via ISBN.

I'm wondering if anyone here or elsewhere has started a project to assign old books fake ISBN numbers - numbers that won't conflict with real ISBN's, but which still look like ISBN's. Sort of like with IP numbers 192.168.0.0 and 127.0.0.0 - In this way it would be possible to print bar code stickers for old books so they can be entered using a bar code scanner (Cue: Cat). This database would be public and rely on end users entering books. Once a book is entered into the database and a fake ISBN assigned, it would be a universal number that any other site (like LT) could use, just like a real ISBN, unique to that book, maintained by a central database/website. Thoughts?

2timspalding
Edited: Jun 6, 2009, 3:57 pm

There are a bunch of universal identifiers—or attempted universal identifiers. LCCN, OCLC, OpenLibrary IDs. The problem is that somebody needs to identify a book as belonging to a particular number. You need to search a database. That means work, and it produces duplicates. It also requires a central authority. Nobody ever barcodes those numbers onto books.*

I've long contemplated a system for assigning non-arbitrary ids, that wouldn't require a registering authority, particularly for items, like Zines, that have never had real identifiers. A method that, to my knowledge, has never been tried would go like this:

1. Go to page 123.
2. Find the first five non-boilerplate words.
1. Go to page 156.
3. Find the last five non-boilerplate words.
4. Put the two together in lower case without punctuation and hash them (using md5 or sha1; crc32 is too small). Add a checksum to make is barcodable.

So, for example, Pamuk's Istanbul: Memories and the City has "to stop looking like a kosk remaining long enough to." The SHA1 is:

e9e87fcdf3fcf47f8c0b9b12bf8b86350a212a0c

Obviously you'd need edge-case rules. What if the book has no words on page 123 or not page 123? (You reduce by ten and then one until you can do it.) What if there are no words at all? Etc. In this case, I had to decide that diacriticals would not be entered—the word kosk uses a ö and the Turkish s with a cedilla (a she or ş, which almost nobody knows how to type). You would need a very long list of very special rules but, I think, you'd only need to look at them once every ten-thousand books, if that.

And you might want to pick something that allows you to distinguish between a hardcover and an exact softcover reprint, although I'm not sure there's any way to do that well.

Anyway, you see what I'm trying to do—a way of coming up with IDs for non-ISBN books that is non-arbitrary and does not involve searching a database of entries.

Anyone want to help on this?

*If you sell them, you give it your own product id. If you are a library you give it an item number. Somewhere else in your system you may or may not tie it to a universal id.

3Stbalbach
Jun 6, 2009, 8:09 pm

A very interesting idea, Tim. It makes a lot of sense, the ID is built-in to the object. It democratizes the identification and frees up from gatekeepers. If the standard is handled like an RFC, in which anyone can contribute, open source, all the better.

Some other variables to consider. The words on ppg.123 and 156 may be the same but:

* Publisher. Same book but published in New York or London.
* Year. Same book but re-printed in later years.
* Media type: hardcover, softcover, journal etc.. (as you say)

These could be added as additional fields of entry. Not sure if that messes up the hash length though with too much data.

Stephen

4timspalding
Jun 6, 2009, 9:02 pm

>3 Stbalbach:

Right. The question is what you're aiming to equate and what you're not. Do you consider the same book, reprinted to be the same or different.

The problem I have with doing year, publisher, media-type and so forth is that they start to look like "cataloging." Figuring out what words start page 123 is easy. Figuring out where the publisher "really" is is not—witness all the publishers who list a bajillion places, like Oxford or O'Reilly. Ditto years. It takes experience to distinguish between edition year and printing year, and reprints often don't change either.

5andyl
Jun 7, 2009, 5:42 am

#4

Sure but you are making a trade-off if you ignore publisher. In that if a US publisher republished a UK original (or vice-versa) and reused the typesetting (so that the pagination was exactly equivalent) there would be no way of telling if you had the US or UK edition. That doesn't matter if you are looking at it as a work, but might do for other purposes. If we start to see more modern facsimile editions of older works that applies double.

I guess it boils down to book as content vs book as cultural artefact/collector's item.

6Nicole_VanK
Jun 7, 2009, 5:53 am

Any modern facsimile editions should have their own ISBN, so they should remain identifiable anyway (unless you would intend this system to replace identification by ISBN entirely).

7Stbalbach
Edited: Jun 7, 2009, 10:56 am

4>>

I agree. LibraryThing has some of the same issues.

If the idea is as #5 says to be a work identifier and not an object ID.. but in that case there would be different ID's for the same work since not every edition of a work has same page numbers.. or there would be the same ID for different objects in the case of facsimile reprints.. it gets messy if there is no clear distinction between work and object, I think the system would have to track one or the other. If it's an object ID system, it has to track things like publisher and year of publication. But that could be handled similar to LT's "Add new book" feature which ties into library catalogs in a multiple choice fashion so end users don't have to enter it manually.

8timspalding
Jun 7, 2009, 10:54 pm

There's an infinitude of different levels to track and ways to track it. To a library, every item has a different barcode. To a rare book dealer the most popular book ever printed stands out if signed by a famous person. An illiterate sees all books as the same.*

What I'm proposing would be a system you could derive from an item-in-hand, without searching databases and making judgement calls on bad data. The question is, what can you track that way?

*I think I've said this before, but I recall a passage in Orwell's Burmese Days, where an itinerant and illiterate book pedlar only distinguishes two types of books--books and bibles, having learned to pick the latter out by their peculiar typographic conventions. The two types had two prices.

9jjwilson61
Jun 8, 2009, 1:18 am

What about book that don't have 156 pages? How about picking a page something like exactly 2/5ths of the way through and 3/5ths for the second page? Special rules would probably be needed for children's picture books that may only have a few words per page or none.

10timspalding
Jun 8, 2009, 1:20 am

You'd definitely need rules for that sort of thing. I'd rather pick a page, so you don't need to get out your calculator to do it. Above I said pick a page and, if it's not there, try 25 less, until it works. Obviously, you do edge cases for everything.

11zire
Jun 8, 2009, 1:31 am

This message has been flagged by multiple users and is no longer displayed (show)
WHERE WOULD IT HURT????????????????????????

12thorold
Jun 9, 2009, 2:22 am

Tim,

I think something like this would be great, but I see a few more obvious problems in your approach:
- As formulated, it doesn't work at all for non-Roman alphabets
- There have to be a lot of little rules to cover edge cases, some already discussed above, but also things like non-standard punctuation symbols, whitespace, etc., which mean that it isn't trivial for the unskilled user who only uses the system once in a blue moon to get it right
- It reacts badly to errors: someone who makes a small typing mistake, or interprets the rules wrongly, gets a result that is wrong but can't be picked up by a simple plausibility check
- It doesn't work at all if you don't have a physical (or facsimile) copy in front of you, so you won't be invited to Legacy Library birthday parties any more

I think what we need is a system that produces a unique identifier that can easily be reverse-engineered: something like first X characters of title plus first Y characters of author name plus year plus simple format code. And get the computer to do the character counting and convert everything to (uppercase) Unicode in a simple consistent way, stripping spaces and punctuation, irrespective of what the user types.

13timspalding
Jun 9, 2009, 3:34 am

- As formulated, it doesn't work at all for non-Roman alphabets

That's easy to solve, though, I think. Just use unicode.

- There have to be a lot of little rules to cover edge cases, some already discussed above, but also things like non-standard punctuation symbols, whitespace, etc., which mean that it isn't trivial for the unskilled user who only uses the system once in a blue moon to get it right

Well, a lot of that can be handled by the thing which hashes it. Yes, it will be somewhat tricky to deal with a book that has type at various angles and colors on an unnumbered page. But 99.5% of books will be a matter of flipping to a page and typing in a few words.

- It reacts badly to errors: someone who makes a small typing mistake, or interprets the rules wrongly, gets a result that is wrong but can't be picked up by a simple plausibility check

That's true. On the other hand, it won't end up mis-matching to anything. It will just be all alone by itself. When the second person puts the book in, it'll be there.

- It doesn't work at all if you don't have a physical (or facsimile) copy in front of you, so you won't be invited to Legacy Library birthday parties any more

Okay, but the standard way of identifying a book—cataloging it—requires item in hand. We can only do LLs because someone *else* has cataloged the book. Most of the time anyway.

I think what we need is a system that produces a unique identifier that can easily be reverse-engineered: something like first X characters of title plus first Y characters of author name plus year plus simple format code. And get the computer to do the character counting and convert everything to (uppercase) Unicode in a simple consistent way, stripping spaces and punctuation, irrespective of what the user types.

Right, but this is just cataloging, and prone to all the problems there. What are the first five words on page 250 is easy. Titles and authors are hard. Then you've got year—what year? You take classes for that, and of the books we care about here—old books, since there are ISBNs for newer ones—many don't have years. Look at a big chunk of older MARC records and you'll see a lot of ?s in the date field. And as for format, that's even harder? Are you thinking just hard and soft? Hard, sort-of-hard, sort of soft? Quarto?

14thorold
Jun 9, 2009, 6:52 am

OK, I take your points: I think I misread the purpose of the scheme a little in my earlier note.

I spent a few minutes testing it in practice with the books on my office shelf, and I think it could be made to work, but it might be a bit harder than you suggest. The straightforward cases were nearer 40% than 99.5%. Probably close to a worst-case scenario, with dictionaries, law texts, and a few maths and science books, but it's not safe to assume that most works are novels, even if most copies of books on LT are.

Issues I spotted from my sondage:
- words hyphenated across two pages
- multiple columns of text
- text at strange angles (yes, I actually found one: a French textbook with a selection of "press cuttings" pasted at various angles and in no clear order on p.123)
- abbreviations (is "v/t" one word or two?), symbols with subscripts and superscripts
- references (how many words is something like "vol.23 (1940), pp.200-222"?)
- words in phonetic alphabet
- equations on first or last line
- footnotes
- tables
- illustration including text at top or bottom of page
- unnumbered pages (this was a facsimile reprint of an old dictionary)
-

15thorold
Jun 9, 2009, 6:56 am

...sorry - previous message got truncated somehow. I meant to add:

These are things that could all be solved fairly easily, given clear rules, but we will probably end up with quite a lot of rules.

16Stbalbach
Jun 9, 2009, 7:44 pm

Ideally, the system can generate ID's using the OCR texts created by Google Books and Internet Archive. That way the database could have millions of IDs without manual determination, and would work equally well if manually entered.

An example, children's picture book `Goody Two Shoes`
http://www.archive.org/details/goodytwoshoes00newyiala

And the plain text version
http://www.archive.org/stream/goodytwoshoes00newyiala/goodytwoshoes00newyiala_dj...

It's a mess because the OCR is not reliable, and on the first page the books library stamp is inserted into the text.

17Stbalbach
Edited: Jun 9, 2009, 7:54 pm

Another idea is to simply use the first (or last) character on the first (or last) line of a page, no matter what the character is, for 5 subsequent pages, starting at page X. That would eliminate a lot of the special case problems, just need some rules for blank or unnumbered pages and pictures and headers and footers.

OCR problems like with Goody Two Shoes above would make it not so Goody for automating ID generation.

18BryanLesiuk
Mar 18, 2010, 1:51 pm

I believe this is an important and interesting topic. I have a few contributions:

1) Multiple schemes are okay. Giving a book multiple numbers is okay. As pointed out above, we already have this: ISBN represents the book-as-works ID, whereas a library-assigned ID represents the book-as-object ID. No single scheme satisfies all needs, therefore supporting multiple schemes is valuable and inevitable. Having a group of standard schemes is superior to everyone using their own (which is what happens right now). I will further point out that one scheme *must* be a globally-unique book-as-object ID, although not every book will have one; there is a reason RFID tag IDs are globally unique.

2) A listing service for book IDs can support multiple schemes and allow finding the book using any applicable scheme. Search engines are good. Database "joins" are good. For example, if the "turn to page 123..." scheme is applicable, you can use it to locate that book in a database.

3) Edge case complexity is a very serious problem. A good first-approximation to deal with complexity is to have different schemes for different classes of books. For example, a scheme based on the Roman alphabet is inapplicable to an Arabic text.

4) I'm not sure about this final recommendation, but I think it's important: a book ID registry (ie: user-maintained Wiki) might have the ability to record "corrections". For example, if someone assigns an incorrect ID to a book in the database, someone else might come along and insert a "this book actually belongs over here" entry, without deleting the original entry. My intuition says the non-deleting of the original entry is important, though my information theory prowess is insufficient to explain why.