I wrote this up on the plane from San Francisco. (I was there on a secret, unbloggable mission!*) It’s a bit involved and it doesn’t “arrive” anywhere, but, if you’re interested in subjects and relevancy ranking, it might be worth thinking about.
There are a couple differences between user tagging (“free tagging,” “social tagging,” etc.) and traditional library classification. “Who does it?” is the most obvious difference, followed by whether or not the labeling action takes place within a predefined ontology, or is made up on the fly.
It’s easy to ignore a third, and very critical difference. Subject classifications, like the Library of Congress Subject Headings (LCSH), are essentially binary. It’s non-overlapping buckets. Something either does or does no belong in a subject. There are no gradations of belonging.
The idea is, as Clay Shirky and David Weinberger have reminded us, rooted in the physical world. Subject classification escapes the physicality of shelf-order classification, in which a book must be shelved in a single place, but is still restrained by the physicality of the catalog card. A catalog card can only reference a certain number of subjects. Nobody wants a book to take up twenty cards. And the subject cards can only reference so many books. About 90% of all literature could fall under the LCSH subject Man-woman relationships. But it would make no sense to slot this 90% under that heading in a physical card catalog–the card catalog would instantly grow by 90%! And there seem to be very real differences in relevancy and “what-the-heck”-ness between real-life members of the “Man-woman relationships” LCSH: High Fidelity, Great expectations, The Fountainhead, I Kissed Dating Goodbye, and The Official Hottie Hunting Guide.
If you’re very selective, you can keep the numbers down. But, apart from the rule that the first subject is generally the primary one, there’s no good way to relevancy rank the books belonging to a subject.
Tags can do it, because tens, hundreds or thousands of users applying tags creates a “statistics of meaning.” So, 1984 is tagged dytopia 549 times, torture six times and Great Britain two times. The numbers can be turned into ranking, so 1984 shows up high on a list of books about “dystopia,” lower under “torture” and near the end of a list of books about Great Britain.
This is all well-worn territory. My question is this: Is there any way to relevancy-rank books within subjects?
I was reminded of the question when checking out OCLC’s new project, FictionFinder. I’ll blog about the whole later, but for now know that you can search for a LCSH subject and get back a list of books belonging to it. (I can’t link to the results, which are session based.**) Check out the LCSH “City and Town Life” and the top book is Red Badge of Courage. Lacking a better method, FictionFinder let popularity (the number of OCLC libraries with a copy) stand in for relevance. LibraryThing does the same, using our popularity numbers instead. The results are not systemmatically better (in this case Ulysses wins).
I tried two solutions:
The first was to tie into LibraryThing’s tags. So, figure out what tags are most characteristic of books with the subject “Man-Woman Relationships,” and then use the presence and number of these tags to rank the subject results. So, for example, “Man-Woman Relationships” has a global correlation with “relationships,” “dating” and “romance,” none of which are very prominent among the tags applied to Great Expectations, so it can fall low on the list.
I got far enough down this road to know it was going to help.
The second and more interesting algorithm was to see if books can be ranked within subjects without any other information. This would help OCLC, who are unlikely to pay for LibraryThing data, and to any library that employs LCSH, most of which would have no “popularity” data to use either.
I hit upon the idea that subjects “reinforce” each other, and that this must leave a statistical signature. For example, it seems that “Love stories” and “Psychological fiction” are commonly applied to books about “Man-Woman Relationships,” but that “Androgynous robot alone on an island — Stories” is not. (Okay, that’s not real, but the point stands.) Can these “related subjects” relevancy rank the subject itself?
I wish so, but I can’t get it to work well enough. It works for some topics, but falls down for others, laughably.
Some ideas I’ve considered:
- Treating subjects as links, and running some sort of “page-rank” style connection algorithm against them. Maybe this would bring out coincidences that simple statistics misses.
- Using other library data, such as LCC and Dewey. This would be reminiscent of how I made LibraryThing’s LCSH/LCC/Dewey recommendations.
- Doing statistics on other fields, such as the title. So, for example, there’s probably a statistical correlation between “Man-woman relationships” and books with “dating,” “men and women” and “proposal” in the title.
None strike me as the silver bullet.
Anyway, my plane has landed–allowing me to do real work again–so I end in aporia. Ideas?
*I’m itching to blog it, but I have to hold off for now. I’ll throw some pictures up soon, however. I’d never been to San Francisco before. What a wonderful wonderful town.
**One can understand why OPACs made in 1996 are session based. How frustrating to see a new product with them.