Major author recalculation!

TalkTalk about LibraryThing

Join LibraryThing to post.

Major author recalculation!

This topic is currently marked as "dormant"—the last message is more than 90 days old. You can revive it by posting a reply.

1timspalding
Edited: Nov 26, 2016, 4:05 pm

Very short version

Something changed that most members won't notice, but members who do a lot of data improvements should like.

Short version

I recently revisited LibraryThing's system for calculating and recalculating the primary author of works. This resulted in significant improvements to the authors picked. A HUGE of works were affected. But almost all were works with only a handful of copies.

If you run across works with a bad primary author, let me know, and I'll look at it.

Longer version

As you may know, LibraryThing's basic notion is that data "bubbles up" from the book level--the level of member's data--to the work data. Because book level disagrees, there's a complicated system for picking the best authors (and titles) for the work overall.

For various reasons, the system was not appropriately combing through past works, recalculating the authors as new books were added to them. Also, when manually triggered, by a member clicking the link to recalculate, by a combination, or a split, the system wasn't perfect. Even worse, the two processes--combing-through and the individual process--were slightly different.(1)

Anyway, I spent two work days on this, and the end result is a new system, used everywhere and any time the author needs recalculating.

The raw changes are impressive:

Statistics

Total works changed 732,708

163,820 went from no primary author to a author primary
8,507 went from a primary author to no primary author
The rest changed primary author.

Of these, 35% changed author variant, but not true author. 360,000 changed their author completely(2).

Changes were strongly concentrated on low-copy works. 95% had fewer than 10 copies. 72% had one or two copies. A very common cirumstance was a work with two copies whose two book have different authors, and it switched from one to another. Academic books with multiple authors are one common cause.

The New System

The new system sorts in a cascade. That is, it only gets to the next level, if the authors are tied so far.

1. Whether there's a manually-added primary author
2. Number of books using the author, all variants combined
3. Number of books using the specific variant
4. Length of author code; this is arbitrary but tends to produce better results.
5. Alphabetical by author code; arbitrary but provides the same answer every time.

Books without authors are counted as 1/3 of a book. That is, if two books in a work have no author, but a third book has an author, it will pick the third author. Most copies need to have no author for it to prefer no author.

The system gives deleted books a 1/100th vote. This occasionally breaks ties.

The Upshot

* Most users won't notice a thing.
* Members who do a lot of librarian work will find the author-selections more rational.
* Members who do a lot of combining and split-assigning will see new low-copy works, from all the works that went from no author to a good author. (One member already noticed, and thought they were from uncombining. See https://www.librarything.com/topic/241944 .

If you see any work-level author picks you disagree with, take a look at the editions page for the work. If you still see a problem, let me know on this thread, and I'll explain why the author was chosen, or improve the algorithm.



1. This wasn't just idiocy. For speed and memory reasons, the code for processing millions of works at a time can't be the same as the code that processes one, quickly. It's now the same, so recalculations take a little longer than I'd like (4-5 seconds for high-copy works).
2. minus some aliasing of authors that's irritating to calculate across so many works efficiently.

2r.orrison
Nov 26, 2016, 4:16 pm

Possibly related? On https://www.librarything.com/work/3175555/summary look at the Other Authors section first; the first author there is "Lederman, Ross" with a valid link. At the top of the page, the name Ross Lederman links to /author/

3timspalding
Nov 26, 2016, 4:35 pm

Interesting. Where'd you find that?

4MarthaJeanne
Nov 26, 2016, 4:40 pm

>2 r.orrison: The link may be valid, but the work doesn't appear on the page.

5Collectorator
Nov 26, 2016, 5:30 pm

This member has been suspended from the site.

6rodneyvc
Edited: Nov 27, 2016, 12:39 am

>2 r.orrison: I'm seeing the bad link for author Ross Lederman at the top of the summary page for the Tarzan the Fearless / Tarzan's Revenge [videorecording]

7omargosh
Nov 26, 2016, 10:14 pm

Thanks for your work on this, Tim. Those are some pretty big numbers. Should the changes mean that manual recalculations in general shouldn't be very necessary anymore (i.e. going forward)?

8lorannen
Nov 26, 2016, 10:36 pm

>7 omargosh: That's the idea! It should also mean that, when manual recalculations are initiated, they're more successful at finding the right author.

9davidgn
Nov 27, 2016, 12:52 am

Bravo! Do we have a count of authorless works remaining?

10timspalding
Edited: Nov 27, 2016, 1:47 am

First, I fixed something that was causing the manual process to use the old algorithm. Grumble. It's working now.

Tarzan the Fearless / Tarzan's Revenge

So I need to understand where this link came from. It does indeed have no author on the work level. (This fact is half overcome by their being a manual author but, as noted, it's not changing the link. Let's ignore that secondary bug for now.) When I reran the reconciliation script, it was one of only 291 works that changed. So I'm guessing the work was JUST made, or just split, or whatever. I want to find out what happened.

The problem MAY be explained by my note at the top. So… wait for another one?

1) works that have more than one clear author choice will still need help making a decision
http://www.librarything.com/work/3844472/editions

Well, there is a clear author. 2/3 books have "Richardson" as the author. 1/3 has Richardson, Adele. The system chooses "Richardson." It's surely "wrong" but it's working as it was designed to work. We have to work with the data we have.

http://www.librarything.com/work/1260901/editions

Yeah, if there are three answers, it has to choose one. In this case, there are two answers with the same number of votes--the no-author variant is counted for less. It does the best it can.

2) works that have one best choice will still need help choosing it
http://www.librarything.com/work/11851919/editions

I see:

Interceptive Orthodontics/Richardson/ISBN 0904588459 (1 copy separate)
Interceptive Orthodontics/Richardson, Andrew/ISBN 0904588564 (1 copy separate)

That's two choices. Note: It's now choosing "Richardson" as the shorter of the two. I've found that usually works better--because longer ones are more often full of garbage, like a second author jammed in.

http://www.librarything.com/work/5137215/editions

Yeah, it has to make a choice.

Bears: Paws, Claws, and Jaws (Wild World of Animals)/Richardson, Adele D./ISBN 073680823X (1 copy separate)
Bears: Paws, Claws, and Jaws (Wild World of Animals)/Richardson/ISBN 073680823X (1 copy separate)
Bears: Paws, Claws, and Jaws (Wild World of Animals (Bridgestone))/Richardson/ISBN 073680823X (no current copies separate)

In this case, it by the deleted copies--they break ties.

Should the changes mean that manual recalculations in general shouldn't be very necessary anymore (i.e. going forward)?

Yes, it shouldn't be necessary, except in serious cases of lag.

Those are some pretty big numbers.

By the way, it's a little hard to calculate, but it looks to me like 50% were basically moving from one arbitrary choice to another. That is, if a work—usually a very low copy work—has two authors that are tied, the system has to make a choice. The old algorithm and the new algorithm solved that problem differently.

11r.orrison
Nov 27, 2016, 5:58 am

Looking at the Helpers Log for Other Authors, a few changes were made to Tarzan the Fearless / Tarzan's Revenge yesterday between 3-4pm EST:

casaloma added author Lederman, Ross to Tarzan the Fearless / Tarzan's Revenge \videorecording\ (Director, primary, all editions)
casaloma added author Lederman, Ross to Tarzan the Fearless / Tarzan's Revenge \videorecording\ (primary, all editions)
casaloma added author Hill, Robert F. to Tarzan the Fearless / Tarzan's Revenge \videorecording\ (Director, main, all editions)
casaloma added author Crabbe, Buster to Tarzan the Fearless / Tarzan's Revenge \videorecording\ (secondary, all editions)
casaloma added author Morris, Glenn to Tarzan the Fearless / Tarzan's Revenge \videorecording\ (secondary, all editions)
\...\
casaloma added author Lederman, D. Ross to Tarzan the Fearless / Tarzan's Revenge \videorecording\ (Director, primary, all editions)


That work doesn't show in the Work Combination or Work Separation logs.

12Collectorator
Nov 27, 2016, 7:53 am

This member has been suspended from the site.

13MDGentleReader
Nov 27, 2016, 10:56 am

Thank you, @timspalding

14timspalding
Nov 27, 2016, 2:06 pm

>12 Collectorator:

Done. See the topic.

15timspalding
Nov 27, 2016, 2:28 pm

I forgot to thank Collectorator for prompting this with a sort of perfect bug--simple, clear, reproducible, etc. See https://www.librarything.com/topic/227454#5807250

16Noisy
Nov 28, 2016, 9:16 am

Anything that advances the cause is truly welcome.

>10 timspalding: Not sure I'd have gone with the 'Always choose shortest' option. Perhaps an impossible request, but what about calculating the average of all (across LT; excluding the bits in brackets) name-strings and picking the one that's nearest to that average.

17timspalding
Nov 28, 2016, 9:36 am

>16 Noisy:

Hmm. I'm not sure that's right. I think the best answer would be to compare the strings. I mean, for example. If you have

twainmark
twainmarksmithjoe

You want twainmark. Ditto if you have

smithjoe
smithjoeeditor

But other times you have

smithjoe
smith

I see no good way to deciding here. Names are of different lengths.

In theory, I could try to remove some words (ed., editor, by, etc.) and preference ones that didn't have that. But at some point we're devising complicated rules that advance things by 1%. This will, after all, ONLY happen when all the other conditions are tied. That's rare.

18lorax
Nov 28, 2016, 9:56 am

>17 timspalding:

Can you check that both first name and last name are populated, and choose the one where both are prior to the length check, as a way of distinguishing between "smith" vs. "smithjoe" and "smithjoe" vs. "smithjoeeditor"?

19Noisy
Nov 28, 2016, 10:04 am

>17 timspalding: Just a thought. Really have no idea what the average might be, so I assumed something like '12'. As you say, it's not one of the leading conditions, and the times the rule would be employed would probably be small. Also, the people who will actually notice this are the more 'involved' cleaners and will accept there has to be a compromise.

20timspalding
Nov 28, 2016, 10:07 am

>18 lorax:

What we have are strings, with, sometimes, commas. There's no explicit first-name, last-name distinction. Bob, Ed. has both parts.

21jjwilson61
Nov 28, 2016, 10:46 am

I think, though, that a closest to 12 (or something like that) test is still simple but is likely to eliminate both the multiple name messes and the last name only cases.

22r.orrison
Edited: Nov 29, 2016, 6:23 pm

I did recalculate author on this work https://www.librarything.com/work/9579395/editions and it chose no author (most popular edition, 2 copies) over the author that appeared on 1 copy.

Edit: Another work where no author is chosen over numerous non-blank options: https://www.librarything.com/work/2197009/editions

23MarthaJeanne
Edited: Nov 29, 2016, 6:00 pm

>22 r.orrison: There are three copies without an author. No author is shorter than GeoCenter.

24r.orrison
Nov 29, 2016, 6:00 pm

GeoCenter is listed as the author of the edition with ISBN 1851371702

25r.orrison
Nov 29, 2016, 6:05 pm

Would it be possible to recognize a list of non-authors and prioritize them lower than other author names? E.g. "anonymous" and "no author".

On a work like https://www.librarything.com/work/10876265/editions where there's a choice between Anonymous and a real name, it would be nice if the system would have a preference for the real name.

26lorax
Edited: Nov 30, 2016, 11:09 am

>25 r.orrison:

Would it be possible to recognize a list of non-authors and prioritize them lower than other author names? E.g. "anonymous" and "no author".

There's a long-standing RSI for this, let me go find it.

Edited: Never mind, I was misremembering the RSI that pertained to "special" authors; it was about combination. For reference it's at https://www.librarything.com/topic/155018 .

27r.orrison
Nov 30, 2016, 3:04 pm

(You're thinking of https://www.librarything.com/topic/93378. So was I.)

28lorax
Nov 30, 2016, 3:15 pm

Ah, thank you.

29Collectorator
Dec 6, 2016, 4:15 pm

This member has been suspended from the site.

30timspalding
Dec 6, 2016, 4:51 pm

Give me an example, C.

31SimoneA
Dec 9, 2016, 7:48 am

One example of this working counterproductive can be seen on the Smith author page http://www.librarything.com/combine.php?author=smith. There are several works that end up on the 'wrong' author page, because the calculation uses the shortest author form. I understand that a calculation has to be chosen, so nothing can be done about this.
However, I also noticed that the zero copy editions don't seem to contribute to the calculation, for example here http://www.librarything.com/work/2568835/editions. Maybe that could be looked into?

32Collectorator
Dec 14, 2016, 12:08 pm

This member has been suspended from the site.

33timspalding
Dec 14, 2016, 2:35 pm

"This is not the only one that already knew to which author division to go once I changed its author name."

What? Say that another way?

34Collectorator
Dec 14, 2016, 3:14 pm

This member has been suspended from the site.

35KoobieKitten
Edited: Dec 14, 2016, 7:35 pm

I would guess it's doing that because back on Oct 12, 2012 user bw42 other-authored that work and "The Essentials of IT" to John Hamilton (11). How exactly it "remembers" which other-author it was previously after the change to John due to the recalc, and then the change back to John Hamilton after the manual add author change, I don't know.

Edit: In this case, when the author's name is changed using add author, this "remembering" part is good, yes C.?

36timspalding
Edited: Dec 15, 2016, 3:04 am

Imagine if you have a book authored by two well-known split authors. Because it's a coauthorship situation, the book has bounced back and forth between primary authors. When the author was X, members other-authored it to the correct split. When the author becomes Y, members other-authored it as well. If it flips back, what should happen? I think it should remember WHICH X it was with.

It remembers because that makes the most sense. If the link between a work and its other author was broken, all sorts of information would be lost. Sometimes like this and sometimes when someone wrongly combined or changed an author.

Yeah, you can see problems if members engage in extensive renumbering. But I've always been against that…

37r.orrison
Dec 15, 2016, 4:24 am

This is why most people like to keep the Disambiguation Notices about splits around, even when there are no books assigned to those splits...

38Collectorator
Dec 15, 2016, 9:01 am

This member has been suspended from the site.

39timspalding
Edited: Dec 15, 2016, 9:49 am

>38 Collectorator:

Sure. We can go back to it staying with the first author ever entered, and not recalculating if the majority-author changes.

Would that be better? No.

In this case, it has a choice between two authors, each with exactly one edition. When the edition counts are tied, it has to decide between them. What metric do you suggest?

If it can't decide by counts, including deleted books to break the tie of undeleted books, It uses length. In general, I find the shorter author is more commonly correct. That is, wrongness like "Smith, John, editor" is more common than "Smith." But either way is going to have problems, and "Why can't it just remember" isn't a solution to those problems, but merely a decision to pick one error and stick with it forever.

40Collectorator
Dec 15, 2016, 10:57 am

This member has been suspended from the site.

41Collectorator
Dec 15, 2016, 11:07 am

This member has been suspended from the site.

42timspalding
Dec 16, 2016, 3:02 am

>41 Collectorator:

If you set the main author, that trumps all.

43r.orrison
Dec 17, 2016, 4:21 pm

Could you take a look at https://www.librarything.com/work/18465958/editions - the author name appears at the top of the page as a URL segment ("topdemirhuumlseyinga"), although the name is correct on the one edition.

44PhaedraB
Dec 17, 2016, 10:09 pm

>43 r.orrison: It looks fine now.

45r.orrison
Dec 18, 2016, 3:17 am

Oh well.

46timspalding
Edited: Dec 18, 2016, 5:32 am

>43 r.orrison:

Those things can happen, but they should be quickly replaced with the full name. Click to recalculate the author name to be sure.

47Collectorator
Dec 28, 2016, 8:09 am

This member has been suspended from the site.

48timspalding
Dec 28, 2016, 11:05 am

>47 Collectorator:

When you combine works, it picks one work to "win"—the work with the most copies. The old work gets aliased into the winning work, all of its editions get pointed at the new one too.

Obviously we can't have the losing work's split win by default. That is, if work A is assigned to 1 and work B is assigned to 2, and B gets combined into A, we can't have the losing work's split-assignment triumph, right?

What you're proposing is, I think, that, if the losing work is listed as belonging to split 1, and the winning work is not assigned to any split, then it should "take the hint" and assign the winning work to split 1.

Right?

49krazy4katz
Dec 28, 2016, 11:12 am

>48 timspalding: Actually, if I understand this correctly, I think the opposite seems to happen. Often the work with the most copies is assigned to the correct author. On the multiple author page, if you combine it with a single from "Unknown", the entire group goes to the Unknown author. I have gotten around this be reassigning the works from unknown to the correct author before combining.

If that is not what you and Collectorator are talking about, I apologize and please ignore me. k4k

50timspalding
Dec 28, 2016, 11:51 am

Okay, you're saying that work combination always wipes out split assignment?

51r.orrison
Dec 28, 2016, 11:58 am

No, sometimes. I think in my experience it usually does the right thing, but I've seen it lose the assignment as well.

52krazy4katz
Dec 28, 2016, 12:09 pm

>50 timspalding: I guess I don't know if it is "always" since I stopped combining works before reassigning them once I noticed this happening.

53PhaedraB
Dec 28, 2016, 12:21 pm

>50 timspalding: "Always remember, never say 'never' or 'always'."

In my experience, combining an already assigned work with copies from the Unknown section takes the work out of the split and categorizes it as Unknown. Every time in my experience.

As I recall, this does not happen when combining works already assigned to the same split.

I don't recall combining when the works combined were assigned to two different splits, so I don't know what happens then.

54Collectorator
Dec 28, 2016, 2:39 pm

This member has been suspended from the site.

55timspalding
Dec 28, 2016, 2:47 pm

>54 Collectorator:

Okay, well, I don't know what it does currently, but I'll look.

56Collectorator
Edited: Dec 28, 2016, 4:18 pm

This member has been suspended from the site.

57MarthaJeanne
Dec 28, 2016, 4:33 pm

I don't think it always wipes out the split assignment, but often enough that you need to check the assignment before and after.

58timspalding
Dec 28, 2016, 5:45 pm

Thanks. Checking tomorrow—it's 2am here. Resetting my read-to marker after this.

59AnnieMod
Dec 28, 2016, 8:18 pm

>58 timspalding:

It almost feels like it keeps tabs of what was assigned before combinations and goes with the highest number of works - so works with a lot of copies will end up as unassigned - maybe because there was a 2000 works one assigned once and then separate non-assigned were folded in but if it is 1 on 1, it sometimes remembers to put it in the proper place after that.

At least that is what seems to happen when I do not assign before combining (or it is actually random)

60timspalding
Dec 29, 2016, 8:37 am

>56 Collectorator:

Thank you.

First, the David Wood one acted unexpectedly. It lost its assignment even though the one with more copies was assigned. So, problem--not working as intended.

Moving on.

61Collectorator
Edited: Dec 29, 2016, 2:00 pm

This member has been suspended from the site.

62timspalding
Dec 29, 2016, 2:08 pm

Argh. I do need another.

I'll look.

63timspalding
Edited: Dec 29, 2016, 2:43 pm

Okay, see New Features: http://www.librarything.com/topic/244676

As you can see, I am now explaining where the combined work is going to "go" in the splits. But I decided against:
"What you're proposing is, I think, that, if the losing work is listed as belonging to split 1, and the winning work is not assigned to any split, then it should "take the hint" and assign the winning work to split 1."
I decided against it, because it's not always a question of assigning between splits. Sometimes you're assigning to splits that are aliased away to another author--more and more of this apparently! And sometimes the works are to different authors, or different authors that are combined, etc. You can even have the author change in the course of combination.

Anyway, the logic and ramifications here just spun out of control when I tried to pin down every edge case. So instead of trying to come up with an automatic "best" answer and apply it, and it was too much. So I decided to do what it was supposed to do--have the "winner"'s split data triumph, and be clearer about what was going on and where it would end up.

64timspalding
Edited: Dec 29, 2016, 2:51 pm

Thanks for all your help on this Collectorator. I really tried to do the thing you proposed—this has been my work most of today. It's just a hard problem. And, in retrospect, I think sticking with "the winner is the winner" is a better principle that doing selective magic.

65leselotte
Aug 18, 2017, 6:03 am

Digging out this thread because I'm not sure this is a bug: I've come across several entries lately that had as main author name a name that doesn't show up in copies / editions of that title.

Example:
https://www.librarything.com/work/9812570/summary

Is it possible to see whether the author was changed manually? Recalculating doesn't help, by the way!
Tia!

66MarthaJeanne
Aug 18, 2017, 6:18 am

>65 leselotte: In that case, the author name on the work is the author page the work is on due to (quite proper) combining.

67leselotte
Aug 18, 2017, 10:02 am

>66 MarthaJeanne: Thank you, MarthaJeanne! Maybe I'll come across some more examples that stumped me (that weren't as clear as the Confucius one to me)

68Collectorator
Aug 18, 2017, 10:14 am

This member has been suspended from the site.

69leselotte
Aug 21, 2017, 1:58 am

>68 Collectorator: Thanks a lot!

70leselotte
Aug 29, 2017, 6:57 am

Here's one that baffles me: Franz Carl Weiskopf. Lots of titles where the author shows up as F.C. Weiskopf, while it's F. C. Weiskopf in work details. Canonical name not set (and shows up as Franz Carl Weiskopf anyways). How come?

71MarthaJeanne
Aug 29, 2017, 8:34 am

>70 leselotte: Look at the 'includes' page: https://www.librarything.com/author/weiskopffranzcarl/names

The FC page was combined into the Franz Carl page.