Marc Export seems broken

TalkBug Collectors

Join LibraryThing to post.

Marc Export seems broken

This topic is currently marked as "dormant"—the last message is more than 90 days old. You can revive it by posting a reply.

1bnielsen
Edited: Jan 6, 2015, 3:48 pm

When I export my books as Marc (selecting all three options here:)

Use existing library records, when available
Fuzzy matching
Integrate your catalog changes into library records

I get a file that only seems to contain 5408 LDR lines (I expected 6318) and some of the 5408 seem badly broken:

LDR 00728 2200229 4500
LDR binary garbage removed since this post wasn't displayed otherwise
LDR 00717 2200241 4500

Can you confirm that there's a problem with the marc export?

I'll try to find some more precise error descriptions, but for a start it would be nice to know if this is something you've seen before?

Exporting gives no error messages and says

Processing done (6,318 records). Click to download.

as expected.

2bnielsen
Jan 6, 2015, 3:55 pm

Update:

If I look at the export file mentioned above, I can guess that bookid = 18282723 is one of the affected books.
Filtering on 18282723 gives one book in the export, but the resulting librarything_bnielsen.marc is just 31 bytes long and doesn't look
like a marc record:

cat /tmp/1book.marc | perl marc.pl --ifmt=marc --ofmt=text | od -tx1z -w8 -v

0000000 4c 44 52 20 11 14 4e 44 >LDR ..ND<
0000010 da 27 79 3c a9 7a 86 e3 <.'y<.z..<
0000020 79 cb 61 6a c9 e8 6a db >y.aj..j.<
0000030 6b 89 bb ad 0a 0a >k.....<

(assuming that you can read the above :-)

3bnielsen
Jan 6, 2015, 3:59 pm

Update:

If I take another bookid, 18282612, the resulting export file is much longer (728 bytes) and
deciphering it is easy and looks nice:

LDR 00728 2200229 4500
001 18282612
003 MePoLT
005 20150106155652.0
008 750103s1974 mau 00010 eng
010 |a 74009250
020 |a0201067374
040 \\|dMePoLT|erda
050 0 |aQA76|b.S3588
082 |a001.6/4
100 10|aSchneider, Ben Ross,|d1920-
245 10|aTravels in computerland :|bor, Incompatabilities and interfaces : a full and true account of the implementation of the London stage information bank /|cby Ben Ross Schneider, Jr.
260 0 |aReading, Mass. :|bAddison-Wesley Pub. Co.,|c1974
300 |avii, 244 p. ;|c21 cm.
650 0|aComputers.
920 \\|a1037178
922 \\|a10
923 \\|aYour library

4bnielsen
Jan 6, 2015, 4:00 pm

BTW I have great praise for the ability to export single books. It really makes for fast debugging compared to the old export.

5ccatalfo
Jan 7, 2015, 7:22 am

>2 bnielsen: Thanks: I can confirm there's a unicodedecode error in the log for the export for that bookid.

Going to see just what's happening there.

6bnielsen
Jan 7, 2015, 1:42 pm

Just checked that 18282723 wasn't one of the books affected by the old bug:

http://www.librarything.com/topic/118711

Also checked that the marc converter here is still functioning:

http://marcpm.sourceforge.net/cgi-bin/converter.cgi

7ccatalfo
Jan 9, 2015, 9:18 am

Update: I'm closing in on where this is happening. Not fixed yet but getting closer.

8bnielsen
Jan 10, 2015, 10:24 am

Let me guess:

If (source == "det kongelige bibliotek" and book_id % 17 > 5 and username == "bnielsen") {
output = "LDR " + randombytes(19);
}

And thanks for the update :-)

9ccatalfo
Jan 12, 2015, 7:13 am

>8 bnielsen: Ha, so close, yes, so close...

10melannen
Jan 19, 2015, 3:57 pm

I'm trying to export today as TSV and/or Excel and getting files with no error messages but default filenames that are in nonsense kanji and contents that are scrambled in various ways (columns that don't line up, several thousand books loading into the same cell, more unexpected kanji, etc.) Is this the same problem? It sounds like it could be.

11bnielsen
Jan 19, 2015, 4:37 pm

>10 melannen: Nope. I think you may have found a new problem :-)

I think LT should sanitize contents of the Subjects field before exporting it. The TSV export is _mostly_ UTF-8 except for one or two fields that may hold surprises.

12ccatalfo
Jan 21, 2015, 9:35 am

>10 melannen: >11 bnielsen: Yes, indeed, it sounds like a different problem. If you could make a new Talk post detailing it Ammar can take a look.

>8 bnielsen: Update on this: I've been busy with other stuff but I have not forgotten this. ;)

13melannen
Jan 22, 2015, 5:47 pm

I will do that, as soo as I have time to try again and see if it's still doing it!

14bnielsen
Edited: Jan 22, 2015, 10:13 pm

>12 ccatalfo: Thanks. Now if you could just change the 5 to 18 in >8 bnielsen: :-)

ETA (for my own purposes) http://unix.stackexchange.com/questions/6516/filtering-invalid-utf8

15bnielsen
Edited: Jan 23, 2015, 6:32 am

Just a little more information that might be handy for the people using linux or osx:

Piping the tsv export through a :

iconv -c -t UTF-8

command will annihilate the stuff, that's not UTF-8

Running it through this perl script instead

while (<>) {
chomp;
next if /
^( (\\x00-\x7F\) # 1-byte pattern
|(\\xC2-\xDF\\\x80-\xBF\) # 2-byte pattern
|(((\\xE0\\\xA0-\xBF\)|(\\xED\\\x80-\x9F\)|(\\xE1-\xEC\xEE-\xEF\\\x80-\xBF\))(\\x80-\xBF\)) # 3-byte pattern
|(((\\xF0\\\x90-\xBF\)|(\\xF1-\xF3\\\x80-\xBF\)|(\\xF4\\\x80-\x8F\))(\\x80-\xBF\{2})) # 4-byte pattern
)*$ /x;
print "$_\n";
}

will display all linies which contains broken UTF-8

16jouni
Edited: Feb 14, 2015, 4:29 am

>15 bnielsen: Wow, exciting :D

The "iconv" seems to fix most problems for my exported data (174 out of 1230), still have to check how mobile app will handle the result. Guess I'll do some iOS coding today!

Any comments what could be wrong with this data (first original export, then iconv converted). Is this an example of "UTF-8 encoded UTF-16 surrogate non-characters"?

22176827 Pitkät jäähyväiset 1 Chandler, Raymond. Kalevi Nyytäjä Porvoo: WSOY, 1988. 427, 1 ; 19 cm. 2. p. 1989. 1988 Pitk�at j�a�ahyv�aiset by Raymond. Chandler (1988), 2. p. 1989. 427 p.; 19 cm 19 cm 19 cm 427 crime, detective, mystery Your library Finnish, English English PS3505.H3224L6 9510151386 9510151386, 9789510151389 Los Angeles|Marlowe, Philip|Yhdysvallat|kaunokirjallisuus|murha|rikkaat|rikoskirjallisuus|romaanit|salapoliisikirjallisuus|tytt�aret 813.52 20th Century > American Fiction > American literature > Early 20th Century 1901- > Literature 1 Helsinki Metropolitan Libraries 2007-10-13 15087

22176827 Pitkät jäähyväiset 1 Chandler, Raymond. Kalevi Nyytäjä Porvoo: WSOY, 1988. 427, 1 ; 19 cm. 2. p. 1989. 1988 Pitk�at j�a�ahyv�aiset by Raymond. Chandler (1988), 2. p. 1989. 427 p.; 19 cm 19 cm 19 cm 427 crime, detective, mystery Your library Finnish, English English PS3505.H3224L6 9510151386 9510151386, 9789510151389 Los Angeles|Marlowe, Philip|Yhdysvallat|kaunokirjallisuus|murha|rikkaat|rikoskirjallisuus|romaanit|salapoliisikirjallisuus|tytt�aret 813.52 20th Century > American Fiction > American literature > Early 20th Century 1901- > Literature 1 Helsinki Metropolitan Libraries 2007-10-13 15087

17bnielsen
Feb 15, 2015, 6:51 pm

Ah, now I remember. The above iconv forces data into UTF-8, but it can't do miracles. I think your example in > 16 is bad, bad data from HML. Try to get in touch with their technical staff and make sure you find someone who knows the meaning of "a Z39.50 gateway". This looks like trouble with the character set, but whether the data is bad or the translating when going through their Z39.50 profile is bad or something bad is happening at the LT end of the connection is impossible for you and me to find out.

And of course fixing the Z39.50 connection between HML and LT isn't going to fix the bad data you've already imported, so you'll need to delete your book 22176827 and reimport it after the fix.

18timspalding
Feb 24, 2015, 8:35 pm

Giving to CC.