Tab-delimited export - unprintable character on field label
Talk Bug Collectors
Join LibraryThing to post.
This topic is currently marked as "dormant"—the last message is more than 90 days old. You can revive it by posting a reply.
1jjmcgaffey
When I export to tab-delimited text, I get, as the first field, "book id" with an unprintable character (box) in front of it. I did a copy-and-paste to get the above label, but the box didn't show up. I can delete it in Excel, of course, but what is it and why does it show up?
2khms
Someone messed up.
That file is in UTF16 (a two byte per character version of Unicode). Because there are two choices about how to do this - most significant byte first ("motorola" or "network" ordering), or least significant byte first ("intel" ordering) - there's a character called "byte order mark" or BOM (U+FEFF, because U+FFFE is guaranteed to never be a legal character) which is inserted at the front of a file so software knows how to read it (FE FF ... -> motorola, FF FE ... -> intel).
Only, in this case someone actually added the character twice. The one left over gets shown as an unprintable character.
That file is in UTF16 (a two byte per character version of Unicode). Because there are two choices about how to do this - most significant byte first ("motorola" or "network" ordering), or least significant byte first ("intel" ordering) - there's a character called "byte order mark" or BOM (U+FEFF, because U+FFFE is guaranteed to never be a legal character) which is inserted at the front of a file so software knows how to read it (FE FF ... -> motorola, FF FE ... -> intel).
Only, in this case someone actually added the character twice. The one left over gets shown as an unprintable character.
3bnielsen
Thanks. I think you just found an explanation to a strangeness, I see everytime I edit a book.
I.e. http://www.librarything.com/work/4759081/edit/25943602
I get a page looking like:
http://www.daimi.au.dk/~bnielsen/cropped.png
I've chosen a larger textsize to get the FEFF character to show up more clearly.
So it's not just the tab export that's flawed.
I'm using Linpus and Firefox and Linpus is Unicode all the way, so that's probably why I get to see this character. I haven't seen anyone else complain about it.
I.e. http://www.librarything.com/work/4759081/edit/25943602
I get a page looking like:
http://www.daimi.au.dk/~bnielsen/cropped.png
I've chosen a larger textsize to get the FEFF character to show up more clearly.
So it's not just the tab export that's flawed.
I'm using Linpus and Firefox and Linpus is Unicode all the way, so that's probably why I get to see this character. I haven't seen anyone else complain about it.
4khms
Well, Ubuntu is Unicode all the way, too, but I don't see it in Epiphany.
Hmm. Not in Galeon, either.
Nor in Mozilla.
Well. Either LT fixed something, or you found a browser bug.
Hmm. Not in Galeon, either.
Nor in Mozilla.
Well. Either LT fixed something, or you found a browser bug.
5PaulFoley
The character U+FEFF is a "zero-width space" (i.e., not actually a space at all, since it has zero width!) so it shouldn't be visible if displayed, no matter how many are there. It's used as a byte-order marker because of this property, though that is an example of Microsoft brain-damage (i.e., not being able to distinguish between internal and external encodings -- in any sensible world, UCS-2 would just be defined as big-endian (and the abomination called UTF-16 wouldn't exist). Even worse: many people think it's a good idea to put a "BOM" on UTF-8 data -- a "narrow" encoding where the concept of "byte order" is meaningless)
6jjmcgaffey
Well, I never saw it on my workpages (FF2 & WinXP), but it still shows up in the tab-delimited export. Not an urgent problem, anyway.
7khms
The character U+FEFF is a "zero-width space"
Actually, that meaning is being phased out.
The standard says this: (http://www.unicode.org/charts/PDF/UFE70.pdf)
Special
FEFF ZERO WIDTH NO-BREAK SPACE
= BYTE ORDER MARK (BOM), ZWNBSP
• may be used to detect byte order by contrast
with the noncharacter code point FFFE
• use as an indication of non-breaking is
deprecated; see 2060 instead
→ 200B zero width space
→ 2060 word joiner
→ FFFE
Actually, that meaning is being phased out.
The standard says this: (http://www.unicode.org/charts/PDF/UFE70.pdf)
Special
FEFF ZERO WIDTH NO-BREAK SPACE
= BYTE ORDER MARK (BOM), ZWNBSP
• may be used to detect byte order by contrast
with the noncharacter code point FFFE
• use as an indication of non-breaking is
deprecated; see 2060 instead
→ 200B zero width space
→ 2060 word joiner
→ FFFE

