Tab-delimited export - unprintable character on field label

TalkBug Collectors

Join LibraryThing to post.

Tab-delimited export - unprintable character on field label

This topic is currently marked as "dormant"—the last message is more than 90 days old. You can revive it by posting a reply.

1jjmcgaffey
Jan 13, 2008, 3:59 am

When I export to tab-delimited text, I get, as the first field, "book id" with an unprintable character (box) in front of it. I did a copy-and-paste to get the above label, but the box didn't show up. I can delete it in Excel, of course, but what is it and why does it show up?

2khms
Jan 19, 2008, 6:03 am

Someone messed up.

That file is in UTF16 (a two byte per character version of Unicode). Because there are two choices about how to do this - most significant byte first ("motorola" or "network" ordering), or least significant byte first ("intel" ordering) - there's a character called "byte order mark" or BOM (U+FEFF, because U+FFFE is guaranteed to never be a legal character) which is inserted at the front of a file so software knows how to read it (FE FF ... -> motorola, FF FE ... -> intel).

Only, in this case someone actually added the character twice. The one left over gets shown as an unprintable character.

3bnielsen
Jan 20, 2008, 6:47 pm

Thanks. I think you just found an explanation to a strangeness, I see everytime I edit a book.
I.e. http://www.librarything.com/work/4759081/edit/25943602
I get a page looking like:
http://www.daimi.au.dk/~bnielsen/cropped.png
I've chosen a larger textsize to get the FEFF character to show up more clearly.

So it's not just the tab export that's flawed.

I'm using Linpus and Firefox and Linpus is Unicode all the way, so that's probably why I get to see this character. I haven't seen anyone else complain about it.

4khms
Edited: Jan 27, 2008, 5:13 am

Well, Ubuntu is Unicode all the way, too, but I don't see it in Epiphany.

Hmm. Not in Galeon, either.

Nor in Mozilla.

Well. Either LT fixed something, or you found a browser bug.

5PaulFoley
Jan 27, 2008, 5:51 am

The character U+FEFF is a "zero-width space" (i.e., not actually a space at all, since it has zero width!) so it shouldn't be visible if displayed, no matter how many are there. It's used as a byte-order marker because of this property, though that is an example of Microsoft brain-damage (i.e., not being able to distinguish between internal and external encodings -- in any sensible world, UCS-2 would just be defined as big-endian (and the abomination called UTF-16 wouldn't exist). Even worse: many people think it's a good idea to put a "BOM" on UTF-8 data -- a "narrow" encoding where the concept of "byte order" is meaningless)

6jjmcgaffey
Jan 27, 2008, 6:07 am

Well, I never saw it on my workpages (FF2 & WinXP), but it still shows up in the tab-delimited export. Not an urgent problem, anyway.

7khms
Edited: Jan 27, 2008, 1:00 pm

The character U+FEFF is a "zero-width space"

Actually, that meaning is being phased out.

The standard says this: (http://www.unicode.org/charts/PDF/UFE70.pdf)

Special
FEFF  ZERO WIDTH NO-BREAK SPACE
= BYTE ORDER MARK (BOM), ZWNBSP
• may be used to detect byte order by contrast
with the noncharacter code point FFFE
• use as an indication of non-breaking is
deprecated; see 2060 ⁠ instead
→ 200B ​ zero width space
→ 2060 ⁠ word joiner
→ FFFE