tag combination/separation and logs are showing Unicode Character 'REPLACEMENT CHARACTER'

TalkBug Collectors

Join LibraryThing to post.

tag combination/separation and logs are showing Unicode Character 'REPLACEMENT CHARACTER'

This topic is currently marked as "dormant"—the last message is more than 90 days old. You can revive it by posting a reply.

1gangleri
Aug 1, 2010, 9:35 am

Hi!

Please read Wikipedia:Replacement character first.

� is the Unicode Character 'REPLACEMENT CHARACTER' - U+FFFD
http://www.fileformat.info/info/unicode/char/fffd/index.htm
HTML Entity (decimal) �
HTML Entity (hex) �
UTF-8 (hex) 0xEF 0xBF 0xBD (efbfbd)

This character indicates a problem.

Helpers log: Tags (7 days) shows a lot of them.

The character can be found also at Tag info: b�rgertum, Tag info: l�beck etc. Howevet it seems that these tags have never been used.

I do not think if it makes much sense to show such tag combination proposels at Tag info: bürgertum, Tag info: lübeck .

I would suggest to implement a filter function to avoid such situations. ( Maybe one should filter urls for %EF%BF%B . )

Best regards Reinhardt

2Tallulah_Rose
Aug 1, 2010, 4:31 pm

Thanks for pointing me towards this. I found this tags today on the page of Buddenbrooks (where they were used, so I don't know why it shows up differently here...) and as I figured they was just an error with the character 'ü' I seemed they should be combined.

However if the tag-info-pages say they have never been used and they are to be found on the book page of Buddenbrooks (they are still there, I've checked) I think there must be another problem as well.

3gangleri
Edited: Aug 1, 2010, 4:48 pm

It might be / have been a sporadical compatibility issue: One should differentiate between representation on the own screen which might be operating system and browser dependent and between what and how is encoded in LT.

Copy and paste might be a source for adding new faults (from the own computer to LT).

I suggest to examine the behaviour on other PC's (at work, in libraries etc.).

I can not see any 'REPLACEMENT CHARACTER' at Buddenbrooks . Regards Reinhardt

4Tallulah_Rose
Edited: Aug 3, 2010, 12:44 pm

I can not see any 'REPLACEMENT CHARACTER' at Buddenbrooks . Regards Reinhardt
well that's weird, because I can see both kinds of tag: the one with 'Replacement Character' and the one with the proper one (ü in this case).
I would be interested in how other members perceive this problem?

5bnielsen
Aug 3, 2010, 2:35 pm

I followed the Buddenbrook link in #2 and didn't see any 'REPLACEMENT CHARACTER'.
Could you be a bit more specific about where you see it?

6gangleri
Edited: Aug 3, 2010, 5:15 pm

I normaly work on different PC's. I prefer to use FireFox.

I am not familiar with the newest web specifications / implementations / bugs .

During last years I visited many pages without proper UTF-8 support. I either see there garbadge characters for German, Icelandic etc. diacritics or other characters are mising.

Here a German description on how to set UTF-8 mode in FF properly:

FireFox > Bearbeiten > Einstellungen > Inhalt > Schriftarten & Farben (dort "Erweitert" ) > Zeichenkodierung : Bitte Unicode (UTF-8) wählen

Good luck! Reinhardt

7bnielsen
Aug 3, 2010, 8:16 pm

It is really just a workaround to set this in the browser. The web pages are supposed to tell which charset they are in.

8gangleri
Aug 3, 2010, 10:00 pm

>7 bnielsen:: I know. If you are familiar with specs please take a look at the pages source code; if necessry please open a new topic.

9MikeBriggs
Edited: Aug 4, 2010, 3:21 pm

I see the tags: b�um, l�/a> on Buddenbrook. I had to click show all to see them.

10Tallulah_Rose
Aug 4, 2010, 2:48 pm

#9 So you see some which I don't :)
#5 you ahve to click 'show all' to see them if you see them at all. Maybe it is just because of my browser.
back to the prob: When it really is just because of my browser adjustments then we maybe really should implement such a filter function to avoid such proposals. What do the others think?

11gangleri
Aug 4, 2010, 3:07 pm

>Tallulah_Rose: Maybe it is a good idea to let us know what operating system and what browser version you are using.

When I go to Firefox > Hilfe >Über Mozilla Firefox I can copy the string:
Mozilla/5.0 (X11; U; Linux i686; de; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3

The information about the operating system:
Ubuntu 7.10 (a Linux based OS; not the newest)
can be found looking at system information.

12Tallulah_Rose
Aug 4, 2010, 3:16 pm

Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.7) Gecko/20100713 Firefox/3.6.7 ( .NET CLR 3.5.30729)
system: Windows XP

13gangleri
Aug 4, 2010, 3:58 pm

>12 Tallulah_Rose:: Thanks!

FYI: I found http://www.librarything.com/tag/antolog%EDa&norefer=1 showing

Tag info: antolog�a
Includes: antologia, antolog�a, Antologia, antol�gia, Antol�gia, ANTOLOGIA

some conclusions:
a) there are many such tags in the system
b) &norefer=1 does not relate to one variant only
c) because http://www.librarything.com/tag/anthology shows

Tag info: anthology
Includes: anthology, Anthology, anthologies, Anthologie, bloemlezing, antologi, anthologie, Anthologies, Bloemlezing, antologia, Antologi, antología, antology, antologio, Antología, Antology, ANTHOLOGY, anothology, format - anthology, ANTHOLOGIES, Anthology., |Anthology, Antologia, #anthology, anthololgy, antolog�a — 25 more, .anthology, _anthology, F:anthology, Antológia, Anthrologies, Antol�gia, anthology;, ANthologies, .Anthology, f:anthology, anthology, Anthololgy, |anthology, ANTOLOGIA, Antologio, Anothology, antológia, anthrologies, anthology., ANthology, anthology, (Anthology), (anthology), antol�gia, _Anthology

LT has to deal with multiple levels of tag combination. I was not aware of this. I only know some bugs related to multiple levels of author combinations.

14gangleri
Edited: Aug 4, 2010, 4:02 pm

>13 gangleri:: http://www.librarything.com/tag/poetry shows

Tag info: poetry
Includes: poetry, Poetry, poesia, poëzie, poésie, poesi, Poëzie, poezie, poesía, Poesie, POETRY, poesie, Poezie, @poetry, poezio, Poesia, Poetry., poerty, a:poetry, Poésie, poety, Poesi, Poesía, genre - poetry, POESIA, g:poetry — 47 more, Po�zie, P (Poetry), g:Poetry, Poerty, Poety, form:poetry, Poetry*, poetry., %-poetry, Poezja, #poetry, Category: Poetry, poèsie, POESI, Poes�a, Poezio, poetrty, p (poetry), Genre - Poetry, . POETRY, A:poetry, .poetry, poezja, po�zie, poeotry, poemtry, POESIE, poetry;, poes�a, #Poetry, poes�a, poetry*, category: poetry, POesi, Poetry, POetry., poetry, @Poetry, .Poetry, poetry, Poèsie, po�sie, pOETRY, Poeotry, ;poetry, POetry, Po�sie

15bnielsen
Aug 5, 2010, 4:49 am

Thanks. Now I also see them. I'm not sure if it is a bug or a feature. (Although in the best of all worlds LT would keep them from being entered into the system).
Surely there are lots of other typos in the list of tags combined into say poetry?

16gangleri
Aug 5, 2010, 11:15 am

Again: Character encoding changed a lot during last 30 years. I remember the times of IBM's EBCDIC code, the 7 bit ASCII, the special national 8 bit ASCII variants etc.

Looking at the link from >12 Tallulah_Rose:: ( http://www.librarything.com/tag/antolog%EDa&norefer=1 ) one can see that there is a single octet character encoding. Probably the character should be "í" ( 'LATIN SMALL LETTER I WITH ACUTE' ).

Unicode is encoding "í" as
Unicode Character 'LATIN SMALL LETTER I WITH ACUTE' (U+00ED)
Encodings:
HTML Entity (decimal) & + #237;
HTML Entity (hex) & + #xed;
HTML Entity (named) & + iacute;

UTF-8 (hex) 0xC3 0xAD (c3ad) - (this is %C3%AD ) as used at
http://es.wikipedia.org/wiki/Antolog%C3%ADa Antolog%C3%ADa for Antología

Any site should have a strategy about how to encode information and the web pages should include information about what encoding is used. This assures that all users and visitors see the same informations.

WP has also a character normalization function implemented. It is active both on preview and on save. This also facilitates search for homoglyph text strings.

I wounder how LT fixing / migration can be managed. For tags maintenance page available for every user could list "abnormal" tags. I think about a multi column list with
a) old variants
b) input fields for corrected / desired tags
c) check box if the corrected / desired tags should still belong to a set of combined tags if applicable

17bnielsen
Aug 6, 2010, 5:56 am

#16: Nice detective work. You could add the problems with importing from various library sources where the character coding is not stated explicitly and might not even be the same for every book in the source.
Plus html-entities embedded in the data. Etc. Etc.

Don't get me started on character sets. I grew up with ascii-8-in-12 on a cdc mainframe :-) Google that on your own risk.

18gangleri
Aug 11, 2010, 7:42 pm

>17 bnielsen:: Thanks for the reply. There are many approaches when implementing a system.

There are worst case scenarios where every field in the database may have its own encoding. This might happen at some of the book sources used to import to LT.
There are approaches wich may follow more or less the KISS principle. A character normalization and a propper parametrisiation in the pages "head" section would assure the correct display on specification compliant browsers.

from a WP page:
head
title KISS principle - Wikipedia, the free encyclopedia /title
meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /

references:
http://www.librarything.com/topic/83076 Topic: add « preview » mode (button) to addbooks
http://www.librarything.com/topic/81448 Topic: perform Unicode normalization on tags
Please note other topics related to Unicode normalization.