tag combination/separation and logs are showing Unicode Character 'REPLACEMENT CHARACTER'
Talk Bug Collectors
Join LibraryThing to post.
This topic is currently marked as "dormant"—the last message is more than 90 days old. You can revive it by posting a reply.
1gangleri
Hi!
Please read Wikipedia:Replacement character first.
� is the Unicode Character 'REPLACEMENT CHARACTER' - U+FFFD
http://www.fileformat.info/info/unicode/char/fffd/index.htm
HTML Entity (decimal) �
HTML Entity (hex) �
UTF-8 (hex) 0xEF 0xBF 0xBD (efbfbd)
This character indicates a problem.
Helpers log: Tags (7 days) shows a lot of them.
The character can be found also at Tag info: b�rgertum, Tag info: l�beck etc. Howevet it seems that these tags have never been used.
I do not think if it makes much sense to show such tag combination proposels at Tag info: bürgertum, Tag info: lübeck .
I would suggest to implement a filter function to avoid such situations. ( Maybe one should filter urls for %EF%BF%B . )
Best regards Reinhardt
Please read Wikipedia:Replacement character first.
� is the Unicode Character 'REPLACEMENT CHARACTER' - U+FFFD
http://www.fileformat.info/info/unicode/char/fffd/index.htm
HTML Entity (decimal) �
HTML Entity (hex) �
UTF-8 (hex) 0xEF 0xBF 0xBD (efbfbd)
This character indicates a problem.
Helpers log: Tags (7 days) shows a lot of them.
The character can be found also at Tag info: b�rgertum, Tag info: l�beck etc. Howevet it seems that these tags have never been used.
I do not think if it makes much sense to show such tag combination proposels at Tag info: bürgertum, Tag info: lübeck .
I would suggest to implement a filter function to avoid such situations. ( Maybe one should filter urls for %EF%BF%B . )
Best regards Reinhardt
2Tallulah_Rose
Thanks for pointing me towards this. I found this tags today on the page of Buddenbrooks (where they were used, so I don't know why it shows up differently here...) and as I figured they was just an error with the character 'ü' I seemed they should be combined.
However if the tag-info-pages say they have never been used and they are to be found on the book page of Buddenbrooks (they are still there, I've checked) I think there must be another problem as well.
However if the tag-info-pages say they have never been used and they are to be found on the book page of Buddenbrooks (they are still there, I've checked) I think there must be another problem as well.
3gangleri
It might be / have been a sporadical compatibility issue: One should differentiate between representation on the own screen which might be operating system and browser dependent and between what and how is encoded in LT.
Copy and paste might be a source for adding new faults (from the own computer to LT).
I suggest to examine the behaviour on other PC's (at work, in libraries etc.).
I can not see any 'REPLACEMENT CHARACTER' at Buddenbrooks . Regards Reinhardt
Copy and paste might be a source for adding new faults (from the own computer to LT).
I suggest to examine the behaviour on other PC's (at work, in libraries etc.).
I can not see any 'REPLACEMENT CHARACTER' at Buddenbrooks . Regards Reinhardt
4Tallulah_Rose
I can not see any 'REPLACEMENT CHARACTER' at Buddenbrooks . Regards Reinhardt
well that's weird, because I can see both kinds of tag: the one with 'Replacement Character' and the one with the proper one (ü in this case).
I would be interested in how other members perceive this problem?
well that's weird, because I can see both kinds of tag: the one with 'Replacement Character' and the one with the proper one (ü in this case).
I would be interested in how other members perceive this problem?
5bnielsen
I followed the Buddenbrook link in #2 and didn't see any 'REPLACEMENT CHARACTER'.
Could you be a bit more specific about where you see it?
Could you be a bit more specific about where you see it?
6gangleri
I normaly work on different PC's. I prefer to use FireFox.
I am not familiar with the newest web specifications / implementations / bugs .
During last years I visited many pages without proper UTF-8 support. I either see there garbadge characters for German, Icelandic etc. diacritics or other characters are mising.
Here a German description on how to set UTF-8 mode in FF properly:
FireFox > Bearbeiten > Einstellungen > Inhalt > Schriftarten & Farben (dort "Erweitert" ) > Zeichenkodierung : Bitte Unicode (UTF-8) wählen
Good luck! Reinhardt
I am not familiar with the newest web specifications / implementations / bugs .
During last years I visited many pages without proper UTF-8 support. I either see there garbadge characters for German, Icelandic etc. diacritics or other characters are mising.
Here a German description on how to set UTF-8 mode in FF properly:
FireFox > Bearbeiten > Einstellungen > Inhalt > Schriftarten & Farben (dort "Erweitert" ) > Zeichenkodierung : Bitte Unicode (UTF-8) wählen
Good luck! Reinhardt
7bnielsen
It is really just a workaround to set this in the browser. The web pages are supposed to tell which charset they are in.
8gangleri
>7 bnielsen:: I know. If you are familiar with specs please take a look at the pages source code; if necessry please open a new topic.
9MikeBriggs
I see the tags: b�um, l�/a> on Buddenbrook. I had to click show all to see them.
10Tallulah_Rose
#9 So you see some which I don't :)
#5 you ahve to click 'show all' to see them if you see them at all. Maybe it is just because of my browser.
back to the prob: When it really is just because of my browser adjustments then we maybe really should implement such a filter function to avoid such proposals. What do the others think?
#5 you ahve to click 'show all' to see them if you see them at all. Maybe it is just because of my browser.
back to the prob: When it really is just because of my browser adjustments then we maybe really should implement such a filter function to avoid such proposals. What do the others think?
11gangleri
>Tallulah_Rose: Maybe it is a good idea to let us know what operating system and what browser version you are using.
When I go to Firefox > Hilfe >Über Mozilla Firefox I can copy the string:
Mozilla/5.0 (X11; U; Linux i686; de; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3
The information about the operating system:
Ubuntu 7.10 (a Linux based OS; not the newest)
can be found looking at system information.
When I go to Firefox > Hilfe >Über Mozilla Firefox I can copy the string:
Mozilla/5.0 (X11; U; Linux i686; de; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3
The information about the operating system:
Ubuntu 7.10 (a Linux based OS; not the newest)
can be found looking at system information.
12Tallulah_Rose
Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.7) Gecko/20100713 Firefox/3.6.7 ( .NET CLR 3.5.30729)
system: Windows XP
system: Windows XP
13gangleri
>12 Tallulah_Rose:: Thanks!
FYI: I found http://www.librarything.com/tag/antolog%EDa&norefer=1 showing
Tag info: antolog�a
Includes: antologia, antolog�a, Antologia, antol�gia, Antol�gia, ANTOLOGIA
some conclusions:
a) there are many such tags in the system
b) &norefer=1 does not relate to one variant only
c) because http://www.librarything.com/tag/anthology shows
Tag info: anthology
Includes: anthology, Anthology, anthologies, Anthologie, bloemlezing, antologi, anthologie, Anthologies, Bloemlezing, antologia, Antologi, antología, antology, antologio, Antología, Antology, ANTHOLOGY, anothology, format - anthology, ANTHOLOGIES, Anthology., |Anthology, Antologia, #anthology, anthololgy, antolog�a — 25 more, .anthology, _anthology, F:anthology, Antológia, Anthrologies, Antol�gia, anthology;, ANthologies, .Anthology, f:anthology, anthology, Anthololgy, |anthology, ANTOLOGIA, Antologio, Anothology, antológia, anthrologies, anthology., ANthology, anthology, (Anthology), (anthology), antol�gia, _Anthology
LT has to deal with multiple levels of tag combination. I was not aware of this. I only know some bugs related to multiple levels of author combinations.
FYI: I found http://www.librarything.com/tag/antolog%EDa&norefer=1 showing
Tag info: antolog�a
Includes: antologia, antolog�a, Antologia, antol�gia, Antol�gia, ANTOLOGIA
some conclusions:
a) there are many such tags in the system
b) &norefer=1 does not relate to one variant only
c) because http://www.librarything.com/tag/anthology shows
Tag info: anthology
Includes: anthology, Anthology, anthologies, Anthologie, bloemlezing, antologi, anthologie, Anthologies, Bloemlezing, antologia, Antologi, antología, antology, antologio, Antología, Antology, ANTHOLOGY, anothology, format - anthology, ANTHOLOGIES, Anthology., |Anthology, Antologia, #anthology, anthololgy, antolog�a — 25 more, .anthology, _anthology, F:anthology, Antológia, Anthrologies, Antol�gia, anthology;, ANthologies, .Anthology, f:anthology, anthology, Anthololgy, |anthology, ANTOLOGIA, Antologio, Anothology, antológia, anthrologies, anthology., ANthology, anthology, (Anthology), (anthology), antol�gia, _Anthology
LT has to deal with multiple levels of tag combination. I was not aware of this. I only know some bugs related to multiple levels of author combinations.
14gangleri
>13 gangleri:: http://www.librarything.com/tag/poetry shows
Tag info: poetry
Includes: poetry, Poetry, poesia, poëzie, poésie, poesi, Poëzie, poezie, poesía, Poesie, POETRY, poesie, Poezie, @poetry, poezio, Poesia, Poetry., poerty, a:poetry, Poésie, poety, Poesi, Poesía, genre - poetry, POESIA, g:poetry — 47 more, Po�zie, P (Poetry), g:Poetry, Poerty, Poety, form:poetry, Poetry*, poetry., %-poetry, Poezja, #poetry, Category: Poetry, poèsie, POESI, Poes�a, Poezio, poetrty, p (poetry), Genre - Poetry, . POETRY, A:poetry, .poetry, poezja, po�zie, poeotry, poemtry, POESIE, poetry;, poes�a, #Poetry, poes�a, poetry*, category: poetry, POesi, Poetry, POetry., poetry, @Poetry, .Poetry, poetry, Poèsie, po�sie, pOETRY, Poeotry, ;poetry, POetry, Po�sie
Tag info: poetry
Includes: poetry, Poetry, poesia, poëzie, poésie, poesi, Poëzie, poezie, poesía, Poesie, POETRY, poesie, Poezie, @poetry, poezio, Poesia, Poetry., poerty, a:poetry, Poésie, poety, Poesi, Poesía, genre - poetry, POESIA, g:poetry — 47 more, Po�zie, P (Poetry), g:Poetry, Poerty, Poety, form:poetry, Poetry*, poetry., %-poetry, Poezja, #poetry, Category: Poetry, poèsie, POESI, Poes�a, Poezio, poetrty, p (poetry), Genre - Poetry, . POETRY, A:poetry, .poetry, poezja, po�zie, poeotry, poemtry, POESIE, poetry;, poes�a, #Poetry, poes�a, poetry*, category: poetry, POesi, Poetry, POetry., poetry, @Poetry, .Poetry, poetry, Poèsie, po�sie, pOETRY, Poeotry, ;poetry, POetry, Po�sie
15bnielsen
Thanks. Now I also see them. I'm not sure if it is a bug or a feature. (Although in the best of all worlds LT would keep them from being entered into the system).
Surely there are lots of other typos in the list of tags combined into say poetry?
Surely there are lots of other typos in the list of tags combined into say poetry?
16gangleri
Again: Character encoding changed a lot during last 30 years. I remember the times of IBM's EBCDIC code, the 7 bit ASCII, the special national 8 bit ASCII variants etc.
Looking at the link from >12 Tallulah_Rose:: ( http://www.librarything.com/tag/antolog%EDa&norefer=1 ) one can see that there is a single octet character encoding. Probably the character should be "í" ( 'LATIN SMALL LETTER I WITH ACUTE' ).
Unicode is encoding "í" as
Unicode Character 'LATIN SMALL LETTER I WITH ACUTE' (U+00ED)
Encodings:
HTML Entity (decimal) & + #237;
HTML Entity (hex) & + #xed;
HTML Entity (named) & + iacute;
UTF-8 (hex) 0xC3 0xAD (c3ad) - (this is %C3%AD ) as used at
http://es.wikipedia.org/wiki/Antolog%C3%ADa Antolog%C3%ADa for Antología
Any site should have a strategy about how to encode information and the web pages should include information about what encoding is used. This assures that all users and visitors see the same informations.
WP has also a character normalization function implemented. It is active both on preview and on save. This also facilitates search for homoglyph text strings.
I wounder how LT fixing / migration can be managed. For tags maintenance page available for every user could list "abnormal" tags. I think about a multi column list with
a) old variants
b) input fields for corrected / desired tags
c) check box if the corrected / desired tags should still belong to a set of combined tags if applicable
Looking at the link from >12 Tallulah_Rose:: ( http://www.librarything.com/tag/antolog%EDa&norefer=1 ) one can see that there is a single octet character encoding. Probably the character should be "í" ( 'LATIN SMALL LETTER I WITH ACUTE' ).
Unicode is encoding "í" as
Unicode Character 'LATIN SMALL LETTER I WITH ACUTE' (U+00ED)
Encodings:
HTML Entity (decimal) & + #237;
HTML Entity (hex) & + #xed;
HTML Entity (named) & + iacute;
UTF-8 (hex) 0xC3 0xAD (c3ad) - (this is %C3%AD ) as used at
http://es.wikipedia.org/wiki/Antolog%C3%ADa Antolog%C3%ADa for Antología
Any site should have a strategy about how to encode information and the web pages should include information about what encoding is used. This assures that all users and visitors see the same informations.
WP has also a character normalization function implemented. It is active both on preview and on save. This also facilitates search for homoglyph text strings.
I wounder how LT fixing / migration can be managed. For tags maintenance page available for every user could list "abnormal" tags. I think about a multi column list with
a) old variants
b) input fields for corrected / desired tags
c) check box if the corrected / desired tags should still belong to a set of combined tags if applicable
17bnielsen
#16: Nice detective work. You could add the problems with importing from various library sources where the character coding is not stated explicitly and might not even be the same for every book in the source.
Plus html-entities embedded in the data. Etc. Etc.
Don't get me started on character sets. I grew up with ascii-8-in-12 on a cdc mainframe :-) Google that on your own risk.
Plus html-entities embedded in the data. Etc. Etc.
Don't get me started on character sets. I grew up with ascii-8-in-12 on a cdc mainframe :-) Google that on your own risk.
18gangleri
>17 bnielsen:: Thanks for the reply. There are many approaches when implementing a system.
There are worst case scenarios where every field in the database may have its own encoding. This might happen at some of the book sources used to import to LT.
There are approaches wich may follow more or less the KISS principle. A character normalization and a propper parametrisiation in the pages "head" section would assure the correct display on specification compliant browsers.
from a WP page:
head
title KISS principle - Wikipedia, the free encyclopedia /title
meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /
references:
http://www.librarything.com/topic/83076 Topic: add « preview » mode (button) to addbooks
http://www.librarything.com/topic/81448 Topic: perform Unicode normalization on tags
Please note other topics related to Unicode normalization.
There are worst case scenarios where every field in the database may have its own encoding. This might happen at some of the book sources used to import to LT.
There are approaches wich may follow more or less the KISS principle. A character normalization and a propper parametrisiation in the pages "head" section would assure the correct display on specification compliant browsers.
from a WP page:
head
title KISS principle - Wikipedia, the free encyclopedia /title
meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /
references:
http://www.librarything.com/topic/83076 Topic: add « preview » mode (button) to addbooks
http://www.librarything.com/topic/81448 Topic: perform Unicode normalization on tags
Please note other topics related to Unicode normalization.

