Opened 9 years ago
Closed 3 years ago
#170 closed defect (invalid)
Handling of codepages in ported GNU software
Reported by: | ak120 | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | rpm | Version: | |
Severity: | high | Keywords: | |
Cc: |
Description
Perhaps it's only a configuration issue, but I could not find any documentation dealing with it.
Example output should be:
C:\usr\bin>csplit.exe --help
Aufruf: C:\usr\bin\csplit.exe [OPTION]... DATEI MUSTER...
Teile der DATEI getrennt durch MUSTER in die Dateien "xx01", "xx02"
Please compare with attached image that shows 7F-character!
C:\usr\share\locale\locale.alias (encoded in UTF-8) tells me
german de_DE.ISO-8859-1
that's quite wrong because almost everybody is using ibm850. Do I have to change my codepage to IBM 819 or 923 (for €) to have it correctly displayed?
Attachments (1)
Change History (8)
by , 9 years ago
Attachment: | csplit.png added |
---|
comment:1 by , 9 years ago
comment:3 by , 8 years ago
The characters in question appear to be typographic quotes “”.
They can't be displayed in codepage 850, or any strict implementation of ISO-8859-x, because those codepages don't contain those characters. (They are contained in codepage 1252, which is Microsoft's extension of Latin-1, and also in 1004.)
The 'house' characters are substituted in by the Unicode conversion routines (iconv or ULS) as a placeholder when converting the string to the target codepage. This will happen with any character that doesn't exist in the current display codepage.
When I implemented UTF-8 support in PMMail, I added a workaround to replace both typographic quotes (U+201C and U+201D) with the dumb quote (") when the target codepage didn't contain the real things. Similarly, the typographic single quotes (U+2018 & U+2019) are replaced with the dumb apostrophe ', and the en and em dashes (U+2013 & U+2014) are replaced with a hyphen. This should only be done when converting to a codepage which doesn't contain the 'real' versions of these characters, obviously.
To the reporter: it's an awkward workaround, but using codepage 1004 should allow them to be displayed. (I think 1004 can only be set as secondary codepage, however, so this would probably require use of CHCP before running the program.)
follow-up: 6 comment:4 by , 8 years ago
Usually gettext should convert the messages from *.mo to the output character set. So it could be wrong in the ported program itself not using a correct output charset. Or there's something wrong with gettext. It could be easily avoided by using a correct encoding in the *.po files from which *.mo files are created. Additionally the whole structure under /usr/share/locale is quite useless. There should be distinguished between ISO-8859-1, IBM850 and UTF-8.
By using the CP1004 workaround some filenames are not displayed correctly, most horrible is the replacement of german "ä" by U+201C.
comment:5 by , 8 years ago
Replying to ydario:
I think this happens because files are encoded with UTF-8.
It also happens with ISO encoded files.
comment:6 by , 8 years ago
Replying to ak120:
Usually gettext should convert the messages from *.mo to the output character set.
Feel free to send patches for gettext
comment:7 by , 3 years ago
Resolution: | → invalid |
---|---|
Status: | new → closed |
there is a separate gettext ticket for this issue. so closing this one for now.
This needs some attention indeed. As it's often wrong for CP866 as well.