Opened 4 years ago

Last modified 3 years ago

#170 new defect

Handling of codepages in ported GNU software

Reported by: ak120 Owned by:
Priority: major Milestone:
Component: rpm Version:
Severity: high Keywords:


Perhaps it's only a configuration issue, but I could not find any documentation dealing with it.

Example output should be:

C:\usr\bin>csplit.exe --help
Aufruf: C:\usr\bin\csplit.exe [OPTION]... DATEI MUSTER...
Teile der DATEI getrennt durch MUSTER in die Dateien "xx01", "xx02"

Please compare with attached image that shows 7F-character!

C:\usr\share\locale\locale.alias (encoded in UTF-8) tells me
german de_DE.ISO-8859-1
that's quite wrong because almost everybody is using ibm850. Do I have to change my codepage to IBM 819 or 923 (for €) to have it correctly displayed?

Attachments (1)

csplit.png (2.0 KB) - added by ak120 4 years ago.

Download all attachments as: .zip

Change History (7)

Changed 4 years ago by ak120

Attachment: csplit.png added

comment:1 Changed 4 years ago by dmik

This needs some attention indeed. As it's often wrong for CP866 as well.

comment:2 Changed 4 years ago by Yuri Dario

I think this happens because files are encoded with UTF-8.

comment:3 Changed 4 years ago by Alex Taylor

The characters in question appear to be typographic quotes “”.

They can't be displayed in codepage 850, or any strict implementation of ISO-8859-x, because those codepages don't contain those characters. (They are contained in codepage 1252, which is Microsoft's extension of Latin-1, and also in 1004.)

The 'house' characters are substituted in by the Unicode conversion routines (iconv or ULS) as a placeholder when converting the string to the target codepage. This will happen with any character that doesn't exist in the current display codepage.

When I implemented UTF-8 support in PMMail, I added a workaround to replace both typographic quotes (U+201C and U+201D) with the dumb quote (") when the target codepage didn't contain the real things. Similarly, the typographic single quotes (U+2018 & U+2019) are replaced with the dumb apostrophe ', and the en and em dashes (U+2013 & U+2014) are replaced with a hyphen. This should only be done when converting to a codepage which doesn't contain the 'real' versions of these characters, obviously.

To the reporter: it's an awkward workaround, but using codepage 1004 should allow them to be displayed. (I think 1004 can only be set as secondary codepage, however, so this would probably require use of CHCP before running the program.)

comment:4 Changed 3 years ago by ak120

Usually gettext should convert the messages from *.mo to the output character set. So it could be wrong in the ported program itself not using a correct output charset. Or there's something wrong with gettext. It could be easily avoided by using a correct encoding in the *.po files from which *.mo files are created. Additionally the whole structure under /usr/share/locale is quite useless. There should be distinguished between ISO-8859-1, IBM850 and UTF-8.
By using the CP1004 workaround some filenames are not displayed correctly, most horrible is the replacement of german "ä" by U+201C.

comment:5 in reply to:  2 Changed 3 years ago by ak120

Replying to ydario:

I think this happens because files are encoded with UTF-8.

It also happens with ISO encoded files.

comment:6 in reply to:  4 Changed 3 years ago by Silvan Scherrer

Replying to ak120:

Usually gettext should convert the messages from *.mo to the output character set.

Feel free to send patches for gettext

Last edited 3 years ago by Silvan Scherrer (previous) (diff)
Note: See TracTickets for help on using tickets.