What is best practice to create a file name, when run in the C locale, with data containing a non-ASCII Unicode character?
For example, the data for the file name is 9 She342200231s My Girl.mp3
(where NNN are single non-ASCII bytes in octal). Note that 342 200 231, which is E2 80 99 in hex, is the UTF-8 encoded form of U+2019 RIGHT SINGLE QUOTATION MARK.
What file name should be created? The code in question uses wcsrtombs
, which stops with a failure when it hits the first of the non-ASCII bytes. The code then terminates the string with a null byte and ends up creating the file name 9 She
. Is this the best-practice behavior or should the code create a file name with the exact same bytes as the supplied data? Or perhaps something else?
This question came up when I used the free (linux) version of the unrar
utility in the C locale (export LC_ALL=C
) on the public domain songs of Tom Lehrer.