Comments on: Text code-page mapping

I recognize that many developers need to be able to input and output older text strings that are not Unicode UTF-8 encoded.

R3.A100 will provide a codec for "arbitrary" text mapping. The default map will be UTF-8, but we'll also support LATIN1 (the first 256 code points of Unicode, also known as ISO-8859-1). If you need some other mapping such as Windows-1252 or ISO-8859-15 you can specify those by providing your own maps (which are simple to create).

In order to specify 8-bit code-page maps, we will need to allow additional inputs to codecs. My current design direction is to allow the codec media argument to be either a word or a block. For example:

text: decode 'text bin  ; default conversion as UTF-8

text: decode [text latin1] bin

text: decode [text 16] bin ; UTF-16 BE

text: decode [text -16] bin ; UTF-16 LE

text: decode reduce ['text char-map] bin

Here, char-map is a reference to a string that maps a byte to each character (unicode-point). The map is simple to setup, but note that each byte is an zero-based index to a char.

For example, latin1 is created this way:

char-map: make string! 256
repeat n 256 [append char-map to-char n - 1]

In some cases you might load the map from a text file (which is encoded itself in UTF-8):

char-map: read/string %iso-8859-15

The alternative would be to add a refinement to decode to specify the map, but I like that less because it splits up the "spec" of the decoding. Thoughts on that?

The encoding method would reflect the same approach:

bin: encode 'text text  ; default conversion as UTF-8

bin: encode [text latin1] text

bin: encode [text 16] text ; UTF-16 BE

bin: encode [text -16] text ; UTF-16 LE

bin: encode reduce ['text char-map] text

It should be pointed out that char-map can be either a string or a binary here. If it's the same string as used with decode, that will work, but an internal binary map must be temporarily allocated for the conversion. However, if you're converting a lot of strings, then make it a binary to directly map chars in the conversion.

Since this change will be part of A100, please post your comments soon.

17 Comments

Comments:

Oldes
18-May-2010 17:40:02 Could you provide an example how the char-map looks like?
Carl Sassenrath
18-May-2010 18:18:24 Ok, I'll add it above, in the article.
Jerry Tsai
18-May-2010 22:16:08 Since most OSes have their built-in string codecs, why don't we just call the related APIs to do the string coding/encoding for us, which is very quick. Besides, by doing that, we don't need to offer our own char-maps.
Jerry Tsai
18-May-2010 22:22:32 In CJK (Chinese, Japanese, Korean) area, we use variable-length encoding a lot (such as, Big-5 is the de facto standard in Taiwan). variable-length encoding means that, some char is one-byte and the others are two-byte. I don't think the method you mention here can handle the variable-length encodings in CJK area.
-pekr-
19-May-2010 1:36:07 read/string what? :-) I can see, that /string and /lines were silently added without posting final resolution to "prunning down the read/write" topic. So - have we failed to keep read/write low level functions (and hence removing read/write-io was kind of preliminary), or is that another example of how life goes, prividing "most-wanted" functionality, even if not in a "pure" way?
As for maps, I am not sure there is much to say. But maybe we could be a bit introduced into encode/decode. I can see, those are mezzanines, internally calling 'do-codec function. Could there be more said about that function? E.g. - some simple example, of how to create own codecs? (streamed of course, not just load-everything-in-memory-first :-)
Peter Wood
19-May-2010 2:18:36 How does encode handle characters that are not in the target encoding?
For example,
my-string: "€400.00"
my-latin1-string: decode [my-string latin1] bin
What would the first character of my-latin1-string be?
DideC
19-May-2010 12:07:14 Yeah, sorry, but it's unclear how it can works for any codepage, especially the variable-length encoding.
Plus, I see here a switch in the argument order usually meet in Rebol functions. I though the string!/binary! value might be the first arg and the encoding spec the second one. Ie: like enbase/debase.
Anton Rolls
20-May-2010 9:36 Dispense with comments:
text: decode 'text/UTF-8 bin
text: decode 'text/latin1 bin
text: decode 'text/UTF-16/BE bin
text: decode 'text/UTF-16/LE bin
text: decode 'text/:char-map bin
(I'm happy with this argument order.)
Carl Sassenrath
20-May-2010 18:53:59 Some replies:
Use OS API: How do you call it in REBOL? That's what codecs do, regardless of the implementation (OS API or not). However, there's also the issue of transportability, and if all OSes support the same standards. Unlikely.
UTF-8 is multibyte: This is the variable-length standard for R3. I am not anticipating support for other multibyte string encodings (are there many?)
READ refinements: Pekr, we had this discussion before, right? READ and WRITE are still pure. The /string refinement was added for easier file reads, but we could add a mezz function for that (e.g. READ-TEXT, or whatever). Let's discuss it. The /lines is not really needed.
Uncodable chars: What's better: 1) throw an error? or, 2) emit a substitution char? Either is ok with me.
REBOL arg order is defined this way: First arg is the most "abstractly" important (so therefore the encoding name), second arg is second most important, etc. Look at READ and WRITE. Of course, there are always "ties" and exceptions (e.g. SWITCH).
Paths for specifying encoding: Right, we talked about this before? Thanks for posting it. (Note, impl is same as block specs.)
I should finish by saying that char encodings aren't used much in USA, so if we're going to get this right in general, I need your ideas, code, testing, etc.
Andreas
20-May-2010 19:30:41 Re "Uncodable chars":
Definitely the "throw an error" option is better. Silent munging and corruption of text is an extreme nuisance. If you try to de- encode something with a codec which simply cannot de/encode the given data, then throwing an error is the _only_ sensible option.
-pekr-
21-May-2010 5:06:36 Carl, there were two blogs discussing the topic (Prunning down Read and Write, Finalizing Read and Write), but not sure, the resulution to the topic was posted, althought there is some conclusion in blog article chat section. /string and /lines refinements appeared silently in-between some releases.
As for me, I don't mind /string, although I objected. I know that things in real life are not 100% pure either. I would not even mind read-text, write-text, as an exception, removing /string from 'read itself.
What I am more interested in, is codec topic in general. This does not belong to this blog article, but is interesting topic indeed. There were blogs to ports, devices, unicode, but no blog or doc or example to codec infrastructure yet.
do-codec: make native! [[ {Evaluate a CODEC function to encode or decode media types.} handle [handle!] "Internal link to codec" action [word!] "Decode, encode, identify" data [binary! image!] ]]

Are we able to write codecs as mezzanines for e.g.? Could codecs be streamed? (I expect yes, but the aproach would probably be - create a higher-level func, e.g. read-blue-ray, this just opens file, reads in chunks, decodes - that can either be done whole in C (codec, extension), or mezzanine).
But - I expect codecs not being our priority topic for 3.0 and I am OK with that ... just some food for what might be interesting blog topic :-)
-pekr-
21-May-2010 5:10:57 char-map. That name reminded me of the missing Unicode related functionality in R3:
- console does not print some chars, it prints "?" instead
- View can't display unicode
- 'sort can't sort Unicode properly
And the last point leads me back to char-map. Could such char-map, or collate-map, be used to extend 'sort by /collation refinement? Imagine getting data from DB, and not being able to sort properly using REBOL's 'sort.
Maybe I should add a wish ticket into CC for that?
Brian Hawley
21-May-2010 13:19:22 Pekr, one point: "- console does not print some chars, it prints "?" instead"
That is not R3's fault, it actually sends the character to the host OS to display. If your display doesn't show the character it is because the font that the system is using for the console doesn't contain that character.
There is no solution for this when using the system console except for the user setting it to use a font that contains the characters they want displayed. For Windows, many of the fixed-length TrueType fonts would do. For Mac or Linux it will require a different (distro-specific) solution.
But regardless of what platform, as long as you use the system console then it is the system console's responsibility to have the fonts to display. R3 just outputs the characters.
Your other points are valid though.
Brian Hawley
21-May-2010 14:33:33 Peter Wood, you got your code wrong. The 'text in the code is a keyword, not a variable, so you can't put your string reference there. And decode only decodes binaries, not strings; I think you are looking for encode. You make a valid point about the unencodable characters though.
Carl, if Anton's (good) idea for path specs is implemented, can the get-word trick be done for block specs as well? It would save on reduce overhead. And we should have support for block specs as well as paths, because most data construction functions return blocks.
Oldes
27-May-2010 5:01:40 Uncodable chars - I think the best would be some user defined callback so you could easily decide what action should be done without stoping the encoding action. Just a simple error throwing does not solve anything. You still would like to substitute and or remove the uncodable chars and you will just ask how.
Just one char substitution is also not a perfect solution.
What I consider a perfect is multisubstitution, where for example chars like ě,é,ë can be substituted to simple e
I could live with the throwing error just in case, where you could use the information from the error message to continue with encoding after user defined char fix.
Edoc
2-Jun-2010 14:38:34
What I consider a perfect is multisubstitution, where for example chars like ě,é,ë can be substituted to simple e
This sounds too good to be true-- I endorse it wholeheartedly, although it may be difficult to implement.
Brian Hawley
3-Jun-2010 13:21:31 Multisubstitution could be done with the above proposal as-is by using a code page that specifies the substitutions; no callbacks would be necessary. Though keep in mind that this is not for string-to-string conversions, it is for string-to-binary and binary-to-string conversions. It is not hard to implement at all.
User-specified callback functions on conversion errors are very unlikely to be implemented, because they are slow. A similar proposal for error callbacks was proposed for transcode and it was rejected because it would have slowed down the function by a lot (an order of magnitude, iirc) even if the option to provide the callback isn't used. The same consideration would likely apply here. The trigger-an-error approach is much more likely to be implementable: That is what transcode does.

Comments on: Text code-page mapping

Comments:

Post a Comment: