Text code-page mapping

Carl Sassenrath, CTO
REBOL Technologies
18-May-2010 19:35 GMT

Article #0322
Main page || Index || Prior Article [0321] || Next Article [0323] || 17 Comments || Send feedback

I recognize that many developers need to be able to input and output older text strings that are not Unicode UTF-8 encoded.

R3.A100 will provide a codec for "arbitrary" text mapping. The default map will be UTF-8, but we'll also support LATIN1 (the first 256 code points of Unicode, also known as ISO-8859-1). If you need some other mapping such as Windows-1252 or ISO-8859-15 you can specify those by providing your own maps (which are simple to create).

In order to specify 8-bit code-page maps, we will need to allow additional inputs to codecs. My current design direction is to allow the codec media argument to be either a word or a block. For example:

text: decode 'text bin  ; default conversion as UTF-8

text: decode [text latin1] bin

text: decode [text 16] bin ; UTF-16 BE

text: decode [text -16] bin ; UTF-16 LE

text: decode reduce ['text char-map] bin

Here, char-map is a reference to a string that maps a byte to each character (unicode-point). The map is simple to setup, but note that each byte is an zero-based index to a char.

For example, latin1 is created this way:

char-map: make string! 256
repeat n 256 [append char-map to-char n - 1]

In some cases you might load the map from a text file (which is encoded itself in UTF-8):

char-map: read/string %iso-8859-15

The alternative would be to add a refinement to decode to specify the map, but I like that less because it splits up the "spec" of the decoding. Thoughts on that?

The encoding method would reflect the same approach:

bin: encode 'text text  ; default conversion as UTF-8

bin: encode [text latin1] text

bin: encode [text 16] text ; UTF-16 BE

bin: encode [text -16] text ; UTF-16 LE

bin: encode reduce ['text char-map] text

It should be pointed out that char-map can be either a string or a binary here. If it's the same string as used with decode, that will work, but an internal binary map must be temporarily allocated for the conversion. However, if you're converting a lot of strings, then make it a binary to directly map chars in the conversion.

Since this change will be part of A100, please post your comments soon.


Updated 24-Mar-2017 - Edit - Copyright REBOL Technologies -