Comments on: UTF-16 auto-detect for READ/string

R3 A92 adds UTF-16 detection and decoding to READ/string on files. This is mainly an ease-of-use enhancement. You can read Unicode files with minimal code.

For example, now you can write:

doc: read/string %that-unicode.doc  ; a 16 bit Unicode file

and, then process it as a normal REBOL text string.

When you use READ/string on a full file read, if it begins with a Unicode byte order marker (BOM), that will determine the encoding it will use to decode the file text.

Currently, these are supported:

UTF-8
UTF-16BE (big endian)
UTF-16LE (little endian)

If no BOM is found, then UTF-8 (hence also ASCII) is assumed.

Take note that surrogate pairs (code points beyond the 16-bit basic multilingual plane) are not currently supported. Hopefully, not many of you require those at this time.

We will need to add an /as refinement to allow you to specify an encoding when no BOM is provided. This also gives us a way to read the common 8-bit latin-1 encoding (as used in R2.)

Similarly, WRITE will need an /as refinement in order to do the desired encoding. Currently, WRITE only outputs UTF-8 (and of course ASCII) for strings.

1 Comments

Comments:

Maxim Olivier-Adlhoch
22-Oct-2009 23:30:52 maybe /as could also support a block! which would serve as an encoding table.
these can then be generated by anyone who needs them for ANY of those derived encoding used by a small percentage of humanity.
This would make REBOL extremely friendly to those under-supported languages.
for example, some countries use latin-1 with a (very) few differences in diacritics (one or two letters, sometimes).
they could thus create a simple encoding file and share with others. a place could even be reserved on the rebol web sites for these user-created encodings.
:-)

Comments on: UTF-16 auto-detect for READ/string

Comments:

Post a Comment: