Comments on: UTF-16 auto-detect for READ/string
R3 A92 adds UTF-16 detection and decoding to READ/string on files. This is mainly an ease-of-use enhancement. You can read Unicode files with minimal code.
For example, now you can write:
doc: read/string %that-unicode.doc ; a 16 bit Unicode file
and, then process it as a normal REBOL text string.
When you use READ/string on a full file read, if it begins with a Unicode byte order marker (BOM), that will determine the encoding it will use to decode the file text.
Currently, these are supported:
- UTF-16BE (big endian)
- UTF-16LE (little endian)
If no BOM is found, then UTF-8 (hence also ASCII) is assumed.
Take note that surrogate pairs (code points beyond the 16-bit basic multilingual plane) are not currently supported. Hopefully, not many of you require those at this time.
We will need to add an /as refinement to allow you to specify an encoding when no BOM is provided. This also gives us a way to read the common 8-bit latin-1 encoding (as used in R2.)
Similarly, WRITE will need an /as refinement in order to do the desired encoding. Currently, WRITE only outputs UTF-8 (and of course ASCII) for strings.
maybe /as could also support a block! which would serve as an encoding table.
these can then be generated by anyone who needs them for ANY of those derived encoding used by a small percentage of humanity.
This would make REBOL extremely friendly to those under-supported languages.
for example, some countries use latin-1 with a (very) few differences in diacritics (one or two letters, sometimes).
they could thus create a simple encoding file and share with others. a place could even be reserved on the rebol web sites for these user-created encodings.
Post a Comment:
You can post a comment here. Keep it on-topic.