UTF-16 auto-detect for READ/string

Carl Sassenrath, CTO
REBOL Technologies
22-Oct-2009 4:19 GMT

Article #0279
Main page || Index || Prior Article [0278] || Next Article [0280] || 1 Comments || Send feedback

R3 A92 adds UTF-16 detection and decoding to READ/string on files. This is mainly an ease-of-use enhancement. You can read Unicode files with minimal code.

For example, now you can write:

doc: read/string %that-unicode.doc  ; a 16 bit Unicode file

and, then process it as a normal REBOL text string.

When you use READ/string on a full file read, if it begins with a Unicode byte order marker (BOM), that will determine the encoding it will use to decode the file text.

Currently, these are supported:

  • UTF-8
  • UTF-16BE (big endian)
  • UTF-16LE (little endian)

If no BOM is found, then UTF-8 (hence also ASCII) is assumed.

Take note that surrogate pairs (code points beyond the 16-bit basic multilingual plane) are not currently supported. Hopefully, not many of you require those at this time.

We will need to add an /as refinement to allow you to specify an encoding when no BOM is provided. This also gives us a way to read the common 8-bit latin-1 encoding (as used in R2.)

Similarly, WRITE will need an /as refinement in order to do the desired encoding. Currently, WRITE only outputs UTF-8 (and of course ASCII) for strings.


Updated 26-May-2024 - Edit - Copyright REBOL Technologies -