Comments on: UTF-16 auto-detect for READ/string

Carl Sassenrath, CTO
REBOL Technologies
22-Oct-2009 4:19 GMT

Article #0279
Main page || Index || Prior Article [0278] || Next Article [0280] || 1 Comments || Send feedback

R3 A92 adds UTF-16 detection and decoding to READ/string on files. This is mainly an ease-of-use enhancement. You can read Unicode files with minimal code.

For example, now you can write:

doc: read/string %that-unicode.doc  ; a 16 bit Unicode file

and, then process it as a normal REBOL text string.

When you use READ/string on a full file read, if it begins with a Unicode byte order marker (BOM), that will determine the encoding it will use to decode the file text.

Currently, these are supported:

  • UTF-8
  • UTF-16BE (big endian)
  • UTF-16LE (little endian)

If no BOM is found, then UTF-8 (hence also ASCII) is assumed.

Take note that surrogate pairs (code points beyond the 16-bit basic multilingual plane) are not currently supported. Hopefully, not many of you require those at this time.

We will need to add an /as refinement to allow you to specify an encoding when no BOM is provided. This also gives us a way to read the common 8-bit latin-1 encoding (as used in R2.)

Similarly, WRITE will need an /as refinement in order to do the desired encoding. Currently, WRITE only outputs UTF-8 (and of course ASCII) for strings.



Maxim Olivier-Adlhoch
22-Oct-2009 23:30:52
maybe /as could also support a block! which would serve as an encoding table.

these can then be generated by anyone who needs them for ANY of those derived encoding used by a small percentage of humanity.

This would make REBOL extremely friendly to those under-supported languages.

for example, some countries use latin-1 with a (very) few differences in diacritics (one or two letters, sometimes).

they could thus create a simple encoding file and share with others. a place could even be reserved on the rebol web sites for these user-created encodings.


Post a Comment:

You can post a comment here. Keep it on-topic.


Blog id:



 Note: HTML tags allowed for: b i u li ol ul font span div a p br pre tt blockquote

This is a technical blog related to the above topic. We reserve the right to remove comments that are off-topic, irrelevant links, advertisements, spams, personal attacks, politics, religion, etc.

Updated 15-Jun-2024 - Edit - Copyright REBOL Technologies -