Dimensions of REBOL Unicode Support

It is a goal to support unicode in REBOL 3.0.

The following text discusses some of the "dimensions" of unicode support. I am presenting these here to seed your thoughts, so that you can make suggestions and help us produce a smart and balanced unicode design for REBOL 3.0.

I want to note that some aspects of unicode are quite complicated, and it is not practical for REBOL to support every possible nuance related to unicode. (Or, stated another way, I will not allow REBOL to become 10 times larger just to support all possible unicode variations.)

A lot of information has been written about unicode. For a good place to start, browse the Unicode entry on Wikipedia. Unicode has its own special glossary of words (of which I openly admit not strictly adhering to). For example, unicode refers to "code points" as references to characters to provide an abstraction layer. Here is a good Unicode Tutorial that helps explain these terms.

Unicode! Datatype

A new REBOL datatype, unicode!, handles the internal storage for unicode string series and provides the standard series functions (make, to, find, next, last, etc.).

Open to Debate

It is still open to debate if unicode should be stored internally in 16 bit format (such as that used by Java and holds all standard characters), or if a 24/32 bit format makes more sense (allowing full range of unicode characters, but wastes memory). It may be possible to auto-size the datatype internally, storing unicode in 16 bit normally, but switching to 24 bit when needed.

Literal Direct Format

REBOL will support a new syntax for unicode strings. This is the format that appears directly within scripts and data files.

The simplest literal format is a hex (or perhaps base-64) encoded string similar to the binary data format. The advantage is that such a format does not cause problems when processed or transferred with normal 8-bit character systems. The disadvantage is that it is large and a user cannot view the actual string contents within the source code. For example:

##{005200450042004f004c} ; 16 bit, basic multilingual plane

###{00005200004500004200004f00004c} ; 32 bit, full unicode

When loaded, both of these strings would result in a unicode! series datatype.

It should be noted here that REBOL defines a byte-order for such literals. REBOL uses big-endian format, so no byte-order-marker need appear within the literal strings.

Literal Encoded Format

Another possibility would be to allow UTF-8 encoding within strings in the source code. The advantage is that you will be able to view the strings in the appropriate editor. The disadvantage is that the script would contain a range of odd looking characters.

Even if UTF-8 is not supported as a literal datatype of REBOL, we would still support conversion to and from UTF-8 format.

I've not reached a decision on this issue as of yet.

Script Encoding

I am thinking about allowing support for unicoded REBOL scripts. The load and do functions would accept scripts as ASCII and also as UTF-16. For scripts that include a REBOL header, the UTF-16 could also be automatically detected (because it would appear as "0R0E0B0O0L", where 0 is a null byte).

Supporting scripts in UTF-8 format would be more problematic, because the REBOL header would appear the same. Also, existing scripts that use the latin-1 encoding could cause false UTF-8 detection. More discussion is needed.

Note that if we do allow unicode for scripts themselves, only literal string! and char! datatypes will be allowed to contain the unicoded characters. Other datatypes, such as words, will remain as they are today. This raises some issues with regard to datatypes like unicoded file names and email addresses, which we should discuss in more detail.

Conversions

A few new functions will be provided to encode and decode unicode strings into a variety of formats. For example, we will provide functions to input and output UTF-8 and UTF-16 formats (including byte-order-marker for endian detection, allowing UTF-16-LE and UTF-16-BE).

Coercions

When functions combine both unicode and string datatypes, we will automatically provide conversion when it makes sense. For example:

insert a-unicode-string "REBOL"

will insert the "REBOL" string, converting from latin-1 to unicode bytes.

Casing and Sorting

Many programmers have requested that the unicode datatype be able to handle the upper- and lower-case conversions as we do today with normal strings. They also want a way to sort unicoded strings, just as we do today with other types of strings. Also, we must allow case-insensitive searching and sorting features.

Ports

I think it would make sense to allow ports (e.g. network connections, files) to operate in a number of unicode codec formats. For example, you may want to read data directly from a UTF-16 XML file without calling an extra conversion function.

Perhaps the best way to solve this requirement is to look at what solutions other languages, such as Java, provide.

Graphical Display and Input Events

In addition to being able to handle unicode as a datatype, we will want to be able to create displays that handle unicode characters and accept unicoded input.

For example, we've had many requests to support Chinese characters in REBOL applications, so we need to make this possible at some stage in REBOL 3.0 or perhaps 3.1.

Operating System Compatibility

And finally, I should mention that unicode is an important consideration over the wide range of operating systems supported by REBOL. Native APIs for Windows, OSX, BSD, Linux, and others have support for unicode, and REBOL must be able to interface to those APIs. This is true not only for making DLL calls, but for operations as standard as file and directory access.

23 Comments