Unicode is the Focus

With the core of R3 stable, it's time for the next stage of R3 development: proper Unicode support.

If you're not familiar with it (because you've been living in a cave or on Mars for many years), Unicode is a standard method for handling text (strings and characters). Details are found at Unicode on wikipedia and Unicode.org websites.

So, Unicode is the focus of our current development, and it must be clearly stated, this is a non-trivial project. Our goal is to have it ready for initial testing by the end of the month.

I'll admit that we under-estimated the magnitude of the Unicode project. Over the last year our assumption was that supporting UTF8, the 8-bit encoded format for Unicode, would be sufficient. After all, REBOL's syntax is already valid UTF8, with just a few minor accommodations, we'd be doing the Unicode dance. Wrong.

Our REBOL founding principles are about more than just accommodating a major concept like Unicode. Namely, code friendliness and performance should not just be footnotes in the bigger picture. They are essential to the REBOL foundation.

So, what do I really mean by friendliness and performance? Let's just take a short little example, beginning with this string:

string: "this is UTF8 string"

Now, UTF8 allows a single unit (a code point... think of it as the abstract definition of a character) to be encoded as multiple bytes within that string. Although, what I've written here is ANSI (chars are all less than 128), that's just for the example, and it could actually contain characters from around Europe, Asia, Africa, or elsewhere. Those would be encoded as multiple bytes.

Ok, so what's wrong with it? Well, just take this simple line:

char: pick string 10

Then, ask the question: should CHAR be the actual character or some "random byte" that is being used to encode who-knows-what character?

The right answer is: it should be the character, not the encoding. That's what we mean by code friendly. It's a trade off: make REBOL smarter to give users greater power with less effort.

Over the last year, we thought we'd solve this problem by introducing a new datatype: Unicode! Sounds good, right? No, sorry, wrong. That approach also makes things more difficult for users, because now you've got to add conversions such as:

string: to unicode! "this is UTF8 string"

as well as worry about what strings in your code are UTF8 (encoded) and which are unicode (decoded). Our scripts would start to fill up with these various conversions everywhere... because we forced the end solution onto the user, rather than solving it in the language.

So, to boil it all down, here are the R3 datatype definitions as they relate to Unicode:

	binary!	strings of bytes. Those bytes can be anything, and we (you) don't care what. They can be encoded text (such as UTF8) or for that matter, encoded images (such as JPG), or even sounds. To use them, you need to know how they are encoded. (This approach is further supported by the new R3 port model which defaults to binary rather than text strings for all I/O.)
	string!	Unicode text. Internally, you don't care what it is or how it is stored. You still write them as quoted strings in your scripts, and if you insert, change, or remove, you just expect the right thing to happen. No worries.

And, the general usage rules are:

Source code is a UTF8. Most scripts are ANSI-7, so they already qualify as UTF8. That is, most code will load as-is.
When code is loaded, literal strings are converted to STRING! datatypes and are Unicode internally.
All STRING! functions act on the Unicode. So, if you pick the 10th char, you get a char. If you insert "hello" into any string, the correct operation happens.
Files read or written are BINARY! unless you specify an encoding.
Various codecs (encoders and decoders) will convert BINARY! (raw bytes) to and from STRING! (unicode).
Special functions, such as TRIM and PARSE will work on BINARY! or STRING!, and do the right thing.
The CHAR! datatype is Unicode as well, so picking a char, finding a char, inserting a char, and other operations work as you would expect.
Console input/output is decoded/encoded as appropriate to the output device. For example, a console that allows unicode will get unicode, and a byte-oriented console will get UTF8 encoding (or a filtered ANSI encoding, if so required).
Graphical text output (GUI) is output as unicode to be displayed as appropriate for the fonts supported.

There are a few other issues to cover, but we will document those separately.

5 Comments