Comments on: Unicode is the Focus

With the core of R3 stable, it's time for the next stage of R3 development: proper Unicode support.

If you're not familiar with it (because you've been living in a cave or on Mars for many years), Unicode is a standard method for handling text (strings and characters). Details are found at Unicode on wikipedia and Unicode.org websites.

So, Unicode is the focus of our current development, and it must be clearly stated, this is a non-trivial project. Our goal is to have it ready for initial testing by the end of the month.

I'll admit that we under-estimated the magnitude of the Unicode project. Over the last year our assumption was that supporting UTF8, the 8-bit encoded format for Unicode, would be sufficient. After all, REBOL's syntax is already valid UTF8, with just a few minor accommodations, we'd be doing the Unicode dance. Wrong.

Our REBOL founding principles are about more than just accommodating a major concept like Unicode. Namely, code friendliness and performance should not just be footnotes in the bigger picture. They are essential to the REBOL foundation.

So, what do I really mean by friendliness and performance? Let's just take a short little example, beginning with this string:

string: "this is UTF8 string"

Now, UTF8 allows a single unit (a code point... think of it as the abstract definition of a character) to be encoded as multiple bytes within that string. Although, what I've written here is ANSI (chars are all less than 128), that's just for the example, and it could actually contain characters from around Europe, Asia, Africa, or elsewhere. Those would be encoded as multiple bytes.

Ok, so what's wrong with it? Well, just take this simple line:

char: pick string 10

Then, ask the question: should CHAR be the actual character or some "random byte" that is being used to encode who-knows-what character?

The right answer is: it should be the character, not the encoding. That's what we mean by code friendly. It's a trade off: make REBOL smarter to give users greater power with less effort.

Over the last year, we thought we'd solve this problem by introducing a new datatype: Unicode! Sounds good, right? No, sorry, wrong. That approach also makes things more difficult for users, because now you've got to add conversions such as:

string: to unicode! "this is UTF8 string"

as well as worry about what strings in your code are UTF8 (encoded) and which are unicode (decoded). Our scripts would start to fill up with these various conversions everywhere... because we forced the end solution onto the user, rather than solving it in the language.

So, to boil it all down, here are the R3 datatype definitions as they relate to Unicode:

	binary!	strings of bytes. Those bytes can be anything, and we (you) don't care what. They can be encoded text (such as UTF8) or for that matter, encoded images (such as JPG), or even sounds. To use them, you need to know how they are encoded. (This approach is further supported by the new R3 port model which defaults to binary rather than text strings for all I/O.)
	string!	Unicode text. Internally, you don't care what it is or how it is stored. You still write them as quoted strings in your scripts, and if you insert, change, or remove, you just expect the right thing to happen. No worries.

And, the general usage rules are:

Source code is a UTF8. Most scripts are ANSI-7, so they already qualify as UTF8. That is, most code will load as-is.
When code is loaded, literal strings are converted to STRING! datatypes and are Unicode internally.
All STRING! functions act on the Unicode. So, if you pick the 10th char, you get a char. If you insert "hello" into any string, the correct operation happens.
Files read or written are BINARY! unless you specify an encoding.
Various codecs (encoders and decoders) will convert BINARY! (raw bytes) to and from STRING! (unicode).
Special functions, such as TRIM and PARSE will work on BINARY! or STRING!, and do the right thing.
The CHAR! datatype is Unicode as well, so picking a char, finding a char, inserting a char, and other operations work as you would expect.
Console input/output is decoded/encoded as appropriate to the output device. For example, a console that allows unicode will get unicode, and a byte-oriented console will get UTF8 encoding (or a filtered ANSI encoding, if so required).
Graphical text output (GUI) is output as unicode to be displayed as appropriate for the fonts supported.

There are a few other issues to cover, but we will document those separately.

5 Comments

Comments:

Chris
7-Jan-2008 21:40:46 Fantastic! What will this do for the char! datatype?
Carl Sassenrath
8-Jan-2008 2:56:34 Chris, char! datatype will be Unicode as well. I will add that to the notes above. Thanks for mentioning it.
Goldevil
8-Jan-2008 3:09:09 I'm very happy to see that Unicode will be completely integrated into Rebol, without tweaks, special datatypes and special instructions. It's still "KISS".
I'm working in Belgium (small country with 3 official languages) and my main customer is the European Commission. Every application I write must be multilingual (at least 3 languages but sometimes up to 21 languages). Some applications must integrate different languages inside the same screen.
Unicode support is our only real way of working and Rebol has zero chance to be used in many areas without Unicode support.
Now, I hope that R3 beta (or public alpha) will be released ASAP.
Great Job, Carl !
Andreas Bolka
8-Jan-2008 12:32:30 Three more things to consider:

Normalization

Collation

Case Conversion

In general, unless there is a specific opposition against one of its shortcoming, using the ICU lib would leave you with a proper implementation of all of this and more. In REBOL's case, ICU's size might also be prohibitive. Further, ICU stores strings internally as UTF-16 which which makes e.g. your pick example break again on strings that use characters beyond the BMP.
If you don't want to use the ICU:
Normalization. There are often many ways to specify a character as a sequence of characters. For example, take a C with acute and cedilla diacritics. You could have that as C-cedilla followed by an acute, or as C-acute followed by a cedilla (or as C followed by an acute followed by a cedilla, or ... you get the idea). Now that's no issue of UTF-8/16 encoding, but those are valid sequences of Unicode characters. Obviously, a user should not be concerned with such matters, and comparing strings of C followed by acute followed by cedilla and C followed by cedilla followed by acute should return them to be equal.
That's where Unicode normalization comes in. Unicode Normalization Forms specify "normal forms" that are canonical representations for a combined character. Squeak, for example, uses Normalization Form D where all combined characters are fully decomposed into the regular character followed by combination marks (i.e. C followed by acute instead of the combined C-acute).
Collation. Once you've Unicode, sorting becomes strenuous. You could always simply sort by code point value, which at least leads to a consistent sort order (but leaves you with Z coming before a). Users typically want some kind of "alphabetic" sorting, which varies between cultures. Unicode Technical Standard #10: Collation (UTS#10) specifies an algorithm for such "culturally aware" sorting.
Case Conversion. Unicode specifies the upper/lower/title cases for all code points, along with a case conversion algorithm. There are some (annoying) special cases again: for example the German ß ("eszett", small sharp s) becomes SS when uppercased (which basically means, that you simply can't uppercase some characters properly).
Edoc
8-Jan-2008 12:53:47 With regard to Andreas' comment about sorting, here's a few words about employing a natural sort rather than the less-friendly array sort in most languages:
http://www.davekoelle.com/alphanum.html

Comments on: Unicode is the Focus

Comments:

Post a Comment: