Comments on: Dimensions of REBOL Unicode Support

It is a goal to support unicode in REBOL 3.0.

The following text discusses some of the "dimensions" of unicode support. I am presenting these here to seed your thoughts, so that you can make suggestions and help us produce a smart and balanced unicode design for REBOL 3.0.

I want to note that some aspects of unicode are quite complicated, and it is not practical for REBOL to support every possible nuance related to unicode. (Or, stated another way, I will not allow REBOL to become 10 times larger just to support all possible unicode variations.)

A lot of information has been written about unicode. For a good place to start, browse the Unicode entry on Wikipedia. Unicode has its own special glossary of words (of which I openly admit not strictly adhering to). For example, unicode refers to "code points" as references to characters to provide an abstraction layer. Here is a good Unicode Tutorial that helps explain these terms.

Unicode! Datatype

A new REBOL datatype, unicode!, handles the internal storage for unicode string series and provides the standard series functions (make, to, find, next, last, etc.).

Open to Debate

It is still open to debate if unicode should be stored internally in 16 bit format (such as that used by Java and holds all standard characters), or if a 24/32 bit format makes more sense (allowing full range of unicode characters, but wastes memory). It may be possible to auto-size the datatype internally, storing unicode in 16 bit normally, but switching to 24 bit when needed.

Literal Direct Format

REBOL will support a new syntax for unicode strings. This is the format that appears directly within scripts and data files.

The simplest literal format is a hex (or perhaps base-64) encoded string similar to the binary data format. The advantage is that such a format does not cause problems when processed or transferred with normal 8-bit character systems. The disadvantage is that it is large and a user cannot view the actual string contents within the source code. For example:

##{005200450042004f004c} ; 16 bit, basic multilingual plane

###{00005200004500004200004f00004c} ; 32 bit, full unicode

When loaded, both of these strings would result in a unicode! series datatype.

It should be noted here that REBOL defines a byte-order for such literals. REBOL uses big-endian format, so no byte-order-marker need appear within the literal strings.

Literal Encoded Format

Another possibility would be to allow UTF-8 encoding within strings in the source code. The advantage is that you will be able to view the strings in the appropriate editor. The disadvantage is that the script would contain a range of odd looking characters.

Even if UTF-8 is not supported as a literal datatype of REBOL, we would still support conversion to and from UTF-8 format.

I've not reached a decision on this issue as of yet.

Script Encoding

I am thinking about allowing support for unicoded REBOL scripts. The load and do functions would accept scripts as ASCII and also as UTF-16. For scripts that include a REBOL header, the UTF-16 could also be automatically detected (because it would appear as "0R0E0B0O0L", where 0 is a null byte).

Supporting scripts in UTF-8 format would be more problematic, because the REBOL header would appear the same. Also, existing scripts that use the latin-1 encoding could cause false UTF-8 detection. More discussion is needed.

Note that if we do allow unicode for scripts themselves, only literal string! and char! datatypes will be allowed to contain the unicoded characters. Other datatypes, such as words, will remain as they are today. This raises some issues with regard to datatypes like unicoded file names and email addresses, which we should discuss in more detail.

Conversions

A few new functions will be provided to encode and decode unicode strings into a variety of formats. For example, we will provide functions to input and output UTF-8 and UTF-16 formats (including byte-order-marker for endian detection, allowing UTF-16-LE and UTF-16-BE).

Coercions

When functions combine both unicode and string datatypes, we will automatically provide conversion when it makes sense. For example:

insert a-unicode-string "REBOL"

will insert the "REBOL" string, converting from latin-1 to unicode bytes.

Casing and Sorting

Many programmers have requested that the unicode datatype be able to handle the upper- and lower-case conversions as we do today with normal strings. They also want a way to sort unicoded strings, just as we do today with other types of strings. Also, we must allow case-insensitive searching and sorting features.

Ports

I think it would make sense to allow ports (e.g. network connections, files) to operate in a number of unicode codec formats. For example, you may want to read data directly from a UTF-16 XML file without calling an extra conversion function.

Perhaps the best way to solve this requirement is to look at what solutions other languages, such as Java, provide.

Graphical Display and Input Events

In addition to being able to handle unicode as a datatype, we will want to be able to create displays that handle unicode characters and accept unicoded input.

For example, we've had many requests to support Chinese characters in REBOL applications, so we need to make this possible at some stage in REBOL 3.0 or perhaps 3.1.

Operating System Compatibility

And finally, I should mention that unicode is an important consideration over the wide range of operating systems supported by REBOL. Native APIs for Windows, OSX, BSD, Linux, and others have support for unicode, and REBOL must be able to interface to those APIs. This is true not only for making DLL calls, but for operations as standard as file and directory access.

23 Comments

Comments:

Petr Krenzelok
8-Jun-2006 22:52 :) Just a small note to syntax. Some kind of syntax extension was discussed on Altme Binary Tools group, and it made sense:
#b{11111111} or b#{11111111}
as binary base 2 for e.g. So, instead of ## or ### cryptic way, what about
ucd32#{005200450042004f004c} or u16#{005200450042004f004c}
why to be cryptic with non self-explanatory enough symbols, if we have better way, the rebol way? :)
Sorry for being off-topic probably, but I know Carl cares for syntax/naming conventions, so just food for thought :)
Cheers, Petr
Brian Hawley
9-Jun-2006 0:31 Instead of worrying about whether latin-1 encoded scripts would be mistaken for UTF8, why not just specify that the encoding of REBOL 3 scripts is UTF8 by default, and then just let latin-1 be the strict subset of UTF8 that it is supposed to be. All valid latin-1 byte streams are valid UTF8 byte streams anyways.
Then we could specify UTF8 strings directly as ##"astring" (converted on load to unicode), and have the characters specified in "astring" (single-byte strings) be syntactically limited to the one-byte subset of UTF8 (latin-1), letting other single byte values be specified with escape codes like they are now.
Would this work?
Chris
9-Jun-2006 0:31 Brian, I've been thinking along the same lines -- after all, UTF8-based REBOL scripts work right now with some caveats (main problem is string functions that treat the latter 128 characters as ISO-8859-1, and of course char! and charsets are meaningless above 127).
Also, while UTF8 is lumpy, it is a good fit for a messaging language, at least when using languages that utilize the lower BMP codes.
Volker
9-Jun-2006 4:22 utf-8 in the sourcecode please. I am used to see/patch code and data in an editor. With ##{005200450042004f004c} i cant. With utf-8 i can, with the right editor. And even strange looking chars can be distinguished.
Cyphre
9-Jun-2006 4:42 I agree with Brian. We don't need Latin-1 scripts. Just make R3 scripts UTF-8 by default.
Goldevil
9-Jun-2006 5:08 Internal storage 16 bits and internal auto conversion into 24/32 bits if needed is a good idea. But, for performance consideration, it could be useful to force 24/32 bits at creation or with a conversion function. I supposer that internal conversion goes quickier if done on empty or small strings.
I also propose that all scripts are considered as UTF-8 because latin-1 is a subset of UTF-8. But informations in the header can help rebol to detect encoding type : REBOL [encoding: 'utf8] REBOL [encoding: 'iso8859]
Totaly agree to not permit unicode in words! . Unicode is for the user of the software not the programmer. The programmer wants to manipulate it easily (direct utf-8 encoding of strings in a script) but can handle latin-1 encoding for variables names which are rarely showed to the user.
The fact that UTF-8 (the most usual unicode usage in databases) has variable length characters is sometimes tricky with some languages. Some rebol examples :
print length? myutf8string
What's the result if the string contains some 2-bytes characters ?
myutf8string: next myutf8string
If the first character is 2 bytes long. Where is the pointer ?
print index? myutf8string
And now, what's the index position ?
I can accept no unicode display in the console but in VID it's more important not only for multilingual purposes but also because unicode contains many useful symbols.
Unicode is not so easy to implement but it's very useful... Event if the Unicode Technical Committee rejected the Klingon proposal in May 2001 :)
Artem Hlushko
9-Jun-2006 5:53 Carl, please, make utf-8 in scripts, resident storage and services.
I propose to use fixed size (16 or 24 or 32 bit) datatype for internal manipulations and second 8 bit (octet by RFC) datatype for serialization to utf-8 for external communication (scripts are communication too :). I think it is natural for the network language and distributed applications.
How unicode was implemented in plan 9 (http://cm.bell-labs.com/sys/doc/utf.html) by the inventors of utf-8 (http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8- history.txt)
Volker
9-Jun-2006 9:01 if 16 bit by default and 24/32 extension is possible, 8 bit default and 16bit-extension would be too? Then we could use unicode all the time while keeping the 8bit-memory-advantage.
maximo
9-Jun-2006 9:33 utf-8 by default for code, seems like a good idea.
after all, the extended character will be sort of like escaped characters, with their own syntax.
AFAIK, a lot, if not most, xml messaging seems to take place in utf-8, cause its human readable and allows better internationalisation.
also, many text editors read and write utf-8 directly too.
so any extended character would be converted transparently.
Andrew
9-Jun-2006 9:49 As far as I see it, simplicity for the programmer is uppermost. Make scripts UTF-8. Make normal REBOL literal strings accept unicode characters, no need for a special syntax. Strings can be stored internally in any convenient format and you may want to allow the programmer access but a function like length? should return the number of characters, not bytes. Ensure that all functions that provide input or output have an /encoding refinement and that graphical output can display such strings. This way we'd have to learn very little new. That would make me happy :)
Chris
9-Jun-2006 9:49 Max hits the nail on the head: "the extended character will be sort of like escaped characters". If -- length? "^/^/^(F4)" == 3 -- then why can the same concept not apply to upper utf-8 characters?
Goldevil
11-Jun-2006 14:26 Just an idea with the /econdig refinement. Maybe, a function can change the default encoding of string manipulation functions. It permits less code writing but for redistributed code, explicit usage allows easy insertion in other script.
The bad thing : typical errors for people who learn Rebol : "Damn! Why my script is unable to read correctly this file ?!"
Goldevil
12-Jun-2006 4:07 I was talking about /encoding refinement. Sorry.
Cal
12-Jun-2006 14:53 I'm also voting for making scripts UTF8 by default. I don't care what restrictions are made on characters used for "word!" values, but I do think we need to allow unicode "file!" values, since at least on windows file names can use unicode characters.
I tend to favor 32 bit encoding for internal use since that is guaranteed to be enough going into the future, but I don't care much about the internal representation if it is made simple enough to use.
Andreas Bolka
12-Jun-2006 19:34 i would default string! to be a unicode string. personally i 'm fine with limiting it to the unicode BMP (16b).
consequently, i'd like to have utf8 as default script encoding and therefore as default encoding for string literals as well: "foo" would therefore be an utf8-encoded unicode literal (datatype string!). additional support fuer utf16 encoded scripts would certainly be a bonus.
sorting may be a bit problematic due to many different sort orders (e.g. german sorts � before z, swedish sorts z before �) - maybe the unicode collation algorithm (UCA) may help here.
rebol has literal binary (octet) strings already#{...}/binary! - conversion functions from unicode strings to various unicode encoding schemes (string! to binary!: utf8, utf16, ..) et vice versa (binary! to string!) should be provided. this would provide full flexibility, also for working with ports, maybe add an additional layer (e.g. read/encoding %filename 'utf16) for convenience.
so, here you are: my personal unicode wishlist :)
Oldes
12-Jun-2006 19:34 To give my opinion, I would prefere just a 16 bit format internally as I'm living in central Europe. I cannot imagine, how much people in Asia can find Rebol useful as I did.
About the issue with default utf-8 scripts I can see probem, if Carl don't want to support unicode for words. It can be difficult for newbies to decide, which chars are not supported as now I can do for example this:
>> ě�čř��: "ě�čř��" == "ě�čř��"
With unicode enabled editor the string would looks like:
>> ucs2/encode "ě�čř��" == #{011B0161010D0159017E00FD00E100ED00E9} >> to-string ucs2/encode "ě�čř��" == {^A^[^Aa^A^M^AY^A~^(at)�^(at)�^(at)�^(at)�}
This is not valid Rebol word at all, but in the editor it will looks like normal charset.
Issue with exotic charsets can be esily solved by downloadable code tables as with tables for correct sorting.
What I need is support for displaying unicode strings in View (and in Windows console - Unix based terminals already support it)
Oldes
12-Jun-2006 19:34 And the ## and ### syntax seems to be fine to me. It's shorter than:
UCS2{005200450042004f004c} and even UCS4{00005200004500004200004f00004c}
and is consistent with current behaviour where:
>> [ucs2{234}#{0052}] == [ucs2 "234" #{0052}]
Oldes
12-Jun-2006 19:34 And if everybody talks about UTF8 and UTF16 - please do not miss that this:
##{005200450042004f004c}
is not UTF-8!! It's UCS2! UTF-8 is compressed 16bit Unicode.
>> ucs2/encode "REBOL" == #{005200450042004F004C} >> utf-8/encode-2 ucs2/encode "REBOL" == #{5245424F4C}
Edoc
14-Jun-2006 13:09 "I don't care what restrictions are made on characters used for "word!" values"
If there are restrictions on characters within word! values, then we lose support for UTF-8 chars in dialects (the word sequences, at least). That's a big loss, in my opinion.
Brian Hawley
14-Jun-2006 13:50 Oldes, UTF-8 isn't compressed 16bit Unicode, it is a compact encoding of full Unicode. It is UCS2 that is limited to a 16bit subset of Unicode (BMP).
Oldes
14-Jun-2006 13:50 Brian, sorry, to me "compact form" is almost same as "compressed" (I'm not native english speaking). Just wanted to say, that it's not the same.
Brian Hawley
23-Jun-2006 22:32 Oldes, I'm sorry as well, as the compact/compressed distinction wasn't the main point I was trying to get across. My main point is that Unicode is not a 16-bit format, and hasn't been for a while now.
The main difference between UCS2 and UTF8 or UTF16 is that UCS2 can only handle a 16-bit subset of Unicode, while UTF* can handle the whole thing. The main advantage of UCS2 is that it has fixed-length characters, which does make certain operations faster. But if people are talking about having strings autoconvert from 8 bit to 16, 24 or 32 bit internal formats as needed, that's what UTF8 does well already.
You don't even really need a header to determine the encoding of REBOL scripts - the word REBOL in front of the script header can act like a byte order marker, and any script that validates as one of the Unicode encodings can be assumed to be that encoding.
Robert D
24-Jul-2007 12:23:55 Of course it has to be in words, and if its not, then its garbage. Programmer sentiment is that I already learned latin, and so I don't need it, only the user. If you don't open up programming to 'non programmers' then you have given up on the dream, and someone who hasn't quit, can pick your pocket. That Rebol is actually planning to be non competitive is ashame.