Comments on: R3 and Unicode

There's a long document about Unicode on the R3 Wiki, but let me summarize the main points:

R3 source code is UTF-8. This is the most popular and most backward compatible Unicode format. That's why we use it.
Source code that is ASCII (0-127) is also valid UTF-8. You don't need to modify such files.
Files that use other encodings with 128-255 will not work. You need to convert them. We will provide examples.
Internally, a string! is Unicode. You don't need to worry about its encoding or storage format. Just use the normal series functions like insert, append, remove, etc.
The console on Win32 should work for all characters: Latin, Chinese, Greek, Cyrillic, Hiragana, etc. If you discover a problem, please mention it.
The console on non-Win32 (Linux, BSD, OS X) does not currently support Unicode. The reason is due to the R3 ReadLine() line-editor. If you want to help fix that problem, contact me and I'll send you the source.
This is 2009 and most editors should be able to handle UTF-8. Even Notepad in XP handles UTF-8 (as well as UTF-16 LE and BE.)
Data files can be binary, ASCII, UTF-8, UTF-16LE, and UTF16-BE. If the file contains a BOM, it will be auto detected when using read/string.
We still need to provide a method for reading LATIN-1 and other "codepage" encodings as data. This will be done with /as added to read and write functions. Then, a small script can convert any codepages to UTF-8.
Internally, REBOL is smart about Unicode. It optimizes storage for strings. For example, ASCII and LATIN-1 strings take no more space than in R2.
R3 in it's default configuration only supports the Lower Unicode Plane (0-64K). That's nearly everything you can imagine. It is possible to compile with full 32bit Unicode support, but that is not what we want for a default.

Adding Unicode to REBOL required a major development effort. It was non-trivial and very expensive to add. Internally, we found that in many cases adding Unicode does not make code twice as complicated, it makes it 5-10 times more complicated.

However, we've isolated nearly all this complexity from REBOL programs. For the most part, programs can be just about as clean and simple as they were in R2 (that did not have Unicode.) This is a significant accomplishment.

8 Comments

Comments:

Nick
31-Oct-2009 0:28:36 Congratulations Carl - that's a major hurdle to get past!
Edoc
31-Oct-2009 11:09:15 Yes, thank you very much for this important advancement.
Luke
31-Oct-2009 12:09:04 Well done. Its worth having if we are to convince new adopters that rebol is not just a niche language.
Giuseppe Chillemi
31-Oct-2009 12:24:43 * Luke:
REBOL is a different language whith paradigms that must be fully acknoledged.
Unicode open REBOL to the rest of the world.
Hope this round it will get more success.
Jerry Tsai
1-Nov-2009 22:52:37 Can source files, which are in UTF-8, have BOM?
-pekr-
2-Nov-2009 11:54:30 Jerry - in fact, they are supposed to have BOM. R3 detects BOM, IIRC, so it should work imo ...
Carl Sassenrath
4-Nov-2009 14:34:50 I think Jerry is asking if .r files will skip the BOM.
That's a really good question... because DO and LOAD are different in how they validate REBOL source.
DO will scan for the REBOL header signature, but LOAD does not require that (for example when you are loading data.)
Please test this and post it as a bug if it's a problem.
Oldes
11-Nov-2025 0:05:07 Full Unicode support is since version 3.20 https://github.com/Oldes/Rebol3/releases/tag/3.20.0

Comments on: R3 and Unicode

Comments:

Post a Comment: