Comments on: R3 and Unicode
There's a long document about Unicode on the R3 Wiki, but let me summarize the main points:
- R3 source code is UTF-8. This is the most popular and most backward compatible Unicode format. That's why we use it.
- Source code that is ASCII (0-127) is also valid UTF-8. You don't need to modify such files.
- Files that use other encodings with 128-255 will not work. You need to convert them. We will provide examples.
- Internally, a string! is Unicode. You don't need to worry about its encoding or storage format. Just use the normal series functions like insert, append, remove, etc.
- The console on Win32 should work for all characters: Latin, Chinese, Greek, Cyrillic, Hiragana, etc. If you discover a problem, please mention it.
- The console on non-Win32 (Linux, BSD, OS X) does not currently support Unicode. The reason is due to the R3 ReadLine() line-editor. If you want to help fix that problem, contact me and I'll send you the source.
- This is 2009 and most editors should be able to handle UTF-8. Even Notepad in XP handles UTF-8 (as well as UTF-16 LE and BE.)
- Data files can be binary, ASCII, UTF-8, UTF-16LE, and UTF16-BE. If the file contains a BOM, it will be auto detected when using read/string.
- We still need to provide a method for reading LATIN-1 and other "codepage" encodings as data. This will be done with /as added to read and write functions. Then, a small script can convert any codepages to UTF-8.
- Internally, REBOL is smart about Unicode. It optimizes storage for strings. For example, ASCII and LATIN-1 strings take no more space than in R2.
- R3 in it's default configuration only supports the Lower Unicode Plane (0-64K). That's nearly everything you can imagine. It is possible to compile with full 32bit Unicode support, but that is not what we want for a default.
Adding Unicode to REBOL required a major development effort. It was non-trivial and very expensive to add. Internally, we found that in many cases adding Unicode does not make code twice as complicated, it makes it 5-10 times more complicated.
However, we've isolated nearly all this complexity from REBOL programs. For the most part, programs can be just about as clean and simple as they were in R2 (that did not have Unicode.) This is a significant accomplishment.
Congratulations Carl - that's a major hurdle to get past!|
Yes, thank you very much for this important advancement.|
Well done. Its worth having if we are to convince new adopters that rebol is not just a niche language.|
REBOL is a different language whith paradigms that must be fully acknoledged.
Unicode open REBOL to the rest of the world.
Hope this round it will get more success.
Can source files, which are in UTF-8, have BOM?|
Jerry - in fact, they are supposed to have BOM. R3 detects BOM, IIRC, so it should work imo ...|
I think Jerry is asking if .r files will skip the BOM.
That's a really good question... because DO and LOAD are different in how they validate REBOL source.
DO will scan for the REBOL header signature, but LOAD does not require that (for example when you are loading data.)
Please test this and post it as a bug if it's a problem.
Post a Comment:
You can post a comment here. Keep it on-topic.