Unicode: READ and LOAD

Unicode changes some important usage patterns in REBOL. So, I want to talk about them here.

This is the first part of several blogs on the R3 Unicode topic.

First, you should know:

	READ	reads binary (bytes) and does not decode them. When you read from a file, the bytes are often encoded. They could be UTF8, or UTF16, but it could also be JPEG or WAVE. So, READ handles raw bytes, not content.
	LOAD	decodes binary into a specific datatype. Just as before we've used LOAD to decode REBOL blocks, JPEG, GIF, or PNG, it can decode other types too, such as UTF8 into a STRING!.

These imply a lot. Just think about it. If you write:

data: read %page.html

How is the data encoded? It could be LATIN1, UTF8, UTF16, or something else. If all we want to do is save it as a file or upload it, then we may not care what it is. We can write:

write %newpage.html data

The nice thing is that this works for all datatypes (including images and sounds), because it's just binary data.

Now, if we want to process the data, we need to take a different approach. We will need to decode the data. That is the purpose of LOAD.

We will need to write a line like:

page: load/as %page.html 'utf8

Now we know how to decode the raw bytes; they are UTF8 encoded. The result (page) is a proper STRING! type that we can process normally.

We could also write:

page: load/as %page.html 'latin-1

For latin-1 style encoding (used for REBOL2).

To write it, we would need to re-encode the page. That might look like this:

save/as %page.html page 'utf8

Note that this method of encoding and decoding is generalized in REBOL. There can be many different codecs (encoder/decoders). So, just as we handled UTF8 above, we may also write:

image: load/as %image.jpg 'jpg
save/as %image.jpg image 'jpg

A shortcut is to allow:

image: load %image.jpg

where we automatically detect the file encoding and apply the correct decoder.

How is the encoding detected? We have two hints: the suffix and the file header. Both of those can be used to tell us we have a JPG file.

Can we use something similar for text files? It is possible. Unicode files often provide a BOM, a byte order marker, to tell us what they are. That would allow:

page: load %page.html

Ah, but wait. Isn't that how we load REBOL code too?

In fact it is, so we'll need to be careful, and also support some refinements to help us know what is being requested.

For example, the line:

code: load %code.r

is easy because we can see it is a .r file. But what if we wrote:

code: load %code

That is valid to do, but what is being loaded?

We have two choices:

Use some easy to remember rules
Require explicit information

I think both should be allowed. For example, if %code contains a REBOL header, then we will load it as code. If the %code file contains a REBOL header, but also a UTF16 BOM, we will still load it as REBOL (converting it as required).

Now, if you want the file decoded as a string rather than REBOL, you'd need to state that:

text: load/string %code

and the BOM is used, and if not found, it is assumed to be UTF8 encoded. We can also do this:

text: load/string/as %code 'utf8

if we know it is UTF8 and we need to handle it that way.

Returning to READ for a minute: should we allow some method of decoding within the READ function? Should it be similar to LOAD?

text: read/as %file 'utf8

I'm not so sure. Decoding is not the main purpose of READ, but perhaps this is a handy shortcut, although it is redundant to LOAD/as, and I generally like to avoid such things.

I should also point out that LOAD can work on binary strings as well. This usage is valid:

data: read %file
text: load/as data 'utf8

Just as is:

image: load/as data 'jpg

The reverse (SAVE) is also allowed:

data: make binary! 10000
save/as data text 'utf8
write %file data

FAQs (from feedbacks)

Q1: Are we overloading LOAD function?

Not if you consider the design abstraction: LOAD "internalizes" external data sources. LOAD is used even today to load more than just code. So it is useful to provide this broad-based load function. Note that the variation on the abstraction is related to the decoding methods it can use, not in its purpose.

Q2: Can other encodings be supported?

Yes. We would allow a run-time expandable table of encoder/decoders (codecs). This would let us support other text encodings, such as ISO code pages, etc.

Q3: How smart is the encoding type detection?

We will allow several methods of detection:

use of LOAD/as type spec
file suffix (via codec table)
encoding type detection (signatures)

So, codec modules would include more than just encode/decode, but also the signature algorithms. For example, JPG binary data downloaded includes a signature that identifies it as JPG.

Q4: Can we override the assumed encoding type?

It would be necessary to do that. For example, given an HTML page that includes embedded REBOL code, we may want to load the code and not the HTML. It may look something like this:

load/as %page.html 'code   ; (preliminary)

Will this be enough? Perhaps. There is a lot of code that uses LOAD to load data that has no headers. Most of the time, it also uses a .r extension. But, it could be that we may need to keep the default LOAD format for CODE and require /as for strings to clarify they are strings only. That approach seems fine.

Q5: What if inconsistent /as specification is used?

What happens in this case?

load/as %image.jpg 'gif

That would cause an error exception (the decoder must verify the signature prior to decoding the data).

Q6: Is LOAD a mezzanine?

Yes, LOAD is written in REBOL, allowing us to refine its behavior during alpha (and future) releases.

6 Comments