Comments on: Unicode: READ and LOAD
Unicode changes some important usage patterns in REBOL. So, I want to talk about them here.
This is the first part of several blogs on the R3 Unicode topic.
First, you should know:
| ||READ||reads binary (bytes) and does not decode them. When you read from a file, the bytes are often encoded. They could be UTF8, or UTF16, but it could also be JPEG or WAVE. So, READ handles raw bytes, not content.
| ||LOAD||decodes binary into a specific datatype. Just as before we've used LOAD to decode REBOL blocks, JPEG, GIF, or PNG, it can decode other types too, such as UTF8 into a STRING!.
These imply a lot. Just think about it. If you write:
data: read %page.html
How is the data encoded? It could be LATIN1, UTF8, UTF16, or something else. If all we want to do is save it as a file or upload it, then we may not care what it is. We can write:
write %newpage.html data
The nice thing is that this works for all datatypes (including images and sounds), because it's just binary data.
Now, if we want to process the data, we need to take a different approach. We will need to decode the data. That is the purpose of LOAD.
We will need to write a line like:
page: load/as %page.html 'utf8
Now we know how to decode the raw bytes; they are UTF8 encoded. The result (page) is a proper STRING! type that we can process normally.
We could also write:
page: load/as %page.html 'latin-1
For latin-1 style encoding (used for REBOL2).
To write it, we would need to re-encode the page. That might look like this:
save/as %page.html page 'utf8
Note that this method of encoding and decoding is generalized in REBOL. There can be many different codecs (encoder/decoders). So, just as we handled UTF8 above, we may also write:
image: load/as %image.jpg 'jpg
save/as %image.jpg image 'jpg
A shortcut is to allow:
image: load %image.jpg
where we automatically detect the file encoding and apply the correct decoder.
How is the encoding detected? We have two hints: the suffix and the file header. Both of those can be used to tell us we have a JPG file.
Can we use something similar for text files? It is possible. Unicode files often provide a BOM, a byte order marker, to tell us what they are. That would allow:
page: load %page.html
Ah, but wait. Isn't that how we load REBOL code too?
In fact it is, so we'll need to be careful, and also support some refinements to help us know what is being requested.
For example, the line:
code: load %code.r
is easy because we can see it is a .r file. But what if we wrote:
code: load %code
That is valid to do, but what is being loaded?
We have two choices:
- Use some easy to remember rules
- Require explicit information
I think both should be allowed. For example, if %code contains a REBOL header, then we will load it as code. If the %code file contains a REBOL header, but also a UTF16 BOM, we will still load it as REBOL (converting it as required).
Now, if you want the file decoded as a string rather than REBOL, you'd need to state that:
text: load/string %code
and the BOM is used, and if not found, it is assumed to be UTF8 encoded. We can also do this:
text: load/string/as %code 'utf8
if we know it is UTF8 and we need to handle it that way.
Returning to READ for a minute: should we allow some method of decoding within the READ function? Should it be similar to LOAD?
text: read/as %file 'utf8
I'm not so sure. Decoding is not the main purpose of READ, but perhaps this is a handy shortcut, although it is redundant to LOAD/as, and I generally like to avoid such things.
I should also point out that LOAD can work on binary strings as well. This usage is valid:
data: read %file
text: load/as data 'utf8
Just as is:
image: load/as data 'jpg
The reverse (SAVE) is also allowed:
data: make binary! 10000
save/as data text 'utf8
write %file data
FAQs (from feedbacks)
Q1: Are we overloading LOAD function?
Not if you consider the design abstraction: LOAD "internalizes" external data sources. LOAD is used even today to load more than just code. So
it is useful to provide this broad-based load function. Note that the variation on the abstraction is related to the decoding methods it can use, not in its purpose.
Q2: Can other encodings be supported?
Yes. We would allow a run-time expandable table of encoder/decoders (codecs). This would let us support other text encodings, such as ISO code pages, etc.
Q3: How smart is the encoding type detection?
We will allow several methods of detection:
- use of LOAD/as type spec
- file suffix (via codec table)
- encoding type detection (signatures)
So, codec modules would include more than just encode/decode, but also the signature algorithms. For example, JPG binary data downloaded includes a signature that identifies it as JPG.
Q4: Can we override the assumed encoding type?
It would be necessary to do that. For example, given an HTML page that includes embedded REBOL code, we may want to load the code and not the HTML. It may look something like this:
load/as %page.html 'code ; (preliminary)
Will this be enough? Perhaps. There is a lot of code that uses LOAD to load data that has no headers. Most of the time, it also uses a .r extension. But, it could be that we may need to keep the default LOAD format for CODE and require /as for strings to clarify they are strings only. That approach seems fine.
Q5: What if inconsistent /as specification is used?
What happens in this case?
load/as %image.jpg 'gif
That would cause an error exception (the decoder must verify the signature prior to decoding the data).
Q6: Is LOAD a mezzanine?
Yes, LOAD is written in REBOL, allowing us to refine its behavior during alpha (and future) releases.
I like it. If you do decide to do read/as and write/as, think of them as shortcuts that would conserve memory over their load/as read or write save/ass counterparts. Will the encodings be extensible - will we be able to add new encodings, maybe at runtime?|
Yes, there will be a way to add encodings at runtime... allowing folks to handle other code-pages.|
Added FAQ section to above blog. Hit reload.|
I have to say, I'm a little uncomfortable with this (perhaps I'm set in my R2 ways). I've always thought of 'read as reading text (or binary as an option), and 'load for loading something more structured -- data, images, etc.
My inclination would be for 'read to import a string as 'utf-8 by default with read/as 'binary, 'latin as refinements. I'd possibly retain 'read/binary too:
read/as (url) 'latin-1
; this would convert latin to a utf-8 string
This approach is both backward compatible, and imo. most intuitive.
Chris, it is a good point, and making this definitional change in READ has been with some concerns, such as what you mention. It is quite useful to be able to read text data that way.
I should mention that the change in READ as also made for the general case of binary data transfer. Take this simple script example:
write %image.jpg read %graphics/image.jpg
In R2 the image.jpg would get corrupted by the fact that it was treated as a text string (with bytes that look like line terminators being modified).
So, that's another reason we changed it... to reduce the likelihood of misuse.
In addition, even for TEXT, users must understand that READ in R3 is not identical to R2. READ in R2 takes bytes and assumes they are valid text. R2 does not assume that. It assumes that bytes are an encoding of text (or image or whatever).
Although this change may be somewhat uncomfortable to existing users to make, I think we need to make it. But, please let me know if you think there's a better approach.
finally, a clear and conscise differentiation of I/O layers.
just like you have done for 2 step importing of modules.
read/binary always was a quirk which generated frustration for all novices.
REBOL is maturing at opposites of all other languages. its improving! its consistency.
Post a Comment:
You can post a comment here. Keep it on-topic.