Comments on: Pruning down READ and WRITE

As you know, read and write functions in R3 will default to binary, rather than string. This is necessary because:

Strings are encoded as binary, for example UTF-8. So in order to decode them, we must know how they are encoded. If the file has no BOM (byte order marker) then its encoding is unknown and must be provided to the function itself.
Running in binary mode will not accidentally corrupt a file. This is the ancient FTP transfer binary/text problem. If you transferred an image using FTP text mode, the file could be damaged if line terminators where found.

A few days ago, I proposed new read and write functions that add an /as refinement. This refinement allows you to specify the encoding:

data: read/as file 'utf-8
write/as file data 'utf-16le

This is useful because the encoding can be specified as part of the function call. However, this also makes this approach the standard method for reading and writing all decodings, even those for image files, etc. For example:

image: read/as file 'jpeg

Our plan was to add an intermediate layer in read and write to allow for codecs (encoding and decoding). They would be stream oriented to allow for partial transfers, and also "fragments" (when not enough data has been received to finish a well-aligned encoding or decoding process.)

Of course, all of that makes read and write more complicated.

The question is: do we want to do that?

Another factor is that the old R2 lower level I/O functions read-io and write-io have been eliminated from R3. The read and write functions have that capability now.

So, we can say that it all boils down to: are read and write lower-level or higher-level functions?

After some consideration, I think they should be lower-level functions.

This would mean that they should be as fast and efficient as possible. This also implies that they should have as few refinements as possible (because, as in any language, the more function arguments there are, even if optional or local, the more overhead the function call has, because those slots must be allocated in the function frame.)

Ok, if we do that, function like read can be defined as:

read file /part size /skip len

That's really pruned down compared to R2. We could prune it even more by adding a seek function to eliminate the /skip refinement.

So, what about the primary REBOL rule of keeping things simple?

This is important, and the solution would be to provide higher level mezzanine functions that provide the necessary encoding.

For example, we could have:

str: read-text file
write-text file str

This read-text function could be smart in many ways. For example, it could examine the BOM of the file to determine the encoding. It would also make the line termination corrections.

Note, because it is a mezzanine function, users have access to easily improve it over time.

We would also provide an /as refinement:

str: read-text/as file 'utf-16le

and, even:

write-text/bom/as file 'utf-16le

To indicate we want the BOM inserted at the head of the file data.

So, there you go. Let's do a quick survey of REBOLers and get some comments.

12 Comments

Comments:

Paul
16-Apr-2008 14:32:49 I agree with the lower level approach. Maybe we should have some higher level function to handle common tasks.
Brian Tiffin
16-Apr-2008 16:58:32 From reading this I think the only thing I'd miss is the behavioural difference between read/lines and read.
Also, can we add codecs (as REBOL functions) to the /as handler, to avoid pulling out gcc to add a new data stream type? read/as file :mystream or similar? Or system/codecs/...? and Root-Codec helpers.
Cheers
Jimeb
16-Apr-2008 20:17:33 Does the Read/Write properly handle little-endian and big-endian encoding?
Reisacher
17-Apr-2008 1:44:10 Will the new low level read and write provide all the error codes and information as the old read- and write-io.
The old write was flawed, as it did not throw an error or gave a warning, when it could not write all data.
DideC
17-Apr-2008 3:15:02 If they are low level, why not calling them with different name like there were with read-io/write-io, and then we have the read/write name for higher level (mezzanine) stuff.
It's just the compatibility thingy that can stop us to do that, but does it is compatible with R2 anymore ?
rebolek
17-Apr-2008 3:45:52 I think read/write should be low-level and for higher level we can use load/save (load/as instead of read/as and so on).
-pekr-
17-Apr-2008 5:29:07 E.g. Carl proposed read-text, write-text for higher level. I must say I don't like it, because then I don't know, what other things I might expect - read-image? Yes, I know, image is binary, but then I can see ppl do things like:
my-doc: read-text %my-file.doc
Thinking they will get MS Word text loaded :-)
Hmm, I think I prefer one and only read/as %filename.jpg 'jpg-codec so that codec would also tell us, what mode should read work in ...
Henrik
17-Apr-2008 6:01:14 From what I can see, throwing out READ-IO, and having READ be like READ-IO, and then introduce READ-TEXT is just shuffling function names around.
I would consider that a low-level function should defer its short name if a higher level function needs it. In this case READ should be a higher level function. Then invent a new function name for the low-level one, perhaps something like... READ-IO.
:-)
Henrik
17-Apr-2008 6:02:08 Sorry, "defer" is the wrong term. "give up" is more suitable.
-pekr-
17-Apr-2008 8:39:20 The more I think about the topic the more I don't like the route we are trying to take. With typical OOP environment, you get something called polymorphism, so e.g. services.send, services.open, services.wait. And that is something I don't like about REBOL - it might not be polymorphism related (as we are not using OOP principles much), but I don't like the fact we get open-service, wait-service, send-service, and ditto for read-text, write-text. I can easily imagine, that e.g. read-image will have tonnes of refinements anyway, because of various needs for various formats.
My proposition is to not use simple literal word for specification, but callback function instead:
my-image: read/as %my-file-even-without-jpg-suffix :codec-jpeg param1 param2
Simply put - I would free read from knowing anything about the format - it is responsibility of encoder/decoder. I can even imagine such encoders/decoders working in an asynchronous/streamed way - using device model. I think if devices and plug-ins have api, why not codecs?
IIRC even BeOS uses app servers for various such purposes, but I might be wrong. It might be also interesting to look at Amiga Datatypes model too ...
FVA
17-Apr-2008 16:31:20 -pekr- I agree with your suggestions especially with "Simply put - I would free read from knowing anything about the format - it is responsibility of encoder/decoder. ...". This quote suggests a reminder, IMO, that delegating tasks to the appropriate levels/modules of the system/language is critical to maintain simplicity, consistency, maintainability, reliability, and readability. You want the system to be DECOUPLED and CONCURRENT as much as possible, e.g., interpret it as a network whose nodes working together accomplishes a given task. This will allow faster evolution/improving of the system while minimizing changes to the way we interact with the system.
An aside question: How concurrent will REBOL 3 be (and what kind of concurrency model), e.g., as compared to Erlang http://www.erlang.org/ I hope Carl addresses the issue of concurrency in a different post. Thanks in advance Carl if you do.
Carl Sassenrath
21-Apr-2008 12:50:01 My reply is detailed. I will post a follow-up blog.

Comments on: Pruning down READ and WRITE

The question is: do we want to do that?

After some consideration, I think they should be lower-level functions.

So, what about the primary REBOL rule of keeping things simple?

Comments:

Post a Comment: