Finalizing READ and WRITE

This article is important and worth reading: we need to finalize the read and write functions this month.

Update: please read my additional notes posted in the comments section.

Quick review

In R3, read and write are entirely different. Unlike R2, these functions no longer use the series-based access model. They use a more traditional stream model, as you'd find in most other languages. As a result, functions like read-io and write-io are no longer needed. You can do those actions with read and write.

One way of thinking of these changes is that read and write are lower level. They basically transfer raw bytes to and from I/O ports.

However, there are some very common usage patterns, especially for file I/O, that we want to support. For example, because we often read text files, we allow the read function to be a little higher-level:

text: read/string %document.txt

This is higher-level because it includes decoding of the bytes into Unicode characters and conversion of CRLF to just LF. It also examines the data to determine if a BOM (byte code marker) is present to properly handle both UTF-16 encodings.

In addition, we also support:

write %document.txt text

And, when text is a string, the file data will automatically be string encoded using UTF-8 standard encoding. In addition, the line terminators will be expanded to the local format, such as CRLF on Windows (but not on Linux and others.)

Please note

These special actions are only done to make code easier for simple cases. For everything else, if you want to use read and write, you will be dealing with the byte streams in your own code.

For example, if you open a port for to handle a file in chunks, your code is dealing with bytes directly. No encoding is implied. If you need to encode or decode it, you must call one of the codecs.

For example, if you write:

port: open %doc.txt
data: read/part port 100
...

The data is binary! It is a series of bytes up to 100 bytes long. It is not decoded text.

This approach to I/O is consistent. Both the /skip and /part refinements must indicate byte offsets and sizes.

Finishing up

With that said, it needs to be decided what other ease-of-use actions we need for read and write.

For example, I previously suggested the /as refinement that would allow:

text: read/as %doc.txt 'utf-16be

or the alternate form:

text: read/as %doc.txt 16

and:

write/as %doc.txt text 'utf-16le

or the alternate form:

write/as %doc.txt text 16

The /as refinement let's you specify the encoding of the string.

In addition, we must decide if the Unicode BOM should be written, and what line terminations are needed.

In summary, we must be able to specify:

UTF encoding
BOM present
line termination

That can be done with a function spec like:

write ... /as utf /bom /with eol

But, of course we're adding two more refinements... to a function that we will often want to use at a low-level with high performance. It's probably not going to make much difference, but it's something we want to recognize.

If we don't want to add these refinements, the alternative isn't that pretty either. We'd need to accept a micro-dialect (non-reduced) that specifies the options:

write/as file text [utf-16be bom crlf]

and we'd probably also allow the variation where UTF is integer and the line terminator is a string or character:

write/as file text [16 bom "^m^j"]

And, I probably should mention that the write defaults for /as would be UTF-8, BOM, and local-style terminators.

Think about it

Again... none of this matters for custom I/O where you're handling the bytes yourself. This is only for the cases where the entire I/O is handled in a single call to read or write -- the high-level ease-of-use action.

Note also

That the load and save functions do allow for other data encodings for ease-of-use file access... such as those used for images, sounds, etc.

25 Comments