Comments on: Finalizing READ and WRITE

This article is important and worth reading: we need to finalize the read and write functions this month.

Update: please read my additional notes posted in the comments section.

Quick review

In R3, read and write are entirely different. Unlike R2, these functions no longer use the series-based access model. They use a more traditional stream model, as you'd find in most other languages. As a result, functions like read-io and write-io are no longer needed. You can do those actions with read and write.

One way of thinking of these changes is that read and write are lower level. They basically transfer raw bytes to and from I/O ports.

However, there are some very common usage patterns, especially for file I/O, that we want to support. For example, because we often read text files, we allow the read function to be a little higher-level:

text: read/string %document.txt

This is higher-level because it includes decoding of the bytes into Unicode characters and conversion of CRLF to just LF. It also examines the data to determine if a BOM (byte code marker) is present to properly handle both UTF-16 encodings.

In addition, we also support:

write %document.txt text

And, when text is a string, the file data will automatically be string encoded using UTF-8 standard encoding. In addition, the line terminators will be expanded to the local format, such as CRLF on Windows (but not on Linux and others.)

Please note

These special actions are only done to make code easier for simple cases. For everything else, if you want to use read and write, you will be dealing with the byte streams in your own code.

For example, if you open a port for to handle a file in chunks, your code is dealing with bytes directly. No encoding is implied. If you need to encode or decode it, you must call one of the codecs.

For example, if you write:

port: open %doc.txt
data: read/part port 100
...

The data is binary! It is a series of bytes up to 100 bytes long. It is not decoded text.

This approach to I/O is consistent. Both the /skip and /part refinements must indicate byte offsets and sizes.

Finishing up

With that said, it needs to be decided what other ease-of-use actions we need for read and write.

For example, I previously suggested the /as refinement that would allow:

text: read/as %doc.txt 'utf-16be

or the alternate form:

text: read/as %doc.txt 16

and:

write/as %doc.txt text 'utf-16le

or the alternate form:

write/as %doc.txt text 16

The /as refinement let's you specify the encoding of the string.

In addition, we must decide if the Unicode BOM should be written, and what line terminations are needed.

In summary, we must be able to specify:

UTF encoding
BOM present
line termination

That can be done with a function spec like:

write ... /as utf /bom /with eol

But, of course we're adding two more refinements... to a function that we will often want to use at a low-level with high performance. It's probably not going to make much difference, but it's something we want to recognize.

If we don't want to add these refinements, the alternative isn't that pretty either. We'd need to accept a micro-dialect (non-reduced) that specifies the options:

write/as file text [utf-16be bom crlf]

and we'd probably also allow the variation where UTF is integer and the line terminator is a string or character:

write/as file text [16 bom "^m^j"]

And, I probably should mention that the write defaults for /as would be UTF-8, BOM, and local-style terminators.

Think about it

Again... none of this matters for custom I/O where you're handling the bytes yourself. This is only for the cases where the entire I/O is handled in a single call to read or write -- the high-level ease-of-use action.

Note also

That the load and save functions do allow for other data encodings for ease-of-use file access... such as those used for images, sounds, etc.

25 Comments

Comments:

Peter Wood
11-Nov-2009 22:37:32 Understandably, you seem to be designing the ease-of-use for a Unicode environment. I still find the vast majority of text files "in the wild" are still not Unicode encoded. Most of them use one of the Windows Code Pages.
I believe that R3 moving to Unicode will highlight the text encoding issues that get ignored in R2. There will be a need for decoding and encoding for many different schemes. (These, of course, can easily be provided in a community written module).
I think that read and write should be kept as lean as possible with the minimum "ease-of-use" refinements. In fact, I would be happy with just two refinements:
read/unicode ;; read/string renamed for clarity write/utf-8 ;; write a UTF-8 string with a BOM

I believe it would be better to rename read/string as read/unicode as many people do not equate the word string exclusively with Unicode.
All the other ease-of-use options can be provided in a text encoding module.
-pekr-
12-Nov-2009 1:22:53 My opinion is known :-) I am a purist. I can understand your easy-of-use arguments, yet I think that adding read/string was a mistake and I am not sure I like it. With your recent blog, it is not clear, where does it lead us - we are not able to keep 'read/'write purpose as low level only functions.
The worst ever thing is, that while you have to call read/string, 'write does not require /string refinement and the action is automatic. Confusing as hell.
'read should be low level - end of story. We are trying to mix responsibility of codecs and 'load into 'read territory. I can understand that codec acts upon readed data only = you have the data in memory twice, but then - right solutions lead to right conclusions. Change the codec model to work in streamed way.
My opinion is, that 'read should stay data format agnostic. The only option for read should be 'read/as
If fact, what is the difference in:
read/string %my-string-file.txt read/as %my-string-file.txt 'string

... where 'string would be a codec, which would handle all the stuff you mention in your blog.
Stating above, it is clear codec mechanism would probably need some changes. But once again - let's use the correct methonds, even if those initially mean more of work. Why not to extend coded API to allow for read/write to hook into the codec interface and use it as a decoder/filter?
You are an Amiga guy, aren't you? Amiga has Datatypes system - programs can read various formats, without knowing anything about the format itself! The same should be true for 'read imo. Codecs shoudl be our Amiga datatypes.
We already have PORT API, we have DEVICE APPI, we have got COMMAND API (extensions), we need CODEC API. Codec should offer you handlers/methods for what is needed and 'read should be able to hook into such API. We can even have a dialect upon that, if needed, but please - don't mess it into 'read directly.
What will come next? Another user request for specific format, which would be nice to handle by 'read directly? Adding other and other refinemens, and ending like with R2 read/load tonnes of refinements? :-)
Sorry if I don't understand the topic correctly, I just try ... but remember - if we do it incorrectly now, there will be no chance for later change!
Henrik
12-Nov-2009 2:54:41 I just want to say: Design for unicode by default, but allow everything else. I'm fine with the current refinements and the mentioned dialect. I don't think there should be any mention of a /unicode refinement, as we already assume the use of unicode for all defaults.
Philippe
12-Nov-2009 3:12:34 Maybe it's not the purpose, but what about the /lines refinements ? In R2, I use read/lines or write/append/lines (especially for csv datas) every day.
DideC
12-Nov-2009 4:35:23 IIRC, read/string was added for ease of use AND memory saving while reading text file.
But as Pekr pointed, I'm concerned about the limit of this "high level" functionnality !
IMO, having a generic and simple method would be better in this regard.
I know that unicode is not a "codec" in R3 as it's his natural string! encoding, but unicode text file could have different encoding (utf8, utf16, be, le, with/without BOM...) and text in general could have other encoding (latin1, ascii, ebcdic...). And don't forget that 'read could be use to read other kind of datas (image, sound, exe...). So I tend to think that we must found a general way to specify the encoding of the readed file. Something like "read/as %file [encoding dialect]".
But I wonder if handling this kind of things was not devote to 'load instead !
So what about devoting unicode/text file reading to load with wome kind of refinment ?
-pekr-
12-Nov-2009 5:32:39 This is how MorphOS does it: http://krashan.ppa.pl/reggae/
RonH
12-Nov-2009 11:39:14 Since I use Rebol mainly for parsing and processing text, I like the read/string form. I have used read/lines, but prefer to be able to read individual lines, since some of my files are rather large. I found this messy with R2. This proposal simplifies my life, and I like it.
On the other hand, I do process binary data, too, so having the default reading of raw binary is very handy.
PatrickP61
12-Nov-2009 11:58:45 I am with Pekr on this one.
IMO, Read & Write should be as simple as possible, (Unicode being the default) with the ability to "hook" into a separately defined codec to accommodate all of todays formats as well as those yet to be created.
It should be clean and straightforward by separating how data is encoded vs where to transit those bytes.
Mark Ingram
12-Nov-2009 12:16:53 My only comment at this point is, please, please, PLEASE, do not put out a BOM for UTF-8 files by default. You can handle one on input, that is OK, but UTF-8 files do not need (nor should they ever have) a BOM. Anyone who does that is making a big mistake.
To quote The Unicode Standard:
"Use of a BOM is neither required nor recommended for UTF-8"
Reisacher
12-Nov-2009 12:29:49 I would prefer READ/WRITE staying low level. The data encoding should be reserved for LOAD/SAVE.
Carl Sassenrath
12-Nov-2009 13:36:51 This is a fairly complex topic because we are trying to find the best balance between efficiency and usability.
But, let me clarify a few concepts:
In R3, read and write are low level, raw data functions. They are mainly byte-oriented I/O functions. This allows them to work for /seek and /part, but also for things like TCP streaming, etc.
In R3, load and save are high level, processed data functions (that can encode, decode, convert). These functions allow codecs for conversions and they can also recognize file suffixes, and later, MIME types. For example, you load an image! from a file, but if you read that file, you get binary!, not an image! datatype.
We added read/string simply for user convenience, because we tend to do that so often.

Of course, the problem is, that once we open that door (read/string), many other options start to flood in... allowing strings leads to encoding/decoding issues, various UTF formats, code pages, BOM, and CR-LF conversion. Then, we start asking about other codecs... and soon read and load become the same function. (And I've not even mentioned write, which has even more issues.)
However, and to reply to Pekr's comment, the "Amiga-like datatypes" model is part of load and save. Those are implemented as mezzanine functions, so they allow us to expand and improve these capabilities quite easily. That is where codec handling belongs.
I think it is a mistake to "blur" the line between read and load, and that will happen if we add too much capability to read. We want read to mean "get the raw data" and load to mean "get the decoded data".
So, the only amount of "blurring" we want is if it provides a great practical benefit. Originally, because R2 allows read of string data, I thought that R3 should also allow it, just for ease-of-use. However, I am beginning to think that it's problematic.
I will leave this topic open for a few more days... then make a decision. There are other things that need our attention for beta.
PS: Mark, correct, we don't write BOM for UTF-8. However, there is an option.
-pekr-
12-Nov-2009 14:17:57 Carl, if you are looking for a convenience solution, I agree with Henrik that dialect is a good solution then. We use such dialects even elsewhere. E.g. 'secure function. It provides us with isolation/flexibility, and it can be specified in one block, not in tonnes of possible refinements ...
Brian Hawley
14-Nov-2009 18:22:20 Pekr, "use a dialect" and "low-level function" are usually at odds with each other. Dialects have processing overhead.
Now, we need functions that does the same thing as read/string and write string! to be built into R3, since text processing is one of the most common things done in programming today. We could drop parse before we could drop Unicode I/O.
However, the arguments after that break down.

These functions don't have to be built into read and write. That is just proposed as a convenience thing. If it turns out to be not convenient, we cab do this some other way.

We can build these functions outside of read and write without a loss in efficiency. However, don't expect the code to be a simple one-liner. We will likely have to implement these as low-level port model calls, with looping and buffer management. The code will likely still need to be native for efficiency, but could theoretically be mezzanine if the infrastructure is well designed.

Pekr, stop arguing about where this could lead to. This is programming, not heroin. We can specify exactly how far we will go with this, and stop there. We really don't have to take things to their "logical conclusion" every time :)

(While we're at it) Pekr, please stop acting like this feature is a threat to your beloved streamed codec proposal, and thus should be opposed automatically. It's not a threat to streamed codecs, and may eventually be implemented with streamed codecs :)

Peter Wood, there is nothing about this model that prevents you from using text encoded in legacy encodings. Carl's /as option, perhaps implemented with Pekr's codecs, would be a good solution to your problems. But don't complain about the use of Unicode: We (as a planet) need an interoperable character standard, and that is the one we (as an industry) decided on. Legacy data is just that.

Carl, read and write are not necessarily binary: That depends on the port scheme. They are definitely stream-oriented, but the stream is not necessarily of bytes. Some schemes are block oriented (such as directories), and some return arbitrary types (such as the clipboard, in theory). That these conversions only apply to byte streams is a good argument for taking them out of read and write and making them separate functions.

Brian Hawley
14-Nov-2009 18:26:05 Given all that, I suggest that we could add read-string and write-string functions and have them do all of these string conversions. The /lines and /as options could be moved to these functions from read and write, which would go back to being low-level efficient wrappers around the port model. The default behavior of read-string would be the Unicode autodetection, and of write-string would be UTF-8 with platform line encodings and no BOM. The /as option could initially be just the Unicode encodings, but could be later extended to other text encodings using Pekr's codec model once we have that - but only text encodings, line endings and BOM-or-not, no image codecs or such. And /as could take a dialect since these would no longer be as low-level as read and write.
These could even be mezzanines in theory, although they would be advanced, inscrutable mezzanines like load for efficiency reasons, at least until we adapt them to use streamed codecs - this would allow the rapid development and refinement seen in load with the low-level stuff being in infrastructure natives. We might want to go full native once the design is finalized and we have streamed codecs, because this code will be called a lot.
The names could be changed, but should not include the word Unicode or UTF (sorry, Peter).
Or, we could keep read /string and/or /lines and write string! or block of strings and just stop there (no "taking this to its logical conclusion"), and then still make the above functions (now optional) to cover the legacy and obscure cases.
Discuss! Choose! :)
Nick
14-Nov-2009 23:51:38 This seems clear cut: 'We want read to mean "get the raw data" and load to mean "get the decoded data"'. I don't have any problem using 'load and 'save for strings. I do agree that whatever the decision, the syntax should be consistent for both reading and writing.
-pekr-
15-Nov-2009 1:57:37 Yes, we can specify how far we want to go with it. You are doing so, and I am doing so. You are fine with adding read-string, and I am speculating about how far it can go. Even rebservices dubbled all posslible functions - open-service, do-service, etc.
So, I can't wait to have all those read-jpg, read-word2K, read-word2003, read-word2007 functions at hand, because - those are the most used data formats in the world :-)
All I am asking for is - let's think, before acting. Because once some path is choosen, it will not be easy to revert (e.g. to remove some function later, if we change our mind, for compatibility reasons).
My initial suspicion was, that what is being proposed is only because the other way might be more difficult to do, and we are pressed for the time. If not, then everything is OK. I know Carl's line of thinking - if some feature covers 80% of user's usage, it might deserve an exception. And I am OK with that. I just want to be sure, it does not necessarily provide us with excuse to choose the same way for other minor formats ...
So - whatever you choose, be sure it is flexible and allows some other formats to plug-in seamlessly in the future.
Brian Tiffin
15-Nov-2009 22:42:45 Wait...
Are we talking a potential ability to LOAD text! (likely including non-REBOL) type data? In an ease of use mode?
I'll quit whining, if someone says "maybe". Cheers
Brian Hawley
16-Nov-2009 12:38:34 Brian, we are talking about being able to READ text. In various character encodings. Into a Unicode string!. Without further translation into REBOL types. Sorry.
Pekr, a collection of advanced conversion and import filters would be a great thing to add - as a set of add-on modules, maybe even extensions in some cases. The vast majority of people won't need all of these conversions, and having them built in would just be extreme bloat.
However, general infrastructure code for adding these codecs would be a great thing to build in, probably following your streaming codec model. The codecs themselves would be library modules/extensions, but the mechanism to load and use them would be mezzanine/native.
What we want to avoid at all costs though is the old-rebservice-style function name explosion. As Fork has demonstrated in R3 chat (with the append-block proposal), adding simple, overly specific, special-purpose functions will lead to an unworkable overload of nearly identical functions to remember. Adding read-jpg, read-word2K, read-word2003, read-word2007 and so on would be bad.
It would be better to add only the functions read-text and write-text (better names than read-string and write-string) which specify as a parameter what you are reading from. That way you would just include a couple functions to wrap the codec infrastructure, and then add codecs without adding function names. And codec autodiscovery would let you avoid having to specify the data format when using read-text if it can be guessed from context - something that wouldn't work as well with overly specific functions. You would have to specify the output format for write-text, but in both cases it would just be a word flag. This model would be simple to learn and easy to extend.
We shouldn't just add this functionality to load and save because their focus is on REBOL data, not text. Text processing would have a different, incompatible set of options. Those functions are complex enough already anyways. We could share the codec infrastructure though.
However, that doesn't mean that there isn't a value to including basic UTF conversions in read and write. Unicode processing is so basic that we already included a datatype for the purpose (string!). Unicode I/O is even needed to load and save scripts. We could drop the codec infrastructure altogether and we would still need to do Unicode I/O. I'm not promoting this, just providing a rationale that would justify this if we decided to keep this behavior in read and write.
meijeru
16-Nov-2009 15:00:50 The examples of a numerical encoding indicator do not make a difference betwen utf-16le and utf-16be. I suppose the convention there should the same as for the result of the utf? native (+16 for be, -16 for le).
Maxim Olivier-Adlhoch
18-Nov-2009 10:16:13 I'd rather we use an explicit function set like encode/decode, and enhance them to support streams. we are encoding and decoding data.
Unicode, even if its "just" text, has to go through a codec to be usable in rebol, don't see why I'd treat it differently.
using a codec-specific dialect, we can support any datatype, including text, using a simple and consistent method.
the encode/decode funcs should support url, binary, or file paths, not just binary.
"reading" unicode is a fallacy. its not a stream of bytes, its a stream of encoded data representing text.
Just as we could read raw images, we could not read formatted images like jpg or png.
if someone wants to add 'READ-TEXT to their own function list, cool, but I definitely do not want this.
Ashley
18-Nov-2009 19:03:58 Agreed. Keep the functions atomic.
Chris
19-Nov-2009 15:48:52 Brian, from a naming perspective - as read-io/write-io are no longer in use, why not repurpose those words to cover the lowest-level read/write functions, then use read/write at the textual level? This would have the benefit of read/write taking on a more familiar form. 'read-io has a more succinct and purposeful meaning for low-level ops than read-text does for more common uses...
meijeru
25-Nov-2009 6:06:10 (at)Maxim, I fully agree. This is perhaps related to my Curecode ticket #1104 (dismissed) which asked for to-binary and to-char/to-string to NOT do automatic UTF-8 encoding/decoding.
Brian Hawley
25-Nov-2009 15:40:21 Meijeru, the consequences of having to-binary not not UTF-8 encode chars and strings is that to-binary would not work on chars and strings at all. The to-* functions are the sensible default conversions. They are not supposed to be the only available method.
Maxim, what you are talking about is what I referred to as Pekr's streamed codec proposal, since he was the first to complain about it not being there. The read-text and write-text functions would be mezzanine wrappers around those streamed codecs that make choosing the conversions easier. The encode and decode functions are also mezzanine wrappers around the same codec infrastructure. We are in agreement (just to let you know).
meijeru
26-Nov-2009 7:04 I have to respect this point of view although I don't agree: to-binary char! could very well be equivalent to to-binary to-integer char! (possibly truncated to 24 or 32 bits). Still I understand that you agree that explicit codecs for UTF-8, -16BE and -16LE should exist, with to-binary being an implicit call of the UTF-8 codec.

Comments on: Finalizing READ and WRITE

Quick review

Finishing up

Think about it

Comments:

Post a Comment: