REBOL 3.0

Comments on: Defining binary datatype insert actions

Carl Sassenrath, CTO
REBOL Technologies
4-Mar-2008 23:58 GMT

Article #0118
Main page || Index || Prior Article [0117] || Next Article [0119] || 12 Comments || Send feedback

First, keep in mind that bytes are not characters in R3. Characters may hold Unicode values, which may require more than one byte, and the form of those bytes depends on the Unicode encoding used.

Now, what does it mean to INSERT or APPEND to a binary value? For example:

bin: make binary! 100
append bin 123
append bin 12:30
append bin "test"
append bin http://www.rebol.com

In R2, BINARY! was defined as raw data that for actions such as those above, defaulted to be text (meaning that CR LF line terminators were as-is, not converted to just an LF, as is REBOL's convention.) Because of that, we defined things like inserting an integer to mean FORM the integer first, then insert those characters into the binary.

In R3, BINARY! is defined as an encoded set of bytes. The encoding depends on what the binary is. For example, text can be encoded in many ways, such as UTF8, UTF16(LE), UTF16(BE), and even Latin1 or other page encodings. If we insert an integer, 123, what does it mean? Do we want to FORM it and insert it as UTF8 as the default? Or do we just insert the byte whose value is 123?

So, right now in R3, the action:

append bin "text"

is ambiguous. Does it mean to append the single bytes for "text" or does it mean use UTF8 encoding, or does it mean something else?

We need to define it. We can pick any meaning we want, but it should be clearly stated. It can even be an error.

So, think about it and post your ideas.

12 Comments

Comments:

Gregg Irwin
4-Mar-2008 19:38:55
Would it make sense to have an /as refinement, like LOAD and SAVE (assuming that's still the model)?

I don't remember if there's a default encoding though, which keeps things simple in homogeneous environments, but could cause problems otherwise. If so, should it work like R2 and FORM things--using the default encoding? Or should the default be INSERT/RAW (no mods to data)?

And, for the "normal" case, do we end up using /AS on every call, specifying a fixed encoding, which is redundant and non-REBOLish.

Unicode support will be a wonderful thing, but if I don't really need it, and it makes my life more painful, that's bad.

Jerry Tsai
4-Mar-2008 21:08:13
When the first argument is a BINARY, I suggest APPEND/INSERT accept only one of BINARY, ISSUE with hex digits, BITSET, IMAGE, and VECTOR as its second argument. Anything else should be converted explicitly before being APPENDed/INSERTed to a BINARY.
Brian Tiffin
4-Mar-2008 21:25:05
I've not thought through the string cases, but I never understood the FORM for inserting integer!. Changing it to accept the 8 bit values would be sweet. And imho rid REBOL of an annoyance. That, being the to char! hoop when treating an integer as an array of bits.

So if 123 is 8 bit, "123" is ??? #{313233} or something completely different? To be honest I hope it is not completely different, but I would, at the same time, praise the change to accept

append bin 123
as inserting a byte.

Best of skill (luck having very little to do with this one),
Brian

Maarten
5-Mar-2008 3:14:56
I go with Jerry here - conversion should be done outside the binary! (or datatype! in general) with default only "one" behaviour.

Then we "only" have to have easy conversion functions.

DideC
5-Mar-2008 3:42:52
+1 for Jerry proposal.

It make sense if series is binary to handle the conversion before inserting/appending in it. But it make the value datatype checking relative to the series datatype argument : could be a problem, dunno ? And it's not easy to explain that in the header of the append/insert func.

Gabriele
5-Mar-2008 4:24:18
I'm in favor of explicit conversions too, but having a default would not hurt - that is, strings could be encoded as UTF-8 by default; if you want anything else, use an explicit conversion.

I'm not sure I've found the FORM on appending to binary! useful...

Robert
5-Mar-2008 8:28:14
IIRC the default encoding in R3 will be UTF-8. So I expect not a lot of change in the normal days. Everything is just UTF-8 encoded. Only if I need ASCII as a result I would do an explicit conversion.

So, if this is the case, than the default encoding should be used and the bytes of this encoding being inserted. If I need something else, I use explicit conversion.

Brian Hawley
5-Mar-2008 10:40:02
I have always disliked the FORM on insert to a binary, and would be happy for it to go away. There should be easy binary conversion functions (CONVERT perhaps), and APPEND and INSERT should probably have a refinement (/AS would do) that would allow you to specify a conversion method, accepting the same set of conversion methods as the CONVERT function.

I agree with Robert that strings should be UTF-8 encoded by default when being inserted into a binary. Other encodings should be able to be specified, with the exception of ASCII. ASCII is a binary-compatible strict subset of UTF-8, so as long as your characters are in the ASCII subset, UTF-8 will be the same thing.

There should be an ASCII? function that will check to see if all of the characters in a string, or individual characters, are in the ASCII subset.

Brian Hawley
5-Mar-2008 11:04:41
Jerry's proposal (second message above) sounds good, with some changes.

I would definitely want a separate CONVERT function or equivalent, but it can save a great deal of memory overhead to do in-place conversions with an /AS refinement being added to INSERT, APPEND and CHANGE.

You could make that refinement optional with the types listed above except image!, which has no default encoding, and possibly issue! because conversion success or failure would depend on the contents of the issue, not its type.

Other types would make the conversion refinement mandatory.

Brian Hawley
5-Mar-2008 11:07:16
Oh, and vector conversions might need to be explicit as well because of endianism.
Carl Sassenrath
5-Mar-2008 14:54:12
Thanks for the comments. My reply is the next blog.
Paul
6-Mar-2008 16:51:28
Inserts have to be as lean as possible because of the heavy use of them so I say what is the leanest solution?

Post a Comment:

You can post a comment here. Keep it on-topic.

Name:

Blog id:

R3-0118


Comment:


 Note: HTML tags allowed for: b i u li ol ul font span div a p br pre tt blockquote
 
 

This is a technical blog related to the above topic. We reserve the right to remove comments that are off-topic, irrelevant links, advertisements, spams, personal attacks, politics, religion, etc.

REBOL 3.0
Updated 29-Apr-2024 - Edit - Copyright REBOL Technologies - REBOL.net