Comments on: A98 changes to Binary!

The A98 release will begin finalizing changes to the binary datatype. As you know, this datatype was significantly disrupted due to the addition of Unicode.

Essentially in R3 you must always remember one thing: bytes are not characters.

In other words, if you convert to binary a string that has a length of 5, the resulting binary length could be 5, 6, 7, 8, or more. That's because the string becomes UTF-8 binary encoded.

As we've discussed in several other articles here, this change has caused us to re-examine the meaning of binary! Perhaps most importantly, we've removed most of the "magic" that was being done in binary-related operations.

For example, if you insert an integer 8 into a binary string, it's inserted as a byte of value 8. Unlike R2, it is not converted to ASCII. The same is true with picking a single element from a binary. It's just an integer.

Ok, with that being said, we do allow several special conversions. For example, if you attempt to insert a string into a binary, by default it will be UTF-8 encoded. If you want some other encoding, then you'll need to add an extra step to do that conversion first. (We will be providing some standard codecs too.) That is a reasonable approach.

Anyway, in A98 we'll be extending what you can do with binary, including fixing CureCode #1452 (limited binary usage.) But, of course, we need you to test it really well. (Unicode doesn't just make the R3 internal code for string handling twice as complicated, it makes it ten times more complicated, mainly because UTF-8 encoding is variable length. So, we need you to do extensive testing on it.)

We will begin updating the REBOL 3 Binary Datatype documentation page to cover the definitions and details. (Be sure to reload the page to get newest changes.)

10 Comments

Comments:

Brian Hawley
30-Apr-2010 21:51:59 That's #1452, not #1453. Nonetheless, yay!
-pekr-
1-May-2010 2:02:44 What about 1575? ("To-integer used on binary values")
Carl Sassenrath
1-May-2010 12:17:34 Brian: corrected it, thanks.
Pekr: I'm inclined to say that 1575 is not a bug. Instead, we can offer a method of sign-extension.
-pekr-
1-May-2010 18:18:58 Carl, my problem (and amateur coder surprise :-) was, that we consider e.g. #(FFFF) "left padded" (its integer value), but when using OR or AND, those functions apply it "right padded".
Carl Sassenrath
1-May-2010 19:10:45 That's correct. Binary is bytes. If you want to AND and OR integers, they must be loaded as integers, not as bytes.
For handling binary-as-structured data, I think we need to add back to R3 a struct-like datatype (or function.) It would allow you to map between the two, with the correct sizing and order. With such a function, all variations can be handled. This seems like a really useful capability.
Ladislav
2-May-2010 4:32:48 I want to mention, that there is a discrepancy between the "network bit order" used by the TO-BINARY function and the "actual bit order" one obtains examining the struct in R2.
meijeru
2-May-2010 5:37:02 You have a struct! datatype reserved already. But do you want it to work on byte-level or on bit-level? For certain very space-efficient encodings you may need the latter. And "actual bit order" is going to depend on the endianness as well, I fear.
Brian Hawley
2-May-2010 15:30:03 Between struct! and a function, I think that a function (or some functions) can be made to be more powerful and flexible than struct! when it comes to conversions.
The struct! type was always better for its original usage: native interop. When you try to use struct! for binary conversions, it gets really awkward and tricky - it just wasn't designed for that. It was a bad habit that we were forced into, but there's no reason we should continue to be forced into that when we can do it right this time.
Gabriele
3-May-2010 4:55:41 Well... my two cents on "conversions"... What about:
parse "some string" [string parsing mode] parse [some block] [block parsing mode] ; new! parse #{ABCDEF} [binary parsing mode]

Binary parsing mode would then be specialized at doing all sort of useful conversions. On the opposite direction, we've been missing FORMAT for so many years already...
format [format dialect] ; return string! ; binary formatting format/to [format dialect] some-binary

In the same way as PARSE, the dialect would be slightly different when doing a binary format. One could also write format/to [...] binary! when they don't want to provide an already existing binary value (though, just using copy #{} would be the same), or there could be something in the dialect to switch to binary mode if you prefer it that way.
I think this would cover 99% of the cases. The remaining 1% would be covered by the mythical host kit.
Maxim Olivier-Adlhoch
3-May-2010 10:47:57 Gab, correct me if I'm wrong, but in R3, there is no parse binary->string conversion anymore, so if you supply a binary, you actually can parse it differently and much more precisely than with R2.
ex:
parse #{ffaa00ff} [ any [ #{FF} (print "up") | #{00} (print "down") | skip (print "ignored") ] ] up ignored down up == true

Comments on: A98 changes to Binary!

Comments:

Post a Comment: