Comments on: It's not your grandpa's BINARY anymore

Ok, this article could easily become the size of a book, so let me try to keep it short and to the point.

Start here...

Let me start by saying:

In R3 A BINARY! is not a STRING!, and a STRING! is not a BINARY!

That was easy to say, but the implications run deep:

To make a STRING! from a BINARY! requires decoding. Why? Because a binary series could be UTF8, UTF16, Latin-1, or something else.
To make a BINARY! from a STRING! requires encoding. Why? Because you will want the binary result to accurately represent the string; meaning, it must conform to a standard text encoding like UTF8.
To INSERT, APPEND, or CHANGE a string (or any other FORMed value) in a BINARY! requires encoding. Why? For the same reason as the prior point.
BINARY! and STRING! are not equivalent. AS-BINARY and AS-STRING must go away because they assume that you can reference the same data without decoding or encoding. You must use TO-BINARY and TO-STRING as we did in pre-V2.6 REBOL.
Encoders and decoders such as ENBASE, DEBASE, COMPRESS, DECOMPRESS, ENCLOAK, and DECLOAK have new rules. If you ENBASE a STRING, either you are implying that it must be converted to BINARY first, or that function should throw an error because it's not directly valid.
Various functions may only work with STRING data or BINARY data, but not both. For example, LOWER-CASE and UPPER-CASE functions are not valid for BINARY, and AND, OR, XOR are not valid for STRING.

Wow. How's that for introducing the wonderful benefits of Unicode? Yes you can love to hate it or hate to love it. It just depends on where you live. But, that's the reality of the modern world of computing. Sorry.

Well, actually, all of this was bound to happen eventually. In the past, we've had the luxury of being a bit sloppy in our coding practices. We could throw binary and text around like they were different sides of the same coin. Now we must buckle down if we want the rest of the world to enlist in our REBOL forces. A lot of people live in Asia. A lot.

An important rule...

So, what's really truly going on here? Well, you know me, I like to summarize down to a nice little rule:

Rule:

In high level languages it is dangerous to make assumptions about low-level internal data representation.

What do I mean?

Here's a quick test of your understanding:

Q: What does this line do?

bin: to binary! "hello"

A: If you said that it converts the internal representation of the string "hello" into a standard binary encoded representation such as UTF-8, then you got it right. (What if you wanted it encoded into something different like UTF-16 or Latin-1? You must specify a function refinement for that.)

Q: What does it not do?

A: It does not give you the internal representation of the "hello" string (anymore).

Q: What does this line do?

str: to string! #{68656C6C6F}

A: If you said that it converts the a standard binary encoded representation such as UTF-8 to an internally represented string, then you got it right.

Q: What does it not do?

A: It does not consider those bytes to be the internal representation of the string. They are an encoding of it.

Note: just because the binary literal looks like ANSI or Latin-1 here, does not mean it is. In fact, the default is UTF-8.

A mistake...

A couple years ago, I added the AS-BINARY and AS-STRING functions to REBOL. When I did it, I knew it was treading on an important rule of computer science: you cannot go around directly aliasing datatypes like that, because if either of the datatypes change representation (such as to support Unicode) then you've got a problem. And, of course, the problem is made much worse by the fact that different CPU's store the data differently: in big or little endian.

To continue...

If I write:

insert bin "example"

What does that do?

Since we do not know how the string is internally represented, we must either auto-encode the string into binary, or throw an error.

Which does R3 actually do? Currently, I assume it should encode the string and insert it, but that's not final. Give me some feedback.

Now let me give you something a bit more complex. What happens if I write:

data: enbase "this is an example line"

Well, ENBASE is a binary base encoder (which defaults to a BASE-64 encoding). Since it encodes binary, that implies that the string needs to be converted to binary; therefore, the string needs to be encoded, either automatically or explicitly. ENBASE is a "double encoder".

Are you with me? Guess what? There's more! ENBASE returns a STRING! because it is designed for inserting base encoded data into things like email or web CGI text. So, if I write:

out: make binary! 1000
append out enbase "this is an example line"

then the STRING! output of ENBASE must be encoded into the BINARY!. So, there's a triple encoding going on here.

Fortunately, R3 is quite smart and efficient internally about how all of this is done. (All of this work is the main reason you've not seen me around much chatting online.) In theory, the above line should evaluate about as fast as it did in R2, and perhaps even a bit faster. I have yet to measure it.

In summary...

I should note that there are advantages and disadvantages to these new rules. For users who just want to write scripts and not worry about it, R3 does a lot of the hard work. However, for those who want to fiddle around with the bits in the bytes, it may be more difficult to make things work out. For that, we'll need to develop some smart and well defined methods. Yep, those of you will need to buckle down. It's not your grandpa's BINARY anymore.

Got some comments to any of this? Please post them right away.

8 Comments

Comments:

Brian Hawley
18-Jan-2008 21:21:30 I am really happy to hear that you have thought this through. You seem to have covered just about everything. The Unicode changes seem to be coming along nicely.
We look forward to your return to the chat rooms - the new REBOL should be fun to build and work with.
Brian Hawley
18-Jan-2008 21:27:37 By the way, will binary-style types be taken out of the any-string! typeset? It seems to me that they binaries are not going to be much like strings anymore, with different functions supported, different behavior. Still series though.
For that matter, will the encoding infrastructure for converting string to binary be extended to encoding other types, such as images? It seems like an opportunity for code sharing.
Pier Johnson
19-Jan-2008 2:53:20 Great, that is if I get it.
A string is UTF-8 text (a fancy way of saying an ASCII-preserving encoding within a single character set that lets us mix symbols from nearly any writing system of the world).
A binary is any user-defined sequence of symbols devised to encode something (MP3, PNG, FLV) that a user-written program can process because folks agree upon a standard start, end and length.
Why not ditch string! and use UNICODE!, ditch binary! and use ENCODED! instead?
After all, even a UTF-8 string is a binary sequence.
Pier Johnson
19-Jan-2008 3:06:03 Why not ditch string! and use UNICODE!, ditch binary! and use ENCODED! instead? ... that is, if you want to be the rebel.
Perhaps the biggest block to folks learning and mastering languages results from terminology overloading.
For years, when making new languages, Priests from Academia have co-opted the words used by previous language designers.
Often, they shade new meanings because they never understood what the original coiner intended.
This act does more harm and causes more confusion than any other.
You must let go of your old words to let go of an old way of thinking.
If you use the same words as other language designers, then when programmers of other languages discover yours, they become confused. Most cannot hold within their minds, variant meanings for the same words.
The first word you ought to lose is "object". There are others.
Brian Tiffin
20-Jan-2008 20:18:47 How will this impact data transfers between an R3 client and an R2 service (and vice versa)?
Cyphre
21-Jan-2008 9:44:04 All that sounds logical to me so it looks R3 is on a good track with Unicode. Keep up the great work!
Cate Dixon
21-Jan-2008 17:44:22 This sounds great, but I do have one request: encoding and decoding functions for unicode strings encoded as 8-bit values (not-UTF, just only allowing the first 256 code points).
maxim olivier-adlhoch
3-Feb-2008 15:26:28 while you're at it...
might want to provide better binary SCALAR conversion/insertion/extraction.
handling binary series in rebol is easy, inserting/extracting some of the obviously convertible values and meta series into binaries isn't always very fun. I even rememeber needing to use a struct for one type (can't remember which).
I say this out of experience implementing tcp servers and clients, which speak in binary. In R2 some of the types are very hard to interface. Ex: integers!
to-binary 13 ==#{3133} #when w'ed expected it to be: #{0D}

Endianess being an issue, it could be a refinement.