REBOL 3.0

Comments on: Important decision: is binary a string?

Carl Sassenrath, CTO
REBOL Technologies
20-Jun-2009 19:51 GMT

Article #0209
Main page || Index || Prior Article [0208] || Next Article [0210] || 21 Comments || Send feedback

It's time to make an important decision: Is a binary sequence of bytes a string?

In R2 it was. But, in R3, we need to discuss it and reach a decision together.

Background

As you know, R3 adds Unicode. A string! (datatype) is defined as sequence of Unicode code-points (just think "chars" for this discussion). Its meaning is "text".

A binary! datatype is a sequence of bytes. It may or may not be text. It could be an image, a sound, machine code, or whatever. So, binary is quite often an encoded datatype. It must be decoded to be useful.

REBOL defines any-string! to include all datatypes that act like strings. But, how do we define string?

Question

R3 requires a more precise definition of "string". Specifically, we need to decide if any-string! includes binary! This would affect a number of functions.

Update: The Decision

The decision has been made and implemented in the newer releases of R3: binary is no longer part of the string "superclass".

In R3, binary plays a different role, and we think this is an important distinction. It also makes many of the functions that deal with binary cleaner.

For example, the relationship between binary and integers becomes really obvious:

>> append b: #{0504} 3
== #{050403}
>> pick b 2
== 4

This was not true in R2, because binaries were used often for character strings, so the above append had to insert #33 the ASCII code for 3. In R3, binary is binary. Just a sequence of bytes. No encoding is to be assumed.

Read the comments for the full discussion.

21 Comments

Comments:

Revolucent
20-Jun-2009 16:27:16
I'm trying to think of a use case for having any-string! contain binary! (regardless of what R2 does).

As you said, any-string! contains types that act like strings. With multi-byte characters in Unicode, I don't think binary! acts like a string anymore, because the one-to-one correspondence of byte to character is broken.

Brian Hawley
20-Jun-2009 16:53:32
In R2 binaries didn't really act like strings, but since the underlying data model was about the same it made a little sense to make binary! a member of any-string!. In R3 even the underlying data model is different.

It is time to stop trying to make binaries act like strings and vice-versa. We need to make binaries act in a way that is best for binaries, and strings act in a way that is best for strings. Clear conversions between them should be designed, but they are conversions, not aliases.

We should also declare that binaries are made up of bytes: unsigned integers from 0 to 255, not characters. All element-wise access (PICK, POKE, paths) would be in integers. This will also allow us to have a clear conversion model between characters and UTF-8 binaries that doesn't assume that the character fits into one binary element.

Once we get rid of the binary-string equivalence, we can finally make really good support for binaries in parse. We can do better than treating them like strings - binary parsing is often a different process, and we can do better at it.

Maxim Olivier-Adlhoch
20-Jun-2009 17:11:31
I can't say it any better than how Brian just put it.

R2's model created so many problems for me in building tcp servers, I wish binaries had never been part of strings in the first place.

Oldes
20-Jun-2009 17:27:19
I agree with Brian as well.
-pekr-
21-Jun-2009 1:22:53
I agree with Brian too ...
Anton Rolls
21-Jun-2009 1:58:28
A binary isn't a string, but a string is a binary.
Ladislav
21-Jun-2009 5:12:19
I am mainly worried about parsing binaries. I can even imagine it might be useful to parse string as a binary sometimes.

It has been inconsistent in R2 to yield integers as elements of binaries and requiring chars to be stored in. On the other hand, it was quite handy to be able to use the #"" char notation in parsing binaries (at least sometimes).

Peter Wood
21-Jun-2009 7:25:47
I had a similar thought to Anton. Shouldn't the question "should string! be included in an any-binary! ?" also be discussed?
Andreas
21-Jun-2009 7:42:39
Couldn't state it more clearly than Brian did.

And no, a string is no binary. A string is a sequence of unicode code points (unsigned integers from 0x0 to 0x10FFFF). A binary is a sequence of bytes (unsigned integers from 0x00 to 0xFF).

Maarten
21-Jun-2009 9:10:54
I go with Brian. Plus, fast binary operators!
Ladislav
21-Jun-2009 10:44:19
One more note to binary parsing: of course, bitsets are useful, even more than for strings.
Ladislav
21-Jun-2009 10:50:07
CS definition from Wikipedia: "In computer programming and some branches of mathematics, a string is an ordered sequence of symbols. These symbols are chosen from a predetermined set or alphabet." According to this definition binary *is* a string - a sequence of bytes, so pretending it is not is just confusing.
Brian Hawley
21-Jun-2009 13:23:45
Ladislav, the set of symbols that a string! can contain is a different set of symbols than the set that a binary! can contain. And in theory every type is also a vector. Please don't let the theoretical distract you from the actual :)

I am mainly worried about parsing binaries. I can even imagine it might be useful to parse string as a binary sometimes.

It would be useful, but to do so in R3 you have to convert the string! to a binary! first - otherwise you won't know what the binary encoding of the string! is. A string! in memory is a black box in R3: The internal encoding is supposed to be able to change for efficiency, with no noticeable semantic difference as far as code that uses them is concerned. That is why string! is defined as a sequence of codepoints, rather than a UTF or UCS stream.

The main reason I want binary! separate is so we can add real support for binaries to parse, rather than the "treat it as a string, poorly" support that R2 has. We are about to rewrite parse - don't handicap us with false equivalency.

Carl Sassenrath
21-Jun-2009 14:51:59
Great discussion so far. I think we are nearly there.

I also want to point out:

  • We have a "superclass" typeset definition that is the basis of REBOL, series!, which is common to both strings and binaries (and other datatypes as well.) They are "ordered sequences of values".

  • In R3, ports read and write binary! streams. It's very simple and extremely efficient, but we will need a nice mezzanine layer if we want the features found in R2 for simple and commonly used text-based actions.

  • R3 has a vector! datatype. The above definition of binary is equivalent to #[vector! [unsigned integer! 8]]. So, binary! can be thought of as a simplified front-end to such vectors.
Gregg Irwin
21-Jun-2009 17:44:31
I'm with the group, in support of Brian's view.

I also agree with Ladislav on the usefulness of using char! values in parse rules, but I don't think we're ruling that out.

On the subject of text mezzanines, yes, please make the most common things the most natural as well.

Something else that I've always missed in REBOL, is support of fixed length structures. Yes, I know, how 1980s. They were very handy, and still are when dealing with C structs and, admit it, BASIC's get/put in random access mode let you build dBase compatible stuff in a snap.

And if we're talking lower level work here, what about the concept of unions (as in C)? Many BASICs also let you LSET one fixed type into another, which was useful.

A binary is just a series of bytes, but the ability to "overlay a view" on that, ooohhh.

Carl Sassenrath
21-Jun-2009 21:22
Gregg, I consider that useful as well. It makes it much easier for REBOL to take almost any binary (sequence of bytes) and both extract and store data into it. Really, it's a struct. Low level, but saves a lot of tedious work in R code.
Ladislav
22-Jun-2009 3:46:16
Carl immediately spotted the weekness of my argument. Well, nevermind, I am not against removing binary! from the any-string! typeset. I just wanted to represent the opposite side, since nobody else was wanting pick that glove;-)
Carl Sassenrath
23-Jun-2009 23:46:30
The conversion has been completed: try A62.
Steve, the eFishAnt
25-Jun-2009 20:13:12
One of the edges I hit in early R3 doing some real-time multi-media codecs was the bit-ness of handling flags for binary when it comes to some of the standards that are done in the philosophy of C. It was painful to do 32-bit flags from data in the standards without special cases since the MSBit was a sign bit, like C. I made it work, but the code was bigger than it needed to be.

I guess I should try my code again on A62 and see if it will shrink.

Sorry about the late reply. I am busy with the architecture that I am running on right now to do this input ... 0-o-o< (the eFishAnt)

Steve, the eFishAnt
25-Jun-2009 20:26:08
ditto to Max. R2 binary was never intuitive to a binary bit-banger, but R2 was designed more for the ascii-encoded hex of the age of MIME. I have built up some binary tools so perhaps porting them to A62 would be a beneficial exercise.

It would be cool for R3 to beat C for elegance in bit-banging. I believe it can be done. Vector seems to make sense.

Carl Sassenrath
6-Jul-2009 15:52:44
Good points. I think adding some native functions to do that can make it much easier... and faster too.

Post a Comment:

You can post a comment here. Keep it on-topic.

Name:

Blog id:

R3-0209


Comment:


 Note: HTML tags allowed for: b i u li ol ul font span div a p br pre tt blockquote
 
 

This is a technical blog related to the above topic. We reserve the right to remove comments that are off-topic, irrelevant links, advertisements, spams, personal attacks, politics, religion, etc.

REBOL 3.0
Updated 7-May-2024 - Edit - Copyright REBOL Technologies - REBOL.net