Comments on: General binary conversions

As you know, R3 includes various functions for conversion to an from binary values. For example,

>> to-binary 1
== #{0000000000000001}
>> to-integer #{0000000000000001}
== 1

However, it must be pointed out that this technique is not designed for general purpose conversions. It's a specific conversion method that comes from how we extend the definition of the to function... but in a very restricted way.

Because many programmers want direct-to-bits conversions between datatypes and binary representations, a more extensive method is required. In R2, one method was the struct! datatype that mapped types to binary.

I suggest that we need something similar in R3. The method needs to be flexible enough to deal with binary sizes, byte order, and sign extensions. In this way, we can map to and from any binary series more directly.

This is a much better solution than trying to piece together mezzanine functions that slice and dice to and form binaries to accomplish the same end.

If we want to come up with an adequate definition, which will no doubt be dialect-based like struct or vector definitions, I'd like to get it into A99 for you to test out.

11 Comments

Comments:

Brian Hawley
6-May-2010 2:23:53 I would suggest two functions:

That reverse apply to integers as well, doing a bswap from little endian to big endian.

A convert spec value function with a /to option.

By going with a single spec for convert we would be able save a spec, build a spec, or even return a spec from a function. If we make it a bunch of parameters or refinements, saving or passing on a set of options would require apply and a lot of weirdness.
The single convert/to function option would be simple:

convert spec value would convert a value to binary based on a spec.

convert/to spec binary would convert a binary to a value based on the exact same spec.

The spec could be a block containing a simple delect-style dialect, with some flexibility of options depending on the output type. Or the spec could be an object!, a map! or a block in object! syntax. We should pick one of those, or both delect-style blocks and object specs.
The advantage to a delect-style dialect would be less to write in the dialect code. For example, a binary to/from integer conversion spec:
[integer! 4 bytes le]
Or a bitfield conversion spec to/from object:
[object! [ a: integer! 3 bits b: integer! 5 bits c: integer! 24 bits ] be]
Or the same to/from a block:
[block! [ integer! 3 bits integer! 5 bits integer! 24 bits ] be]

The default endianism would be network byte order unless specified. The default length to convert would be the same as to the type, unless specified. Length would be specified in bits or bytes, but a bitfield would probably have to add up to a multiple of 8. Objects and blocks could probably be nested. Any extra data in the binary would be ignored, because it is likely intended to be converted later. Any missing binary data would trigger an error.
If you are converting from a value to a binary, in theory the datatype could be omitted from the spec since it could be determined from the value itself. But it would be to your advantage to specify the type anyways because that would allow convert to typecheck the value, triggering an error like assert/type if the types don't match. Also, you will be able to reuse the spec to convert/to back from binary to the value.
Different types could have different supported options as well, or type-specific constraints like the length of a decimal! in binary (4 or 8 bytes). We can decide on the details.
This approach could have a lot more flexibility than R2's struct! conversions.
Gabriele
6-May-2010 4:58:39 I still vote for parse and format to handle binary as well...
Brian Hawley
6-May-2010 6:22:24 Agreed, Gabriele, and format is overdue for a complete rethink anyways. This is not in opposition to the need for something like the convert proposal above though.
Brian Hawley
6-May-2010 6:48:34 Gabriele, could you give us an idea about which parse operations you think would be most valuable for binary mode? Perhaps you can give some kind of suggested list of operations, maybe in some cases borrowing operations from string or block mode that binary parsing doesn't already support? The exact semantics you want them to support would be helpful too. That would make it a lot easier to get started on discussing and implementing them.
On another topic, convert might be able to benefit from the /into option, bringing it insert-style chaining. I don't think that convert/to would get the same benefit from /into though, so we could limit it to binary! output arguments.
Maxim Olivier-Adlhoch
6-May-2010 14:45:44 I like Brian's idea, but I wonder about the speed, since we'd be continually processing the data as a series (in both ways) and breaking it up over and over, this might be pretty heavy when using DLLs.
let me expand on the idea (but using a different approach)
maybe if we built convert as an extension it could be super optimised using C, and we could JIT compile the convertion process so it becomes a linear piece of machine code without any conditions or jumps.
using a handle as the spec reference, the extension would be able to re-apply the convertion to new data very easily.
so we'd do:
; here struct is a handle, we can't modify it or create one from the interpreter. struct: structure [... whatever ...]
; we go from binary to REBOL data data: instruct #{...} struct
; we go from REBOL to binary data bin: destruct data struct

Most of all, having it as an extension, the community can improve or adapt the spec parsing so we tailor to more platforms and specific binary requirements, like I/O stream processing, for which C code usually exists and can be used directly.
this could also be the basis for encoding/decoding, since instruct/destruct would just be looped on recognised patterns.
so for example:
; load a file which contains a header and a series of equaly-sized segments. data: read %/data.bin
; define & convert header hdr: structure [... a spec which eats up 20 bytes ...] header: instruct take data 20 hdr
; define & convert segments, using data in header. seg: structure compose [ ... spec which uses (header/segment-size) ...] blk: [] loop header/segment-count [ append blk take data header/segment-size seg ]

this would allow anyone to build codecs using REBOL syntax instead of having to go into C. but benefit from vastly superior processing speed, in a well controlled and pretty safe environment.
things like comparing struct size < binary length BEFORE running the JITed code are easily done and provide good sandboxing against memory violations.
in fact using PARSE to control the instruct/destruct process would be blindingly fast, and might suit or complement Gabriele's requirements for binary parsing.
With the combination of PARSE and an optimised, purpose-built bit-crushing JIT, I wouldn't be surprised if REBOL would actually encode/decode some files faster than (or close to) some native loaders out there. Parse alone can already be faster than compiled code using Regexp on complex patterns.
Brian Hawley
7-May-2010 5:13:51 Maxim, good point, but lose the "structure", "instruct" and "destruct" naming - none of those words apply to the task at hand. We aren't talking about structures (noun for an object), we're talking about conversions (noun for an action). Unless you mean "structure" as a verb, which wouldn't work with your functions because the noun form for the compiled spec is the same word as the verb form for the action of applying the spec; "instruct" means something completely different, and the opposite word of the verb "structure" is "destructure", not "destruct".
Calling your spec compiler conversion makes it a noun form like "object", and it lets you use the verb convert for the application of the spec. Same semantics, different naming. You could also allow convert with a block spec compile and use the spec on the fly, for those who need convenience more than speed. In either case, the spec dialect should be processed by native code for speed.
We could still use suggestions for parse operations for binary recognition - the existing manipulation operations should be fine as is. Perhaps we can get by with datatype recognition like block parsing, plus a bitfield [...] operation. I assume others have better ideas...
Ladislav
24-May-2010 16:59:53 When considering an R2 STRUCT! - like interface, I would suggest the following datatypes:
int32 - 32-bit signed integer, little endian uint32 - 32-bit unsigned integer, little endian int32be - 32-bit signed integer, big endian int64 - 64-bit signed integer, little endian
Ladislav
24-May-2010 17:25:50 Moreover, we need a pointer, void* datatype (I am not sure, whether just 32-bit, or even 64-bit too), and we need to be able to specify repetition somehow too.
Ladislav
24-May-2010 17:34:15 Which endianness should be default - the little endian case looks more frequent (for C interface, and little endian processor architecture).
Brian Hawley
3-Jun-2010 13:43:06 Pointer type in R3 is handle!, so we can use that. I would suppose that network byte order should be the default endianness, on principle, but would go with whatever solution would be appropriate. Agreed on the need for a way to specify repetition, especially for conversions to strings, blocks or vectors.
"When considering an R2 STRUCT! - like interface"
Please don't, at least not for conversions. The struct! method was perhaps the worst method conceivable (at least by a semi-sane person) for doing conversions. It was only used because R2 didn't have a better method; it would have been avoided if there was an even slightly better method available. We need something better than that.
Ladislav
8-Jun-2010 12:18:06 Opposing myself in re the void* datatype. Actually, I think, that a char* datatype with arithmetic would be preferable.

Comments on: General binary conversions

Comments:

Post a Comment: