Rethinking BITSET complements

Carl Sassenrath, CTO
REBOL Technologies
5-Feb-2008 19:05 GMT

Article #0114
Main page || Index || Prior Article [0113] || Next Article [0115] || 8 Comments || Send feedback

As I mentioned, the BITSET datatype implementation has been changed and expanded in R3 to account for Unicode.

However, an interesting problem arises for the new bitsets, when you perform a complement action on them. For example, in R2 you would often write something like:

no-nums: complement make bitset! "0123456789"

This creates a bitset that contains all characters except numeric digits. It works fine in R2 because character bitsets spanned only 256 valid characters. The bits themselves would be inverted within the array. Pretty easy.

But now, with Unicode and the new bitsets, this becomes problematic because to complement all of the characters of the bitset implies that the entire codepoint set of Unicode is mapped into the bitset. Of course, it is not, and as I mentioned earlier, bitsets are even more abbreviated than in R2, only requiring as many bits as needed to reach their highest bit value.

There are a few possible solutions to this complement problem:

1. Disallow the COMPLEMENT action. Instead, add to PARSE and FIND (and any other bitset function) the ability to specify a NOT action. For example in PARSE:

nums: make bitset! "0123456789"
parse data [some [not nums]]

and in FIND on a string:

find/not data nums

and of course the BITSET check itself:

find/not nums item

2. Add an internal flag to bitsets to indicate that they are complemented. So, the line:

no-nums: complement make bitset! "0123456789"

Does not actually complement the bits, it just sets a flag. The flag is checked for all bitset operations.

This would work fine, but now we require a special notation for MOLD/all of a complemented bitset, because they no longer can be represented with only a binary string. The flag must be indicated as well. For example, MOLD/all of a complemented bitset could look like:

#[bitset! not #{0000000000...00}]

Or, we could also move the NOT after the binary, but it may get lost for large bitsets.

(Note that I also want to discuss other possible MOLD formats for bitsets, but not right now in this article.)

So, there's something to think about. I prefer the second solution over the first because the complemented state of the bits is attached to the bitset data itself, not to the functions that use it.

Furthermore, we will need to consider the various results of logical operations performed on bitsets, including AND, OR, XOR, but also UNION, INTERSECT, DIFFERENCE, and EXCLUDE. These become just a bit more complicated internally to get the desired results. No pun intended.


Updated 23-Jun-2024 - Edit - Copyright REBOL Technologies -