REBOL 3.0

Comments on: Rethinking BITSET complements

Carl Sassenrath, CTO
REBOL Technologies
5-Feb-2008 19:05 GMT

Article #0114
Main page || Index || Prior Article [0113] || Next Article [0115] || 8 Comments || Send feedback

As I mentioned, the BITSET datatype implementation has been changed and expanded in R3 to account for Unicode.

However, an interesting problem arises for the new bitsets, when you perform a complement action on them. For example, in R2 you would often write something like:

no-nums: complement make bitset! "0123456789"

This creates a bitset that contains all characters except numeric digits. It works fine in R2 because character bitsets spanned only 256 valid characters. The bits themselves would be inverted within the array. Pretty easy.

But now, with Unicode and the new bitsets, this becomes problematic because to complement all of the characters of the bitset implies that the entire codepoint set of Unicode is mapped into the bitset. Of course, it is not, and as I mentioned earlier, bitsets are even more abbreviated than in R2, only requiring as many bits as needed to reach their highest bit value.

There are a few possible solutions to this complement problem:

1. Disallow the COMPLEMENT action. Instead, add to PARSE and FIND (and any other bitset function) the ability to specify a NOT action. For example in PARSE:

nums: make bitset! "0123456789"
parse data [some [not nums]]

and in FIND on a string:

find/not data nums

and of course the BITSET check itself:

find/not nums item

2. Add an internal flag to bitsets to indicate that they are complemented. So, the line:

no-nums: complement make bitset! "0123456789"

Does not actually complement the bits, it just sets a flag. The flag is checked for all bitset operations.

This would work fine, but now we require a special notation for MOLD/all of a complemented bitset, because they no longer can be represented with only a binary string. The flag must be indicated as well. For example, MOLD/all of a complemented bitset could look like:

#[bitset! not #{0000000000...00}]

Or, we could also move the NOT after the binary, but it may get lost for large bitsets.

(Note that I also want to discuss other possible MOLD formats for bitsets, but not right now in this article.)

So, there's something to think about. I prefer the second solution over the first because the complemented state of the bits is attached to the bitset data itself, not to the functions that use it.

Furthermore, we will need to consider the various results of logical operations performed on bitsets, including AND, OR, XOR, but also UNION, INTERSECT, DIFFERENCE, and EXCLUDE. These become just a bit more complicated internally to get the desired results. No pun intended.

8 Comments

Comments:

maxim oliver-adlhoch
5-Feb-2008 16:34:52
doesn't having a flag which complements a bitset render it incapable to support further logical ops?

how would you implement (something I have in my code):

union complement set1 set2
without making a bitset which holds ALL unicode bits?
maxim oliver-adlhoch
5-Feb-2008 16:47:39
throughout the years, the first option has been discussed/requested many times AFAIK, but not just for bitsets.

Many algorythms are vastly simplified by "looking for holes in the cheese".

in fact some things are downright complex (or just exhaustive) to parse without 'not when they could be a no brainer with it.

I am thinking that implementing 'not within parse can be quite tricky, and might lead to some slow rules, when used inapropriately, but its been requested many times, and discussed even on this channel too at least once.

I vote for option 1)

Steeve
5-Feb-2008 19:48:19
"Many algorythms are vastly simplified by "looking for holes in the cheese". so true, i can't remember how many times i was ashamed of not having a find/not refinement.

Karol
6-Feb-2008 3:19:05
I think we have more options: 3. complement invert bitset for only ascii characters and/or with refinement for provided characters 4. bitset internal format stores characters in compressed way to allow storing ranges of characters (mold format also needed)
Gabriele
6-Feb-2008 4:18:36
I prefer (2). PARSE needs enhancements, but that's a separate topic.

Max: the code:

union complement set1 set2
is equivalent to:
complement difference set1 set2
so there is not necessarily a problem with other logical ops; it just becomes a bit more complicated internally, because REBOL has to handle all such cases.
Carl Sassenrath
6-Feb-2008 13:42:41
Gabriele is correct. We can use high level logic.

Of course, it also depends on what your goal is. In R2, bitsets are bounded, so the complement is restricted to the size of the set. In effect, it is a complement mask of a given size.

With bitsets in R3, it would not be a mask, but an entire unbounded set. That should be good most of the time, but there may be cases were you really do want to mask a set of bits. For that, we would want to refine COMPLEMENT to allow some kind of range specification.

Carl Sassenrath
6-Feb-2008 13:47:35
Karol: yes, those are also possible options. Thanks.

For #3, I should note that we are trying to move away from the ASCII defaults and go with more of a Unicode mindset.

For #4, we have talked about some method of using ranges, like sparse arrays. Of course, performance is critical for bitsets because they are used in functions like PARSE.

maxim oliver-adlhoch
14-Feb-2008 14:43:16
although I agree high level logic can be used, any sane person does not want to try and play around with inverted logic when given the choice. its can be brain numming, in some cases.

for two sets, its already a think and wait. With a ruleset of many many bitsets it may in fact be almost impossible to wrap your mind around it, especially for novices.

although I agree that its possible, why not ALSO have a /not refinement on find and maybe just a select few series functions. its VERY (more?) usefull for other things than just bitsets too... and it WILL make rebol soooo much friendlier in MANY CASES.

Post a Comment:

You can post a comment here. Keep it on-topic.

Name:

Blog id:

R3-0114


Comment:


 Note: HTML tags allowed for: b i u li ol ul font span div a p br pre tt blockquote
 
 

This is a technical blog related to the above topic. We reserve the right to remove comments that are off-topic, irrelevant links, advertisements, spams, personal attacks, politics, religion, etc.

REBOL 3.0
Updated 8-Sep-2024 - Edit - Copyright REBOL Technologies - REBOL.net