Allowing lexical exceptions in LOAD

Compared to most languages, REBOL provides a large number of primitive (built-in) datatypes. This is an important aspect of the language.

Many of these datatypes have direct lexical (string) representations:

1234      ; integer
123.4     ; decimal
123,4     ; decimal (non-British)
$123.45   ; money
$123,45   ; money (NB)
12.34%    ; percent
1.2.3.4   ; tuple (version, color, ip-address)
1-2-3400  ; date
12:34     ; time
12x34     ; xy pair
"1234"    ; string
<1234>    ; markup tag
#1234     ; issue
#{1234}   ; binary string
#"(1234)" ; unicode character
etc.

As a result, the lexical analysis of REBOL source code is fairly intense; not something you want to write yourself. That is why functions like LOAD/next were created - to let you use REBOL to easily read REBOL at a string level.

However, many developers have expressed an interest in allowing REBOL's lexical scanner to handle non-REBOL symbols sequences. Having used REBOL in this manner myself (to load data from web tables, etc.), I understand the merits of the request.

For example, if you want REBOL to load the string:

this is 1st

the scanner will throw an error on "1st" because it is considered a malformed number (bad syntax).

It may be possible to tell REBOL's scanner to relax on its interpretation of invalid syntax. But, before adding such a feature, I wanted to poll the group to determine if it is worth doing. (In theory, only a few hours.)

But heed this warning...

Any such change would come with the additional warning that what is invalid REBOL today may be valid in the future. For example, today we do not allow a pair! to include floating point such as:

1.25x2.5

But, that is a desired quality, and we hope to add it in the future.

Any relaxation of errors in the scanner would come with the caveat that those exception cases may not be exceptions in the future. REBOL's syntax gets extended over time.

Then, also, there is the issue of how such exceptions would be flagged in the result of the scan. Since those invalid values by definition exist outside the value space of REBOL, how do you detect them in the resulting block? They are "out of band" in the lexical sense.

One possibility would be to return them as strings, but with a special flag attached (similar to the new-line flag that keeps source code line break information to allow nice-looking molding of code):

code: load/relax %data.txt
foreach value code [
    if invalid? value [print value]
]

Of course, the word invalid is too general purpose here, so a better word may be needed.

There is also the issue of what it would mean to form and mold those values? Would they be converted to strings, or would the invalid form be output?

There is also the issue of how the scanner recovers from the error. That is, where does it "get back in sync" with the source. For example, does it look for the next delimiter, whitespace, or end of line? This is not as obvious as it may seem, as can be found in the line:

"there is no end quote

The scanner detects the missing quote when it hits the end of the line. So the invalid value will be everything from the quote to the end of line. But, in the case of:

1st is better than last

The "1st" will be the invalid value, and the rest will be valid REBOL words. So, you can see my point there. The extent of the invalid value is determined by the internal state of the scanner.

This topic is now open to your comments.

26 Comments