Comments on: Allowing lexical exceptions in LOAD

Compared to most languages, REBOL provides a large number of primitive (built-in) datatypes. This is an important aspect of the language.

Many of these datatypes have direct lexical (string) representations:

1234      ; integer
123.4     ; decimal
123,4     ; decimal (non-British)
$123.45   ; money
$123,45   ; money (NB)
12.34%    ; percent
1.2.3.4   ; tuple (version, color, ip-address)
1-2-3400  ; date
12:34     ; time
12x34     ; xy pair
"1234"    ; string
<1234>    ; markup tag
#1234     ; issue
#{1234}   ; binary string
#"(1234)" ; unicode character
etc.

As a result, the lexical analysis of REBOL source code is fairly intense; not something you want to write yourself. That is why functions like LOAD/next were created - to let you use REBOL to easily read REBOL at a string level.

However, many developers have expressed an interest in allowing REBOL's lexical scanner to handle non-REBOL symbols sequences. Having used REBOL in this manner myself (to load data from web tables, etc.), I understand the merits of the request.

For example, if you want REBOL to load the string:

this is 1st

the scanner will throw an error on "1st" because it is considered a malformed number (bad syntax).

It may be possible to tell REBOL's scanner to relax on its interpretation of invalid syntax. But, before adding such a feature, I wanted to poll the group to determine if it is worth doing. (In theory, only a few hours.)

But heed this warning...

Any such change would come with the additional warning that what is invalid REBOL today may be valid in the future. For example, today we do not allow a pair! to include floating point such as:

1.25x2.5

But, that is a desired quality, and we hope to add it in the future.

Any relaxation of errors in the scanner would come with the caveat that those exception cases may not be exceptions in the future. REBOL's syntax gets extended over time.

Then, also, there is the issue of how such exceptions would be flagged in the result of the scan. Since those invalid values by definition exist outside the value space of REBOL, how do you detect them in the resulting block? They are "out of band" in the lexical sense.

One possibility would be to return them as strings, but with a special flag attached (similar to the new-line flag that keeps source code line break information to allow nice-looking molding of code):

code: load/relax %data.txt
foreach value code [
    if invalid? value [print value]
]

Of course, the word invalid is too general purpose here, so a better word may be needed.

There is also the issue of what it would mean to form and mold those values? Would they be converted to strings, or would the invalid form be output?

There is also the issue of how the scanner recovers from the error. That is, where does it "get back in sync" with the source. For example, does it look for the next delimiter, whitespace, or end of line? This is not as obvious as it may seem, as can be found in the line:

"there is no end quote

The scanner detects the missing quote when it hits the end of the line. So the invalid value will be everything from the quote to the end of line. But, in the case of:

1st is better than last

The "1st" will be the invalid value, and the rest will be valid REBOL words. So, you can see my point there. The extent of the invalid value is determined by the internal state of the scanner.

This topic is now open to your comments.

26 Comments

Comments:

Edoc
3-Apr-2007 16:59:32 I'm interested if this feature could lead to more flexible and robust dialects via 'parse. My own experiments in this area suggest that it's a great deal of work to manage user input (e.g. typos or input not easily coerced to REBOL datatypes).
Maxim Olivier-Adlhoch
3-Apr-2007 17:16:59 PLEASE PLEASE PLEASE!
wow, this comes right in line with implementation I am actually doing with antidote (a string! based rebol interpreter built with parse as the run-time and 100% re-implemented type classes, providing a 100% sand box. No values are actually loaded or evaluated by REBOL -directly-, I dont call load or do, I dont even use the types directly, I have my own datatype engine :-)
I have the same issue and I understand all your qualms. especiallly since some constructs are inherintly greedier than others (like the string example).
The biggest caveat is when things like strings expect end terminators beyond normal separators like spaces. in this case, the parser will inadvertently go up to the end of the document or some other conflicting symbol before realizing the error, which can be costly in terms of performance, especially if such tokens occur repeatedly. Even worse is how to realign nested blocks, for example.
I can see (cause I am doing it myself) how some lexical contradictions could lead to useless assumptions depending on the nature of the input, the intent of supporting "lexical extensions", and the general structure of the source data. Because of all of this, by default I'd prefer you stop at the first whitespace, cause that is how most symbols are represented by humans. The whitespace is a powerfull stop. If people have their own start and end terminators (a comma?), then, code which supports these will be responsible for detecting and sorting through the various values anyways.
I've also thought often about how "lexical errors" could be stored within rebol and I admit, using the parser extensively, I'd treat them exactly as strings in all regards but just having their own type name (and serialized form), would allow us to very easily intergrate them within parse rules which just convert to other types of our choosing (in altered form?) or most probably custom types.
call it any! or unknown! or something of the like. The fact that lexical analysis evolves is already a consideration in REBOL so it might not be as big an issue for current developers as you think, it just allows us more options.
It might even allow us to implement custom datatypes with recognizeable lexical forms... hell, we might even provide a few usefull types which fit in the rest of the rebol family and help you implement future extensions, by giving actually implemented examples :-)
Brian Tiffin
3-Apr-2007 17:24:56 Yippee,
I do believe this will up the 'simple things simple' concept of REBOL. And open up all kind of possibilities for user input.
That is of course after you (Carl) work out all the technical bits regarding missing quotes and block brackets, etc...
I already have plans swimming in my head for a better forthish string/line editor, using parse and invalid? instead of a recursive try loop beastie I have now.
Unless one of the guru's thinks of a showstopper, I'm excited.
Thanks.
Gregg Irwin
3-Apr-2007 18:07 I've always been leary of this, feeling that if you need to parse non-REBOL data, use string parsing. Maybe my fear is that it's *so* seductive. I mean, come on, a loader *that* smart; it's too good to be true.
My main reason for not fighting it too hard is that I know someday I'll probably have a need for it myself, where I'd rather have REBOL do yet *more* work for me. :-)
Norm
3-Apr-2007 19:17:56 To let Rebol become the Mother of adhoc dialects interpreters, Rebol will attract a strong interest in lexical analysis, and languages that steems from real use. Your suggestion not to constrain Rebol core with that is wise; alternating from contexts of interpretation is generally a corecursive reactive loop, so fussy simple, so easy to scrap, hard to do well, neither too much nor too less. The knowledgable will be tempted by the mathematical analysis of the Liar Paradox by late Jon Barwise with non-well founded sets, and Nuel Belknap work on corecursive definitions mediated by negation for another view of the logic of those structures. To toogle off regular interpretation may be the opportunity of a clever explicit toggle back-in Rebol explicit lexical token, better maybe than white space? And we could feed the interpreter a new behaviour to apply from the toggle-off cut-point. We would all like to devise our own ad-hoc _paragraph_ editor, right in Rebol Interpreter, a la Emacs or Vim. R3 could become a linguist dream, a logician desk, even a therapist conducting interviews. Giving a way to use the R console with some specialised dialect will nurture some amazing uses.
Maxim Olivier-Adlhoch
3-Apr-2007 19:26:30 for my part, a large part of what I am doing now could be complimented by REBOL, if it allowed REBOL-alien symbols within a rebol block.
hum an alien! type ? ;-)
this means I could at least load data (not actual REBOL source code) and then act on it without fear that REBOL won't choke on lexical extensions or out of bounds values, etc, which antidote might support.
some of the types are hard to recreate within the parser, so having REBOL giving me 90% of the work is already seductive.... right now 'LOAD is useless cause the slightest issue renders is 100% incapable.
the most notable symbol which crashes REBOL load is a stray comma (,) ! this means most text content cannot be loaded (where about 95% is loadable) but must be string parsed :-(
Brian Tiffin
3-Apr-2007 19:46:01 Suggested names...
junk! garbage! unknown! babel! syntax! alien! foreign! typo! mistake! lober!
If I had my druthers this datatype would be gettable but not expressable.
Sorry, I'm just having fun now...but I definitely liked Norm's comment. Lead the horses to water...
Oldes
3-Apr-2007 19:51:19 I would vote for whitespace as a delimiter. In my dialect, which is based on loaded block I have only problem with invalid tags. I cannot use << and <<< as valid words. So maybe just some way how to relax the scanner so it knows that such a sequence of chars is not invalid tag but a word would be enough for me. And maybe to let comma to be valid word as dot can be.
Anyway... I still would like to know if string based parsing (of some dialect), can be faster than parsing preloaded data.
And it would be really cool to be able to load for example ecmascript for parsing. I'm sure I don't need to load html pages to parse, for such a case string based parsing is good enough.
Maxim Olivier-Adlhoch
3-Apr-2007 19:59:18 obviously the real solution would be to be able to add actual parse rules right in the load call, this would allow dialect makers a chance to integrate custom types and data directly within the system. ex:
>> load/extension "coord: 0.3x23.4x44.1" [ copy val [some decimal "x" some decimal "x" some decimal ] (make utype! 'tripplet val) ] ==[coord: #[utype! tripplet [0.3 23.4 44.1]]]

(NOTE: I am just projecting a possible example user type api, I have no internal understanding of the R3 user type model)
for full extensability, I'd parse the rules first, allowing any dialector, the chance to overide internal type construction, by what those symbols mean for his dialect.
This means some type values could even be interpreted and other values of the same type be loaded normally.
possible effects:
pre-reduced words
auto-capitalized strings
auto-(un)abbreviation
decryption
compression
paths which load images directly
denied urls
the list is limitless, :-)

why do so directly on the load you ask?
-cause you end up with your data in one pass
-you don't have to fiddle around trying to recursively browse blocks.
-your user types end up being fully integrated.
-you don't have to REPARSE 700MB of data to find and detect 3 potential occurences.
-for security concerns too, you might want to ensure that any loaded data does NOT contain some data.
Carl Sassenrath
3-Apr-2007 23:17:05 Just a quick note regarding the official definition of a REBOL dialect: it must be lexically valid. Arbitrary string-based lexicons/grammars are not dialects.
I'll have more replies to your feedback soon.
Dave Cope
4-Apr-2007 3:15:27 I'm not sure I understand the full ramifications of the relaxed lexical anaylsis yet to comment. But I would love to see decimal pairs which might also support negative numbers too.
I work with geographical data and being able to encode a 2D spatial point as 51.1234x-3.5678 (lat/lon) or 21234.5x34567.8 (metric easting/northing) would be wonderful ! I could see a block of these points being able to model polygons and polylines.
Cheers.
Rob Lancaster
4-Apr-2007 7:47:41 I quite like Maxim's example... Especially if you could override REBOLs default behaviour:
For example, when parsing linker Map files, I'd like 0x340 to represent hex and produce an Integer! Rather than a pair... and not to stop when it encounters 0x03A
----
Does Maxim seem to want user datatypes? Could these be defined as only existing inside a particular context?
Therefore, doing a load in that context would not need the /extension refinement. But merely use the lexical definition acquired from the local context...
Could userdata types override the lexical definition of native rebol datatypes, for that context?
Could it be possible for a programmer to explicitly specify a limited set of datatypes are availible within a context: So that when REBOL is extended in the future.. the explicit specification of availible datatypes will ensure that a string that is loaded and currently produces a Unknown! value shall continue to produce an unknown! value the future....
This is probably not a few hours work? :-)

---
Edoc
4-Apr-2007 11:16:40
>> Arbitrary string-based lexicons/grammars are not dialects.
Sorry for any confusion I may have introduced. An interactive, dynamic interpreter is probably the term I should have used instead of dialect.
Brian Hawley
4-Apr-2007 16:41:13 There's a few things here that I don't understand:
Why are you using LOAD on non-REBOL data? I don't mean REBOL code, I mean the REBOL data format. Isn't that what PARSE is for? LOAD is for REBOL dialects - you use PARSE for other languages.
Why doesn't LOAD interpret 1st as a word? For that matter, why doesn't it interpret any string of word characters that doesn't match up to another type as being a word? It it to reduce errors, or to keep REBOL code sane-looking?
Last, but not least, why it is possible to crash LOAD at all? LOAD might fail, maybe even throw a catchable error, but should never crash.
Maxim Olivier-Adlhoch
5-Apr-2007 1:32:35 brian, REBOL has such a rich lexical analyser within LOAD that trying to reimplement it all the time becomes tiring. moreover, when I look back pragmatically at many of the data sources I have needed to analyse in the past... about 90% of the raw data fit into REBOL LOADED form...
I am currently rebuilding much of what is internal to REBOL and its tedious, I know others have done it too, each with a flavor, but if the LOAD didn't throw an error I could rely on much of what load provides freely and MUCH faster.
(sorry if the word crash is used liberally by myself, but a load error is = to a crash from the user's stand point)
Carl: I understand what a dialect is (sorry if I sound blunt in saying it ;-) , but specifically the allowance for lexical error relaxing would allow us to extend A dialect to treach such symbols as valid data, thus it would BECOME valid dialect, since load would pass it on to the program to allow or deny.
Maxim Olivier-Adlhoch
5-Apr-2007 2:08:31 I guess my overall stake on the whole subject is quite simple:
Do we ALLOW the people using REBOL to possibly implement hacks wrt REBOL's lexical analysis OR do we "protect them" from themselves of possible harm by various long/short term side effects.
I'd rather allow people to hang themselves and put a severe warning on the rope. "Do not place rope around neck, unless you understand the exact effect of the knot. Also note, rope lenght might shorten in time"
I understand Carl wishing to keep control over the language, but allowing people to extend something as fundamental as this... is a feature (not a bug), and a hellishly powerfull one at that. provide an official api (however simple or basic) and control is retained. whatever people do with it... is their problem. as long as the api remains constant (like lexical relaxation remaining in future versions) we can work around this and make a good deal of use out of it.
The open (as in programable and public) port spec is a good example. an api was given to add port url schemas directly within the language. something which very few other languages can do at the lexical level.
Anyhow, I'll stop Being the canadian Pekr I can be with my critic posts now. I'll leave you all to decide on this topic I have long and often mused about in the last 8 years, (alone in my dark, dusty dungeon, crafting liquids and elixirs ;-)
Brian Tiffin
5-Apr-2007 9:16:32 Carl; I agree with keeping dialect pure.
I'm looking forward to using the relaxed parsing for CLI's. As a matter of fact, my plans include a CLI that uses supporting dialects to produce a dialect block to get the work done.
I guess I had envisioned a parse block rule that included a simple(r)
[ set command word! set params [string! | junk!] ]

Thanks again.
Volker
5-Apr-2007 18:12:44 I share Maxims view. 'load knows a lot usefull things. After all urls etc where build into it for that reason: beeing very natural. adding some exceptions for custom formats may go a long way. Another purpose may be to embed binary data. For now it is bloated by base64. The rule could also trigger to stop, i imagine 3 + 4 + 5, click on the 3 in an editor and get the rebol-result up to the "," (something "wrong")
Brian Hawley
5-Apr-2007 19:28:50 Maxim, I like to look at it the other way around. Instead of hacking LOAD to load non-REBOL data, why not add a load directive to the PARSE dialect to handle embedded REBOL data? I believe such a thing was proposed before, but in case not...
Proposal: A new PARSE operation called LOAD, to be used during string parsing.
Syntax: LOAD rule . The rule would be any directive, block of directives or reference to a block of directives in the PARSE block dialect.
Behavior: The processor would attempt to perform the equivalent of the REBOL function LOAD/NEXT at the current parse position. The parse position advances to after the REBOL value. It would then attempt to check the resulting REBOL value using the rule, which would be applied using block-parsing conventions. If either the load or the check fail, it would trigger a standard parse failure and backtrack appropriately. The SET operation should work with LOAD, or perhaps the COPY operation.
Conditions for success:

At the current parse position there must be a sequence of characters that would constitute a properly formatted REBOL value

That REBOL value, once loaded, is matched by the rule in accordance with the block parsing dialect.

Conditions for failure and backtracking:

If the character sequence at the current position does not correspond to a REBOL value, fail.

If the value doesn't match the rule, fail.

Example:
parse "Hello, World!" [some [ set x load skip (print type? :x) | copy x [skip to " " | skip to end] (print ["Other:" x]) ]]
Volker
5-Apr-2007 19:35:54 I prefer the other way around. Because Brians way needs some expertise with 'parse. Maxims way could predefine some extensions, and use [ load/extend data maxims-specials ]. But i would like a load-smart string-parser too, but then not with 'load, but datatypes. [ .. set x any-type! (..) ..] would do it. Want both^^
Oldes
5-Apr-2007 19:55:22 Some parse-load combination may be useful if it reduce one processing pass. For example in my dialect which is block based I have to first load string and than use parse on such a loaded block. For small dialect it's probably not a problem, but what if I will have mature dialect one day:)
Brian Hawley
5-Apr-2007 20:05:27 It makes sense to be able to parse REBOL values in the middle of non-REBOL text, but not as much the other way around.
The problem with relaxing the lexical rules of the LOAD function is that REBOL can't handle natural language syntax, particularly punctuation. You would need to convert a word to a string to tell the similarity between "Hello" and "Hello," or the difference between HELLO and hello, at which point you are doing string parsing again, but slower.
I can see the advantage to having a comma! datatype that would be a syntactical noop, so you can put one anywhere in REBOL code and it would be ignored by the standard dialects (like DO and PARSE rules). You could even make a comma! a delimiter so that you could put it right next to a word or something and still distinguish it. Then we can give these commas meaning in our own dialects, or ignore them too.
Maxim Olivier-Adlhoch
5-Apr-2007 20:19 Brian, I feel strange realizing that I am here able to even participate in this discussion, when 'PARSE has been such a mystery to me for so long. :-)
Your load idea is actualy very clever. I had tought of a new function which did something very similar to what you propose but was missing a universal way to apply it... your proposal is that idea. what I realise mainly is that with your idea, we can end up with load/extension just by putting the LOAD directive at the end.
ex:
src: {this: 'is an "unloadable" string! , really!} blk: copy [] parse src [ some [ "," (append blk ",") | set x LOAD [NOT function!](append blk x)] ]
(note I didn't use parse/all)
(note: I slipped-in the most wanted parse feature request too, try to find what it is ;-)
here any unstrung comma is automatically appended as a string, any other unloadable data raises a parse error :-) . volker see how this even reads like the load/extension example I gave... only its inside a more flexible shell, the parser.
yep. I like brian's idea very much. in one of my (tedious ;-) posts, I wanted to suggest adding lexical relaxation as a separate function entirely. Adding LOAD directly within parse feels much more like a multiplication of features :-)
In fact that is exactly the goal which is desired, to let LOAD complement my own parse rules...
also, there is a definite chance that all current parsers which would retrofit this LOAD directive would become even faster.
I think we have a VERY powerfull language extension here, parse made easy for any newbie.
I guess my tedious posts might have spured some good ideas from others at least!
Brian Hawley
5-Apr-2007 20:49:19 Yeah, that NOT would be nice. For that kind of thing now I usually use FAIL, which I set to [END SKIP] since it will always fail. Example with LOAD:
fail: [end skip] parse str [ set x load [any-function! fail | skip] (append c :x) ]
Tricks of the trade, I suppose.
Maxim Olivier-Adlhoch
5-Apr-2007 23:32:37 good trick!... but NOT would still be a bit easier to use since its on the left ;-)
Volker
8-Apr-2007 17:48:21 I would use datatypes directly in parse, instead of 'load. Looks very much like block-parser then. And in this cases i will have very specific requirements to the data. I want a 'number!, a 'date! etc. If i can get a block! or even function!, or write [ load [number!] ], i am close to using the old load/next-way.. [ number! any ["," number!] ] looks much better :) About usage, my main use will be be to load rebol out of other text. DME and Oberon 4 are still the smartest editor to me. If R3 gets the ability to build a real editor (gui-speed, colors), i will have a lot executable notes in my texts :)

Comments on: Allowing lexical exceptions in LOAD

Comments:

Post a Comment: