Comments on: Comma as a delimiter in some special cases?

As you know, REBOL does not use commas. This design decision is critical to the foundation of the language.

In REBOL we write:

[1 2 3]

not:

[1, 2, 3]

Why? Simple, because we want to write:

[1 + 2]

not:

[1, +, 2]

Think about that carefully. In REBOL data can be code, and code can be data. Commas would prevent that concept from working properly. This property makes REBOL quite special and unique in the world of programming languages. So, it's fundamental.

Now, ticket #537 requests a change to the lexical analyzer to give higher strength to commas. Several people have asked for it over the years. What's desired is if someone provides data in the form:

a, b, c

We can easily use the REBOL scanner to parse it without too much trouble.

I spent some time today looking into this request. It's possible, but non-trivial.

While comma doesn't have any special significance in the lexicon, it is used for non-English style decimal points... worldwide. That is, in REBOL:

1.23 = 1,23

People in the USA freak out on that, but if they travel to say, Paris, after the initial shock, they get the idea. I suppose it works the opposite direction as well.

Anyway, comma is not a delimiter in REBOL, it's just a special character used for numbers.

Now that all that's clear, can comma be made a bit stronger to act more like a delimiter? Perhaps. But, we'd need to resolve this ambiguous case:

[abc,123]

Which of these lines would be the correct result?

[abc 123]
[abc 0.123]
**error: comma found in expression

Now, what happens if we write:

[abc,123,456]

Hmmm. You begin to see the problem, and it's non-trivial because comma means two different things here.

So, after some thought, I think we're better off leaving such lines to be processed as strings by parse and keeping the REBOL lexical analyzer as it is.

18 Comments

Comments:

Gabriele
7-Feb-2010 6:05:21 People who ask for LOAD to work on "a, b, c" are just crazy. ;-)
Brian Hawley
7-Feb-2010 7:01:21 The request was not for load to work on "a, b, c". It was for transcode/next/error or transcode/only/error to generate the error after the "a" is recognized, at the position of the first ",", rather than at the "a," position. This would allow error recovery routines to be more useful and easier to write.
No sane person would dispute that "," should be an error in REBOL - the question is when that error is triggered :)
Edoc
7-Feb-2010 11:44:33 I'm indifferent to this, mainly because I have no need to support [1, 2, 3], but my thought is that if it were supported, the series [1, 2, 3] (with one or more spaces after each comma) would be valid but not [1,2,3] (no space between values).
Nick
7-Feb-2010 12:34:19 Perhaps I don't understand all the benefits of making such a change, but from my limited perspective, it seems like a very bad idea to make this fundamental change. Would the benefits really outweigh the potential problems?
Ladislav
7-Feb-2010 12:35:05 +1 to Gabriele's comment
Carl Sassenrath
7-Feb-2010 14:10:44 I should point out that Brian is correct. The request was related to how transcode deals with it as an error. It seems like a reasonable request, and there may be some method to do it.
The problem is, under the hood it's more complicated. REBOL's lexical analyzer is very fast. It gets that speed from a three layer token classification/processing mechanism. Any exceptions that we add to this mechanism will slow down processing.
Note that comma isn't the only non-delimiting invalid character within words. So is pound (#) and others. If I write abc#def,ghi where would we indicate the error stopping point? Probably at the #.
So, to make the requested change, we'd want to return the error position at the first non valid char. This position would be determined by the error handler, not by the scanner itself (which does not know about the invalid char error position.)
So, that's something to consider in the future, if we so desire.
Brian Hawley
7-Feb-2010 16:43:36 The real trick of the request is that this is the current behavior:
>> transcode/error to-binary "a," == [make error! [ code: 200 type: 'Syntax id: 'invalid arg1: "word" arg2: "a," arg3: none near: "(line 1) a," where: [transcode] ] #{}]

And here is the desired behavior:
>> transcode/error to-binary "a," == [a make error! [ code: 200 type: 'Syntax id: 'invalid arg1: "word" arg2: "," arg3: none near: "(line 1) a," where: [transcode] ] #{}]

The current transcode/error behavior would work with more carefully written error handlers if the exact supported character set of R3 word values was documented. That would make it easier to write our own recovery parsers. It would also help if ticket #1457 was fixed (or implemented, if you prefer to think of it as a wish).
Brian Hawley
7-Feb-2010 17:01:09 The other effect of this request would be to change this current behavior:
>> transcode/error to-binary "a,123" == [make error! [ code: 200 type: 'Syntax id: 'invalid arg1: "word" arg2: "a,123" arg3: none near: "(line 1) a,123" where: [transcode] ] #{}]

To this desired behavior:
>> transcode/error to-binary "a,123" == [a 0.123 #{}]

And change this:
>> transcode/error to-binary "a,123,456" == [make error! [ code: 200 type: 'Syntax id: 'invalid arg1: "word" arg2: "a,123,456" arg3: none near: "(line 1) a,123,456" where: [transcode] ] #{}]

To this:
>> transcode/error to-binary "a,123,456" == [a make error! [ code: 200 type: 'Syntax id: 'invalid arg1: "decimal" arg2: ",123,456" arg3: none near: "(line 1) a,123,456" where: [transcode] ] #{}]
DideC
8-Feb-2010 5:27:29 I'm not for anything that is special case (handling of comma like this is special usage case IMO) and that would slow down Rebol if it is implemented.
So to me it's simple : if the parser know the exact position where it find an error (so at the comma position) then handle this correctly in the make error!. If the position is at the begining of the word, then let it be like it is.
Another posibility is to make the error message (the one displayed) clearer by adding what sort of chars could cause the error so that the developper can know about it.
And of course, let it be well documented in 'transcode ('load ???) documentation page.
Rebol is not made to be "Nostradamus" loader of any data. If human can not understand the meaning of any data, why would you want computer to be smarter than human.
So let developper parser handles this if his program needs to handle it !!
Brian Hawley
8-Feb-2010 16:01:40 It has been suggested that there be a place in system/catalog where standard charsets that would be used by many parse rules would be stored. If a word-char charset was one of those, including all of the chars that are recognized as being part of a word! (not including the special cases), then the request for the comma tweak to transcode would be unnecessary, and could be withdrawn.
Heck, even documenting the R3 syntax with an optional module filled with charsets and parse rules - kept current with the precise syntax - would be enough. That would allow us to write a whole variety of R3 syntax analysis tools, recovery parsers and such.
When it comes down to it, this request is a reaction to the lack of documentation about the exact syntax of REBOL. If the documentation of the syntax isn't sufficient to help us write our own REBOL parsers then whole classes of REBOL add-on development tools are just blocked. If we just knew what the delimiters that affect words were then we could write our own DWIM parsers for mostly-REBOL data and be happy.
We just don't have the time to reverse-engineer something that should be documented in the first place. If it's documented then we can write (and contribute) all sorts of tools without needing to slow down the real REBOL parser in the slightest.
-pekr-
9-Feb-2010 3:54:59 To BrianH's comment - AMEN :-) Carl? :-)
Henrik
9-Feb-2010 4:30:31 I very much agree with Brian Hawley.
Mark Ingram
9-Feb-2010 11:09:13 +100 Brian Hawley - the lack of a formal syntax specification for REBOL, even one written as rules for 'parse, has always been and will continue to be a huge stumbling block for professional software developers like myself.
With the (albeit partial) adoption of UTF-8 source code, this problem has gotten even worse than it was ten years ago, when it had the status of "must fix immediately" IMO.
Ingo
10-Feb-2010 5:01:40 Me too :-) +1 to BrianH
I once asked for default rules to parse Rebol.
Then we got the block parsing, which is helpful to parse valid Rebol, but not so much to parse something different (if even slightly).
Brian Hawley
15-Feb-2010 2:26:41 Ratio, there's no difference: Code is data, at runtime - it's data for the code interpreter. Even a function is data and contains references to other data, such as the spec and code blocks. The distinction you make has no meaning for REBOL, or for that matter a few of the Lisp dialects (but not the most popular ones). This can be a little confusing to users who are more familiar with less-powerful languages, though.
You are right that the "can be" phrasing didn't make much sense. It was really a bit of soft-pedaling so that we wouldn't have to explain how dialects and interpreters work. Replace "can be" with "is" and it makes more sense.
Brian Hawley
16-Feb-2010 1:40:45 We're not going to rehash your favorite flame war again. I know what you mean, understand your arguments in great detail, and disagree with them.
It is clear that you have a different idea about what the "basics" are, and that idea is derived from languages that are so different from REBOL that they aren't even comparable. And refuse to recognize that there are situations that require languages that follow a different model than C. C is a great language for certain uses, including some of the implementation of REBOL. I use C when it's appropriate, and use REBOL or other languages when it's not. Which happens a lot, btw.
The knowledge of "Good developers" can be wrong if it's inapplicable, or if they aren't familiar with the subject, or if they refuse reason. You have proved that there is no point to arguing with you, and it's not because of any soundness of your arguments that you might imagine.
Please use another language.
-pekr-
16-Feb-2010 4:14:18 Ratio, you see? Brian Hawley does not agree with you. You know who Brian Hawley is? He is "BrianH", and he is 2009 winner of the reboller of the year, he is our luminary, one of the few, who contribute to the very core of the technology. I just hope that for any potential readers it is clear now, who's arguments are right, and who's are not all that correct.
And Ratio - nothing against you personally, but - if you have any proposal how to make eventually REBOL better, please use the DocBase wiki and write down some document. I did so for the REBOL marketing area, you can do so for the technical aspects of REBOL ....
xRatio
21-Feb-2010 18:03:26 (at)Brian Hawley,
Please explain exactly what you want to say or want to know about logical problems described earlier. Sorry - personal aggressions are no arguments.
It's very easy to write illogical things into code, Brian. ;-(
If on higher levels others try to correct lower failures they often introduce new failures. The results are megabytes of nonsens nobody understands.
We see this everywhere - beginning -sorry- also in REBOL which basically is an ingenious system.
Best thing for REBOL - people like you, Brian, help to stop illogical things. It's not easy, I know. But necessary.
Cheers, xRatio

Comments on: Comma as a delimiter in some special cases?

Comments:

Post a Comment: