REBOL 3.0

Comments on: Improvements to the PARSE function

Carl Sassenrath, CTO
REBOL Technologies
5-Nov-2008 18:50 GMT

Article #0155
Main page || Index || Prior Article [0154] || Next Article [0156] || 53 Comments || Send feedback

It's long been suggested that some improvements to PARSE would be useful. I know that many of you want these. I want them too.

R3 is the place to make such changes, because we have allowed ourselves the luxury of breaking just about everything... in order to make improvements. For example, Unicode has forced changes in PARSE already.

So, I think we can make some improvements to PARSE, but to succeed in doing so, we must be quite strict about the project. After all, PARSE is not a function to be modified casually.

Specifically:

  1. All improvements should be gathered into a single list posted on DocBase where they can be openly added and edited. Each improvement should be given a reference identifier of some kind. (Note that a few developers have already started listing ideas, so they may want to contribute that to this project.)
  2. Someone should search RAMBO and gather all the related PARSE requests and post them to the DocBase list.
  3. We should agree (as much as possible) on the priority of changes. See my notes below.
  4. Each improvement will require test code be provided that would certify its correctness. No test code, no improvement. (Sorry... you often ask me what you can do to help. Please don't put the burden of testing such changes on me.)
  5. Any features that may be affected or changed as a result of the improvement should be noted and also must include tests.
  6. All changes need user documentation. (The DocBase list is probably a start, but more examples are necessary.)
  7. We need one person to volunteer to be in charge as the main editor/manager of this project. This is important. That does not mean you will have to write the docs, tests, etc. only that you will help urge other people to do so.

Priorities

Regarding priorities, it is always difficult for a group of people to agree on anything specific. So, here's how we will rank the priorities:

 HighChanges that are critical, but not highly complicated. For example, providing a NOT command seems easy enough, and it is now critical because using complemented charsets is problematic (due to the Unicode enhancements).
 MediumChanges that are essentially "easy" and provide much greater usability (or ease-of-use) in PARSE. For example a FAIL command to force a rule to fail.
 LowChanges that very few people would use, or they are quite difficult to implement. To recognize these, if you suggest an enhancement, but the algorithm seems mysterious or unclear, then it's probably of this rank.

There are also some priorities that fall in-between.

You can also post your suggestions in the comment section here... if you don't want them to be lost.

Links

53 Comments

Comments:

Peter Wood
5-Nov-2008 18:32:05
START

Purpose: Allow initialisation within a parse rule

Priority: Medium

Example:

  parse [a b c d] [
    any [
      start (acc: 0)
      |
      set inc integer! (acc: acc + inc)
      |
      end
    ]
  ]

Test:

 [
   a: 0
   parse [] [start (a: 1) end]
   1 = a
 ]
Peter Wood
5-Nov-2008 18:41:43
RETURN

Purpose: Allow a value to be returned from a parse rule

Priority: Medium

Example:

  total: [
    any [
      set inc integer! (acc: acc + inc)
      |
      end (return acc) 
    ]
  ]
  acc: 0
  print ["The total is" parse [a b c d] total]

Test:

 [
   acc: 0
   a: 1
   b: 2
   total: [
     any [
       set inc integer! (acc: acc + inc)
       |
       end (return acc) 
     ]
   ]
   3 = parse [a b] total
 ]
Brian Hawley
5-Nov-2008 22:00:09
Peter, your START is unnecessary because you can just put the paren at the start by itself and it will behave the same way. Your RETURN sounds interesting, though it would need to be a PARSE operation rather than something you run in the paren. Code in a paren is standard DO dialect code.
Steeve
6-Nov-2008 12:09:01
10 months ago, i tryed to develop an async parser (parsing of stream files). IMHO, a major improvement of parse would be this: to be able to parse data ports.

* Pending this day, i reposted my last version of async-parser here: "About the parse evolution" thread

http://www.digicamsoft.com/cgi-bin/rebelBB.cgi?thread=<6Nov200818052719372100>

* Currently, it can parse a file without the need to load the whole file in memory.

Brian Hawley
6-Nov-2008 15:07:56
Peter, your RETURN proposal has been broken into two proposals and submitted.
Steeve
6-Nov-2008 15:37:04
The RETURN command could be generalized and replaced by the EMIT command like in the COLLECT function. If only one EMIT command is encoutered during the parsing process, parse will return only one value. If several EMIT have been encountered, then parse will return a block of values.

Brian Hawley
7-Nov-2008 19:14:50
Steeve, you can use the existing COLLECT and KEEP (not EMIT) functions with PARSE now in parens. As many have noted lately, the COLLECT and KEEP model is only one of many that people want to do. We can't support all models in PARSE - native code tends to be inflexible. The wide variety of collection models would be better served by REBOL code.

We are being careful to maintain the simplicity of PARSE, only adding features that are really tricky or impossible to do well in PARSE now. The modification ops would only be added because they cover common operations that are commonly done wrong. Simple features with big payback in ease of use or power are preferred.

The main advantage to the proposed RETURN operation is that it would stop the parse altogether no matter how nested you are in the rules. BREAK doesn't do that - you have to work around with COPY, some vars and THROW and CATCH in parens. That's a little too tricky for my taste. Nonetheless, RETURN might not make it in - making BREAK/return work with PARSE might be sufficient.

Norman
8-Nov-2008 3:49:14
Are these logic-changes or addons to Parse?

I realy hope addons because nothing is more irritating then mastering a good language and then go back to school again...

Oldes
8-Nov-2008 12:31:20
Maybe some better binary support at least to get binary results if parsing binaries.

I really don't like the current behavior:

>> parse/all #{00FF} [any [copy x 1 skip (probe x)]]
"^(at)"
"ΓΏ"
== true
I would rather expect result:
#{00}
#{FF}

Brian Hawley
8-Nov-2008 16:51:43
Norman, even addons would break some code because the new keywords might be the names of existing rules. For that reason, most of the proposed operations have been named after functions in REBOL - less conflict and confusion that way.

The only proposal that would not be compatible with existing code is the INTO changes, but that change is too powerful to pass up, given the limits of the original.

One thing to keep in mind here is that these PARSE proposals are to be applied to REBOL 3, which is a deliberate break with backwards compatibility. We need this to be able to fix the really bad bugs and design changes. This is a major version change.

Brian Hawley
8-Nov-2008 16:54:07
Oldes, there will be changes when it comes to parsing binaries in R3 because they are not interchangeable with strings any more, as a result of the Unicode changes. I don't know what form those changes will take, but your suggestion sounds like a good start.
Paul
8-Nov-2008 22:47:43
I would like to see Parse be able to return true on Datatype! value.
Peta
9-Nov-2008 8:13:45
Brian: regarding the proposed INTO change: see the AND rule (an opposite of the NOT rule, see http://en.wikipedia.org/wiki/Parsing_expression_grammar).

Using the AND rule we can keep the "original" INTO rule and instead of

INTO type rule

we can write

AND type INTO rule

Peta
9-Nov-2008 8:45:24
CHANGE REMOVE and INSERT are usually so inefficient, that it is not advisable to implement them (if you want a new string, then you can build it "on the fly").

BREAK/RETURN, RETURN etc. rules are not necessary. What is necessary is to make the usual RETURN, BREAK, etc. work from within PARSE parens.

The START rule may be useful (not for initialization, though, that was an error), but for the start position detection, e.g. for reverse parsing etc.

Peta
9-Nov-2008 9:00:41
How about an EITHER rule like in GTDPL - see http://en.wikipedia.org/wiki/Top-down_parsing_language?

EITHER rule1 rule2 rule3

It tries to match rule1 and when the match was found, it matches rule2, otherwise it matches rule3

Norman
9-Nov-2008 10:03:56
Brian thanks.. I though I heard last year that especialy 'parse would not be touched in R3.. So i hope it wont be a complete renewal of 'parse...

But I must say .. I realy love the rebol way of parsing.. very very nice ,and readable dialect!

Anton Rolls
9-Nov-2008 21:47:27
Peta,

AND and EITHER are excellent suggestions - looking into them.

I defend CHANGE, REMOVE and INSERT:

When the modified input is short (eg. cleaning a user's entered field), the modifiers are efficient enough.

You may also have a very long input, which could be more than half your available memory. In this case, you do not want to be building a new string, because you may run out of memory.

It's the usual speed vs memory consumption.

CHANGE may also be efficient with very long input, when the replacement length exactly equals the matched length.

The input could be a block! (not a string!), in which case changing the length of smaller nested blocks may be efficient, so that processing the entire block is also efficient.

Both ways - building a new string, modifying in place - are good in different situations, hence: CHANGE, REMOVE and INSERT are good.

Steeve
10-Nov-2008 8:47:25
Peta, good idea of having an operator that force a rule to not consume data (instead of using index to repos) I have some cases like that in my script.

But for your EITHER proposal, i don't see the point. currently: [rule1 rule2 | rule3] acts like your either specs, no ?

Steeve
10-Nov-2008 8:54:46
correction: either acts more like that [rule1 opt rule2 | rule3]
Peta
10-Nov-2008 9:33:50
Steeve: that is not equivalent, the exact equivalent is:

[rule1 (cont: rule2) | (cont: rule3)] cont

Brian Hawley
10-Nov-2008 11:41:05
Peta, we can't keep the existing INTO rule and use AND to get the behavior of the INTO proposal. The existing INTO rule only recurses into block! types, and tests for that.

The main change of the INTO proposal is to extend INTO so that it will work with other series types, particularly string types. In order to do that INTO would need to know what type it would be recognizing, particularly since block parsing has a different set of operations than string parsing does (like INTO, for instance). Even with your proposal all code using INTO now would need to be changed, because INTO would not test for block! anymore.

Brian Hawley
10-Nov-2008 11:48:37
Peta, people use CHANGE, REMOVE and INSERT functions in REBOL all the time. These proposals wouldn't be any more inefficient than the REBOL versions. They still might not make it though, as modification operations push PARSE beyond its scope. The only reason they were suggested is because people have been trying to do this anyways, and usually mess up the attempt.
Brian Hawley
10-Nov-2008 12:01:24
Peta, the BREAK/return proposal is exactly what you suggest: to make the regular REBOL BREAK function work from inside PARSE parens. The RETURN function already works from PARSE parens. I'll add your name to the BREAK/return proposal.

As for your START proposal, that was already suggested as HEAD? (equivalent to [reverse end reverse]). So far it hasn't met the significance threshold since the equivalent is so simple and trivial.

Peta
10-Nov-2008 12:16:50
Brian, regarding the INTO ability to recurse only BLOCK! types: INTO currently recurses all ANY-BLOCK! types.

I guess, that many existing rules use additional type-checking anyway, when the intention was to recurse only the block! type.

String parsing: yes, that is a new functionality, but I don't suppose it to break many rules, while the proposed change surely does.

CHANGE, REMOVE and INSERT are used in REBOL, but in PARSE they are influencing the "current parsing position" as well as any optimizations PARSE could use to be fast. (see e.g. http://pdos.csail.mit.edu/~baford/packrat/thesis/). Since these operations "mess up" the "current parsing position" the most probable result would be that people messing up their current attempts will continue to mess up things anyway. (do you want to bet?)

Peta
10-Nov-2008 12:24:59
To keep it for the future reference: I propose the NOT and AND rules, since they make the Parse dialect compatible with PEGs as specified in the Wikipedia.
Peta
10-Nov-2008 12:27:05
The same reason holds for the FAIL rule, of course.
Peta
10-Nov-2008 12:42:47
It looks, that I was not specific enough regarding the (RETURN) and (BREAK) comments. The (return) and (break) currently (2.7.6.3.1) work OK. I don't ask for additional functionality.
Peta
10-Nov-2008 12:55:05
The FAIL rule would make the Parse dialect compatible with TDPL. Moreover, the above EITHER rule would make the Parse dialect compatible with GTDPL.
Brian Hawley
10-Nov-2008 15:20:50
Peta, these are proposed changes for REBOL 3, not REBOL 2. There will be breaking changes as there are in the rest of the language. Most string parsing rules will need to be rewritten because of the Unicode changes, for instance.

The most important part of the INTO proposal is being able to switch to string parsing for a value in the middle of block parsing - the rest is just a bonus. In theory, you could just extend INTO to string types and let PARSE figure out for itself which parser to use based on the type of the data it finds, but since string parsing and block parsing have different dialects that would lead to syntax errors later on, ones that would not make any sense because they are too far removed from the real error. To reduce errors it is better to require the expected type or typeset be specified - the old behavior could be done as [INTO any-block! rule].

We can't use the Packrat optimizations because rules can be changed at runtime during the parse process, so you can't assume that the result would be the same with the same arguments, or even that you are parsing the same data when you use get-word position setting, or in the same direction when you use the REVERSE operation. Packrat optimizations only work with parser generators or other fixed-rule parsers with no explicit position setting.

The BREAK function does not currently work in PARSE parens like it does in loops. The BREAK operation of PARSE is for breaking PARSE loops, not the entire PARSE process. BREAK/return certainly doesn't work. Hence the proposal. The alternative to this proposal is wrapping the PARSE call in a LOOP 1 [].

The AND and NOT operations of PEGs are interesting, though they wouldn't work as is, and the word NOT is taken by a proposal already likely to by accepted. We'll have to see if the language patterns they recognize are common enough to merit inclusion. There are workarounds for both operations using the usual explicit position setting lookahead:

  • [rule1 AND rule2] = [rule1 p: rule2 :p]
  • [rule1 NOT rule2] = [rule1 p: NOT rule2 :p]

Since the GTDPL EITHER operation would require dynamic rules it merits inclusion, but we'll have to see if its usage is common enough in comparison to all of the other language patterns that require dynamic rules. Dynamic rules are pretty common in advanced parsing.

Peta
10-Nov-2008 15:32:44
Brian: sorry, the AND rule (recursion safe) implementation I mean is rather:

[e (cont: none) | (cont: [end skip])] [end skip] | cont

Peta
10-Nov-2008 15:33:47
And the NOT rule implementation I mean is:

[e (cont: [end skip]) | (cont: none)] cont

Peta
10-Nov-2008 15:35:20
and, moreover, the syntax of both is just:

AND rule

and

NOT rule

Peta
10-Nov-2008 15:49:44
"The AND and NOT operations of PEGs are interesting, though they wouldn't work as is..." - sorry, I don't understand this, as it looks
Brian Hawley
10-Nov-2008 15:53:54
Carl's NOT implementation is supposed to be like:

[not rule] = [rule fail | skip]

He needs this because Unicode charsets in REBOL 3 can't effectively be complemented in the same way that 8-bit charsets are in REBOL 2 without using up all of your RAM (512 MB each). He could easily use alternation as an alternative but that would be much slower.

Those recursion-safe implementations are nice, but we already have recursion-safety handled with the USE proposal.

Brian Hawley
10-Nov-2008 16:02:37
"AND and NOT operations of PEGs ... wouldn't work as is"

You would need to state how those operations would interact with explicit position setting using get-words before they could work as a proposal.

Brian Hawley
10-Nov-2008 16:10:32
Sorry, I messed up. Using your algorithm, Carl's NOT would be:

[not rule] = [rule (cont: [end skip]) | skip (cont: none)] cont

Peta
10-Nov-2008 16:23:40
I checked Gabriele's proposal and it looks like being faster than the USE approach, since it does not need BIND/COPY, doesn't it?

The recursion-safe implementation is faster for sure, since it does not need any rebind operation.

Regarding the NOT implementation: actually, it does not make a big difference if the AND operation will be available, since then the look-ahead NOT can be simply:

AND NOT rule

Peta
10-Nov-2008 16:29:28
Regarding the implementation of AND. I expect

AND rule

to work exactly like the idiom:

[rule (cont: none) | (cont: [end skip])] [end skip] | cont

Peta
10-Nov-2008 16:34:43
Although I am not sure, what this means in case REMOVE is used, but that is a problem for the REMOVE implementation, not for the AND implementation...
Brian Hawley
10-Nov-2008 16:54:28
Gabriele was the editor of the proposals on the REP page, not the one who proposed them. I wrote every proposal that has workaround example code. Gabriele did propose the DO operation. You are right about your recursion-safe implementation being faster.

Any lookahead rule would have difficulties not only with the proposed modification operations but also the positional setting and REVERSE operations. I am going over your AND proposal to see if the conflicts can be resolved.

It's nice to see someone else adding proposals.

Peta
10-Nov-2008 17:32:38
BTW, as you can easily find out in the TDPL article, both NOT and AND operations can be easily defined using EITHER FAIL and NONE:

[NOT rule] = [EITHER rule fail none] [AND rule] = [NOT NOT rule]

Brian Hawley
10-Nov-2008 18:05:34
Peta, I've done the editor bit and tried to pump up the Importance sections of your proposals, as well as a general consistency check.

Knowing this crowd, mere compatibility with another parsing model is not important unless you can provide real world examples of language patterns that would be worth recognizing. We are results-oriented in the REBOL community, not theory-oriented. It's good to see some ideas proposed from the theoretical community though :)

Brian Hawley
10-Nov-2008 18:11:42
Please make any comments here or preferably in the AltME worlds. Discussions in the wiki will be removed when they are resolved to reduce confusion, but the history is useful to have elsewhere to help others. You will note that I have not added my name to any notes - this is because the notes are not a criticism of the proposals, but a request for discussion. The discussion happens elsewhere.
shadwolf
14-Nov-2008 1:35:59
Carl what is parse?
Peta
16-Nov-2008 4:41:21
Recursive/multitasking safety.

The RULE:

rule: [set i subrule (print i)]

is not recursively safe, due to the fact, that the subrule *may* eventually change the value of the 'i' variable.

As opposed to that, the RULE2:

rule2: [set i integer! (print i)]

is recursively safe, since there is no recursion that can change the value of the 'i' variable.

What worries me is, that my (recursively safe) proposal at the PARSE Project page was found to be unsafe WRT multitasking.

If that is true, then even the above RULE2 is unsafe WRT multitasking, since nobody can guarantee, that another task does not change the value of the 'i' variable after it was set and before it is printed.

Any further comments?

Peta
16-Nov-2008 11:54:30
Shadwolf: see the Wikipedia: REBOL page (the Dialects section), the above PARSE project page, the PARSE function help, the Parse section of REBOL/Core User's guide, the Parse section in the REBOL Programming Wikibook, ...
Peta
16-Nov-2008 11:57:37
BrianH: BTW, I wasn't the first one to notice, that REBOL is a member of the TDPL family. It looks, that the first one was Roland Hadinger mentioning that in the Wikipedia.
shadwolf
17-Nov-2008 0:47:16
recursive ??? multitasking ??? with rebol and parse???

MUUUUUUUUUUUUWAAHAHAHAHAHAHAHA Okay Peta your the most fun thing i read those days...

But same ask still goes Carl What is parse ?

Peta
17-Nov-2008 12:10:36
Shadwolf: Sorry for trying to answer your question. Before trying it, I had doubts whether my effort could be useful at all.
Brian Hawley
24-Nov-2008 5:03:59
"Any further comments?"

Strangely enough Peta, even your first rule could also be recursively safe since the SET is done after the subrule is done. The word passed to SET is not set until after the subrule has matched, and if the subrule fails the word isn't set at all. Of course if the subrule was expecting any of its changes to the word to persist, that won't happen unless the subrule doesn't match after it makes the changes.

As for multitasking-safety, that is what the USE proposal is for.

Interesting information about the theory about PARSE. I hadn't read about TDPL for years; it was interesting to revisit. Nice work on the NOT 2 and AT proposals - I like those a lot.

Shadwolf, we will have multitasking in R3, and we have recursion already in R2 - it will just be safer in R3. This has been known for a while now. Please try to keep up.

Peta
24-Nov-2008 11:56:27
Brian: thank you for straightening the recursive safety of my first example, it is fine to know that.
Peta
24-Nov-2008 12:10:58
Brian: let me try a more contrived example, then. I think, that the following is not recursively safe, since RULE can change the value of the 'cont' variable even when it does not succeed:

[(cont: [end skip]) opt [rule (cont: none)] [end skip] | cont]

As opposed to that, the:

[rule (cont: none) | (cont: [end skip])] [end skip] | cont]

is recursively safe, since the 'cont' variable cannot be changed after it is set and before it is used.

OTOH, if we take multitasking into account, then even the last example is not safe, since there is no guarantee, that the 'cont' variable will remain unchanged after being set and before being used.

I hope this time I succeeded to communicate the difference I see between recursive safety and safety when multitasking is used. So, generally, a recursively safe rule can still be unsafe when multitasking is used, which certainly is the case of my USE2 proposal, which is recursively safe in the same way the second example here is, but unsafe when multitasking unexpectedly changing the 'cont' variable can occur

Brian Hawley
24-Nov-2008 14:08:42
Yeah, multitasking safety is likely to be an ongoing concern in the R3 project. REBOL code often has a lot of single-tasking assumptions in it. My hope is that if we can make the basics safe, it will be easier to build safe code on that foundation. Everyone is going through this nowadays.

Post a Comment:

You can post a comment here. Keep it on-topic.

Name:

Blog id:

R3-0155


Comment:


 Note: HTML tags allowed for: b i u li ol ul font span div a p br pre tt blockquote
 
 

This is a technical blog related to the above topic. We reserve the right to remove comments that are off-topic, irrelevant links, advertisements, spams, personal attacks, politics, religion, etc.

REBOL 3.0
Updated 29-Mar-2017 - Edit - Copyright REBOL Technologies - REBOL.net