Comments on: To copy or not to copy, that is the question.

Ok, this subject is both deep and broad, but I want to get you started thinking about it...

One issue that we need to settle very soon is the issue of when to copy and when not to copy REBOL series (strings, blocks, etc.). This issue permates all of REBOL, and it is important because the copy/non-copy rules must be firmly in your mind while you program in REBOL.

For example, there is the simple case of when you use literal strings in REBOL, such as in a function:

fun: func [str] [append "example " str]

Most of us recognize this issue. The first time we call the function:

print fun "test"
example test

and, the second time:

print fun "this"
example testthis

The literal string is being modified. To prevent this you must write:

fun: func [str] [append copy "example " str]

This is one of the first things a beginner learns - and, usually from being burned by it.

But, the issue gets deeper really quickly because it applies to all REBOL series including blocks, parens, filenames, URLs, etc.

For example, when you create an object in this manner:

obj-body: [a: 10 b: [print a]]
obj1: make object! obj-body
obj2: make object! obj-body

You get into the issue of whether the obj-body block should be copied before it is bound (binding the object fields variables). The original obj-body is being modified by make. If you don't know that, you may get surprised. Surprise!

The same can be said about function specs, bodies, etc. etc.

So, this is an issue that is worth revisiting in REBOL 3.0. There are two opposing REBOL principles at work:

Simplicity: we want to minimize the side effects in normal code. Good for beginners and probably good for us experts too.
Optimization: minimize memory and maximize performance. If we have to copy strings and blocks in a lot of places, it takes time and memory to do so.

So, there are some issues that you should think about if you are deeply into REBOL programming. We need a stated solution. It can be any of:

Do nothing. Leave it like it was.
Change it where it is most appropriate and optimal.
Change it everywhere and ignore the extra memory and lost performance.

So, there's something to consider next time you relax at the pub, smoke your pipe in your easy chair, or scale the local mountain range. And, I invite you to share your insights with the group, either here in the comment section here or via your own blog or web pages.

21 Comments

Comments:

Volker Nitsch
9-Apr-2006 16:09 I still opt for write-protection. And i would extend that: series would have owners, technically an integer. That could be usefull for: - multithreading: when you pass a message, you change the owner and pass a pointer, and most locking-problems are gone. - executing client-supplied code: it does not own function-body, it can not touch it. - write-protection: function-bodies are owned by "somebody else", and the copy-problem goes away. (you get notified early with this append). (there would by owners for read and write, so bodies could be read, but not changed)
Carl Sassenrath
9-Apr-2006 16:48 It is an interesting idea to think about. Such ideas require a deeper analysis to determine if the overhead is worth the gain.
The concept of ownership is "semantically deep" - meaning it adds another orthogonal dimension on top the existing principles. Sometimes that can pay off, other times not. So that is what we would want to determine. (E.g. consider capabilities methods too.)
I am not saying "no", but I'm not jumping to "yes" immediately either. We need to talk more about it.
Maarten Koopmans
9-Apr-2006 16:57 To me, I'd like to see operations on series work on a reference of the series. But when I create something from it (such as an object or a function) it should use a copy.
Mostly how it works now, and that is easy to remember?
Volker Nitsch
9-Apr-2006 17:23 :) I agree about real ownership. My thinking was like this: i want read only. so i need a check. its no big difference to check for a bit or a whole int. (memory is no big issue, as it would be only for series, not each value) if i have an int, what could i do too? Then i remembered amiga-messaging, which passed pointers, but declared the message-memory as undefined. In a scripting-language i would enforce that. This int could do the job too. Same to protect system-functions against user-code. Maybe "ownership" is not the right word, it implies to much.
maximo
9-Apr-2006 17:44 using other languages has made me appreciate the rebol current way of treating series. its consistent, and although a little bit more work, its more flexible. other scripting languages always copy and this really leads to working around the other way ... e.g. you end needing to propagate string values all over the place and its prone to other bugs. we would not be fixing an issue we would be inverting an issue, just causing other bugs and making most current scripts buggy. I vote for adding new options to places where series copyin IS an issue, like adding /init to func dialects , for example.
maximo
9-Apr-2006 17:52 the continuation discution in the rebol3 world on altme was a very enlightening discussion about this very issue.
A general agreement about default values in functions, which is about 90% of the "copying issue" , IMHO, would propably be a good starting point.
Brian Hawley
9-Apr-2006 18:00 How difficult would it be to add copy-on-write to the runtime? This could solve a lot of optimization problems, at least as far as memory is concerned.
Oldes
9-Apr-2006 18:00 I vote for not to copy everything. I'm quite fine how it's working now, maybe just some better way how to deal with series in cloning objects. And I don't think, that we can say, that some memory is not problem nowadays. If I have Rebol server with a lot of people, I don't want to waste a lot of memory. Some locking as was mentioned seems to be interesting idea and would be useful if I would like to make Rebol desktop, running multiple Rebol scripts inside, which may be from different authors. But I'm not sure if I need to set owners to every string in a data file from a database with thousands of rows.
Petr Krenzelok
9-Apr-2006 22:50 :) I am not saying yes or now, but I want to point to some other aspect. Someone said Rebol's aproach is both consistent and memory savy. But - I also saw, how even gurus made mistakes. I really don't like subobjects sharing. It is not logical in any way and it has to be explicitly stated in the docs. There is no easy way for beginners, how you clone existing object, without deep-inspection of its childs.
I am not suggesting the change, it is for you gurus to consider though, because if 10 of 10 newbies is burned (and believe me they are) by this issue, then it is worth thinking about, if we want to "keep it simple". Some ppl may find Rebol "unpredictable", because they feel it has "side effects" with series operations ...
Brian Hawley
10-Apr-2006 0:28 Petr, subobjects sharing is the best way to implement class-based objects in REBOL right now. Admittedly you do this using explicit delegation, but not having to have a copy of every member function in every instance object is a great memory saver, and all of that copying would be slow too. There are other ways that object delegation can be useful - just look at the ports system or View.
I tend to include a copying function with every object type I create that needs to support copying. When you change the structure, you change the function accordingly. Just because REBOL isn't class-based doesn't mean that you don't have to design your object classes.
Brian Hawley
10-Apr-2006 0:54 As for make, it would be nice if the copying behavior applied to the spec would match that applied to the prototype object, to be consistent.
I vote for choice 2, to make the changes where necessary, and to consider that most REBOL code hasn't been written yet. Make the best choices for ease of use and efficiency, be consistent where being inconsistent would be too confusing, and break backwards compatibility when the old behavior was bad. If you have to do some interesting things internally for efficiency (like copy-on-write), do them, and tell us about it - we're interested.
But make sure that whatever you choose, document it thoroughly. There have been many times when I have been tripped up by unexpected (often undocumented) behavior in REBOL, needing much fiddling with scratch code to figure out what's going on. If documentation consists of articles on some advanced section of the web site, fine, but don't let things go undocumented to avoid scaring the newbies. I'm sure you could talk some of the community into helping with this if you need to...
Artem Hlushko
10-Apr-2006 2:59 I vote for 2. I need multithreading.
Volker Nitsch
10-Apr-2006 8:00 :) About copy-on-write, we had that discussion long ago on the ML. Holger knocked me out by pointing to 'same? . Two series would be the same until one of it changes, while by definition they are different all the time or never. About default copy, what if we do it the other way around: by default we copy, but we can request a reference? could look like f: func[i-never-change-this [ref!]] or in code ;this needs performance, so.. obj: context ref [..] Implentation idea: every reference can have an atribute auto-copy! (part of typeset). The interpreter handles everything as autocopy. Copied values are not autocopy!. thus we get autocopy in 'do, 'func, but such copies are passed by value. then res: append append "" this that would work: the "" is copied when touched, not aut-copy!, so the appends append to the same thing. With "make object!" its a half solution: No need to copy the spec, because series are copied when doing the spec. Still the spec must be bound to the new context, so bindings are changed. But that is only a problem when we deal at the meta-level, accecing the spec as series later, and at that level knowledge should be deep enough.
Volker Nitsch
10-Apr-2006 8:02 :) Correction: according to my implementation-idea, this example is wrong: obj: context ref [..] No need for the ref. More like obj: context[val: ref [a big read-only-series] ]
Brian Hawley
10-Apr-2006 12:52 Volker, series are refs already. Every time you assign a series to a word, the value slot contains a pointer and an offset to the data, not the data itself. To implement copy-on-write you would just need to put a flag in the value slot that the writing actions would check before writing. For that matter this could work in conjunction with a read-only flag - every reference to read-only data could be made a copy-on-write reference so the original data wouldn't be changed.
It would all be internal, just an implementation detail. There would be no reason for same? to expose this detail to the outside world (nor should it), although internally same? would need to be aware of the flag so that it could say that the series are different when they point to the same data. The only change in behavior is that the overhead of copying is deferred to the first write (which could be never). The only reason to find out if a copy-on-write reference refers to the same data would be for debugging purposes (memory profiling).
Copy-on-write and read-only together would also enable portions of the runtime to be used as a shared resource, perhaps in a shared library or ROM. This would reduce the memory usage of REBOL on resource constrained devices - I first got the idea after using REBOL on WinCE.
Volker Nitsch
10-Apr-2006 16:07 :) > a: "" > b: copy-on-write a > same? a b must be false, because copy-on-write must look like a copy. > c: copy-on-write a > same? b c must be false too. > d: b > same? b d This must be true. But b is a copy-on-write of a. so is c. How to distingiush them?
Ryan Cole
10-Apr-2006 16:07 :) I am not fond of mucking around with current referencing behavior. I think the current way is slick and powerful, though admittingly not super beginner friendly.
I looked at a recently written rebol program, and tried to see what it would take to change it over to copy by default by adding 'by-ref functions and removing 'copy functions. In the 20k program there were surprisingly only 19 'copy's. I stopped trying to figure out the 'by-ref's, but it quickly got out of hand. I suppose a few hundred would be required.
Brians suggestions are interesting. I am not sure all the ramifications, but I think perhaps they could be best supported as datatypes. This might make it easier for beginners, sometimes. Clearly advantageous on certain devices. However I dont love it because I think we would lose some overall simplicity.
Gabriele
10-Apr-2006 18:25 :) Volker, I think your SAME? problem can be solved, some way or another. Copy on write is actually just an optimization, and not new semantics (at least if implemented correctly); I think that we still need "reference" by default unless we want all types to be immutable (which would probably make all current code stop working).
maximo
14-Apr-2006 14:21 I just want to state that although being newbie friendly is nice... it must not be at the cost of some of the core principles of REBOL.
being all REF is very powerfull, and because Everything is referenced, there is no ambiguity. its just some of things which are not very well illustrated in documentation. the docs should always copy [] , for example, not just use [] directly.
Immutables NOOO we might as well code in python :-(
Jeff M
1-May-2006 13:53 :) I personally love the way REBOL does this now, primarily for performance reasons. Whenever you are going to program something which can cause massive performance problems (and copying of a series typically results in O(n^2) problems down the road), making the programmer have to explicitly copy something is good.
However, where REBOL's method gets a little frustrating is when it does this to "constant" data. When I write "example" in code, I don't expect that hard-coded data to ever change. But it can, and does. So I need to copy it. I imagine that 90+% of the time, this occurs when constructing a list or string of data, and so the copy is always desired.
So, I have a couple possible suggestions:
1. Make "constant" data just that - constant. Make the copy implicit when binding a word to constant data. This would be quite easy, and would solve a lot of headaches before they happen.
2. Have a series building operator (++ comes to mind from Haskell), which acts as a copy + append. This is a little scarey, because I think most people would just use ++ instead of append, and without knowing that it does a copy, we're back to potential O(n^2) performance issues.
My 2 cents...
Scot Sutherland
25-Sep-2006 18:28:40 I spend a lot of time talking about REBOL code with people who don't know REBOL. I see the current way REBOL handles copies as a feature not a problem. Quite often I have written or analyzed some really slick implementations that take advantage of REBOL's referencing scheme. People that favor other languages just stare at me bewildered. Keep it the way it is, as far as I'm concerned.
I need asych but not necessarily multi-threading.

Comments on: To copy or not to copy, that is the question.

Comments:

Post a Comment: