Comments on: More about Port layers - the OSI Model

This is a follow-up to my Pruning down READ and WRITE article and the comments posted there. Sorry, it is a bit long, but this topic requires a deeper discussion.

OSI Layer Model

Before I begin, please review OSI layers model. This is the standard layered model of network architecture that has been around for about 30 years.

Ok, so what does it all mean?

When you design any type of "operating system" (and that can include a runtime system such as a TCP stack), you try break it down into clean, well isolated (decoupled) layers. At least, that is always the goal. However, it's not always possible to make it perfect, because the layers interact, and sometimes that interaction is complex or the necessity of performance requires closer coupling than is ideal. (In fact, that's why after 10 years of being a total object-oriented system advocate, I backed away from it. But I digress, that is the subject of a separate blog.)

Layers in Ports

REBOL ports are no exception. You can apply the OSI model to ports (even though we use them for more than just networking.)

For example, if you open a port, and you need to provide an awake function for it, then you are creating code at Layer-5, the session level. Layers 1-4 below it are handled by the operating system, its device drivers, and hardware.

If you don't need to write an awake function, then you are using a higher level port. For example, if you write:

data: read http://www.rebol.com

the HTTP port scheme knows about things like headers, transfer lengths, and content encoding. In this example, HTTP is working at layers 6 and 7. Part of HTTP is to send headers which must be interpreted. In other words, they obtain a context and a meaning. The data is no longer just a string of bytes.

In REBOL, we can create and use ports at different levels.

The lowest level is that of reading raw bytes. When you open a REBOL TCP port and start reading and writing data, you are sending bytes. The meaning of those bytes is not defined by the port itself, it is defined by your application. TCP I/O is a lower level operation.

However, once you begin to interpret the meaning of those bytes, you move to a higher level, and there can be multiple levels.

So, you want it decoded?

One of the first steps to "get the meaning" is to decode the bytes. To convert a stream of bytes into a string of characters, you need to know how the characters are encoded. Are they in UTF-8 or UTF-16, or maybe the data are in a compressed format?

You can then ask, "how do we know the encoding?" Well, there are two main methods:

It is embedded in the data. Something about the data tells us its encoding. For example, Unicode can use a BOM to tell us its encoding. Many image formats include a "magic" signature to tell us what they are, e.g. JPEG uses "JFIF". An HTTP header tells us information about the encoding of the data.
It is specified separate from the data. The encoding is specified to a function that handles the data. This is the /as option we talked about before. For this to work, we already know the encoding in advance. It came from somewhere external to the data. For example, if we noticed that the filename suffix was .jpg, then we can use that information for decoding the format. This is what MIME types are all about.

In reality we often use a mix of these. For example, we may examine HTTP data to find the content-encoding, which we then specify to a decoding function.

So, you want content?

Decoding often means a lot more than just converting bytes to characters. An image file contains structures that give you information about the image, such as its width, height, colors, compression, and more. Even a text file may be in REBOL format that gets loaded into block format so we can interpret its content.

In the design of a system, we must ask ourselves where do we want this content decoding to happen, and how "automatic" is it?

An extreme case would be to ask: If you read bytes from a TCP port, does "the system" automatically determine the bytes are an image and return to you a decoded image datatype?

I think we can obviously say no to that. We want TCP as a stream of bytes. Any other decoding should be part of a separate layer.

This layer rule about bytes versus content applies to a lot more than just images. In R3 it also applies to text, because we now care about Unicode, and have become more text sensitive.

Building the Layers

Ok, so we are now at the fun part: What is the smartest way to handle these layers?

Well, REBOL already has a basic model. We know that the load function is supposed to take data and produce content. We load REBOL code and data, we load an image, and we load a sound. We also know that we read raw bytes from a network or a file.

That basic model provides the top and the bottom layers, but the question is, do we want to access the intermediate layers too? In the past, we would write:

insert port "some text data"

and expect:

probe copy port
"some text data"

But now, we support Unicode, so we have added an additional requirement. What is the encoding of the text? And, where does it get decoded?

In order to handle that, we invented the /as refinement to indicate the encoding. Using R3 port read method, it was:

text: read/as tcp-port 'utf-8

However, applying this refinement at a lower level makes the lower level a lot more complicated! At first, it may not seem like it, but once you get into the details, you find some problems. For example, UTF-8 encoding is multi-byte per character. What happens if we read from TCP, and we don't get all the bytes necessary to decode the last character? We must hold it in a special buffer and insert it into the head of the next part of the data stream. This also goes for CRLF text line conversions as well.

In the past, the way we solved this problem was to use multi-level ports. For example, when you use the HTTP scheme and read from its port, it is not the actual TCP port, it is a virtual port. Within the HTTP scheme code itself is a hidden TCP port.

So, the HTTP port is doing whatever "magic decoding" it needs to do in order to create the result we need.

Using this approach, we can say that maybe we can allow read of an HTTP port to return properly decoded text. In such a case, read is returning a string datatype not a binary datatype. That may be fine.

Of course, we want to determine how far to go with that model. For example, if I read an image using HTTP, do we get an image returned from read? I'd say no. In that case it may return a byte stream, because we may not want the image fully decoded at this point. For example, if it is JPEG, we might want to write it as a JPG file.

So, it is conceivable that we can create a TCPS scheme, for TCP String, that provides a layer on top of the lower level TCP to deal with the necessary string encoding. The TCPS scheme can be implemented in REBOL itself, allowing it to be extended and improved without requiring native code changes. It would then be possible to write:

>> port: open tcps://example.com:8080
>> write port "a string"
>> print read port
   "got it"

And, because open now defines a concept of specification, it is possible to even provide information about the type of encoding we want to use. An example would be:

port: open [
       scheme: 'tcps
       host: "example.com"
       port-id: 8080
       encoding: 'utf8
]

Of course, the TCPS scheme would be implemented with TCP, with that port being embedded within it. And, that's the main point: The lower level TCP layers do not need to deal with encoding and decoding. It just cares about bytes.

Back to the goal...

So, is this a good approach?

It depends on what we want, doesn't it? In REBOL, we have this objective:

Simple things should be simple.
Complex things should be possible (and as simple as possible).

So, we want to be able to simply get text from a file, or data from a REBOL script, or an image from an image file. We want to write code like:

data: get-the-contents-of %file
data: get-the-contents-of web-url

Earlier, people objected to the idea that we might provide helper functions such as:

data: read-text %afile

because it would mean enumerating that for all datatypes, such as read-image, etc.. I agree.

But, if we want to avoid that, we need to say that we are going to use a much smarter function, one that can properly identify the content and decode it.

I do not think of that as the main purpose of read. To me, that seems more like what load does. And, we are allowed to make load as smart as we want. For example, load may use a system table that contains suffixes and take a MIME-style.

The result may be something like this:

image: load %photo.jpg
code: load %script.r
text: load %info.txt

where, .jpg, .r, and even .txt are defined within a system/load-types table of some kind. In the case of .txt, the default would examine the BOM for UTF encoding. I think we can do a smart design job of keeping it simple.

For things like network transfers, when we write more complex code such as handling our own transfers with our own port awake functions, I do not think it is a big problem to deal with a few more details, such as encoding.

And, of course, as we find patterns to more complex usage, we can create new schemes or add options to existing schemes, to help simple things remain simple.

3 Comments

Comments:

Henrik
22-Apr-2008 5:13:32 The more I think about it, the more it reminds me of how OS'es are often poor at recognizing a specific file type, and how there are so many models for doing so. I always remember how Windows has been so poor at this, because its recognition model is a simple table that matches against the extension of the file, when good old AmigaOS allowed for precise control over this via tooltypes, but poses no model for file recognition at all. Only via datatypes is this possible, but that does not count for the initial identification. I remember that DOpus 5 for AmigaOS had a wonderful way of allowing you to model up a file type recognition mechanism. Are we building a new model like this?
Some things to contemplate:
- How to add new types or remove types
- Handling falsely identified data (a piece of text that starts with JFIF magic, a GIF file mistakenly named .BMP, or simply a text file that does not end in .txt, but .bin, .dat or nothing)
- Handling corrupt data. Do we want an error returned for a corrupt JPEG or do we still want to try to load it? Perhaps this is individual for each type and perhaps this should be exposed to you, rather than hidden inside LOAD.
A smart LOAD can be coerced to fail, so you need some way to force loading as a specific type, but only when necessary, such as in a MultiView like application. To me that makes it look like you have to default to something else:
>> if not image? load %im-a-gif.jpg [do-something-else]

or for cases where you would like to fall back to a specific other type, in case the first try does not work. Perhaps you want multiple fallback types in case the first ones don't work:
>> load/suggest %img-a-gif.jpg [gif bmp tiff png]

Sometimes the file won't load, because of misidentification, but you know it is of a specific type or it is one of a range of types. You know the data is fine, because that jpg loads just fine in Photoshop or in your browser. That is: The identification fails and LOAD won't load the file, but the data is otherwise fine. /suggest would thereby only try specific types, if the original identification attempt fails.
Another case is where you only want to identify the file, but not really load it, as that transformation from raw data (jpg) to REBOL data (image!) is not useful for identification. Perhaps it can also be used for cases where LOAD fails and you want to see how it fails. Can we use this system for that?
>> identify %im-a-gif.jpg == gif >> identify read/binary %im-a-gif.jpg == gif >> identify %gibberish.bin == none >> identify %package.zip == zip
Goldevil
22-Apr-2008 15:01:55 I like the idea that the behaviour of 'read and 'load are clearly different.
But something important is managing file subtypes. It could be convenient that a function like 'identify returns more informations than a simple word. Is this a progressive encoding GIF ? What kind of compression algorithm is used in this zip ? Is this text file is an XML file ? Is this an encrypted PDF ?
About XML, do you think that creating a scheme for XML decoding that support DTD is a good idea ? Or the XML decoding and validation must be defined elsewhere ?
shadwolf
1-May-2008 1:09:50 hum and how about the handling of lower protocols like hum ICMP ?
Many of us would be glad to be able to wirte some easy rebol style code for doing ping or traceroutes without having to call 3rd party programs/library, since in my opinion DLL/third party programs cause many difficulties of integration and that not from then rebol code working. For example for network monitoring interface.
I took ICMP as example but many other thing could be done if we had a lower access to the network interface (like bandwith monitoring, firewall etc...)
Lets be crazy and lets imagine we are a big company runing 100 servers with linux and rebol cheyenne! website with rebol aplet inside and rebol backoffice things, it would be cool in my opinion to have a rebol written software to monitor the state of each of those 100 servers (we could check hardaware state, bandwith allocation, web softwares and backoffice state and why not database states too and then be able to get a report on those use simply by looking to a VID3 or web frendly interface ( yea i now who monitors the monitors state...))
in previous version of rebol port intented to be the entring point for OS specific interractions like systray on windows for example does this intent will be redesign or will it be extended ?