Reading Delimited Text Files (Like CSV)

Author: Gregg Irwin
Return to REBOL Cookbook

One of the most common file formats is the delimited text file. The most common is line-oriented CSV (comma separated values) which can be created and handled by spreadsheet programs, like Microsoft Excel for instance. In most cases these files are small enough that you can read all the data into memory without issue so that's what we'll cover here. At the end I'll give you some clues about how to handle larger files.

Here is some sample data we'll use in our examples. Just put it into a file called %test.csv if you want to run the examples.

    rec 1 field 1,"rec 1, field 2",rec 1 field 3
    rec 2 field 2,"rec 2, field 2",rec 2 field 3
    rec 3 field 3,"rec 3, field 2",rec 3 field 3

The first thing we need to do is read the file into memory and split it up; one record per line.

    lines: read/lines %test.csv

To break each line up into fields we use the PARSE function, telling it what delimiter to use.

    foreach line lines [
        print mold parse/all line ","

Results in:

    ["rec 1 field 1" "rec 1, field 2" "rec 1 field 3"]
    ["rec 2 field 2" "rec 2, field 2" "rec 2 field 3"]
    ["rec 3 field 3" "rec 3, field 2" "rec 3 field 3"]

So, now you've converted the data, you can get individual fields using REBOL ordinal functions like FIRST, SECOND, etc. or path notation (e.g. record/3). You can also iterate over them, change them, and anything else you can do with a block of data.

There are two things to note in the use of PARSE shown above. First, we're using the /ALL refinement with PARSE which tells it to parse all characters, even spaces. Look what happens if we don't use /ALL:

    lines: read/lines %test.csv
    print mold parse lines/1 ","

Results in:

    ["rec" "1" "field" "1" "rec 1, field 2" "rec" "1" "field" "3"]

By default, PARSE will split a string at whitespace delimiters; using the /ALL refinement tells it to treat spaces as normal characters.

The second thing to note is that PARSE wasn't confused when it saw the second field was quoted and contained an embedded delimiter. There is an important, and very subtle, issue here however; there are no spaces between the quotes and the surrounding delimiters. If you have spaces between the delimiter and the quotes, on the "outside" of the quotes, it will confuse PARSE. For example:

    print mold parse/all {field 1, "field,2" ,field 3}  ","

Results in:

    ["field 1" { "field} {2" } "field 3"]

About Parse

PARSE is a very powerful function that can operate in two "modes". We're using the first mode here to do simple string splitting. In its more advanced mode, PARSE uses grammar definitions to parse blocks of data, allowing you to create your own custom languages with relative ease.

Note that using the /LINES refinement with READ causes it to split the file into a block of strings at each newline separator. If you need to split the lines at a different delimiter, you can use the /WITH refinement.

    lines: read/lines/with %test.zzz "%%"

If your file has a trailing newline character after the last record, you will have an empty item at the end of your block.

If you need to large files (larger than available memory) you can use OPEN/DIRECT on the file (/DIRECT prevents buffering) and then use COPY/PART to read data in fixed size chunks. When you've read all the data in the file, COPY/PART will return NONE.

2006 REBOL Technologies REBOL.com REBOL.net