Tiny HTML to Text Converter

Author: Carl Sassenrath
Return to REBOL Cookbook

Here's a little script I use so often I thought it should be part of the cookbook. This script will take any HTML file (like a web page or REBOL document) and convert it to simple text. It does nothing special for formatting, but it does make for an easy way to get to the actual content of a page.

    text: read http://www.rebol.com/rebolintro.html
    data: load/markup text
    remove-each item data [tag? item]
    text: rejoin data
    write %rebintro.txt text

This example reads a page from a web site, loads the text into a REBOL block. The block contains tags as tag! datatypes and text as string! datatypes. REMOVE-EACH removes all the tags, which leaves only the strings. REJOIN joins all the strings together into a single string that is written to a text file.

As with most REBOL code, the above example can be simplified down to just:

    data: load/markup http://www.rebol.com/rebolintro.html
    remove-each item data [tag? item]
    write %rebintro.txt data

Note you don't need the separate READ (LOAD will do that for you) or the REJOIN (which is implicit in a WRITE of a block).

2006 REBOL Technologies REBOL.com REBOL.net