Docs and Text Processing

Author: John M. Gabriele | back to index

  • Docs and Text Processing
  • ---

    Overview

    Rewriting documents into a more flexible format is quite a pain in the neck. That is, you typically start out writing a document in plain text, maybe because you expect it to be very short and simple. But then you start adding things like tables to it. Then you break it up into sections with subsections. Then you want images and you're stuck. Plain text ran out of gas, and you need to rewrite your document into some markup format that can generate, say, html, and which allows images. Next you find you need to create your doc as a pdf, and maybe the format you chose won't do that. And it goes on like that from there.

    So, I think it's important to pick a good doc format to start off with. Some requirements are:

    An important aspect to docs is the type of format you use to write them in. That is, there are markup formats that use funny characters to indicate things like italics, bold, sections and subsections, and so on. Markdown, reST, and Textile fall into that category. Then there are formats that rely on commands for everything like Texinfo, for example, or Docbook (where the commands all consist of two <parts> </parts>). It's nice to have a doc markup format that's not too verbose, and that's not too painful to write and read in source form.

    The ascii-art ones are helpful when you're writing short docs. They are quick to write -- and read -- in your editor. But their limitation is that only the basic markup feels natural. The more markup you put into the format, the more strange your docs start to look, and the harder it gets to remember what all the funny symbols mean. It also gets harder to avoid accidentally writing markup when you're just writing content.

    The command-style markup formats suffer (IMO) from being too verbose, and also sometimes being not very easy to write and read in source form.

    IMO, a compromise between the two styles is optimal for most documentation purposes. And a good compromise is present in the new Perl 6 Pod format by Damian Conway. It's command-based, but the commands are short and easy to type. There are also a few shortcuts (it is Perl, after all) that make editing easier and the markup less verbose. And, of course, a nice bonus is that you can easily read your docs on the command line using an equivalent of the perldoc command (though currently, Perl 5's perldoc doesn't yet understand Perl 6 Pod, so reading as plain text requires something like perldoc2text ./foo.pod6 | less). Of course, support for converting to HTML is included.

    So, these days, for any substantial docs, I tend to use Perl 6 Perldoc. For the smallest files that are used exclusively for either quick short docs (like README's) or for generating html, or that I expect to be edited by non-tech people, I often use Markdown.

    For writing anything containing mathematics, I'm looking forward to a perldoc2tex or perldoc2latex tool, though I don't know of any currently being worked on.

    Types of documentation

    When it comes to writing docs for software projects, recall that there's three kinds usually (hopefully?) written:

    Plain TeX

    If you find yourself needing a short simple doc containing lots of mathematics, Plain TeX might be just what you need.

    Some books on Plain TeX:

    When using Plain TeX, do also have a look at Eplain.

    Syntax highlighting code-to-html

    It's often useful to convert various bits of code to html for display on a webpage. A good way to do this is to use your editor to export the syntax highlighted display straight to html. For Emacs, htmlize does a beautiful job. After installing it, use like so:

    M-x load-library RET
    htmlize RET
    M-x htmlize-TAB
    

    For jEdit there's a plugin called Code2HTML which also works well.

    Depending upon your editor, you may be able to automate this process (so it can be scripted). For example, see Text::VimColor.

    Character sets and encoding

    Unicode is a giant organized set of written language characters with numbers associated with them (for example, 'a' is 97). UTF-8 is a set of instructions for how you encode these numbers into a plain computer file (i.e. a .txt file). UTF-8 Unicode is the worldwide standard for textual data. Older, legacy character sets/encodings include ASCII and ISO-8859-x. Note though, for files consisting only of 7-bit ASCII characters, UTF-8 Unicode and ASCII are identical.

    On Debian, to set your system up to use UTF-8, just run dpkg-reconfigure locales and select only xx_XX.UTF-8 (for english in the USA, that's en_US.UTF-8). That is, be sure to uncheck anything that doesn't have "UTF-8" in it. After that, you may wish to set up some of your programs to use UTF-8 fonts. Any font with encoding "iso10646-1" is what you want. For example, -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1.

    To insert Unicode characters using most Gnome/GTK+ apps, hit Shift-Ctrl-U, then type in some hex digits (the code point), and hit Enter.

    Some useful links:

    Software:

    See also: