
Docs and Text Processing
Author: John M. Gabriele | back to index
- Overview
- Types of documentation
- Plain TeX
- Syntax highlighting code-to-html
- Character sets and encoding
---
Overview
Rewriting documents into a more flexible format is quite a pain in the neck. That is, you typically start out writing a document in plain text, maybe because you expect it to be very short and simple. But then you start adding things like tables to it. Then you break it up into sections with subsections. Then you want images and you're stuck. Plain text ran out of gas, and you need to rewrite your document into some markup format that can generate, say, html, and which allows images. Next you find you need to create your doc as a pdf, and maybe the format you chose won't do that. And it goes on like that from there.
So, I think it's important to pick a good doc format to start off with. Some requirements are:
- can generate html and pdf output
- has good editor support
- provides for including images, links, and maybe even some mathematics
An important aspect to docs is the type of format you use to write them in. That is, there are markup formats that use funny characters to indicate things like italics, bold, sections and subsections, and so on. Markdown, reST, and Textile fall into that category. Then there are formats that rely on commands for everything like Texinfo, for example, or Docbook (where the commands all consist of two <parts> </parts>). It's nice to have a doc markup format that's not too verbose, and that's not too painful to write and read in source form.
The ascii-art ones are helpful when you're writing short docs. They are quick to write -- and read -- in your editor. But their limitation is that only the basic markup feels natural. The more markup you put into the format, the more strange your docs start to look, and the harder it gets to remember what all the funny symbols mean. It also gets harder to avoid accidentally writing markup when you're just writing content.
The command-style markup formats suffer (IMO) from being too verbose, and also sometimes being not very easy to write and read in source form.
IMO, a compromise between the two styles is optimal for most
documentation purposes. And a good compromise is present in the new
Perl 6 Pod format by
Damian Conway. It's command-based, but the commands are short and easy
to type. There are also a few shortcuts (it is Perl, after all)
that make editing easier and the markup less verbose. And, of course,
a nice bonus is that you can easily read your docs on the command line
using an equivalent of the perldoc command (though currently, Perl
5's perldoc doesn't yet understand Perl 6 Pod, so reading as plain
text requires something like perldoc2text ./foo.pod6 | less). Of
course, support for converting to HTML is included.
So, these days, for any substantial docs, I tend to use Perl 6 Perldoc. For the smallest files that are used exclusively for either quick short docs (like README's) or for generating html, or that I expect to be edited by non-tech people, I often use Markdown.
For writing anything containing mathematics, I'm looking forward to a
perldoc2tex or perldoc2latex tool, though I don't know of any
currently being worked on.
Types of documentation
When it comes to writing docs for software projects, recall that there's three kinds usually (hopefully?) written:
- Docs for folks maintaining or extending your code. These implementation notes are usually just regular comments for folks you expect to be reading the code itself. You might format them in Markdown to make them more readable.
- Docs for programmers using your modules in their own programs. This is your Pod. Folks reading this generally won't be looking at the code itself.
- Various Design docs, a manual, tutorials, etc. Use dedicated Pod
(
.pod) files for these. For larger docs, you might write them as one chapter per Pod file.
Plain TeX
If you find yourself needing a short simple doc containing lots of mathematics, Plain TeX might be just what you need.
Some books on Plain TeX:
- http://www.ctan.org/tex-archive/info/gentle/
- http://www.ntg.nl/doc/wilkins/
- http://www.ctan.org/tex-archive/info/impatient/
When using Plain TeX, do also have a look at Eplain.
Syntax highlighting code-to-html
It's often useful to convert various bits of code to html for display on a webpage. A good way to do this is to use your editor to export the syntax highlighted display straight to html. For Emacs, htmlize does a beautiful job. After installing it, use like so:
M-x load-library RET htmlize RET M-x htmlize-TAB
For jEdit there's a plugin called Code2HTML which also works well.
Depending upon your editor, you may be able to automate this process (so it can be scripted). For example, see Text::VimColor.
Character sets and encoding
Unicode is a giant organized set of written language characters with
numbers associated with them (for example, 'a' is 97). UTF-8 is a
set of instructions for how you encode these numbers into a plain
computer file (i.e. a .txt file). UTF-8 Unicode is the worldwide
standard for textual data. Older, legacy character sets/encodings
include ASCII and ISO-8859-x. Note though, for files consisting only
of 7-bit ASCII characters, UTF-8 Unicode and ASCII are identical.
On Debian, to set your system up to use UTF-8, just run
dpkg-reconfigure locales and select only xx_XX.UTF-8 (for
english in the USA, that's en_US.UTF-8). That is, be sure to
uncheck anything that doesn't have "UTF-8" in it. After that, you may
wish to set up some of your programs to use UTF-8 fonts. Any font with
encoding "iso10646-1" is what you want. For example,
-misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1.
To insert Unicode characters using most Gnome/GTK+ apps, hit Shift-Ctrl-U, then type in some hex digits (the code point), and hit Enter.
Some useful links:
- http://www.utf-8.com/
- http://www.unicode.org/
- http://eyegene.ophthy.med.umich.edu/unicode/ -- Quick Primer
- http://dev.mysql.com/tech-resources/articles/4.1/unicode.html
- http://www.cl.cam.ac.uk/~mgk25/unicode.html -- UTF-8 and Unicode FAQ
- http://www.unifont.org/fontguide/ -- Unicode fonts
Software:
- iconv -- comes with libc for converting between a vast number of
different encodings. Run
iconv -lto see the list. - piconv -- arguably smarter alternative to iconv. Written in Perl.
- tcs -- convert various encodings (like 8859-1) to various encodings (like UTF-8 or ascii).
- uni2ascii (http://billposer.org/Software/uni2ascii.html)
- unifont
See also: