Palimpsest’s Book Group is reading two H.G. Wells books at the moment. Being a skinflint, I thought I would download them from Project Gutenberg, a library of free books available in ext format, and sometimes HTML.
The two novels are:
The trouble is that often the HTML option isn’t there, and the text files are formatted with hard line breaks, which means that the lines break at that point whether it needs to or not. So if you load them into a word processor and change the font and text size to get the page count down for printing, the results look terrible.
Surely, I thought, it must be possible to automatically remove these line breaks, somehow? I asked in various places:
- Palimpsest’s Ono No Komachi suggested using EReader, but that costs money!
- The guys at South Cheshire LUG had a go at producing some Perl to convert the text files – but they didn’t quite work, and I don’t have a linux box to run these on yet
- Blixa on Palimpsest came up with some VBA script for MS Word to reformat the text, but it doesn’t quite work either, and it involves Word and VBA (but is very clever)…
All to no avail!
Until Carfilhiot suggested a tool called GutenMark, a command line tool for linux or Windows which takes the text file and reformats nicely it to HTML. It is released under the GPL, so it should be possible to have a look at the source and see if it can be persuaded to produce just text files, though it may be possible to cut and paste from the browser to a text editor to see what results from that.
Carfilhiot has hosted the reformatted versions of the Wells texts:
Excellent – and the copy-and-paste to text file seems to work too!