Gutenberg formatting

Palimpsest’s Book Group is reading two H.G. Wells books at the moment. Being a skinflint, I thought I would download them from Project Gutenberg, a library of free books available in ext format, and sometimes HTML.

The two novels are:

The trouble is that often the HTML option isn’t there, and the text files are formatted with hard line breaks, which means that the lines break at that point whether it needs to or not. So if you load them into a word processor and change the font and text size to get the page count down for printing, the results look terrible.

Surely, I thought, it must be possible to automatically remove these line breaks, somehow? I asked in various places:

All to no avail!

Until Carfilhiot suggested a tool called GutenMark, a command line tool for linux or Windows which takes the text file and reformats nicely it to HTML. It is released under the GPL, so it should be possible to have a look at the source and see if it can be persuaded to produce just text files, though it may be possible to cut and paste from the browser to a text editor to see what results from that.

Carfilhiot has hosted the reformatted versions of the Wells texts:

Excellent – and the copy-and-paste to text file seems to work too!

Leave a Reply

Your email address will not be published.