Format Your Own Damned Book Part XIII -- Cool Tricks For Book Nerds

Format Your Own Damned Book Part XII ... Format Your Own Damned Book: Cover Pr...

Format Your Own Damned Book Part XIII -- Cool Tricks For Book Nerds

The tricks in this chapter are for those who are comfortable writing their own computer programs. The examples run on Linux and probably could be persuaded to run on Windows and the Mac, too. Doing this is left as an exercise for the student.

You don't need these tricks to publish your own books, but those who can take advantage of them will find that they save time and make the process less tedious.

If you look down on the art of computer programming and the practitioners thereof you should skip this chapter.

No warranty is offered for these code samples. They work for the author, and should work for the reader as well, but if they do not the reader will need to diagnose the problem himself. Remember the title of these posts: Format Your Own Damned Book. Not Let's Bug The Author.

Strip HTML Utility

The first utility should be copied into a file named striphtml.py. You will need Python 2 and Beautiful Soup 3 installed to run it. Beautiful Soup is a Python library that extracts information from the HTML in web pages. Badly formatted web pages are sometimes called "tag soup", hence the name of this library. A version of Beautiful Soup may be included in your Linux distribution.

You can read about Beautiful Soup here:

https://www.crummy.com/software/Beaut...

What the utility does is take an input HTML file and strip out anything that is not needed to create an XHTML file for an EPUB.

It strips out <pre> and <img> tags, because most word processors don't handle these well.

It also inserts a link to a style sheet, generates a table of contents, and inserts a special marker before each H1 tag that can be used later to automatically split up the file into multiple chapter files.

The generated table of contents only works if you split up the chapters into multiple files. If you used a table of contents in your source document the generated TOC will have "None" for all the chapter titles and will be useless. I mostly use this script for my novels, which do not include a Table of Contents.


#! /usr/bin/env python
from BeautifulSoup import BeautifulSoup

def _attr_name_whitelisted(attr_name, attr_value):
 if attr_name.lower() == "align" and attr_value.lower() == "center":
 return True
 else:
 return False

# remove these tags, complete with their contents.
blacklist = ["head", "img", "pre" ]

# remove attributes from these tags, except
# those whitelisted.
striplist = [ "p", "h1", "h2", "h3" ]

# Anything not in this list will be replaced with <span> 
# tags with no attributes.
whitelist = [
 "p", "br", "pre", "meta", 
 "table", "tbody", "thead", "tr", "td", 
 "blockquote", "h1", "h2", "h3", 
 "ul", "li", 
 "b", "em", "i", "strong", "u" 
 ]

soup = BeautifulSoup(open("input.html"))

print "<html>\n<head>\n"
print "<meta http-equiv=\"CONTENT-TYPE\" content=\"text/html; "
print "charset=UTF-8\">"
print soup.title
print "<link href=\"../Styles/ebook.css\" rel=\"stylesheet\""
print " type=\"text/css\"/>"
print "\n<head>\n<body>"

print "<h1>Contents</h1>"
print "<ul>"
print "<li><a href=\"TOC_0001.xhtml\">Title Page</a></li>"

i = 1
for chapter in soup.findAll("h1"):
 i = i + 1
 print("<li><a href=\"TOC_" + str(i).zfill(4) + ".xhtml\">")
 print(chapter.string)
 print("</a></li>")
 
print "</ul>"
print "<hr class=\"sigilChapterBreak\" />"

print "<p class=\"title\">"
print soup.title.string
print "</p>"

print "<p class=\"author\">Author Name</p>"

for tag in soup.findAll():
 if tag.name.lower() in blacklist:
 # blacklisted tags are removed in their entirety
 tag.extract()
 elif tag.name.lower() in striplist:
 tag.attrs = [(a[0], a[1]) 
 for a in tag.attrs if _attr_name_whitelisted(a[0], a[1])]
 elif tag.name.lower() not in whitelist:
 # not a whitelisted tag. I'd like to remove it from the tree
 # and replace it with its children. But that's hard. It's much
 # easier to just replace it with an empty span tag.
 tag.name = "span"
 tag.attrs = []

print(soup.renderContents("utf-8"))
print "</body></html>"

The code is available for download here:

https://raw.githubusercontent.com/sug...

You run it with this script:


./striphtml.py 
| sed 's_<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">__' 
| sed 's/<span>//g'
| sed 's_</span>__g' 
| sed 's_<h1>_<hr class="sigilChapterBreak" /><h1>_' 
| sed 's_<hr class="sigilChapterBreak" /><h1>Contents</h1>_<h1>Contents</h1>_' 
| sed 's/<p align="CENTER">/<p style="text-align: center">/' > TOC.xhtml

The script actually goes all on one line, but I have split it up into multiple lines to make it fit better on a printed page. Place it in a file named genbook.sh and make that executable. Or download it here:

https://raw.githubusercontent.com/sug...

The script uses the sed utility to search and replace things in the HTML that the Python program can't remove itself. Sed is a Linux utility that is part of that operating system. Versions of that utility may be available for lesser operating systems.

To run it, save your manuscript as input.html in the same directory where the Python program and genbook.sh live. Run

./genbook.sh

and it will create a file TOC.xhtml which you may import into Sigil.

For best results, if your printed book has a TOC page you should save a copy of your manuscript under a new name, delete the TOC page from that copy, and then generate your input.html file from that copy. The automatically generated TOC will not work well as an ebook TOC and will interfere with the Python program generating a TOC.

A Mystery Script

Have a look at this script and try to guess what it does:


enchant -l -d en myfile.txt 
| sort | uniq -c | sort -nr > myfile-words.txt

I'll give you a hint: myfile.txt is a text file containing the full text of the Bhagavata Purana, an important Hindu scripture that I wanted to publish on Create Space. You can get that text file here:

http://www.gutenberg.org/ebooks/39442

Download the Plain Text version of the book and replace "myfile.txt" with that file, then run the script.

To understand what the script does, let's take it a step at a time:

1. Enchant is a front end to command line spell checkers provided by all Linux distributions. You can read about it at http://www.abisource.com/projects/enc.... The command enchant -l -d en myfile.txt means "scan myfile.txt using an English dictionary and list out all misspelled words, one on each line.

2. sort takes the list and sorts it into ascending sequence.

3. uniq -c means remove duplicated words from this sorted list and count how many times the word occurred before each word. The count goes before the word in the output.

4. sort -nr > myfile-words.txt takes that list and sorts it into numeric sequence, but in reverse order (high to low), then puts it into file myfile-words.txt.

So when we're done we have a list of all the misspelled words in the Bhagavata Purana, sorted from most commonly misspelled to least commonly misspelled. Can you guess why this would be useful?

The Bhagavata Purana is full of character names that don't appear in an English dictionary, and hence would be considered misspellings by enchant. The ones that appear most in the list are the most important characters in the book and vice versa. And it turns out that Libre Office will generate an index for you, if you supply it with a list of the words you want to index. Thus in a few minutes I was able to add a decent index to my Create Space edition of this book, something the original book never had.

Like • 0 comments • flag

Published on November 19, 2016 12:44

No comments have been added yet.

Bhakta Jim's Bhagavatam Class

If I have any regrets about leaving the Hare Krishna movement it might be that I never got to give a morning Bhagavatam class. You need to be an initiated devotee to do that and I got out before that ...more

Bhakta Jim's profile
15 followers