Hi everyone! Long time reader, first time poster.
I'm working on a free open-source ebook project called Standard Ebooks. Its goal is to bring classics that are free of copyright restrictions (i.e. public domain books) up to modern technological and editorial standards--in other words, to produce commercial-quality liberated ebooks for true book lovers.
Part of the "modern technological standards" bit is making an effort at supporting auto hyphenation in our ebooks. Ideally, since ebooks are basically just web pages, ereading software would simply use CSS's `hyphens` property to do that automatically. In reality, almost no ereading software has that ability right now. But lately a lot of ereading software has gained the ability to understand soft hyphens, and that's a small step in the right direction.
Of the major ereading platforms, at least Google Play Books and Kindle support soft hyphens. (This is Kindle's much-vaunted "enhanced typography": instead of doing it the smart way with CSS hyphens on the firmware level, it seems they're instead planning on somehow adding soft hyphens to their entire ebook catalog... ugh.)
I searched around for programs that could add soft hyphens automatically, and came across an excellent Calibre plugin that can do that via a GUI. But I needed to automate the process from the command line, so that we could automatically build compatible ebooks from our untainted epub3 sources. Browsing through that thread suggested that a few people were looking for a similar solution.
So I went ahead and created a Python script that will automatically add soft hyphens to the text of any xhtml file, and thus to ebooks. I thought I'd share it with you all in case someone found it helpful.
To install on Ubuntu 16.04:
Adding soft hyphens to an epub file
The script operates on single xhtml files, but since epub files are just zip files filled with xhtml files, you can hyphenate a whole ebook by unzipping the epub, running hyphenate on all of the xhtml files within, and re-zipping it:
Adding soft hyphens to Kindle ebook files
For those of you using Kindle devices or software, from my limited experiments it appears that only azw3 files support hyphenation right now. To hyphenate a Kindle ebook, you can use Calibre to convert it to epub first, perform the hyphenation, then convert it back to azw3:
A note on languages
pyhyphen requires that you install dictionaries for each language you want to process. I believe it downloads a dictionary for your system's default language when it's installed, but there are instructions on downloading additional dictionaries in the pyhyphen documentation.
The script tries to guess the xhtml file's language by looking for a `lang` attribute on the `<html>` element. If your files don't have one, you can force the script to use a specific language like so:
I hope someone finds this useful. We also have a few more command-line tools for processing ebooks that some of you might find helpful in our complete tools repository. This and all our tools are GPLv3 and contributions via Github are welcome. And if you'd like to volunteer at the Standard Ebooks project and bring a liberated classic up to our high standards, drop me a line! :)
I'm working on a free open-source ebook project called Standard Ebooks. Its goal is to bring classics that are free of copyright restrictions (i.e. public domain books) up to modern technological and editorial standards--in other words, to produce commercial-quality liberated ebooks for true book lovers.
Part of the "modern technological standards" bit is making an effort at supporting auto hyphenation in our ebooks. Ideally, since ebooks are basically just web pages, ereading software would simply use CSS's `hyphens` property to do that automatically. In reality, almost no ereading software has that ability right now. But lately a lot of ereading software has gained the ability to understand soft hyphens, and that's a small step in the right direction.
Of the major ereading platforms, at least Google Play Books and Kindle support soft hyphens. (This is Kindle's much-vaunted "enhanced typography": instead of doing it the smart way with CSS hyphens on the firmware level, it seems they're instead planning on somehow adding soft hyphens to their entire ebook catalog... ugh.)
I searched around for programs that could add soft hyphens automatically, and came across an excellent Calibre plugin that can do that via a GUI. But I needed to automate the process from the command line, so that we could automatically build compatible ebooks from our untainted epub3 sources. Browsing through that thread suggested that a few people were looking for a similar solution.
So I went ahead and created a Python script that will automatically add soft hyphens to the text of any xhtml file, and thus to ebooks. I thought I'd share it with you all in case someone found it helpful.
To install on Ubuntu 16.04:
Code:
#Make sure you have pip3 installed
sudo apt install python3-pip
#Install some dependencies
sudo pip3 install pyhyphen beatifulsoup4
#Download the script and make it executable
wget https://raw.githubusercontent.com/standardebooks/tools/master/hyphenate
chmod +x hyphenate
The script operates on single xhtml files, but since epub files are just zip files filled with xhtml files, you can hyphenate a whole ebook by unzipping the epub, running hyphenate on all of the xhtml files within, and re-zipping it:
Code:
#Blow up our epub file
unzip mybook.epub -d mybook-extracted
#Hyphenate all (x)html files
find mybook-extracted -iname "*htm*" -exec hyphenate "{}" \;
#Rebuild our epub file (you may have to tweak this line a little)
zip -9 --no-dir-entries -X --recurse-paths mybook-hyphenated.epub mybook-extracted/mimetype mybook-extracted/META-INF mybook-extracted/OEBPS
Adding soft hyphens to Kindle ebook files
For those of you using Kindle devices or software, from my limited experiments it appears that only azw3 files support hyphenation right now. To hyphenate a Kindle ebook, you can use Calibre to convert it to epub first, perform the hyphenation, then convert it back to azw3:
Code:
#Use Calibre's command-line tools to convert your Kindle book to epub
ebook-convert mybook.azw3 mybook.epub
#Perform the steps for epub as listed above
#After you've done that, use Calibre to convert back to azw3
ebook-convert mybook-hyphenated.epub mybook-hyphenated.azw3
pyhyphen requires that you install dictionaries for each language you want to process. I believe it downloads a dictionary for your system's default language when it's installed, but there are instructions on downloading additional dictionaries in the pyhyphen documentation.
The script tries to guess the xhtml file's language by looking for a `lang` attribute on the `<html>` element. If your files don't have one, you can force the script to use a specific language like so:
Code:
./hyphenate --language="en-US" myfile.xhtml