Quantcast
Channel: MobileRead Forums
Viewing all articles
Browse latest Browse all 47695

MobiPocket dictionary index extraction/format/method

$
0
0
Hi all,

I'm trying to do something a bit weird: I like the experience of looking up words on Kindle so much that I'd like to be able to do something a bit similar on the PC, rather than going to my favourite online dictionary. Therefore I envisage first of all a program which accepts a word and a dictionary file, and which should render the appropriate entry in the dictionary.

Now, I have unpacked the .mobi of my dictionary with KindleUnpack, had a look through the HTML and see that the entries are there and use tags like <idx:orth value="word"> and <idx:iform value="inflectedword"/> to describe the entries which are then given in ordinary HTML. I can, using any old XML parser, scan through the .html file and search for the given word, and return the relevant HTML: it would then be easy to use webkit or another renderer to render it.

The problem is that the unpacked HTML is 50-odd megabytes, so if I enter a word beginning with Z I have to wait 10 seconds or so to scan through everything. My little kindle handles it faster than this, so I'd like to do better. Now first off, I'm looping through in python, which is a slow language. I could do better if I wrote it in C++, but that's a pain. But presumably, when a .mobi file is created, all those indices are assembled together in some manner for quick lookup of the relevant locations.

I don't see any evidence of this in the unpacked file, but it's well possible that KindleUnpack just discards this information. I'd like to know if it's possible to extract this information or, failing that, what format it takes so that I can create something similar. The idea would then be a method which I can use to generate my own lookup table for any KindleUnpack-extracted file for rapid indexing into the XML. The trouble is that the most obvious way I can think of - just noting the byte position in the file where the relevant data starts - doesn't work well with my current method of using a standard (and therefore fast) XML parser to extract the info. I could not try to use a parser to extract the correct location (in my example dictionary, the next <div> after the <idx:...> tag contains the entry, and there is no nesting etc, so I could get it with a simple regex) but this would then discard all the ancestor elements which might be useful, and might break with unusual dictionaries. Taking this approach I may as well do something even more straightforward and just search the file for value="..." instances and not bother with XML parsing at all.

Any thoughts?

Viewing all articles
Browse latest Browse all 47695

Trending Articles


FLASHBACK WITH SIRASA FM AT GALGAMUWA 2022


Mp3 Download: Mdu - Mazola


Imitation gun was fired at motorist in Leicester road-rage incident


Ndebele names


MCKINNEY EMALINE “EMMA” OF WES...


Okra & Motia — The Workshop (Prod by Hammer)


Skint TV teen to be sentenced


Moondru Mudichu 19-09-2017 – Polimer tv Serial


YOSVANI JAMES Arrested by Miami-Dade County Corrections on Jan 10, 2017


Stories • Goddess Stepmom