MobiPocket dictionary index extraction/format/method

Hi all,

I'm trying to do something a bit weird: I like the experience of looking up words on Kindle so much that I'd like to be able to do something a bit similar on the PC, rather than going to my favourite online dictionary. Therefore I envisage first of all a program which accepts a word and a dictionary file, and which should render the appropriate entry in the dictionary.

Now, I have unpacked the .mobi of my dictionary with KindleUnpack, had a look through the HTML and see that the entries are there and use tags like <idx:orth value="word"> and <idx:iform value="inflectedword"/> to describe the entries which are then given in ordinary HTML. I can, using any old XML parser, scan through the .html file and search for the given word, and return the relevant HTML: it would then be easy to use webkit or another renderer to render it.

The problem is that the unpacked HTML is 50-odd megabytes, so if I enter a word beginning with Z I have to wait 10 seconds or so to scan through everything. My little kindle handles it faster than this, so I'd like to do better. Now first off, I'm looping through in python, which is a slow language. I could do better if I wrote it in C++, but that's a pain. But presumably, when a .mobi file is created, all those indices are assembled together in some manner for quick lookup of the relevant locations.

I don't see any evidence of this in the unpacked file, but it's well possible that KindleUnpack just discards this information. I'd like to know if it's possible to extract this information or, failing that, what format it takes so that I can create something similar. The idea would then be a method which I can use to generate my own lookup table for any KindleUnpack-extracted file for rapid indexing into the XML. The trouble is that the most obvious way I can think of - just noting the byte position in the file where the relevant data starts - doesn't work well with my current method of using a standard (and therefore fast) XML parser to extract the info. I could not try to use a parser to extract the correct location (in my example dictionary, the next <div> after the <idx:...> tag contains the entry, and there is no nesting etc, so I could get it with a simple regex) but this would then discard all the ancestor elements which might be useful, and might break with unusual dictionaries. Taking this approach I may as well do something even more straightforward and just search the file for value="..." instances and not bother with XML parsing at all.

Any thoughts?

MobiPocket dictionary index extraction/format/method

Trending Articles

LAG, Lacp configuration on Mellanox switches

Karimnagar District Police Office Mobile Numbers List in Telangana State

Ifield Avenue closed following crash in Langley Green

NCERT Solutions for Class 9th Sanskrit Chapter 2 अविवेकः परमापदां पदम्

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Black Angus Grilled Artichokes

Derbyshire jeweller and scrap gold dealer, Jonathan Haag, must pay £57,000...

BREAKING NEWS: Park Street closed off after fire

Practice Sheet of Right form of verbs for HSC Students

The 10 Tennessee Cities With The Largest Black Population For 2021

FLASHBACK WITH SIRASA FM AT GALGAMUWA 2022

Mp3 Download: Mdu - Mazola

Imitation gun was fired at motorist in Leicester road-rage incident

Ndebele names

MCKINNEY EMALINE EMMA OF WES...

Okra & Motia — The Workshop (Prod by Hammer)

Skint TV teen to be sentenced

Moondru Mudichu 19-09-2017 – Polimer tv Serial

YOSVANI JAMES Arrested by Miami-Dade County Corrections on Jan 10, 2017

Stories • Goddess Stepmom