Quantcast
Channel: MobileRead Forums
Viewing all articles
Browse latest Browse all 47096

adjusting a function

$
0
0
Hi

Some months ago, you gave us a nice function which allowed to split words "glued" together. After a mistake of mine, I had the opportunity to use this function on a lot of words on a French EPUB. I have of course installed a French dictionary. Please read on... :)
The results were amazingly good and quick.

Spoiler:

Code:

>([^<]+)<
Code:

import regex
from calibre import replace_entities, prepare_string_for_xml

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    def fix_word(m):
        word = m.group()
        if dictionaries.recognized(word):
            return word
        for i in xrange(1, len(word) - 1):
            a, b = word[:i], word[i:]
            if dictionaries.recognized(a) and dictionaries.recognized(b):
                return a + ' ' + b
        return word
    text = replace_entities(match.group(1))
    text = regex.sub(r'\b\w+\b', fix_word, text, flags=regex.VERSION1)
    text = prepare_string_for_xml(text)
    return '>' + text + '<'



Using the Calibre Editor spell checker before and after the use of this function, I could see that the number of words unknown to the dictionary went down from 1167 to 261. Taking into account the fact that probably 2/3 of the remaining ones were "noms propres" (proper nouns ?), I nevertheless realized that some few words had not been split (50 to 70 probably).

The cause was related with some kind of elided form. Here are some of them. One can easily discern the same pattern: a word followed by one letter and one curved apostrophe (in red here); these last two elements being characteristic of elided forms in French.

accompagnentn’auront
àl’origine
dansl’entrée
dem’expliquer
des’opposer
Etj’écrasai
ils’attendait
manueld’algèbre

What makes me hope that the function could be improved so as to take care of elided forms is that for all of them, the first suggestion of the dictionary of the Calibre editor is to split them correctly.

Viewing all articles
Browse latest Browse all 47096

Trending Articles