Python Docx: change name of font in w:cs? Converting font-encoding to Unicode

Question

Some writing systems (scripts) have been represented in "hacked fonts" by changing the glyphs of characters in ASCII or Arabic or other ranges. For example, the shape of the glyph for "a" could be changed to a Burmese character, "b" to another, etc. Applying the font while typing on a QWERTY keyboard let people create documents in non-standardized scripts. This works great as long as the font can be applied correctly to the text, as in a shared Word document. It has been used for many indigenous languages, especially by linguists and other academics for documenting languages around the world.

However, such font encoding is made obsolete when a writing system is added to Unicode and when fonts and keyboards become available for the script. It is then very desirable to convert older font-encoded text to Unicode text. For this, I use docx to find strings in a given hacked font in text runs, converting the characters to Unicode equivalents. I then set the font name to a new Unicode font: run.font.name = self.unicodeFont

This changes w:ascii and w:hAnsi values in document.xml of the converted .docx: <w:rFonts w:cs="Hacked Font" w:ascii="Unicode Font" w:hAnsi="Unicode Font"/>

However, this leaves the style in an ambiguous state when the new script is "complex", such as Myanmar script.

It's not clear what Unicode ranges are "complex scripts"? Is there any documentation on this?

Can Docx change the w:cs value in addition to w:ascii and w:hAnsi? Or can I remove the w:cs option?

And how can Docx update the fonts.xml and style.xml parts to include a new Unicode font so it appears in Word's font list?

Thanks. I want to create a version of a .docx with characters converted from a font hack into the equivalent Unicode characters. And then, I want to apply an appropriate Unicode font to each instance of the converted text. — Sven Oly
– Sven Oly, Commented Aug 4, 2024 at 2:34
I'm starting with a .docx file and I want to create a new version of that .docx. Characters to be converted are in paragraph runs, identified by the name of the encoding font. I have written a Python routine using DOCX that does this conversion, saving the new file with converted paragraph runs containing the equivalent Unicode characters. The problem is setting the Unicode font for each converted text run. Because many font hacks use ASCII, the source is not a complex script. However, the output can be complex, e.g., Myanmar script. This requires a new font. — Sven Oly
– Sven Oly, Commented Aug 4, 2024 at 2:42
I'm not an expert on this, but I think there might be two separate questions: (1) What does Word classify by default as "complex script"? I think the answer to that might be just about any script other than Latin, Greek, Cyrillic, Chinese, Japanese (kana) or Korean (Hangul). (2) When does a run use "complex script" formatting settings? Reading ECMA 376 (OOXML), the "complex script" formatting settings will be used whenever <w:cs/> is set on a run. — Peter Constable
– Peter Constable, Commented Aug 5, 2024 at 12:09

Axel Richter · Accepted Answer · 2024-08-07 15:43:49Z

Well, you are a little bit late. Using such fonts which maps ASCII code points to Chinese, or other, glyph is outdated for a long time. Unicode was introduced in 1990s and must have prevailed at least since start of 21th century. So your problem is not really a widely discussed issue in 2024. And these fonts are no longer widely available.

But there are still fonts which maps ASCII code points to some other glyph. The font Wingdings for example maps picture symbols to ASCII (and ANSI). So this can be used for tests.

Test case source_document_glyphed_font.docx :

First paragraph text is "Abc def ghi 123 ?" formatted in font Wingdings and using different font settings.

Code:

from docx import Document
from docx.text.paragraph import Paragraph
from docx.text.run import Run
from docx.text.hyperlink import Hyperlink
from docx.table import Table
from docx.oxml.ns import qn

from docx.oxml.simpletypes import ST_String
from docx.oxml.xmlchemy import OptionalAttribute

#glyphed_font_name = 'Wingdings'
glyphed_font_name = 'Webdings'
new_font_name = 'Meiryo'

def converting_the_characters_to_Unicode_equivalents(run_inner_content: str) -> str:
    #to cyrillic alphabet
    #translation_table = { ord('A'): 0x0410, ord('b'): 0x0431, ord('c'): 0x0432, 
    #                      ord('d'): 0x0433, ord('e'): 0x0434, ord('f'): 0x0435,
    #                      ord('g'): 0x0436, ord('h'): 0x0437, ord('i'): 0x0438 }
    #to chinese, wild guessed, buto only to show that default font should work
    translation_table = { ord('A'): 0x4E10, ord('b'): 0x4E11, ord('c'): 0x4E12, 
                          ord('d'): 0x4E13, ord('e'): 0x4E14, ord('f'): 0x4E15,
                          ord('g'): 0x4E16, ord('h'): 0x4E17, ord('i'): 0x4E18 }
    new_run_inner_content = run_inner_content.translate(translation_table)
    return new_run_inner_content

document = Document('source_document_glyphed_font.docx')

body = document._body

for body_element in body.iter_inner_content():
    if isinstance(body_element, Paragraph):
        for run_element in body_element.iter_inner_content():
            if isinstance(run_element, Run):
                if run_element.font.name == glyphed_font_name:
                    for run_inner_content in run_element.iter_inner_content():
                        if isinstance(run_inner_content, str):
                            unicode_text = converting_the_characters_to_Unicode_equivalents(run_inner_content)
                            run_element.text = unicode_text
                    
                    # unset special font name, use the default font
                    # try this first                    
                    run_element.font.name = None    
                    
                    # set special font name, if really needed
                    # set ascii and hAnsi font name                 
                    run_element.font.name = new_font_name
                    # set eastAsia font name
                    run_element.element.rPr.rFonts.set(qn('w:eastAsia'), new_font_name)
                    # set cs (complex script) font name
                    run_element.element.rPr.rFonts.set(qn('w:cs'), new_font_name)
                
            elif isinstance(run_element, Hyperlink):
                print('ToDo')
            else:
                print('ToDo')
                

    elif isinstance(body_element, Table):
        print('ToDo')
    #elif isinstance(body_element, OtherBodyElementType):
        #print('ToDo')
    else:
        print('ToDo')
    
document.save('source_document_unicode.docx')

Result source_document_unicode.docx:

First paragraph text is the formerly "Abc def ghi 123 ?" converted to some Chinese Unicode using some wilde guessed mapping. But Chinese glyph appear and different font settings retain while formatted using default font. And, as you told in your Question, you have solved that converting the characters to Unicode equivalents already.

Conclusion: As always, retain to defaults first before trying special settings. Microsoft Word, at least current versions, are able to show all Unicode glyph using the default font. If the default font not contains all glyphs, Microsoft Word will use glyph from supplement fonts "Yu Gothic" or "Meiryo" or "Arial Unicode MS" depending of the Windows System. So you possibly will see "Apros" or "Calibri" as font in Word's GUI but the glyph will not be from that font but from any supplement fonts. Therefore the first you should try is to unset the special font name using run_element.font.name = None to take the document default font.

But, if really needed, you can set the new font name using run_element.font.name = new_font_name. But this only sets ascii and hAnsi font name. To set eastAsia font name too, one need to set this attribute on more low level stage:

   # set eastAsia font name
   run_element.element.rPr.rFonts.set(qn('w:eastAsia'), new_font_name)

Same for cs font name then.

   # set cs (complex script) font name
   run_element.element.rPr.rFonts.set(qn('w:cs'), new_font_name)

About Complex Scripts, you should read About Uniscribe -> About Complex Scripts

Thank you for the information on using the qn(...) values to set the fonts. This seems to work well! And yes, I'm quite aware that the old font encodings are mostly obsolete. This work is to update language documentation to Unicode for better archiving without the hacked fonts.

Collectives™ on Stack Overflow

Python Docx: change name of font in w:cs? Converting font-encoding to Unicode

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related