Fixing Mojibake using Python and ftfy

Jun Choi
2 min readJan 20, 2021
Picture of Mojibake in Wikipedia

Have you ever had this problem where foreign languages appear in gibberish?

Mojibake (文字化け) occurs when someone has encoded Unicode with one standard and decoded it with a different one. The result is the garbled text that you see in the picture above. In Japanese, the word literally translates to “character changing”.

This was a problem back when Unicode wasn’t around and different countries were using their own encoding systems. The problem got worse when East Asian countries started to develop multi-byte encoding systems to include hundreds and thousands of their characters.

Thankfully, there is a python package that can easily fix Mojibake!

ftfy: fixes text for you

Github: https://github.com/LuminosoInsight/python-ftfy

Documentation: https://ftfy.readthedocs.io/en/latest/

According to the documentation, ftfy can understand text that was decoded as any of the following encodings:

  • Latin-1 (ISO-8859–1)
  • Windows-1252 (cp1252 — used in Microsoft products)
  • Windows-1251 (cp1251 — the Russian version of cp1252)
  • Windows-1250 (cp1250 — the Eastern European version of cp1252)
  • ISO-8859–2 (which is not quite the same as Windows-1250)
  • MacRoman (used on Mac OS 9 and earlier)
  • cp437 (used in MS-DOS and some versions of the Windows command prompt)

Installing ftfy

To install ftfy, run the following pip command:

pip install ftfy

Using ftfy

The main method of ftfy is the fix_text method.

Documentation description: Given Unicode text as input, fix inconsistencies and glitches in it, such as mojibake.

To use this method, simply import ftfy and call the function on the characters that you wish to ungarble!

import ftfyprint(ftfy.fix_text('This text should be in “quotesâ€\x9d.'))

And the line above will print:

This text should be in "quotes".

There is another method called explain_unicode that can be used to break down character by character to see its category in the Unicode standard and its name in the Unicode standard.

ftfy.explain_unicode('(╯°□°)╯︵ ┻━┻')

Above line will print:

U+0028  (       [Ps] LEFT PARENTHESIS
U+256F ╯ [So] BOX DRAWINGS LIGHT ARC UP AND LEFT
U+00B0 ° [So] DEGREE SIGN
U+25A1 □ [So] WHITE SQUARE
U+00B0 ° [So] DEGREE SIGN
U+0029 ) [Pe] RIGHT PARENTHESIS
U+256F ╯ [So] BOX DRAWINGS LIGHT ARC UP AND LEFT
U+FE35 ︵ [Ps] PRESENTATION FORM FOR VERTICAL LEFT PARENTHESIS
U+0020 [Zs] SPACE
U+253B ┻ [So] BOX DRAWINGS HEAVY UP AND HORIZONTAL
U+2501 ━ [So] BOX DRAWINGS HEAVY HORIZONTAL
U+253B ┻ [So] BOX DRAWINGS HEAVY UP AND HORIZONTAL

Enjoy fixing Mojibakes!

--

--