Fixing Mojibake using Python and ftfy

Picture of Mojibake in Wikipedia

Have you ever had this problem where foreign languages appear in gibberish?

Mojibake (文字化け) occurs when someone has encoded Unicode with one standard and decoded it with a different one. The result is the garbled text that you see in the picture above. In Japanese, the word literally translates to “character changing”.

This was a problem back when Unicode wasn’t around and different countries were using their own encoding systems. The problem got worse when East Asian countries started to develop multi-byte encoding systems to include hundreds and thousands of their characters.

Thankfully, there is a python package that can easily fix Mojibake!

ftfy: fixes text for you

Github: https://github.com/LuminosoInsight/python-ftfy

Documentation: https://ftfy.readthedocs.io/en/latest/

According to the documentation, ftfy can understand text that was decoded as any of the following encodings:

  • Latin-1 (ISO-8859–1)
  • Windows-1252 (cp1252 — used in Microsoft products)
  • Windows-1251 (cp1251 — the Russian version of cp1252)
  • Windows-1250 (cp1250 — the Eastern European version of cp1252)
  • ISO-8859–2 (which is not quite the same as Windows-1250)
  • MacRoman (used on Mac OS 9 and earlier)
  • cp437 (used in MS-DOS and some versions of the Windows command prompt)

Installing ftfy

To install ftfy, run the following pip command:

pip install ftfy

Using ftfy

The main method of ftfy is the fix_text method.

Documentation description: Given Unicode text as input, fix inconsistencies and glitches in it, such as mojibake.

To use this method, simply import ftfy and call the function on the characters that you wish to ungarble!

import ftfyprint(ftfy.fix_text('This text should be in “quotesâ€\x9d.'))

And the line above will print:

This text should be in "quotes".

There is another method called explain_unicode that can be used to break down character by character to see its category in the Unicode standard and its name in the Unicode standard.

ftfy.explain_unicode('(╯°□°)╯︵ ┻━┻')

Above line will print:

U+0028  (       [Ps] LEFT PARENTHESIS
U+256F ╯ [So] BOX DRAWINGS LIGHT ARC UP AND LEFT
U+00B0 ° [So] DEGREE SIGN
U+25A1 □ [So] WHITE SQUARE
U+00B0 ° [So] DEGREE SIGN
U+0029 ) [Pe] RIGHT PARENTHESIS
U+256F ╯ [So] BOX DRAWINGS LIGHT ARC UP AND LEFT
U+FE35 ︵ [Ps] PRESENTATION FORM FOR VERTICAL LEFT PARENTHESIS
U+0020 [Zs] SPACE
U+253B ┻ [So] BOX DRAWINGS HEAVY UP AND HORIZONTAL
U+2501 ━ [So] BOX DRAWINGS HEAVY HORIZONTAL
U+253B ┻ [So] BOX DRAWINGS HEAVY UP AND HORIZONTAL

Enjoy fixing Mojibakes!

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

List of Serverless FunctionProviders you need to know about in 2020

Top Bootstrap Alternatives

top bootstrap alternatives

Pascal’s Triangle

Hacking the Python Games on your Raspberry Pi

RFP = “Realize full Potential” | Axim Global

Markdown and privacy-focused note taking are the perfect match

Deep Learning for Developers

Using GetMapping, PostMapping etc. annotations with Feign in Spring

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jun Choi

Jun Choi

More from Medium

The “Culture Reset” and Its Effects on Learning

Easiest, most secure way to access infrastructure.

Problem Solving Patterns ~ Frequency Counter

Heap’s Algorithm For Generating Permutation