Python Shift_JIS and CP932 Report

Overview
We Can Forget About Shift_JIS
The Cheeky Behavior of Our Beloved VSCode
Well, Mac Also Sends Over Some Weird Character Encodings Once in a While
Why Should "Good Kids Use UTF-8"?

Overview

UTF-8 has become widely adopted, and nowadays, we hardly ever have to worry about character encoding issues. It's a good time to be alive. However, the harmony can sometimes be disturbed by Shift_JIS and CP932, character encodings that Windows tends to use. Today, let's dive into that topic.

We Can Forget About Shift_JIS

The reason is that Shift_JIS is actually a subset of CP932. So, by covering CP932, you're also essentially covering Shift_JIS.

Let's run some experiments.

# First, create a file with Shift_JIS encoding.
with open('shift_jis.txt', 'w', encoding='shift_jis') as f:
    f.write('りんご')

# Then, create a file with CP932 encoding.
# '①②③' falls outside of the Shift_JIS range but within CP932.
# You can find CP932 specific characters here↓
#     https://ja.wikipedia.org/wiki/Microsoftコードページ932
with open('cp932.txt', 'w', encoding='cp932') as f:
    f.write('①②③')

# Let's check if it's working.
# '①②③' is not in Shift_JIS so you can't write it with that encoding.
with open('cp932.txt', 'w', encoding='shift_jis') as f:
    f.write('①②③')
    # --> UnicodeEncodeError:
    #     'shift_jis' codec can't encode character '\u2460' in position 0:
    #     illegal multibyte sequence

And then let's read these files.

def detect_encoding(file_path: str) -> str:
    """
    A function that detects the encoding of the specified file.

    Parameters:
        file_path (str): The path of the file you want to detect the encoding for.

    Returns:
        str: A message based on the detected encoding.
    """
    with open(file_path, 'rb') as f:
        data = f.read()
    try:
        data.decode('cp932')
        return 'Readable with CP932'
    except UnicodeDecodeError:
        pass
    try:
        data.decode('shift_jis')
        return 'Readable with Shift_JIS'
    except UnicodeDecodeError:
        pass
    return 'Not readable with either CP932 or Shift_JIS'

print(detect_encoding('shift_jis.txt'))
# --> Readable with CP932
print(detect_encoding('cp932.txt'))
# --> Readable with CP932

See? Since CP932 can read Shift_JIS, we can forget about the latter. Just encode everything in CP932. But we won't, right? We'll use UTF-8 anyway!

The Cheeky Behavior of Our Beloved VSCode

As previously mentioned, there is a significant difference between Shift_JIS and CP932, a mistake in which can result in a UnicodeDecodeError. However, our dear VSCode shows some cheeky behavior regarding this.

Image showing VSCode correctly displaying the file's character encoding as Shift_JIS

Image showing VSCode displaying the file's character encoding as Shift_JIS even though it's actually CP932

It seems that Microsoft refers to both the pre-extension Shift_JIS and the post-extension CP932 as simply Shift_JIS. Probably because VSCode is a Microsoft product, it follows the same policy in displaying character encodings. I don't think that's a good thing, you know?

Well, Mac Also Sends Over Some Weird Character Encodings Once in a While

Like UTF-8-Mac, for instance.

(2022-11-04) Python Mysterious Characters, Again: UTF-8-MAC Edition

The conclusion for articles about character encoding is always the same. All the good kids out there should just stick to using UTF-8.

Why Should "Good Kids Use UTF-8"?

When tracing the history of character encoding, Shift_JIS and CP932 were originally developed for use in Japan. However, these character sets have limitations when it comes to international scalability.

Language Support Limitations: Shift_JIS and CP932 are specialized for Japanese. Therefore, they're inconvenient for projects that require multi-language support.
Non-Standard Extensions: As mentioned earlier, CP932 and the like have a lot of unique extensions. This can lead to compatibility issues.
Complexity of Encoding: UTF-8 uses variable-length encoding but is well-designed. On the other hand, while Shift_JIS and CP932 also use variable-length encoding, their older design can be incompatible with modern environments.

These limitations are why more "universal" character sets like UTF-8 are preferred. UTF-8 supports multiple languages and its encoding is efficient. That's why the saying goes, "Good kids use UTF-8."