hbllmutils.template.decode

Automatic text decoding utilities with emphasis on Chinese encodings.

This module provides a small, focused API for decoding byte strings when the source encoding is unknown. The decoding strategy prioritizes encodings commonly used on Chinese Windows systems and falls back to the system default encoding. An additional heuristic uses chardet to detect likely encodings.

The module contains the following public components:

Note

The detection order changes for short inputs to reduce mis-detection. For data shorter than 30 bytes, the module tries common encodings first.

Example:

>>> from hbllmutils.template.decode import auto_decode
>>> text_bytes = b'\xc4\xe3\xba\xc3'  # "你好" in GBK
>>> auto_decode(text_bytes)
'你好'

windows_chinese_encodings

hbllmutils.template.decode.windows_chinese_encodings = ['utf-8', 'gbk', 'gb2312', 'gb18030', 'big5', 'cp936', 'cp950', 'hz', 'euc-cn', 'utf-16', 'utf-16-le', 'utf-16-be', 'utf-32', 'utf-32-le', 'utf-32-be']

Ordered list of encodings commonly found in Chinese Windows environments.

This list is used by auto_decode() to attempt decoding when the encoding is unknown. The order is chosen to prioritize modern and frequently used encodings.

Type:

list[str]

auto_decode

hbllmutils.template.decode.auto_decode(data: bytes | bytearray) str[source]

Automatically decode bytes data by trying multiple encodings.

This function attempts to decode the input data using multiple encodings in the following order:

  1. The encoding detected by chardet (for inputs >= 30 bytes)

  2. Common Chinese encodings used in Windows

  3. The default system encoding

  4. The encoding detected by chardet (for inputs < 30 bytes)

The function tries each encoding until successful decoding is achieved. If all encodings fail, it raises the UnicodeDecodeError that decoded the longest prefix before failing.

Parameters:

data (Union[bytes, bytearray]) – The bytes data to decode.

Returns:

The decoded string.

Return type:

str

Raises:

UnicodeDecodeError – If the data cannot be decoded with any of the attempted encodings.

Example:

>>> text_bytes = b'\xc4\xe3\xba\xc3'  # "你好" in GBK encoding
>>> auto_decode(text_bytes)
'你好'