hbllmutils.template.decode

Automatic text decoding utilities with emphasis on Chinese encodings.

This module provides a small, focused API for decoding byte strings when the source encoding is unknown. The decoding strategy prioritizes encodings commonly used on Chinese Windows systems and falls back to the system default encoding. An additional heuristic uses chardet to detect likely encodings.

The module contains the following public components:

windows_chinese_encodings - Ordered list of commonly used Chinese encodings
auto_decode() - Robust decoder that tries multiple encodings

Note

The detection order changes for short inputs to reduce mis-detection. For data shorter than 30 bytes, the module tries common encodings first.

Example:

>>> from hbllmutils.template.decode import auto_decode
>>> text_bytes = b'\xc4\xe3\xba\xc3'  # "你好" in GBK
>>> auto_decode(text_bytes)
'你好'

windows_chinese_encodings

hbllmutils.template.decode.windows_chinese_encodings = ['utf-8', 'gbk', 'gb2312', 'gb18030', 'big5', 'cp936', 'cp950', 'hz', 'euc-cn', 'utf-16', 'utf-16-le', 'utf-16-be', 'utf-32', 'utf-32-le', 'utf-32-be']

Ordered list of encodings commonly found in Chinese Windows environments.

This list is used by auto_decode() to attempt decoding when the encoding is unknown. The order is chosen to prioritize modern and frequently used encodings.

Type:: list[str]

auto_decode

hbllmutils.template.decode.auto_decode(data: bytes | bytearray) → str[source]

Automatically decode bytes data by trying multiple encodings.

This function attempts to decode the input data using multiple encodings in the following order:

The encoding detected by chardet (for inputs >= 30 bytes)
Common Chinese encodings used in Windows
The default system encoding
The encoding detected by chardet (for inputs < 30 bytes)

The function tries each encoding until successful decoding is achieved. If all encodings fail, it raises the UnicodeDecodeError that decoded the longest prefix before failing.

Parameters:: data (Union[bytes, bytearray]) – The bytes data to decode.
Returns:: The decoded string.
Return type:: str
Raises:: UnicodeDecodeError – If the data cannot be decoded with any of the attempted encodings.

Example:

>>> text_bytes = b'\xc4\xe3\xba\xc3'  # "你好" in GBK encoding
>>> auto_decode(text_bytes)
'你好'