hbllmutils.template.decode
Automatic text decoding utilities with emphasis on Chinese encodings.
This module provides a small, focused API for decoding byte strings when the
source encoding is unknown. The decoding strategy prioritizes encodings commonly
used on Chinese Windows systems and falls back to the system default encoding.
An additional heuristic uses chardet to detect likely encodings.
The module contains the following public components:
windows_chinese_encodings- Ordered list of commonly used Chinese encodingsauto_decode()- Robust decoder that tries multiple encodings
Note
The detection order changes for short inputs to reduce mis-detection. For data shorter than 30 bytes, the module tries common encodings first.
Example:
>>> from hbllmutils.template.decode import auto_decode
>>> text_bytes = b'\xc4\xe3\xba\xc3' # "你好" in GBK
>>> auto_decode(text_bytes)
'你好'
windows_chinese_encodings
- hbllmutils.template.decode.windows_chinese_encodings = ['utf-8', 'gbk', 'gb2312', 'gb18030', 'big5', 'cp936', 'cp950', 'hz', 'euc-cn', 'utf-16', 'utf-16-le', 'utf-16-be', 'utf-32', 'utf-32-le', 'utf-32-be']
Ordered list of encodings commonly found in Chinese Windows environments.
This list is used by
auto_decode()to attempt decoding when the encoding is unknown. The order is chosen to prioritize modern and frequently used encodings.- Type:
list[str]
auto_decode
- hbllmutils.template.decode.auto_decode(data: bytes | bytearray) str[source]
Automatically decode bytes data by trying multiple encodings.
This function attempts to decode the input data using multiple encodings in the following order:
The encoding detected by
chardet(for inputs >= 30 bytes)Common Chinese encodings used in Windows
The default system encoding
The encoding detected by
chardet(for inputs < 30 bytes)
The function tries each encoding until successful decoding is achieved. If all encodings fail, it raises the
UnicodeDecodeErrorthat decoded the longest prefix before failing.- Parameters:
data (Union[bytes, bytearray]) – The bytes data to decode.
- Returns:
The decoded string.
- Return type:
str
- Raises:
UnicodeDecodeError – If the data cannot be decoded with any of the attempted encodings.
Example:
>>> text_bytes = b'\xc4\xe3\xba\xc3' # "你好" in GBK encoding >>> auto_decode(text_bytes) '你好'