Skip to content

Font name mojibake for Chinese/CJK fonts: GBK-encoded /BaseFont decoded as Latin-1/WinAnsi #1266

@xuse2008

Description

@xuse2008

Describe the bug

When extracting font names from PDFs that use common Chinese system fonts (such as 黑体 / SimHei, 宋体 / SimSun, 微软雅黑 / Microsoft YaHei), the returned font name strings are garbled (classic mojibake). For example:

  • Correct: 黑体
  • Extracted as: ºÚÌå (or similar variations like ºÚÌä)

This happens consistently when accessing:

  • Letter.FontName
  • Font resource dictionaries via page.Resources.Fonts
  • /BaseFont or /FontName values in the resolved font dictionaries

To Reproduce

  1. Use a PDF that references non-embedded Chinese fonts (very common in documents created on Windows with Office/ WPS).
  2. Open with PdfPig and extract font names, for example:
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;

using (var document = PdfDocument.Open("chinese-document.pdf"))
{
    var page = document.GetPage(1);
    
    // Simple way that often shows mojibake
    foreach (var letter in page.Letters)
    {
        Console.WriteLine(letter.FontName);   // → "ºÚÌå" instead of "黑体"
    }

    // Or traverse resources
    if (page.Resources?.TryGetValue(UglyToad.PdfPig.Tokens.NameToken.Font, out var fontToken) == true &&
        fontToken is UglyToad.PdfPig.Tokens.DictionaryToken fontDict)
    {
        foreach (var kv in fontDict.Data)
        {
            // Resolve reference and get BaseFont
            if (kv.Value is UglyToad.PdfPig.Tokens.IndirectReferenceToken refToken)
            {
                var fontObj = document.StructureResolver.Resolve(refToken) as UglyToad.PdfPig.Tokens.DictionaryToken;
                if (fontObj?.TryGet(UglyToad.PdfPig.Tokens.NameToken.BaseFont, out var baseFontTok) == true &&
                    baseFontTok is UglyToad.PdfPig.Tokens.NameToken nameTok)
                {
                    Console.WriteLine($"Alias {kv.Key}: {nameTok.Value}");  // → garbled
                }
            }
        }
    }
}

Expected behavior
Font names should be correctly decoded and readable as proper Chinese characters, e.g. "黑体", "宋体", "微软雅黑", etc.
Actual behavior
The library appears to decode the raw bytes of the /BaseFont name using PDFDocEncoding / WinAnsiEncoding (Latin-1 like), while many Chinese PDFs embed the name bytes directly in GBK/GB18030 encoding without proper encoding specification.
Example byte sequence for "黑体" in GBK: BA DA CC E5
→ Misdecoded as Latin-1: º Ú Ì å → combined string "ºÚÌå"
Environment

PdfPig version: [0.1.13]
.NET version: [.NET 6 / 8 ]
OS: Windows (but issue is cross-platform)

Additional context

This is a very common pattern in Chinese PDFs generated by Microsoft Office, WPS, or domestic software.
The PDF specification allows /BaseFont to be non-ASCII, but in practice many creators put raw GBK bytes.
Similar mojibake issues exist in other PDF libraries when handling CJK font names (e.g. seen in pdf.js, iText discussions).
Related older issue: #365 "messy code China font name" (from 2021, but seems not fully resolved for font names specifically).

Possible solutions / suggestions
I believe the cleanest fixes could be one or more of the following (in order of preference):

Add an optional font name post-processing / heuristic repair
Expose a property or ParsingOption like AttemptRepairCjkFontNames = true
When enabled, if the name contains high-byte characters (>127) and looks like Latin-1 mojibake, try re-encoding:
byte[] bytes = Encoding.GetEncoding(1252).GetBytes(rawName);string repaired = Encoding.GetEncoding("GB18030").GetString(bytes); (fallback to GBK)
Then apply a simple check (e.g. contains Han characters U+4E00–U+9FFF) to accept the repaired version.
Expose raw bytes of NameToken
Currently NameToken.Value is already a string (post-decoded). Add NameToken.Data or NameToken.GetBytes() so consumers can apply their own decoding logic.
Auto-detect encoding per font dictionary
More advanced: when resolving a font, check if it's a CJK font (e.g. Subtype /Type0 + Encoding /Identity-H or CMap hints), then prefer GB18030 decoding for names. But this might be overkill.
Document the workaround
At minimum, add to the README or font-notes.md a section about common CJK font name mojibake and the above byte-reencoding trick as a recommended user-side fix.

Many .NET developers handling Chinese/Asian documents run into this exact issue — fixing it would make PdfPig much more usable in East Asian regions.
Thank you very much for maintaining this excellent lightweight PDF library — it's really valuable for many projects!
I would be happy to help test any proposed fix or provide sample PDFs (non-confidential ones exhibiting the issue).

Best regards

Image

公路-河北公路项目招标文件.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions