Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Encoding in Console Output with WinExe OutputType #111057

Open
lindexi opened this issue Jan 3, 2025 · 7 comments
Open

Incorrect Encoding in Console Output with WinExe OutputType #111057

lindexi opened this issue Jan 3, 2025 · 7 comments
Assignees
Labels
area-System.Console untriaged New issue has not been triaged by the area owner

Comments

@lindexi
Copy link
Member

lindexi commented Jan 3, 2025

Description

When setting the OutputType to WinExe, the console output is encoded incorrectly, resulting in garbled text.

Reproduction Steps

  1. Create a console project and set the OutputType to WinExe.
  2. Perform console output.

You will notice that the encoding of StandardOutput in the console is set to CodePage=0, which causes other software attempting to capture the application's output to display garbled text for some Unicode content.

I have written a simple demo program to illustrate this issue. Here is my code:

using System.Diagnostics;
using System.Text;

var codePage = Console.OutputEncoding.CodePage; // The code page will be 0 when the OutputType is WinExe.

if (args.Length > 0)
{
    Console.WriteLine($"CodePage={codePage} Text=\u6797");
}
else
{
    var self = Path.Join(AppContext.BaseDirectory, "YemwearqufeballnoBayboqemli.exe");
    var processStartInfo = new ProcessStartInfo(self, "foo");
    processStartInfo.RedirectStandardOutput = true;
    processStartInfo.StandardOutputEncoding = Encoding.UTF8;
    var process = Process.Start(processStartInfo)!;
    var text = process.StandardOutput.ReadToEnd();
    // You can find the output text is "CodePage=0 Text=��"
    _ = text;
}

You can access the entire project code from https://github.com/lindexi/lindexi_gd/tree/0dc56dbf6f635a7cc9cbda295b1cbe40c2eab8d9/Workbench/YemwearqufeballnoBayboqemli

Expected behavior

When setting OutputType to WinExe, it should still be possible to obtain the correct output encoding.

Actual behavior

Currently, it results in garbled text. This directly affects debugging WinExe applications in Rider, and it is not possible to set Console.OutputEncoding to UTF-8.

Reference: dotnet-campus/dotnetCampus.Logger#32

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

@hez2010
Copy link
Contributor

hez2010 commented Jan 3, 2025

Why are you asking for console when you explicitly disabled console by using WinExe?

@tannergooding
Copy link
Member

tannergooding commented Jan 3, 2025

Why are you asking for console when you explicitly disabled console by using WinExe?

Not having a visible output window is not the same as not having output altogether.

System.Console ultimately defaults to a thin wrapper over the standard C input/output streams. On Windows, it additionally uses the Win32 Console APIs to try and query various information and ensure it behaves "better" in the default environment.

There are then many ways for an exe to not have a console window, such as by using the CreateProcess parameters that disable it. There are equally many way for a winexe to have a console, such as by using AllocConsole.

The general issue here looks to be that Console.OutputEncoding on Windows is calling GetConsoleOutputCP and then not handling the failure result which is 0, it's just passing it down instead which will default it to ANSI.


The general console environment on Windows has changed a lot over recent years, while System.Console in .NET hasn't really had any changes to account for this, for the existence of pseudo-consoles, virtual terminal sequences, system code page differences, etc. Many (but not all) of the Win32 Console* APIs are correspondingly no longer recommended for use and have better alternatives. Likewise, the mix of using some Console* APIs but abstracting the standard C input/output streams in others leads to various disconnects like the above.

I expect its something that could be fixed, but which is not a trivial task and which has a high chance of impacting existing Windows console applications. -- Some of these nuances also show up on Linux, since the Linux environment for console/terminal handling is a bit different and doesn't "cleanly" map onto what .NET had exposed (which was largely oriented around the Windows APIs from 25 years ago).

@jeffhandley
Copy link
Member

Assigned to @tannergooding to finish triaging this

@tannergooding
Copy link
Member

@jkotas would you happen to have any context as to why we're simply passing GetConsoleOutputCP along instead of handling its error scenario?

The code was last touched back in 2016 (7b440e3) but much of it comes from .NET Framework and it doesn't appear as though the original authors are still on the team.

There appear to be various mismatches in the paths between what the current encoding is expected to be vs what it actually might be. For example, SetConsoleOutputEncoding only calls SetConsoleOutputCP if the Encoding is not Unicode.CodePage which then means that doing something like Encoding = UTF8 then Encoding = Unicode won't actually change it back to Unicode. Additionally, many of the Win32 Console* APIs report failure if no console was actually allocated but we're inconsistent around whether that failure is handled. For example, we simply pass the output of GetConsoleOutputCP down to GetSupportedConsoleEncoding and so even though the native call reported failure (0) the managed side then treats that as if we requested "default code page", typically CP_ACP (ANSI), which leads to issues like this one.

This seems fixable, but also risky due to the code having been setup this way for so long.

@jkotas
Copy link
Member

jkotas commented Jan 17, 2025

we simply pass the output of GetConsoleOutputCP down to GetSupportedConsoleEncoding and so even though the native call reported failure (0) the managed side then treats that as if we requested "default code page"

I do not have context why it is done this way. It has been like that since .NET Framework 1.0: https://github.com/SSCLI/sscli_20021101/blob/77d46e0f04f52052a12ac40ce2cf96712c934b3c/clr/src/bcl/system/console.cs#L150

It is quite possible that the (accidental) fallback to CP_ACP when there is no console attached was considered the right behavior back in 2001 when .NET Framework 1.0 shipped. CP_ACP was the default encoding to use across Windows back then.

This seems fixable, but also risky due to the code having been setup this way for so long.

What do you think the fix should be? Fallback to UTF-8? I do not see a problem with doing that.

@tannergooding
Copy link
Member

What do you think the fix should be? Fallback to UTF-8? I do not see a problem with doing that.

This was my initial thought, changing it to fallback to UTF-8 as the default instead to better fit more modern code. This would require modifying ConsolePal.OutputEncoding and potentially EncodingHelper.GetSupportedConsoleEncoding

But I think a "better" fix would be to also more broadly cleanup the caching that System.Console does in relation to how System.ConsolePal does the calls to native. This would also entail fixing calls to APIs like ConsolePal.SetConsoleOutputEncoding to be an unconditional P/Invoke rather than conditioning it if the target code page isn't Unicode. Instead Console.set_OutputEncoding would condition the call to ConsolePal.SetConsoleOutputEncoding based on whether the new value is out of sync with the cached s_outputEncoding (which we already track).

I believe it's fine for us to state that a user manually modifying the code page, such as by calling the Win32 APIs themselves, is undefined behavior and that they should only do it via System.Console. If they were to be in a scenario such as WinExe and then do AllocConsole to create a new console, they would likewise be responsible for ensuring the code page is set to match the one currently set by Console to ensure the state between them is "in sync". -- This gives them a path forward, but doesn't put any burden on us to make our own code more complex to support such niche edge cases.

@jkotas
Copy link
Member

jkotas commented Jan 18, 2025

conditioning it if the target code page isn't Unicode

I assume that Win32 console did not support UTF-16 and that's why we decided to skip the call in that case. Is it no longer the case? if yes, do we know when Win32 console started supporting UTF-16?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Console untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

6 participants