Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode/EBCDIC decode fixes and validator functions #4874

Open
wants to merge 9 commits into
base: dev
Choose a base branch
from

Conversation

Rot127
Copy link
Member

@Rot127 Rot127 commented Feb 1, 2025

Your checklist for this pull request

  • I've read the guidelines for contributing to this repository.
  • I made sure to follow the project's coding style.
  • I've documented every RZ_API function and struct this PR changes.
  • I've added tests that prove my changes are effective (required for changes to RZ_API).
  • I've updated the Rizin book with the relevant information (if needed).

Detailed description

  • Fix UTF32 decode. The number of consumed bytes were incorrect. UTF-32 is fixed length.
  • Adds Unicode validator functions for UTF-32 and EBCDIC.

Test plan

Added

Closing issues

closes #4697

@Rot127 Rot127 mentioned this pull request Feb 1, 2025
5 tasks
RZ_API int rz_utf32le_decode(const ut8 *ptr, int ptrlen, RzCodePoint *ch);
RZ_API int rz_utf32be_decode(const ut8 *ptr, int ptrlen, RzCodePoint *ch);
RZ_API bool rz_utf32_valid_cp(const ut8 *buf, size_t buf_len, bool big_endian);
Copy link
Member

@wargio wargio Feb 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RZ_API bool rz_utf32_valid_cp(const ut8 *buf, size_t buf_len, bool big_endian);
RZ_API bool rz_utf32_valid_codepoint(const ut8 *buf, size_t buf_len, bool big_endian);

Avoid using cp in function names.

@notxvilka
Copy link

These breaks are worrying:

[XX] 268 ms   db/formats/elf/strings iz (utf-32le)
RZ_NOPLUGINS=1 /usr/bin/rizin -escr.utf8=0 -escr.color=0 -escr.interactive=0 -eflirt.sigdb.load.system=false -eflirt.sigdb.load.home=false -N -qc 'iz~Hello
fl@F:strings
s sym.main
af
pdf~str.Hello
aar
fl@F:strings
' bins/elf/analysis/hello-utf-32le
-- stdout
--- expected
+++ actual
@@ -1,6 +1,6 @@
   0 0x000005e8 0x004005e8  12   52 .rodata utf32le Hello World\n
 0x004005e8 52 str.Hello_World
 0x00400620 32 str.S
-|           0x0040052e      mov   qword [var_10h], str.Hello_World     ; 0x4005e8 ; U"Hello World\n"
+|           0x0040052e      mov   qword [var_10h], str.Hello_World     ; 0x4005e8 ; U"\U00000048\U00000065\U0000006c\U0000006c\U0000006f\U00000020\U00000057\U0000006f\U00000072\U0000006c\U00000064\U0000000a"
 0x004005e8 52 str.Hello_World
 0x00400620 32 str.S

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] ps recognizes UTF8 as UTF16
3 participants