Skip to content

fix: replace os.path.join with safe_path_join to prevent path manipulation (CWE-22)#1369

Merged
bact merged 4 commits intodevfrom
copilot/bug-fix-pythainlp-path-manipulation
Mar 26, 2026
Merged

fix: replace os.path.join with safe_path_join to prevent path manipulation (CWE-22)#1369
bact merged 4 commits intodevfrom
copilot/bug-fix-pythainlp-path-manipulation

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 25, 2026

49 path manipulation vulnerabilities (CWE-22) remain in v5.3.2 despite the prior safe_path_join introduction, because many call sites still use os.path.join directly.

What do these changes do

Replace every remaining os.path.join call that constructs a file-system path with safe_path_join, which canonicalizes via os.path.abspath and validates the result is within the expected base directory before returning it.

What was wrong

safe_path_join existed in pythainlp.tools.path but was not consistently used. Affected sites spanned module-level path constants, archive extraction logic, and model-loading helpers — any of which could accept tainted input (archive member names, user-supplied model directories) and silently produce a path outside the intended base.

How this fixes it

All os.path.join calls that produce file paths are replaced with safe_path_join:

File Detail
pythainlp/corpus/__init__.py _CORPUS_PATH constant
pythainlp/tag/unigram.py 4 module-level model path constants
pythainlp/tag/perceptron.py 4 module-level model path constants
pythainlp/parse/transformers_ud.py local model sub-directory joins
pythainlp/translate/en_th.py _get_translate_path return value
pythainlp/spell/words_spelling_correction.py embeddings.npy and vocabulary.txt paths
pythainlp/tokenize/crfcut.py CRF model path
pythainlp/corpus/core.py get_hf_hub, tar/zip member validation, symlink target validation

For archive extraction (Python 3.9–3.11 fallback path), the previous os.path.join + _is_within_directory two-step is replaced with a single safe_path_join call that raises on traversal. Relative symlink targets are resolved by passing the archive-root-relative member dirname and link target as separate *parts directly to safe_path_join, which handles the join, canonicalization, and containment check in one step — eliminating all intermediate os.path.join calls:

# Before
member_path = os.path.join(path, member.name)
if not _is_within_directory(path, member_path):
    raise ValueError(...)
# ... and for symlinks:
rel_candidate = os.path.normpath(os.path.join(os.path.dirname(member.name), link_target))
safe_path_join(path, rel_candidate)

# After — canonicalizes and validates in one step, no intermediate os.path.join
try:
    safe_path_join(path, member.name)
except ValueError:
    raise ValueError(f"Attempted path traversal in tar file: {member.name}")
# ... and for symlinks:
try:
    safe_path_join(path, os.path.dirname(member.name), link_target)
except ValueError:
    raise ValueError(...)

Unused import os statements removed from files where os was only needed for os.path.join.

Your checklist for this pull request

  • Passed code styles and structures
  • Passed code linting checks and unit test
Original prompt

This section details on the original issue you should resolve

<issue_title>bug: PyThaiNLP version 5.3.2 still have issue about Path Manipulation</issue_title>
<issue_description>### Description

Unfortunately, PyThaiNLP version 5.3.2 still have issue about Path Manipulation

pythainlp-v5.3.2 23MAR2026 DeveloperWorkbook.pdf

This article give suggestion by Clean Path
Fortify keeps track of the parts that came from the original input. If you use any of the original input, you may still get the error.
https://www.experts-exchange.com/articles/30499/Fortify-Path-Manipulation-Issues.html

This article suggest that
Fortify will flag the code even if the path/file doesn't come from user input like a property file. The best way to handle these is to canonicalize the path first, then validate it against a white list of allowed paths.
https://stackoverflow.com/questions/12690652/how-to-fix-path-manipulation-vulnerability-in-some-java-code

Image

Parse the input for a whitelist of acceptable characters. Reject from the input, any character you don't want in the path.

Image

Expected results

If possible, security issue about Path Manipulation should be fixed.

Current results

There are 49 issues found about Path Manipulation.

Steps to reproduce

Run source code scan tool, such as Fortify.

PyThaiNLP version

5.3.2

Python version

3.12

Operating system and version

Windows

More info

No response

Possible solution

This article give suggestion by Clean Path
Fortify keeps track of the parts that came from the original input. If you use any of the original input, you may still get the error.
https://www.experts-exchange.com/articles/30499/Fortify-Path-Manipulation-Issues.html

This article suggest that
Fortify will flag the code even if the path/file doesn't come from user input like a property file. The best way to handle these is to canonicalize the path first, then validate it against a white list of allowed paths.
https://stackoverflow.com/questions/12690652/how-to-fix-path-manipulation-vulnerability-in-some-java-code

Image

Parse the input for a whitelist of acceptable characters. Reject from the input, any character you don't want in the path.

Image

Files

No response</issue_description>

Comments on the Issue (you are @copilot in this section)


⚡ Quickly spin up Copilot coding agent tasks from anywhere on your macOS or Windows machine with Raycast.

…ation (CWE-22)

Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Agent-Logs-Url: https://github.com/PyThaiNLP/pythainlp/sessions/a3eefdf3-b5f2-4c0c-a887-e8ee039b38bb
Copilot AI changed the title [WIP] Fix path manipulation issue in PyThaiNLP 5.3.2 fix: replace os.path.join with safe_path_join to prevent path manipulation (CWE-22) Mar 25, 2026
Copilot AI requested a review from wannaphong March 25, 2026 15:46
@bact bact added the security label Mar 25, 2026
Copy link
Copy Markdown
Member

@bact bact left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot there are few instances of os.path.join introduced. do they necessary? do another review.

…ng safe_path_join *parts

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Agent-Logs-Url: https://github.com/PyThaiNLP/pythainlp/sessions/3c4510c4-5975-4c56-93be-a20c10f6b56e
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 25, 2026

@copilot there are few instances of os.path.join introduced. do they necessary? do another review.

Done in 7214d5c. The last two os.path.join instances (inside the symlink validation helpers for tar and zip) are eliminated. safe_path_join accepts *parts and canonicalises/validates everything internally, so passing the archive-relative dirname and link target as separate parts replaces the intermediate os.path.join entirely. No os.path.join calls remain in any of the modified files.

@bact bact marked this pull request as ready for review March 25, 2026 17:51
Copilot AI requested a review from bact March 25, 2026 17:52
Copilot finished work on behalf of bact March 25, 2026 17:52
Updated changelog to summarize security improvements and issue #1369.
@sonarqubecloud
Copy link
Copy Markdown

@bact bact added this to the 5.3.3 milestone Mar 25, 2026
@bact bact added this to PyThaiNLP Mar 25, 2026
@bact bact moved this to In progress in PyThaiNLP Mar 25, 2026
@bact bact merged commit 7f55388 into dev Mar 26, 2026
29 checks passed
@github-project-automation github-project-automation bot moved this from In progress to Done in PyThaiNLP Mar 26, 2026
@bact bact deleted the copilot/bug-fix-pythainlp-path-manipulation branch March 26, 2026 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

bug: PyThaiNLP version 5.3.2 still have issue about Path Manipulation

3 participants