Skip to content

feat: support unicode in object names#875

Open
ferhatelmas wants to merge 26 commits intomasterfrom
ferhat/name
Open

feat: support unicode in object names#875
ferhatelmas wants to merge 26 commits intomasterfrom
ferhat/name

Conversation

@ferhatelmas
Copy link
Member

@ferhatelmas ferhatelmas commented Feb 26, 2026

What kind of change does this PR introduce?

feature

What is the current behavior?

Limited set of characters are permitted in object names.

What is the new behavior?

Unicode is supported across the board.

Additional context

#133

@coderabbitai
Copy link

coderabbitai bot commented Feb 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: e5c96d57-5ea8-43b2-9cf5-52bf67b8e633

📥 Commits

Reviewing files that changed from the base of the PR and between cadda24 and c473769.

📒 Files selected for processing (3)
  • src/storage/object.ts
  • src/storage/protocols/s3/s3-handler.ts
  • src/test/s3-protocol.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/storage/protocols/s3/s3-handler.ts

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Wider Unicode support for object names, optional source-version copy, XML numeric-entity decoding, webhook disable filtering, and safer S3 CopySource encoding. Signing/URL routines tightened.
  • Bug Fixes

    • Clearer "invalid object name" errors, rejection of disallowed Unicode/non-characters, fixes to continuation-token and signing edge cases (including double-encoded paths).
  • Tests

    • Large expansion of Unicode, signed-URL, copy/multipart, webhook-filtering, TUS and test-hygiene coverage.

Walkthrough

Adds a tenant SQL migration enforcing object-name constraints; decodes XML numeric entities in the XML parser; tightens key validation to reject XML-invalid and Unicode non-characters; centralizes S3 CopySource encoding/parsing and adds validation for versionId; introduces versioned continuation tokens and robust decoding; changes signed URL signing APIs and adds request-path matching utilities; adjusts DB error messaging for the new constraint; and expands tests across Unicode keys, CopySource, signed URLs, continuation tokens, and related flows.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HTTP as S3 Handler
    participant Parser as parseCopySource
    participant Storage as Storage Service
    participant DB as Database
    participant Backend as Object Backend

    Client->>HTTP: COPY request (CopySource header, Destination)
    HTTP->>Parser: parseCopySource(CopySource)
    alt malformed
        Parser-->>HTTP: throw InvalidParameter / MissingParameter
        HTTP-->>Client: 400
    else parsed
        Parser-->>HTTP: {bucket, key, sourceVersion?}
        HTTP->>Storage: copyObject(srcBucket, srcKey, srcVersion?, destBucket, destKey)
        Storage->>DB: SELECT version WHERE bucket=srcBucket AND name=srcKey
        DB-->>Storage: {version, metadata} or not found
        alt sourceVersion provided
            Storage->>Storage: compare provided sourceVersion vs DB.version
            alt mismatch
                Storage-->>HTTP: throw NoSuchKey / InvalidParameter
                HTTP-->>Client: 404/400
            else match
                Storage->>Backend: CopyObject with encoded CopySource (includes versionId)
                Backend-->>Storage: {eTag, lastModified}
                Storage-->>HTTP: success
                HTTP-->>Client: 200 OK
            end
        else no sourceVersion
            Storage->>Backend: CopyObject using DB.version in CopySource
            Backend-->>Storage: {eTag, lastModified}
            Storage-->>HTTP: success
            HTTP-->>Client: 200 OK
        end
    end
Loading

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ferhatelmas ferhatelmas force-pushed the ferhat/name branch 7 times, most recently from 3dd9e78 to 60b9538 Compare March 5, 2026 10:20
@ferhatelmas ferhatelmas marked this pull request as ready for review March 5, 2026 13:15
@ferhatelmas ferhatelmas requested a review from a team as a code owner March 5, 2026 13:15
@ferhatelmas ferhatelmas changed the title feat: support utf8 in object names feat: support unicode in object names Mar 5, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/storage/object.ts (1)

876-894: ⚠️ Potential issue | 🟠 Major

Backward compatibility break in continuation token decoding corrupts pagination cursors

Line 889 unconditionally decodes all token values with decodeURIComponent(). Old tokens stored raw unescaped values (e.g., a%2Fb as a literal string), but the new decoder treats %2F as an encoded /, silently transforming cursor positions and causing list operations to skip or duplicate results.

Proposed fix (version-prefix encoded tokens)
 function encodeContinuationToken(tokenInfo: ContinuationToken) {
   const result: string[] = []
   for (const [k, v] of Object.entries(CONTINUATION_TOKEN_PART_MAP)) {
     const value = tokenInfo[v]
     if (value) {
-      result.push(`${k}:${encodeURIComponent(value)}`)
+      result.push(`${k}:e:${encodeURIComponent(value)}`)
     }
   }
   return Buffer.from(result.join('\n')).toString('base64')
 }

 function decodeContinuationToken(token: string): ContinuationToken {
@@
-    let value = partMatch[2]
-    try {
-      value = decodeURIComponent(value)
-    } catch {
-      // Backward compatibility: previously cursor values were stored unescaped.
-    }
+    let value = partMatch[2]
+    if (value.startsWith('e:')) {
+      value = decodeURIComponent(value.slice(2))
+    }
     result[CONTINUATION_TOKEN_PART_MAP[partMatch[1]]] = value
   }
   return result
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/storage/object.ts` around lines 876 - 894, The decoder
decodeContinuationToken currently always runs decodeURIComponent on every part
which corrupts legacy unescaped tokens; update decodeContinuationToken to
support a version prefix: detect and strip a token version prefix (e.g., "v1:"
vs legacy/no-prefix), and only apply decodeURIComponent for the new version (v1)
while preserving legacy behavior by leaving values raw (no decoding) when the
token has no version prefix; keep using CONTINUATION_TOKEN_PART_MAP and the same
part parsing logic, and ensure the function throws the same Invalid continuation
token error for malformed parts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/internal/errors/codes.ts`:
- Line 327: The error message in ERRORS.InvalidKey uses encodeURIComponent(key)
which can throw for unpaired UTF-16 surrogates; update ERRORS.InvalidKey (where
the message is constructed) to produce a safe-encoded key: prefer using
key.toWellFormed() if available, otherwise normalize/replace unpaired surrogates
(or fall back to a try/catch around encodeURIComponent and sanitize the key to
replace invalid code units with U+FFFD) and then call encodeURIComponent on the
sanitized string before embedding it in the message so URIError cannot escape
and turn a 400 into a 500.

---

Outside diff comments:
In `@src/storage/object.ts`:
- Around line 876-894: The decoder decodeContinuationToken currently always runs
decodeURIComponent on every part which corrupts legacy unescaped tokens; update
decodeContinuationToken to support a version prefix: detect and strip a token
version prefix (e.g., "v1:" vs legacy/no-prefix), and only apply
decodeURIComponent for the new version (v1) while preserving legacy behavior by
leaving values raw (no decoding) when the token has no version prefix; keep
using CONTINUATION_TOKEN_PART_MAP and the same part parsing logic, and ensure
the function throws the same Invalid continuation token error for malformed
parts.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 46444ed6-2080-41d2-80b4-c88c86a791d1

📥 Commits

Reviewing files that changed from the base of the PR and between d8dba53 and cde444c.

📒 Files selected for processing (21)
  • migrations/tenant/57-unicode-object-names.sql
  • package.json
  • src/http/plugins/xml.ts
  • src/internal/database/migrations/types.ts
  • src/internal/errors/codes.ts
  • src/scripts/migrations-types.ts
  • src/storage/backend/s3/adapter.ts
  • src/storage/database/knex.ts
  • src/storage/limits.ts
  • src/storage/object.ts
  • src/storage/protocols/s3/s3-handler.ts
  • src/test/bucket.test.ts
  • src/test/common.ts
  • src/test/limits.test.ts
  • src/test/object-list-v2.test.ts
  • src/test/object.test.ts
  • src/test/s3-adapter.test.ts
  • src/test/s3-protocol.test.ts
  • src/test/test-hygiene.test.ts
  • src/test/tus.test.ts
  • src/test/xml-plugin.test.ts

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/test/xml-plugin.test.ts`:
- Around line 3-15: Add tests to ensure decodeXmlNumericEntities rejects
XML-forbidden but in-range numeric entities (so they remain un-decoded). In the
existing test suite for decodeXmlNumericEntities add negative cases asserting
inputs like '�' (and '�' hex/dec variants), surrogate halves such as
'�' (and decimal '�'), and noncharacters like '' (and
decimal '') are returned unchanged; reference the
decodeXmlNumericEntities function to locate the implementation to validate
behavior and ensure both hex (&#x...) and decimal (&#...;) forms are covered.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: af25095c-594e-4a8f-afaa-599e92b3eee5

📥 Commits

Reviewing files that changed from the base of the PR and between cde444c and fc16aa6.

📒 Files selected for processing (21)
  • migrations/tenant/57-unicode-object-names.sql
  • package.json
  • src/http/plugins/xml.ts
  • src/internal/database/migrations/types.ts
  • src/internal/errors/codes.ts
  • src/scripts/migrations-types.ts
  • src/storage/backend/s3/adapter.ts
  • src/storage/database/knex.ts
  • src/storage/limits.ts
  • src/storage/object.ts
  • src/storage/protocols/s3/s3-handler.ts
  • src/test/bucket.test.ts
  • src/test/common.ts
  • src/test/limits.test.ts
  • src/test/object-list-v2.test.ts
  • src/test/object.test.ts
  • src/test/s3-adapter.test.ts
  • src/test/s3-protocol.test.ts
  • src/test/test-hygiene.test.ts
  • src/test/tus.test.ts
  • src/test/xml-plugin.test.ts
🚧 Files skipped from review as they are similar to previous changes (9)
  • src/test/test-hygiene.test.ts
  • src/test/object-list-v2.test.ts
  • src/test/common.ts
  • migrations/tenant/57-unicode-object-names.sql
  • src/storage/object.ts
  • src/storage/backend/s3/adapter.ts
  • src/internal/errors/codes.ts
  • src/http/plugins/xml.ts
  • src/test/limits.test.ts

@ferhatelmas ferhatelmas force-pushed the ferhat/name branch 3 times, most recently from ff9670e to bf8efff Compare March 6, 2026 09:41
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/storage/protocols/s3/s3-handler.ts`:
- Around line 1448-1501: Change decodeLegacyContinuationToken and
decodeContinuationToken to accept a parameter name (e.g.
decodeContinuationToken(token: string, parameterName: string)) and replace all
plain throws like throw new Error('Invalid continuation token') with the
standardized request error from your ERRORS collection, passing the parameter
name (e.g. throw ERRORS.InvalidArgument(parameterName) or the appropriate
ERRORS.* factory used elsewhere). Update callers (e.g. where
KeyMarker/UploadIdMarker are decoded) to pass the parameter name string such as
'KeyMarker' or 'UploadIdMarker' when calling
decodeContinuationToken/decodeLegacyContinuationToken so malformed client tokens
produce a 4xx request error instead of a generic exception.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 773df860-6e8f-4ac9-bba0-96890cba39d8

📥 Commits

Reviewing files that changed from the base of the PR and between 0121419 and ff9670e.

📒 Files selected for processing (20)
  • migrations/tenant/57-unicode-object-names.sql
  • src/http/routes/object/getSignedObject.ts
  • src/http/routes/object/getSignedUploadURL.ts
  • src/http/routes/render/renderSignedImage.ts
  • src/http/routes/signed-url.ts
  • src/storage/backend/s3/adapter.ts
  • src/storage/events/lifecycle/webhook-filter.ts
  • src/storage/events/lifecycle/webhook.ts
  • src/storage/object.ts
  • src/storage/protocols/s3/s3-handler.ts
  • src/test/limits.test.ts
  • src/test/object-list-v2.test.ts
  • src/test/object.test.ts
  • src/test/render-routes.test.ts
  • src/test/s3-adapter.test.ts
  • src/test/s3-protocol.test.ts
  • src/test/signed-url-route.test.ts
  • src/test/tus.test.ts
  • src/test/webhook-filter.test.ts
  • src/test/webhooks.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/storage/backend/s3/adapter.ts
  • src/test/limits.test.ts

mlatief and others added 11 commits March 6, 2026 14:42
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants