Skip to content

[JSON transformer] Mask individual object keys in arrays while preserving length #545

@dp-dandy

Description

@dp-dandy

Describe the bug
The JSON transformer cannot properly mask individual object keys within JSON arrays while preserving the original length of each value. When using the #.field path syntax to target all fields in an array of objects, the transformer either:

  1. Returns a template execution error due to receiving []interface{} instead of individual string values
  2. Concatenates all masked values into a single string and applies it to every object in the array

Why this should be considered a bug rather than a limitation:

  • sjson is working correctly: When using #.field path, sjson correctly extracts all matching values and returns them as an array, which is the expected behavior per sjson documentation
  • pgstream's template system is incomplete: pgstream's template execution context passes .GetValue containing the array []interface{} but doesn't provide proper array handling capabilities
  • Other tools solve this: Tools like jq can easily process arrays with expressions like .[] | .field = mask(.field)
  • Template capabilities exist: pgstream uses Go templates with Sprig functions that have array iteration capabilities (range, etc.) but doesn't expose them properly for this use case
  • The data is available: The array of values is correctly extracted by sjson - pgstream just needs better template context handling

To Reproduce
Steps to reproduce the behavior:

  1. Use this configuration:
transformations:
  validation_mode: relaxed
  table_transformers:
    - schema: my_schema
      table: my_table
      column_transformers:
        json_column:
          name: json
          parameters:
            operations:
              - operation: set
                path: "#.sensitive_field"
                value_template: '{{ masking "default" .GetValue }}'
                error_not_exist: false
                skip_not_exist: true
  1. Run pgstream with a JSONB column containing:
[
  {"sensitive_field": "ABC123DEF456GHI789", "type": "system_a"},
  {"sensitive_field": "XYZ789UVW012RST345QWE678", "type": "system_b"}
]
  1. Perform replication or snapshot operation

  2. See error:

error executing template: template: op[0] set #.sensitive_field:1:21: executing "op[0] set #.sensitive_field" at <.GetValue>: wrong type for value; expected string; got []interface {}

Expected behavior
The JSON transformer should be able to:

  1. Target individual object keys within JSON arrays using path syntax like #.field
  2. Apply masking functions to each individual value while preserving the original length
  3. Maintain the JSON array structure with each object having its own masked field

Expected output:

[
  {"sensitive_field": "******************", "type": "system_a"},
  {"sensitive_field": "*************************", "type": "system_b"}
]

Current Workaround Limitations

  • Using indexed paths (0.field, 1.field, etc.) defeats the purpose of handling variable-length arrays
  • Using literal replacement values (value: "***MASKED***") doesn't preserve original length
  • The template transformer doesn't support JSONB data types
  • Template approaches that attempt to process the array return concatenated results applied to all objects

Potential Solutions
The fix would be in pgstream's JSON transformer to either:

  1. Detect when .GetValue returns an array and provide iteration helpers in the template context
  2. Provide template functions that can work with arrays (e.g., {{ range .GetValue }}{{ masking "default" . }}{{ end }})
  3. Change the approach to process array elements individually before template execution

Setup (please complete the following information):

  • pgstream version: v0.8.3
  • Postgres version: 15
  • Postgres environment: Docker container
  • Column type: JSONB

Additional context
This limitation significantly impacts the ability to anonymize JSON data containing arrays of objects with sensitive identifiers. The current JSON transformer works well for simple key-value pairs but struggles with array iteration and individual element processing, despite having all the necessary underlying capabilities through sjson and Go templates.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requesttransformersTransformer related work

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions