Skip to content

fix: query default user missing group files #64

@AhmedBoutaraa57

Description

@AhmedBoutaraa57

Issue #64: Query Default User Missing Group Files

Problem

When querying as the default user (username=None), data stored in the group path (tmp/group/{username}/...) is not accessible, leading to incomplete query results. This occurs because resolve_table_dir() only checks one path based on the username parameter.

Current Code Behavior

  • Default user queries: Only check tmp/data/{db}/{table}/ (never checks group path)
  • Group user queries: Only check tmp/group/{username}/{db}/{table}/ (never checks default path)
  • cloud_fetch_parquet(): Always writes to group path (line 288)
  • cloud_sink_parquet(): Merges local (default path) with cloud, but only reads from default path (line 242: build_files_list(..., None))
  • cloud_sync_parquet(): Uploads from default path to cloud, downloads from cloud to group path, but never updates the default path with merged data

Data Flow Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                        DATA FLOW & PATH SEPARATION                       │
└─────────────────────────────────────────────────────────────────────────┘

                    ┌─────────────────────┐
                    │   Cloud Storage     │
                    │   (S3)              │
                    │   [A, B, C, D, E]   │
                    └──────────┬──────────┘
                               │
                               │ cloud_fetch_parquet(username="ahmed_test")
                               │ Line 288: Always writes to group path
                               ▼
        ┌──────────────────────────────────────────────┐
        │  tmp/group/ahmed_test/{db}/{table}/           │
        │  Files: activitydetails_2025-02-10.parquet   │
        │  Contains: Records A, B, C, D, E            │
        │  ✅ Accessible by: Group user queries        │
        │  ❌ NOT accessible by: Default user queries  │
        └──────────────────────────────────────────────┘
                               │
                               │ insert() writes new data
                               │ Line 436: Always writes to default path
                               ▼
        ┌──────────────────────────────────────────────┐
        │  tmp/data/{db}/{table}/                      │
        │  partition_date=2025-02-10/data.parquet      │
        │  Contains: Records F, G                      │
        │  ✅ Accessible by: Default user queries      │
        │  ❌ NOT accessible by: Group user queries    │
        └──────────────────────────────────────────────┘
                               │
                               │ cloud_sink_parquet()
                               │ Line 242: build_files_list(..., None)
                               │         → Only reads from tmp/data/...
                               │ Line 388-400: Merges local [F, G] with cloud [A-E]
                               │
                    ┌──────────┴──────────┐
                    │                     │
        ┌───────────▼──────────┐ ┌───────▼────────────┐
        │ Cloud Storage (S3)   │ │ Query Resolution  │
        │ After cloud_sink:    │ │ (resolve_table_dir)│
        │ [A, B, C, D, E, F, G]│ │                   │
        │ ✅ Complete (merged) │ │                   │
        └──────────────────────┘ │                   │
                                 │                   │
                    ┌────────────┴──────────┐       │
                    │                       │       │
        ┌───────────▼──────────┐ ┌───────────▼───────▼┐
        │ Default User Query  │ │ Group User Query  │
        │ username=None       │ │ username="user"   │
        │                     │ │                   │
        │ Returns: tmp/data/ │ │ Returns:          │
        │ Sees: F, G          │ │ tmp/group/...     │
        │ ❌ Missing: A-E      │ │ Sees: A, B, C, D, E│
        │                     │ │ ❌ Missing: F, G   │
        └─────────────────────┘ └───────────────────┘

Key Issues:
- cloud_sink_parquet() merges default path with cloud ✅
- BUT: cloud_sink_parquet() ignores group path data ❌
- Query resolution is EXCLUSIVE (only one path checked)
- Local paths remain split even though cloud is complete

Summary: Complete vs Incomplete Data

✅ Complete Data - Default User:
   - All inserts go to tmp/data/... (default path)
   - No data in tmp/group/...
   - Query sees all data ✅

✅ Complete Data - Group User:
   - cloud_fetch_parquet() downloads to tmp/group/...
   - No new inserts yet (or all inserts synced)
   - Query sees all data ✅

❌ Incomplete Data - Default User:
   - cloud_fetch_parquet() downloads to tmp/group/... [A, B, C]
   - User inserts new data → goes to tmp/data/... [D, E]
   - Query as default user → Only sees [D, E], missing [A, B, C] ❌
   
   OR
   
   - cloud_sync_parquet() downloads complete data to tmp/group/...
   - tmp/data/ still has old data (not updated)
   - Query as default user → Sees old data, missing new data from group path ❌

❌ Incomplete Data - Group User:
   - cloud_sync_parquet() downloads complete data to tmp/group/... [A, B, C]
   - User inserts new data → goes to tmp/data/... [D, E]
   - Query as group user → Only sees [A, B, C], missing [D, E] ❌
   - New inserts missed until next cloud_sync_parquet()

Root Cause:
- insert() always writes to tmp/data/... (default path)
- cloud_fetch_parquet() always writes to tmp/group/...
- Query checks only ONE path (default OR group, never both)
- cloud_sync_parquet() updates group path but NOT default path

Metadata

Metadata

Labels

bugsomething isn't workinghelp wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions