-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
bugsomething isn't workingsomething isn't workinghelp wantedExtra attention is neededExtra attention is needed
Description
Issue #64: Query Default User Missing Group Files
Problem
When querying as the default user (username=None), data stored in the group path (tmp/group/{username}/...) is not accessible, leading to incomplete query results. This occurs because resolve_table_dir() only checks one path based on the username parameter.
Current Code Behavior
- Default user queries: Only check
tmp/data/{db}/{table}/(never checks group path) - Group user queries: Only check
tmp/group/{username}/{db}/{table}/(never checks default path) cloud_fetch_parquet(): Always writes to group path (line 288)cloud_sink_parquet(): Merges local (default path) with cloud, but only reads from default path (line 242:build_files_list(..., None))cloud_sync_parquet(): Uploads from default path to cloud, downloads from cloud to group path, but never updates the default path with merged data
Data Flow Diagram
┌─────────────────────────────────────────────────────────────────────────┐
│ DATA FLOW & PATH SEPARATION │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────┐
│ Cloud Storage │
│ (S3) │
│ [A, B, C, D, E] │
└──────────┬──────────┘
│
│ cloud_fetch_parquet(username="ahmed_test")
│ Line 288: Always writes to group path
▼
┌──────────────────────────────────────────────┐
│ tmp/group/ahmed_test/{db}/{table}/ │
│ Files: activitydetails_2025-02-10.parquet │
│ Contains: Records A, B, C, D, E │
│ ✅ Accessible by: Group user queries │
│ ❌ NOT accessible by: Default user queries │
└──────────────────────────────────────────────┘
│
│ insert() writes new data
│ Line 436: Always writes to default path
▼
┌──────────────────────────────────────────────┐
│ tmp/data/{db}/{table}/ │
│ partition_date=2025-02-10/data.parquet │
│ Contains: Records F, G │
│ ✅ Accessible by: Default user queries │
│ ❌ NOT accessible by: Group user queries │
└──────────────────────────────────────────────┘
│
│ cloud_sink_parquet()
│ Line 242: build_files_list(..., None)
│ → Only reads from tmp/data/...
│ Line 388-400: Merges local [F, G] with cloud [A-E]
│
┌──────────┴──────────┐
│ │
┌───────────▼──────────┐ ┌───────▼────────────┐
│ Cloud Storage (S3) │ │ Query Resolution │
│ After cloud_sink: │ │ (resolve_table_dir)│
│ [A, B, C, D, E, F, G]│ │ │
│ ✅ Complete (merged) │ │ │
└──────────────────────┘ │ │
│ │
┌────────────┴──────────┐ │
│ │ │
┌───────────▼──────────┐ ┌───────────▼───────▼┐
│ Default User Query │ │ Group User Query │
│ username=None │ │ username="user" │
│ │ │ │
│ Returns: tmp/data/ │ │ Returns: │
│ Sees: F, G │ │ tmp/group/... │
│ ❌ Missing: A-E │ │ Sees: A, B, C, D, E│
│ │ │ ❌ Missing: F, G │
└─────────────────────┘ └───────────────────┘
Key Issues:
- cloud_sink_parquet() merges default path with cloud ✅
- BUT: cloud_sink_parquet() ignores group path data ❌
- Query resolution is EXCLUSIVE (only one path checked)
- Local paths remain split even though cloud is complete
Summary: Complete vs Incomplete Data
✅ Complete Data - Default User:
- All inserts go to tmp/data/... (default path)
- No data in tmp/group/...
- Query sees all data ✅
✅ Complete Data - Group User:
- cloud_fetch_parquet() downloads to tmp/group/...
- No new inserts yet (or all inserts synced)
- Query sees all data ✅
❌ Incomplete Data - Default User:
- cloud_fetch_parquet() downloads to tmp/group/... [A, B, C]
- User inserts new data → goes to tmp/data/... [D, E]
- Query as default user → Only sees [D, E], missing [A, B, C] ❌
OR
- cloud_sync_parquet() downloads complete data to tmp/group/...
- tmp/data/ still has old data (not updated)
- Query as default user → Sees old data, missing new data from group path ❌
❌ Incomplete Data - Group User:
- cloud_sync_parquet() downloads complete data to tmp/group/... [A, B, C]
- User inserts new data → goes to tmp/data/... [D, E]
- Query as group user → Only sees [A, B, C], missing [D, E] ❌
- New inserts missed until next cloud_sync_parquet()
Root Cause:
- insert() always writes to tmp/data/... (default path)
- cloud_fetch_parquet() always writes to tmp/group/...
- Query checks only ONE path (default OR group, never both)
- cloud_sync_parquet() updates group path but NOT default path
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugsomething isn't workingsomething isn't workinghelp wantedExtra attention is neededExtra attention is needed