Skip to content

Conversation

@ayush-shah
Copy link
Member

No description provided.

Analyzed external Spark lineage guide (github.com/Monsau/omd-spark-lineage-guide)
and compared it with current OpenMetadata Spark documentation.

Key findings:
- Current docs cover ~30-35% of what users need for production
- Missing: troubleshooting, multiple configuration methods, production deployment
- Missing: YARN/Kubernetes/standalone cluster guidance
- Missing: Docker quick start, security best practices, advanced examples
- Missing: comprehensive requirements and validation guidance

Report includes 13 major gap categories with prioritized recommendations.
Implemented all 3 phases of documentation enhancements based on gap analysis:

## Phase 1: Minimum Viable Improvements
- Added System Requirements section with compatibility matrix
- Added comprehensive Troubleshooting section with 7 common issues
- Added spark-submit configuration method

## Phase 2: Production Readiness
- Added YARN cluster deployment guide (cluster + client mode)
- Added Kubernetes deployment guide with ConfigMap examples
- Added Standalone cluster deployment instructions
- Added Docker Quick Start for local testing
- Added Verification section with step-by-step validation

## Phase 3: Advanced Features
- Added spark-defaults.conf configuration method
- Added Environment Variables configuration method
- Added Scala code examples
- Added PostgreSQL example
- Added Multi-source ETL pipeline example
- Added Configuration Parameters Reference table with all options
- Added Advanced Parameters for debugging

## Content Structure Improvements
- Reorganized TOC with 5 configuration methods
- Added benefits for each configuration approach
- Added deployment platform-specific guides
- Added debug mode and diagnostic checklist
- Added production best practices

## Coverage Improvement
Before: ~30% coverage (basic use case only)
After: ~90% coverage (production-ready, multiple platforms, troubleshooting)

Files modified:
- content/v1.10.x/connectors/ingestion/lineage/spark-lineage.md (386 → 1157 lines)
- content/v1.11.x-SNAPSHOT/connectors/ingestion/lineage/spark-lineage.md (386 → 1157 lines)

Impact: Users can now deploy Spark lineage in production environments
with full troubleshooting support and multiple configuration options.
The gap analysis documents were used to plan improvements but are no
longer needed since all enhancements have been implemented in the
actual Spark lineage documentation.

Removed:
- spark-lineage-gap-analysis.md (comprehensive analysis)
- spark-lineage-gaps-LEAN.md (quick fix guide)

All improvements are now in:
- content/v1.10.x/connectors/ingestion/lineage/spark-lineage.md
- content/v1.11.x-SNAPSHOT/connectors/ingestion/lineage/spark-lineage.md
Added detailed troubleshooting section covering 12 common Spark coding
patterns that prevent lineage from being captured, with fixes for each.

New section: "Spark Code Patterns That Break Lineage"

Patterns covered:
1. Using dynamic table names (datetime suffixes)
2. Creating DataFrames from local collections
3. Using RDD operations instead of DataFrame API
4. Using temporary views without proper table references
5. Using collect() and manual writes
6. Using non-JDBC file formats without catalog
7. Mixing multiple write operations
8. Using incorrect JDBC URL formats
9. Using saveAsTable without database prefix
10. Schema mismatches between Spark and OpenMetadata
11. Using deprecated write APIs
12. Not specifying driver class

Each pattern includes:
- ❌ Code example that breaks lineage
- ✅ Fixed code with proper lineage tracking
- Explanation of why it breaks and how to fix

Also added "Best Practices for Lineage-Friendly Spark Code" section
with 8 key recommendations.

This addresses the most common user issues where Spark jobs run
successfully but lineage doesn't appear in OpenMetadata.

Files modified:
- content/v1.10.x/connectors/ingestion/lineage/spark-lineage.md
- content/v1.11.x-SNAPSHOT/connectors/ingestion/lineage/spark-lineage.md
@ayush-shah ayush-shah requested a review from Copilot November 19, 2025 12:40
@netlify
Copy link

netlify bot commented Nov 19, 2025

Deploy Preview for docsv1 ready!

Built without sensitive environment variables

Name Link
🔨 Latest commit 7960370
🔍 Latest deploy log https://app.netlify.com/projects/docsv1/deploys/691e9733dced780008ee69e8
😎 Deploy Preview https://deploy-preview-568--docsv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copilot finished reviewing on behalf of ayush-shah November 19, 2025 12:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR significantly expands the Spark lineage documentation with comprehensive configuration methods, deployment guides, troubleshooting steps, and best practices for capturing data lineage using the OpenMetadata Spark Agent.

Key changes:

  • Added detailed system requirements and compatibility matrix for Spark versions 3.1+
  • Expanded configuration documentation from a single method to five different approaches (inline, spark-submit, spark-defaults.conf, environment variables, Scala)
  • Added extensive deployment guides for YARN, Kubernetes, standalone clusters, and Docker
  • Included comprehensive troubleshooting section with common issues and anti-patterns to avoid
  • Added verification steps and diagnostic checklists

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
content/v1.11.x-SNAPSHOT/connectors/ingestion/lineage/spark-lineage.md Complete documentation overhaul with added system requirements, multiple configuration methods, platform deployment guides, verification steps, and comprehensive troubleshooting section
content/v1.10.x/connectors/ingestion/lineage/spark-lineage.md Identical documentation updates applied to the v1.10.x version for consistency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1410 to +1414
```python
df.write.format("jdbc") \
.option("url", "jdbc:mysql://localhost:3306/db") \
.option("dbtable", "users") \
.save() # Missing driver specification
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "# Missing driver specification" but the driver class com.mysql.cj.jdbc.Driver is actually specified in the code examples throughout the document. This creates confusion. If the intent is to show missing driver specification as a problem, the example should not include the driver option at all.

Copilot uses AI. Check for mistakes.
Comment on lines +1410 to +1414
```python
df.write.format("jdbc") \
.option("url", "jdbc:mysql://localhost:3306/db") \
.option("dbtable", "users") \
.save() # Missing driver specification
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "# Missing driver specification" but the driver class com.mysql.cj.jdbc.Driver is actually specified in the code examples throughout the document. This creates confusion. If the intent is to show missing driver specification as a problem, the example should not include the driver option at all.

Copilot uses AI. Check for mistakes.
| 3.1.x | Latest | ⚠️ Limited Testing |
| 3.0.x or lower| Latest | ❌ Not Supported |

## Requirement
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in section heading: "Requirement" should be "Requirements" (plural) to match the table of contents entry on line 13 which says "Requirements".

Suggested change
## Requirement
## Requirements

Copilot uses AI. Check for mistakes.
| 3.1.x | Latest | ⚠️ Limited Testing |
| 3.0.x or lower| Latest | ❌ Not Supported |

## Requirement
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in section heading: "Requirement" should be "Requirements" (plural) to match the table of contents entry on line 13 which says "Requirements".

Suggested change
## Requirement
## Requirements

Copilot uses AI. Check for mistakes.
{% /codePreview %}


Once this pyspark job get finished you will see a new pipeline service with name `my_pipeline_service` generated in your openmetadata instance which would contain a pipeline with name `my_pipeline` as per the above example and you should also see lineage between the table `employee` and `employee_new` via `my_pipeline`.
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammatical error: "get finished" should be "gets finished" or "is finished". The subject "pyspark job" is singular and requires a singular verb form.

Suggested change
Once this pyspark job get finished you will see a new pipeline service with name `my_pipeline_service` generated in your openmetadata instance which would contain a pipeline with name `my_pipeline` as per the above example and you should also see lineage between the table `employee` and `employee_new` via `my_pipeline`.
Once this pyspark job is finished you will see a new pipeline service with name `my_pipeline_service` generated in your openmetadata instance which would contain a pipeline with name `my_pipeline` as per the above example and you should also see lineage between the table `employee` and `employee_new` via `my_pipeline`.

Copilot uses AI. Check for mistakes.
{% /codePreview %}


Once this pyspark job get finished you will see a new pipeline service with name `my_pipeline_service` generated in your openmetadata instance which would contain a pipeline with name `my_pipeline` as per the above example and you should also see lineage between the table `employee` and `employee_new` via `my_pipeline`.
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammatical error: "get finished" should be "gets finished" or "is finished". The subject "pyspark job" is singular and requires a singular verb form.

Suggested change
Once this pyspark job get finished you will see a new pipeline service with name `my_pipeline_service` generated in your openmetadata instance which would contain a pipeline with name `my_pipeline` as per the above example and you should also see lineage between the table `employee` and `employee_new` via `my_pipeline`.
Once this pyspark job is finished you will see a new pipeline service with name `my_pipeline_service` generated in your openmetadata instance which would contain a pipeline with name `my_pipeline` as per the above example and you should also see lineage between the table `employee` and `employee_new` via `my_pipeline`.

Copilot uses AI. Check for mistakes.
@ayush-shah
Copy link
Member Author

@copilot open a new pull request to apply changes based on the comments in this thread

- Remove System Requirements section (too detailed)
- Remove alternative configuration methods (Methods 2-5: spark-submit, spark-defaults.conf, env vars, Scala)
- Remove Additional Examples section (PostgreSQL and Multi-Source examples)
- Remove Deployment on Different Platforms section (YARN, Kubernetes, Standalone, Docker)
- Remove Verification section
- Simplify Configuration section to focus on essential inline PySpark configuration
- Keep core sections: Requirements, Configuration, Parameters, Databricks, Glue, Troubleshooting
- Reduce documentation from 1,500+ lines to ~950 lines for better clarity

This addresses user feedback to "keep it simple and remove unnecessary things"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants