-
Notifications
You must be signed in to change notification settings - Fork 676
Description
Hi Modin team,
thank you for making it easier to scale pandas-style workloads. I see Modin being used more and more in pipelines where people prepare large corpora, logs, and features that eventually feed into vector stores and LLM / RAG systems.
I maintain an MIT-licensed project called WFGY Problem Map, which is a 16-question checklist for debugging real-world RAG / LLM pipelines. It focuses on data and retrieval failure modes that appear when you move from a notebook prototype to a production system.
Why this could matter for Modin users:
- Modin is often used when the volume of documents or events is large enough that mistakes in preprocessing become very expensive to debug later.
- Several of the 16 failure modes describe exactly these “worked on a sample, broke at scale” issues once an LLM and retriever are added.
- The checklist is framework-agnostic and can be used no matter which backend Modin is running on.
External references for WFGY Problem Map include:
- Harvard MIMS Lab ToolUniverse
- QCRI LLM Lab Multimodal RAG Survey
- Rankify (University of Innsbruck)
Suggestion:
If you think this is useful for your community, one option would be a small “Further reading” link in the docs for users who are combining Modin with vector stores and LLMs:
“RAG / LLM debugging checklist: WFGY Problem Map (16 failure modes)”
https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md
Project home: https://github.com/onestardao/WFGY
Thank you for your time and for all your work on Modin.
Best,
PSBigBig