-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
647581f
commit 12c8fe2
Showing
5 changed files
with
187 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,19 +1,63 @@ | ||
# Docs | ||
# RAG system blocks | ||
--- | ||
|
||
## RAG system blocks | ||
## Stages | ||
|
||
- Knowledge-base | ||
- Local Source System | ||
- Content ETL+T | ||
|
||
- Pre-retrieval | ||
- Indexing | ||
- Query Manipulation | ||
- Data Modification | ||
|
||
- Retrieval Step | ||
- Search & Rank | ||
- Strategy | ||
|
||
- Post-retrieval | ||
- Re-ranking | ||
- Filtering | ||
|
||
- Generation | ||
- Enhancing | ||
- Customization | ||
|
||
### Knoledge-base | ||
|
||
**Local Source System** | ||
|
||
- **Folder Structure**: The `knowledge/` with 4 venues. | ||
- **Agnostic Source Catalog File**: `source_catalog.json`. | ||
- **Content has been downloaded**: yes (72 articles from 3 different venues). | ||
|
||
**Content ETL+T** | ||
|
||
- Extract: | ||
- v0.1.0 : filter fields, read a PDF file, split in pages, python compatible. | ||
- (Rust) define a meta-data filter to read pdf. | ||
- (Rust) read 1 pdf file (using the previously defined filter). | ||
- (Rust) split the content page by page. | ||
- (Rust) pass the result in a python-compatible dictionary. | ||
|
||
- Transform: | ||
- v0.1.0 : Select only text, only python-compatible dict format, a single implementation of a tokenizer. | ||
- Part 1 : Remove and Regroup | ||
- (Rust) drop anything that is not text. | ||
- (Rust) re-group into common sections (abstract, introduction, etc.). | ||
- (Rust) define filter to apply into text. | ||
- (Rust) filter the grouped text. | ||
- (Rust) pass the result in a python-compatible dictionary. | ||
- Part 2 : Tokenize | ||
- (Python) tokenize section-wise. | ||
|
||
- Load: | ||
- v0.1.0 : Not yet defined. | ||
- (Shell - Rust) Create a vectror DB with chroma. | ||
- (Rust) inject into vector DB. | ||
|
||
- Test: | ||
- v0.1.0 : Not yet defined. | ||
- (Rust) corresponding functionality and integration tests. | ||
- (Python) corresponding functionality and integreation tests. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
# Knowledge Base | ||
|
||
Only Journal Articles with OpenAccess rights. | ||
|
||
## Folder structure | ||
|
||
It should exist a `knowledge` folder, then each of its sub-folder is either a source (e.g. one per journal) and within | ||
each of the sub-folder also files should be present (any naming convention, as log it is consistent). | ||
|
||
```bash | ||
molina/ | ||
└─ knowledge/ | ||
├── conference_icml/ | ||
│ ├── alon22a.pdf | ||
│ ├── ... | ||
│ └── zrnic24a.pdf | ||
├── journal_fb/ | ||
│ ├── fbloc-02-00014.pdf | ||
│ ├── ... | ||
│ └── fbloc-07-1455070.pdf | ||
└── journal_jifmim/ | ||
├── 1-s2.0-S1042443117302858-main.pdf | ||
├── ... | ||
└── 1-s2.0-S1544612324012200-main.pdf | ||
``` | ||
## Source Catalog | ||
|
||
A file, as agnostic from any programming language as possible, to express or represent the folder structure and meta data about the file's contents. | ||
|
||
```json | ||
"publisher": { | ||
"elsevier": { | ||
"journal" : { | ||
"jifmim": { | ||
"abv": "jifmim", | ||
"name": "Journal of International Financial Markets, Institutions and Money", | ||
"isOpenAccess": false, | ||
"hasOpenAccess": true, | ||
}, | ||
"pss": { | ||
"abv": "pss", | ||
"name": "Planetary and Space Science", | ||
"isOpenAccess": false, | ||
"hasOpenAccess": true, | ||
}, | ||
}, | ||
}, | ||
"frontiers": { | ||
"journal": { | ||
"fb": { | ||
"abv": "fb", | ||
"name": "Frontiers in Blockchain", | ||
"isOpenAccess": true, | ||
"HasOpenAccess": true, | ||
}, | ||
}, | ||
}, | ||
"icml": { | ||
"conference": { | ||
"icml": { | ||
"abv": "icml", | ||
"name": "International Conference on Artificial Intelligence", | ||
"isOpenAccess": true, | ||
"hasOpenAccess": true, | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
|
||
## Downloaded content | ||
|
||
### Publisher: [Elsevier](https://www.sciencedirect.com) | ||
|
||
- Some are Open Access. | ||
- [copyright](https://www.elsevier.com/open-access) | ||
- Publications: | ||
- **PSS** (Journal): Planetary and Space Science. | ||
- **JIFMIM** (Journal): Journal of International Financial Markets, Instutions and Money. | ||
|
||
### Publisher : [Frontiers](https://www.frontiersin.org/articles) | ||
|
||
- All is Open Access. | ||
- [copyright](https://www.frontiersin.org/about/open-access) | ||
- Publications: | ||
- **FiB** (Journal): Frontiers in Blockchain | ||
|
||
### Publisher: [ICML](https://icml.cc/) | ||
|
||
- All is Open Access | ||
- [copyright](https://icml.cc/FAQ/Copyright) | ||
- Publications: | ||
- **PMLR** (Conference Proceedings): 2024:2021 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
{ | ||
"publisher": { | ||
"elsevier": { | ||
"journal" : { | ||
"jifmim": { | ||
"abv": "jifmim", | ||
"name": "Journal of International Financial Markets, Institutions and Money", | ||
"isOpenAccess": false, | ||
"hasOpenAccess": true, | ||
}, | ||
"pss": { | ||
"abv": "pss", | ||
"name": "Planetary and Space Science", | ||
"isOpenAccess": false, | ||
"hasOpenAccess": true, | ||
}, | ||
}, | ||
}, | ||
"frontiers": { | ||
"journal": { | ||
"fb": { | ||
"abv": "fb", | ||
"name": "Frontiers in Blockchain", | ||
"isOpenAccess": true, | ||
"HasOpenAccess": true, | ||
}, | ||
}, | ||
}, | ||
"icml": { | ||
"conference": { | ||
"icml": { | ||
"abv": "icml", | ||
"name": "International Conference on Artificial Intelligence", | ||
"isOpenAccess": true, | ||
"hasOpenAccess": true, | ||
} | ||
} | ||
} | ||
} | ||
} |