Skip to content

Commit

Permalink
Stage 0: Knowledge-base
Browse files Browse the repository at this point in the history
  • Loading branch information
IFFranciscoME committed Oct 15, 2024
1 parent 647581f commit 12c8fe2
Show file tree
Hide file tree
Showing 5 changed files with 187 additions and 11 deletions.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
# -- Local Files ------------------------------------------------------------ #
# -- ----------- ------------------------------------------------------------ #

knowledge/
*.pdf
*.gguf
*.pdf
Expand Down
13 changes: 6 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,12 @@ Challenges of the use of Large Language Models (LLMs), as an academic research a
- Bias in Training Data.
- Interpretability and Response Attribution.

## General structure
- pre-retrieval
- retrieval
- post-retrieval
- generation
Those seem very closely related to what a expert human research should also avoid.

## Install

## Example 1: Knowledge Summarization

## Copyright disclaimer

The purpose of this tool is not to promote or condone the unauthorized downloading or use of copyrighted materials.
All users of this should only use legally obtained documents, self-archive documents and materials in their projects.
The purpose of this tool is not to promote or condone the unauthorized downloading or use of copyrighted materials. All users of this should only use legally obtained documents, self-archive documents and materials in their projects.
50 changes: 47 additions & 3 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,63 @@
# Docs
# RAG system blocks
---

## RAG system blocks
## Stages

- Knowledge-base
- Local Source System
- Content ETL+T

- Pre-retrieval
- Indexing
- Query Manipulation
- Data Modification

- Retrieval Step
- Search & Rank
- Strategy

- Post-retrieval
- Re-ranking
- Filtering

- Generation
- Enhancing
- Customization

### Knoledge-base

**Local Source System**

- **Folder Structure**: The `knowledge/` with 4 venues.
- **Agnostic Source Catalog File**: `source_catalog.json`.
- **Content has been downloaded**: yes (72 articles from 3 different venues).

**Content ETL+T**

- Extract:
- v0.1.0 : filter fields, read a PDF file, split in pages, python compatible.
- (Rust) define a meta-data filter to read pdf.
- (Rust) read 1 pdf file (using the previously defined filter).
- (Rust) split the content page by page.
- (Rust) pass the result in a python-compatible dictionary.

- Transform:
- v0.1.0 : Select only text, only python-compatible dict format, a single implementation of a tokenizer.
- Part 1 : Remove and Regroup
- (Rust) drop anything that is not text.
- (Rust) re-group into common sections (abstract, introduction, etc.).
- (Rust) define filter to apply into text.
- (Rust) filter the grouped text.
- (Rust) pass the result in a python-compatible dictionary.
- Part 2 : Tokenize
- (Python) tokenize section-wise.

- Load:
- v0.1.0 : Not yet defined.
- (Shell - Rust) Create a vectror DB with chroma.
- (Rust) inject into vector DB.

- Test:
- v0.1.0 : Not yet defined.
- (Rust) corresponding functionality and integration tests.
- (Python) corresponding functionality and integreation tests.

94 changes: 94 additions & 0 deletions knowledge/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Knowledge Base

Only Journal Articles with OpenAccess rights.

## Folder structure

It should exist a `knowledge` folder, then each of its sub-folder is either a source (e.g. one per journal) and within
each of the sub-folder also files should be present (any naming convention, as log it is consistent).

```bash
molina/
└─ knowledge/
├── conference_icml/
│ ├── alon22a.pdf
│ ├── ...
│ └── zrnic24a.pdf
├── journal_fb/
│ ├── fbloc-02-00014.pdf
│ ├── ...
│ └── fbloc-07-1455070.pdf
└── journal_jifmim/
├── 1-s2.0-S1042443117302858-main.pdf
├── ...
└── 1-s2.0-S1544612324012200-main.pdf
```
## Source Catalog

A file, as agnostic from any programming language as possible, to express or represent the folder structure and meta data about the file's contents.

```json
"publisher": {
"elsevier": {
"journal" : {
"jifmim": {
"abv": "jifmim",
"name": "Journal of International Financial Markets, Institutions and Money",
"isOpenAccess": false,
"hasOpenAccess": true,
},
"pss": {
"abv": "pss",
"name": "Planetary and Space Science",
"isOpenAccess": false,
"hasOpenAccess": true,
},
},
},
"frontiers": {
"journal": {
"fb": {
"abv": "fb",
"name": "Frontiers in Blockchain",
"isOpenAccess": true,
"HasOpenAccess": true,
},
},
},
"icml": {
"conference": {
"icml": {
"abv": "icml",
"name": "International Conference on Artificial Intelligence",
"isOpenAccess": true,
"hasOpenAccess": true,
}
}
}
}
```

## Downloaded content

### Publisher: [Elsevier](https://www.sciencedirect.com)

- Some are Open Access.
- [copyright](https://www.elsevier.com/open-access)
- Publications:
- **PSS** (Journal): Planetary and Space Science.
- **JIFMIM** (Journal): Journal of International Financial Markets, Instutions and Money.

### Publisher : [Frontiers](https://www.frontiersin.org/articles)

- All is Open Access.
- [copyright](https://www.frontiersin.org/about/open-access)
- Publications:
- **FiB** (Journal): Frontiers in Blockchain

### Publisher: [ICML](https://icml.cc/)

- All is Open Access
- [copyright](https://icml.cc/FAQ/Copyright)
- Publications:
- **PMLR** (Conference Proceedings): 2024:2021

40 changes: 40 additions & 0 deletions knowledge/source_catalog.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
{
"publisher": {
"elsevier": {
"journal" : {
"jifmim": {
"abv": "jifmim",
"name": "Journal of International Financial Markets, Institutions and Money",
"isOpenAccess": false,
"hasOpenAccess": true,
},
"pss": {
"abv": "pss",
"name": "Planetary and Space Science",
"isOpenAccess": false,
"hasOpenAccess": true,
},
},
},
"frontiers": {
"journal": {
"fb": {
"abv": "fb",
"name": "Frontiers in Blockchain",
"isOpenAccess": true,
"HasOpenAccess": true,
},
},
},
"icml": {
"conference": {
"icml": {
"abv": "icml",
"name": "International Conference on Artificial Intelligence",
"isOpenAccess": true,
"hasOpenAccess": true,
}
}
}
}
}

0 comments on commit 12c8fe2

Please sign in to comment.