DeepKE-LLM: A Large Language Model Based
Knowledge Extraction Toolkit

Requirements
News
Data
Models
Methods
Citation

Requirements

In the era of large models, DeepKE-LLM utilizes a completely new environment dependency.

conda create -n deepke-llm python=3.9
conda activate deepke-llm

cd example/llm
pip install -r requirements.txt

Please note that the requirements.txt file is located in the example/llm folder.

News

[November 2023] The weights of knowlm-13b-ie have been updated. This update mainly adjusted the NAN outputs, shortened the inference length, and added support for instructions without a specified schema. Additionally, it now supports VLLM quick inference as detailed in 6.1.3 VLLM prediction acceleration.
[October 2023] A new bilingual (Chinese-English) text-based topic information extraction (IE) instruction dataset named InstructIE was released. More information can be found at InstructIE.
[August 2023] A specialized version of KnowLM for information extraction (IE), named knowlm-13b-ie, was launched.
[July 2023] Some of the instruction datasets used for training were released, including knowlm-ke and KnowLM-IE.
[June 2023] The first version of pre-trained weights, knowlm-13b-base-v1.0, and the first version of zhixi-13b-lora were released.

Data

Name	Download	Quantity	Description
InstructIE-train	Google drive HuggingFace Baidu Netdisk	30w+	InstructIE train set, which is constructed by weak supervision and may contain some noisy data
InstructIE-valid	Google drive HuggingFace Baidu Netdisk	2000+	InstructIE validation set
InstructIE-test	Google drive HuggingFace Baidu Netdisk	2000+	InstructIE test set
train.json, valid.json	Google drive	5,000	Preliminary training set and test set for the task "Instruction-Driven Adaptive Knowledge Graph Construction" in CCKS2023 Open Knowledge Graph Challenge, randomly selected from instruct_train.json

InstrumentIE-train contains two files: InstrumentIE-zh.json and InstrumentIE-en.json, each of which contains the following fields: 'id' (unique identifier), 'cate' (text category), 'entity' and 'relation' (triples) fields. The extracted instructions and output can be freely constructed through 'entity' and 'relation'.

InstrumentIE-valid and InstrumentIE-test are validation sets and test sets, respectively, including bilingual zh and en.

train.json: Same fields as KnowLM-IE.json, 'instruction' and 'output' have only one format, and extraction instructions and outputs can also be freely constructed through 'relation'.

valid.json: Same fields as train.json, but with more accurate annotations achieved through crowdsour

Here is an explanation of each field:

Field	Description
id	Unique identifier
cate	text topic of input (12 topics in total)
input	Model input text (need to extract all triples involved within)
instruction	Instruction for the model to perform the extraction task
output	Expected model output
entity	entities(entity, entity_type)
relation	Relation triples(head, relation, tail) involved in the input

Models

LLaMA-series

LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We also provide a bilingual LLM for knowledge extraction named ZhiXi (智析) (which means intelligent analysis of data for knowledge extraction) based on KnowLM.

ZhiXi follows a two-step approach: (1) It performs further full pre-training on LLaMA (13B) using Chinese/English corpora to enhance the model's Chinese comprehension and knowledge while preserving its English and code capabilities as much as possible. (2) It fine-tunes the model using an instruction dataset to improve the language model's understanding of human instructions. For detailed information about the model, please refer to KnowLM.

Case 1: LoRA Fine-tuning of LLaMA for CCKS2023 Instruction-based KG Construction English | Chinese

Case 2: Using ZhiXi for CCKS2023 Instruction-based KG Construction English | Chinese

ChatGLM

Case 1: LoRA Fine-tuning of ChatGLM for CCKS2023 Instruction-based KG Construction English | Chinese

Case 2: P-Tuning of ChatGLM for CCKS2023 Instruction-based KG Construction English | Chinese

MOSS

Case 1: OpenDelta Fine-tuning of Moss for CCKS2023 Instruction-based KG Construction English | Chinese

Baichuan

Case 1: OpenDelta Fine-tuning of Baichuan for CCKS2023 Instruction-based KG Construction English | Chinese

CPM-Bee

Case 1: OpenDelta Fine-tuning of CPM-Bee for CCKS2023 Instruction-based KG Construction English | Chinese

GPT-series

Case 1: Information Extraction with LLMs English | Chinese

Case 2: Data Augmentation with LLMs English | Chinese

Case 3: CCKS2023 Instruction-based KG Construction with LLMs English | Chinese

Case 4: Unleash the Power of Large Language Models for Few-shot Relation Extraction English | Chinese

Case 5: CodeKGC-Code Language Models for KG Construction English | Chinese

To better address Relational Triple Extraction (rte) task in Knowledge Graph Construction, we have designed code-style prompts to model the structure of Relational Triple, and used Code-LLMs to generate more accurate predictions. The key step of code-style prompt construction is to transform (text, output triples) pairs into semantically equivalent program language written in Python.

Methods

Method 1: In-Context Learning (ICL)

In-Context Learning is an approach to guide large language models to improve their performance on specific tasks. It involves iterative fine-tuning and training of the model in a specific context to better understand and address the requirements of a particular domain. Through In-Context Learning, we can enable large language models to perform tasks such as information extraction, data augmentation, and instruction-driven knowledge graph construction.

Method 2: LoRA

LoRA (Low-Rank Adaptation of Large Language Models) reduces the number of trainable parameters by learning low-rank decomposition matrices while freezing the original weights. This significantly reduces the storage requirements of large language models for specific tasks and enables efficient task switching during deployment without introducing inference latency. For more details, please refer to the original paper LoRA: Low-Rank Adaptation of Large Language Models.

Method 3: P-Tuning

The PT (P-Tuning) method, as referred to in the official code of ChatGLM, is a soft-prompt method specifically designed for large models. P-Tuning introduces new parameters only to the embeddings of large models. P-Tuning-V2 adds new parameters to both the embeddings and the preceding layers of large models.

Citation

If you use this project, please cite the following papers:

@misc{knowlm,
  author = {Ningyu Zhang and Jintian Zhang and Xiaohan Wang and Honghao Gui and Kangwei Liu and Yinuo Jiang and Xiang Chen and Shengyu Mao and Shuofei Qiao and Yuqi Zhu and Zhen Bi and Jing Chen and Xiaozhuan Liang and Yixin Ou and Runnan Fang and Zekun Xi and Xin Xu and Lei Li and Peng Wang and Mengru Wang and Yunzhi Yao and Bozhong Tian and Yin Fang and Guozhou Zheng and Huajun Chen},
  title = {KnowLM Technical Report},
  year = {2023},
 url = {http://knowlm.zjukg.cn/},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DeepKE-LLM: A Large Language Model Based
Knowledge Extraction Toolkit

Requirements

News

Data

Models

LLaMA-series

Case 1: LoRA Fine-tuning of LLaMA for CCKS2023 Instruction-based KG Construction English | Chinese

Case 2: Using ZhiXi for CCKS2023 Instruction-based KG Construction English | Chinese

ChatGLM

Case 1: LoRA Fine-tuning of ChatGLM for CCKS2023 Instruction-based KG Construction English | Chinese

Case 2: P-Tuning of ChatGLM for CCKS2023 Instruction-based KG Construction English | Chinese

MOSS

Case 1: OpenDelta Fine-tuning of Moss for CCKS2023 Instruction-based KG Construction English | Chinese

Baichuan

Case 1: OpenDelta Fine-tuning of Baichuan for CCKS2023 Instruction-based KG Construction English | Chinese

CPM-Bee

Case 1: OpenDelta Fine-tuning of CPM-Bee for CCKS2023 Instruction-based KG Construction English | Chinese

GPT-series

Case 1: Information Extraction with LLMs English | Chinese

Case 2: Data Augmentation with LLMs English | Chinese

Case 3: CCKS2023 Instruction-based KG Construction with LLMs English | Chinese

Case 4: Unleash the Power of Large Language Models for Few-shot Relation Extraction English | Chinese

Case 5: CodeKGC-Code Language Models for KG Construction English | Chinese

Methods

Method 1: In-Context Learning (ICL)

Method 2: LoRA

Method 3: P-Tuning

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

DeepKE-LLM: A Large Language Model BasedKnowledge Extraction Toolkit

Requirements

News

Data

Models

LLaMA-series

Case 1: LoRA Fine-tuning of LLaMA for CCKS2023 Instruction-based KG Construction English | Chinese

Case 2: Using ZhiXi for CCKS2023 Instruction-based KG Construction English | Chinese

ChatGLM

Case 1: LoRA Fine-tuning of ChatGLM for CCKS2023 Instruction-based KG Construction English | Chinese

Case 2: P-Tuning of ChatGLM for CCKS2023 Instruction-based KG Construction English | Chinese

MOSS

Case 1: OpenDelta Fine-tuning of Moss for CCKS2023 Instruction-based KG Construction English | Chinese

Baichuan

Case 1: OpenDelta Fine-tuning of Baichuan for CCKS2023 Instruction-based KG Construction English | Chinese

CPM-Bee

Case 1: OpenDelta Fine-tuning of CPM-Bee for CCKS2023 Instruction-based KG Construction English | Chinese

GPT-series

Case 1: Information Extraction with LLMs English | Chinese

Case 2: Data Augmentation with LLMs English | Chinese

Case 3: CCKS2023 Instruction-based KG Construction with LLMs English | Chinese

Case 4: Unleash the Power of Large Language Models for Few-shot Relation Extraction English | Chinese

Case 5: CodeKGC-Code Language Models for KG Construction English | Chinese

Methods

Method 1: In-Context Learning (ICL)

Method 2: LoRA

Method 3: P-Tuning

Citation

DeepKE-LLM: A Large Language Model Based
Knowledge Extraction Toolkit