Skip to content

Latest commit

 

History

History
397 lines (223 loc) · 37.6 KB

models_tools_services.rst

File metadata and controls

397 lines (223 loc) · 37.6 KB

Hebrew NLP Models, Tools, Commercial and Online Services

Contents

  • Yonti Levin's Hebrew Tokenizer [Python] {MIT} - A very simple python tokenizer for Hebrew text. No batteries included - No dependencies needed!
  • Hebrew Tokenizer {?} - Eyal Gruss's Hebrew tokenizer. A field-tested Hebrew tokenizer for dirty texts (ben-yehuda project, bible, cc100, mc4, opensubs, oscar, twitter) focused on multi-word expression extraction.
  • RFTokenizer [Python] {Apache License 2.0} - A highly accurate morphological segmenter to break up complex word forms
  • The MILA Morphological Analysis Tool [?] {GPLv3} - Takes as input undotted Hebrew text (formatted either as plain text or as tokenized XML following MILA's standards). The Analyzer then returns, for each token, all the possible morphological analyses of the token, reflecting part of speech, transliteration, gender, number, definiteness, and possessive suffix. Free for non-commercial use. (temporarily down)
  • The MILA Morphological Disambiguation Tool [?] {GPLv3} - Takes as input morphologically-analyzed text and uses a Hidden Markov Model (HMM) to assign scores for each analysis, considering contextual information from the rest of the sentence. For a given token, all analyses deemed impossible are given scores of 0; all n analyses deemed possible are given positive scores. Free for non-commercial use. (temporarily down)
  • BGU Tagger: Morphological Tagging of Hebrew [Java] {?} - Morphological Analysis, Disambiguation.
  • AlephBERT {Apache License 2.0} - a large pre-trained language model for Modern Hebrew, publicly available, pre-training on Oscar, Texts of Hebrew tweets, all of Hebrew Wikipedia, published by the OnlpLab team. This model obtains state-of-the- art results on the tasks of segmentation and Part of Speech Tagging. Github: https://github.com/OnlpLab/AlephBERT
  • AlephBERTGimmel {CC0 1.0} - a new Hebrew pre-trained language model, trained on the same dataset as the previous SOTA Hebrew PLM AlephBERT, consisting of approximately 2 billion words of text but with a substantially increased vocabulary of 128,000 word pieces. Published as a collaboration of the OnlpLab team and Dicta. Github: https://github.com/Dicta-Israel-Center-for-Text-Analysis/alephbertgimmel
  • TavBERT {MIT} - a BERT-style masked language model over character sequences, published by Omri Keren, Tal Avinari, Prof. Reut Tsarfaty and Dr. Omer Levy.
  • Verb Inflector [Java] {Apache License 2.0} - A generation mechanism, created as part of Eran Tomer's ([email protected]) Master thesis, which produces vocalized and morphologically tagged Hebrew verbs given a non-vocalized verb in base-form and an indication of which pattern the verb follows.
  • HebPipe [Python] {Apache License 2.0} - End-to-end pipeline for Hebrew NLP using off the shelf tools, including morphological analysis, tagging, lemmatization, parsing and more.
  • YAP morpho-syntactic parser [Go] {Apache License 2.0} - Morphological Analysis, disambiguation and dependency Parser. Morphological Analyzer relies on the BGU Lexicon. [original repository] Demo
  • SPMRL to UD {Apache License 2.0} - Converts YAP's output from the SPMRL scheme to UD v2.
  • HebMorph [Lucene] {AGPL-3.0} - An open-source effort to make Hebrew properly searchable by various IR software libraries. Includes Hebrew Analyzer for Lucene.
  • Hspell [?] {AGPL-3.0} - Free Hebrew linguistic project including spell checker and morphological analyzer. HspellPy [Python] {AGPL-3.0} - Python wrapper for Hspell.
  • DictaBERT-morph {CC BY 4.0} - A fine-tuned model for mophological tagging task.
  • OtoBERT {CC BY 4.0} - Designed specifically for identifying suffixed verbal forms in Modern Hebrew.
  • AlephBERT {Apache License 2.0} - a large pre-trained language model for Modern Hebrew, publicly available, pre-training on Oscar, Texts of Hebrew tweets, all of Hebrew Wikipedia, published by the OnlpLab team. This model obtains state-of-the- art results on the tasks of segmentation and Part of Speech Tagging. Github: https://github.com/OnlpLab/AlephBERT
  • AlephBERTGimmel {CC0 1.0} - a new Hebrew pre-trained language model, trained on the same dataset as the previous SOTA Hebrew PLM AlephBERT, consisting od approximiately 2 billion words of text but with a substantially increased vocabulary of 128,000 word pieces. Published as a collaboration of the OnlpLab team and Dicta. Github: https://github.com/Dicta-Israel-Center-for-Text-Analysis/alephbertgimmel
  • TavBERT {MIT} - a BERT-style masked language model over character sequences, published by Omri Keren, Tal Avinari, Prof. Reut Tsarfaty and Dr. Omer Levy.
  • The MILA Morphological Analysis Tool [?] {GPLv3} - Takes as input undotted Hebrew text (formatted either as plain text or as tokenized XML following MILA's standards). The Analyzer then returns, for each token, all the possible morphological analyses of the token, reflecting part of speech, transliteration, gender, number, definiteness, and possessive suffix. Free for non-commercial use. (temporarily down)
  • HebPipe [Python] {Apache License 2.0} - End-to-end pipeline for Hebrew NLP using off the shelf tools, including morphological analysis, tagging, lemmatization, parsing and more
  • YAP morpho-syntactic parser [Go] {Apache License 2.0} - Morphological Analysis, disambiguation and dependency Parser. Morphological Analyzer relies on the BGU Lexicon. [original repository] Demo
  • DictaBERT-morph {CC BY 4.0} - A fine-tuned model for mophological tagging task.
  • Shtey Shekel {MIT} - Wikiproject for correcting grammar mistakes. (Heuristic) positive annotions can be derived from query.
  • Hspell [?] {AGPL-3.0} - Free Hebrew linguistic project including spell checker and morphological analyzer. HspellPy [Python] {AGPL-3.0} - Python wrapper for Hspell.
  • Legal-HeBERT {?} - a BERT model for Hebrew legal and legislative domains. It is intended to improve the legal NLP research and tools development in Hebrew. Avichay Chriqui, Dr. Inbal Yahav Shenberger and Dr. Ittai Bar-Siman-Tov release two versions of Legal-HeBERT: The first version is a fine-tuned model of HeBERT applied on legal and legislative documents. The second version uses HeBERT's architecture guidlines to train a BERT model from scratch.
  • NeMo-text-processing {Apache License 2.0} - Verbit extended NeMo-text-processing python package with WFST-based Hebrew inverse text normalization (ITN). ITN is a part of Automatic Speech Recognition (ASR) post-processing pipeline and can be used to convert spoken to written form to improve text readability.
  • HeRo {?} - RoBERTa based language model for Hebrew; Fine-tuned for sentiment analysis, named entity recognition and question answering.
  • LongHeRo {?} - State-of-the-art Longformer language model for Hebrew.
  • Legal-HeBERT {?} - a BERT model for Hebrew legal and legislative domains. It is intended to improve the legal NLP research and tools development in Hebrew. Avichay Chriqui, Dr. Inbal Yahav Shenberger and Dr. Ittai Bar-Siman-Tov release two versions of Legal-HeBERT: The first version is a fine-tuned model of HeBERT applied on legal and legislative documents. The second version uses HeBERT's architecture guidlines to train a BERT model from scratch.
  • Universal Language Model Fine-tuning for Text Classification (ULMFiT) in Hebrew - The weights (e.g. a trained model) for a Hebrew version for Howard's and Ruder's ULMFiT model. Trained on the Hebrew Wikipedia corpus.
  • Hebrew Psychological Lexicons {Apache License 2.0} - Easy-to-use Python interface for Hebrew clinical psychology text analysis. Useful for various psychology applications such as detecting emotional state, well being, relationship quality in conversation, identifying topics (e.g., family, work) and many more.
  • HebPipe [Python] {Apache License 2.0} - End-to-end pipeline for Hebrew NLP using off the shelf tools, including morphological analysis, tagging, lemmatization, parsing and more.
  • HebTTS [Python] {Apache License 2.0} - A language modeling Diacritics (`Niqqud’) free Text-To-Speech (TTS) approach, for the task of Hebrew TTS. The model operates on discrete speech representations and is conditioned on a word-piece tokenizer.
  • HebTTS [Python] {Apache License 2.0} - A language modeling Diacritics (`Niqqud’) free Text-To-Speech (TTS) approach, for the task of Hebrew TTS. The model operates on discrete speech representations and is conditioned on a word-piece tokenizer.
  • Text-Fabric [Python] {CC BY-NC 4.0} - A Python package for browsing and processing ancient corpora, focused on the Hebrew Bible Database.
  • Hebrew OCR with Nikud [Python] {?} - A program to convert Hebrew text files (without Nikud) to text files with the correct Nikud. Developed by Adi Oz and Vered Shani.
  • HebMorph [Lucene] {AGPL-3.0} - An open-source effort to make Hebrew properly searchable by various IR software libraries. Includes Hebrew Analyzer for Lucene.
  • Hebrew-Mistral-7B-200K {Apache License 2.0} - An open-source Large Language Model (LLM) pretrained in Hebrew and English and created by yam Peleg. It has been pretrained with 7B billion parameters and with 200K context length, based on Mistral-7B-v1.0 from Mistral. It has an extended hebrew tokenizer with 64,000 tokens and is continuesly pretrained from Mistral-7B on tokens in both English and Hebrew. The resulting model is a powerful general-purpose language model suitable for a wide range of natural language processing tasks, with a focus on Hebrew language understanding and generation.
  • Dicta-LM 2.0 Collection {Apache License 2.0} - Generative language models specifically optimized for Hebrew.
  • word2word {Apache License 2.0} - Easy-to-use Python interface for accessing top-k word translations and for building a new bilingual lexicon from a custom parallel corpus.
  • HeArBERT {?} - A bilingual BERT for Arabic and Hebrew, pretrained on the respective parts of the OSCAR corpus.
  • AlephBERT {Apache License 2.0} - a large pre-trained language model for Modern Hebrew, publicly available, pre-training on Oscar, Texts of Hebrew tweets, all of Hebrew Wikipedia, published by the OnlpLab team. This model obtains state-of-the- art results on the tasks of segmentation, Part of Speech Tagging, Named Entity Recognition, and Sentiment Analysis. Github: https://github.com/OnlpLab/AlephBERT
  • AlephBERTGimmel {CC0 1.0} - a new Hebrew pre-trained language model, trained on the same dataset as the previous SOTA Hebrew PLM AlephBERT, consisting od approximiately 2 billion words of text but with a substantially increased vocabulary of 128,000 word pieces. Published as a collaboration of the OnlpLab team and Dicta. Github: https://github.com/Dicta-Israel-Center-for-Text-Analysis/alephbertgimmel
  • HeBERT {MIT} - HeBERT is a Hebrew pretrained language model for Polarity Analysis and Emotion Recognition, published by Dr. Inbal Yahav Shenberger and Avichay Chriqui. It is based on Google's BERT architecture and it is BERT-Base config. HeBert was trained on three dataset: OSCAR, A Hebrew dump of Wikipedia, Emotion User Generated Content (UGC) data that was collected for the purpose of this study. The model was evaluated on downstream tasks: HebEMO - emotion recognition model and sentiment analysis. (https://huggingface.co/avichr/heBERT?fbclid=IwAR2Lo9pkN5HLZmtFiFwcIDWyXR9gyP646pyFzNSUUP_djalAkewvB9p8E_o)
  • TavBERT {MIT} - a BERT-style masked language model over character sequences, published by Omri Keren, Tal Avinari, Prof. Reut Tsarfaty and Dr. Omer Levy.
  • BEREL {?} - BERT Embeddings for Rabbinic-Encoded Language - DICTA's pre-trained language model (PLM) for Rabbinic Hebrew.
  • Legal-HeBERT {?} - a BERT model for Hebrew legal and legislative domains. It is intended to improve the legal NLP research and tools development in Hebrew. Avichay Chriqui, Dr. Inbal Yahav Shenberger and Dr. Ittai Bar-Siman-Tov release two versions of Legal-HeBERT: The first version is a fine-tuned model of HeBERT applied on legal and legislative documents. The second version uses HeBERT's architecture guidlines to train a BERT model from scratch.
  • DictaBERT {CC BY 4.0} - A base model pretrained with the masked-language-modeling objective.
  • Criminal Sentence Classification {OpenRAIL} - This project classifies key aspects of criminal cases within the Israeli legal framework. The project leverages a few-shot learning approach for accurate sentence classification relevant to sentencing decisions.
  • MsBERT {CC BY 4.0} - A pretrained dedicated BERT model, dubbed MsBERT (short for: Manuscript BERT), designed from the ground up to handle Hebrew manuscript text. MsBERT substantially outperforms all existing Hebrew BERT models regarding the prediction of missing words in fragmentary Hebrew manuscript transcriptions in multiple genres, as well as regarding the task of differentiating between quoted passages and exegetical elaborations. We provide MsBERT for free download and unrestricted use, and we also provide an interactive and user-friendly website to allow manuscript scholars to leverage the power of MsBERT in their scholarly work of reconstructing fragmentary Hebrew manuscripts.
  • TavBERT {MIT} - a BERT-style masked language model over character sequences, published by Omri Keren, Tal Avinari, Prof. Reut Tsarfaty and Dr. Omer Levy.
  • Legal-HeBERT {?} - a BERT model for Hebrew legal and legislative domains. It is intended to improve the legal NLP research and tools development in Hebrew. Avichay Chriqui, Dr. Inbal Yahav Shenberger and Dr. Ittai Bar-Siman-Tov release two versions of Legal-HeBERT: The first version is a fine-tuned model of HeBERT applied on legal and legislative documents. The second version uses HeBERT's architecture guidlines to train a BERT model from scratch.
  • Universal Language Model Fine-tuning for Text Classification (ULMFiT) in Hebrew - The weights (e.g. a trained model) for a Hebrew version for Howard's and Ruder's ULMFiT model. Trained on the Hebrew Wikipedia corpus.
  • Hebrew GPT neo {MIT} - Doron Adler's Hebrew text generation model based on EleutherAI's gpt-neo.
  • DICTA {CC-BY-SA 4.0} - Analytical tools for Jewish texts. They also have a GitHub organization.
  • wordfreq 3.0.3 {MIT} - wordfreq is a Python library for looking up the frequencies of words in 44 languages, including Hebrew. The Hebrew data is based on Wikipedia, OPUS OpenSubtitles 2018 and SUBTLEX, Google Books Ngrams 2012, Web text from OSCAR and Twitter.
  • Eyfo - A commercial engine for search and entity tagging in Hebrew.
  • Melingo's ICA (Intelligent Content Analysis) - A text analysis and textual categorized entity extraction API for Hebrew, Arabic and Farsi texts.
  • Genius - Automatic analysis of free text in Hebrew.
  • AlmaReader - Online text-to-speech service for Hebrew.
  • Amnon The Transcriber - a WhatsApp bot that receives a voice note and transcribe it to text.
  • Callee - a WhatsApp bot that receives a voice note, transcribes it to text also summarize it (as a text).
  • verbit.ai - Transcription.
  • Text Analytics for health containers
  • Hebrew-Nlp
  • HebMorph [Lucene] {AGPL-3.0} - An open-source effort to make Hebrew properly searchable by various IR software libraries. Includes Hebrew Analyzer for Lucene.