Note This project is still in an early stage of development, so one should expect significant changes in the future, including backward incompatible ones. That said, the general concepts and design principles should remain the same or be extended, not changed or limited. Thus, the package is suitable for experimental usage.
Segram is a software implementation of a framework for automated semantics-oriented grammatical analysis of text data. It is implemented in Python and based on the excellent spacy package, which is used to solve core NLP tasks such as tokenization, lemmatization, dependency parsing and coreference resolution.
- Automated grammatical analysis in terms of phrases/clauses focused on detecting actions as well as subjects and objects of those actions.
- Flexible filtering and matching with queries expressible in terms of properties of subjects, verbs, objects, prepositions and descriptions applicable at the levels of individual phrases and entire sentences.
- Semantic-oriented organization of analyses in terms of stories and frames.
- Data serialization framework allowing for reconstructing all
segramdata after an initial parsing without access to anyspacylanguage model. - Structured vector similarity model based on weighted averages of cosine similarities between different components of phrases/sentences (several algorithms based on somewhat different notions of what it means for sentences or phrases to be similar are available).
- Structured vector similarity model for comparing documents in terms of sequentially shifting semantics.
- Hypergraphical representation of grammatical structure of sentences.
| Package | Version |
|---|---|
python |
>=3.11 |
spacy |
>=3.4 |
The required Python version will not change in the future releases
for the foreseeable future, so before the package becomes fully
mature the dependency on python>=3.11 will not be too demanding
(although it may be bumped to >=3.12 as the new release is expected
soon as of time of writing - 29.09.2023).
Segram comes with a coreference resolution component based on an
experimental model provided by spacy-experimental package.
However, both at the level of segram and spacy this is currently
an experimental feature, which comes with a significant price tag attached.
Namely, the acceptable spacy version is significantly limited
(see the table below). However, as spacy-experimental gets integrated
in the spacy core in the future, these constraints will be relaxed.
| Package | Version |
|---|---|
spacy |
>=3.4,<3,5 |
spacy-experimental |
0.6.3 |
en_coreference_web_trf |
3.4.0a2 |
Currently, only English is supported and segram was tested on models:
en_core_web_trf>=3.4.1(transformer-based model for the general NLP)en_core_web_lgl>=3.4.1(used for context-free word vectors)en_coreference_web_trf==3.4.0a2(for coreference resolution)
pip install segram
python -m spacy download en_core_web_trf
python -m spacy download en_core_web_lg # skip if word vectors are not neededpip install segram[coref,gpu]
# Just one of the two options can also be selected
# And language models
python -m spacy download en_core_web_trf
python -m spacy download en_core_web_lg
# The last one is a special model for the coref component
pip install https://github.com/explosion/spacy-experimental/releases/download/v0.6.1/en_coreference_web_trf-3.4.0a2-py3-none-any.whlpip install git+ssh://[email protected]/sztal/segram.git
# + downloading language modelspip install "segram[gpu,coref] @ git+ssh://[email protected]/sztal/segram.git"pip install -r requirements/examples.txtimport spacy
nlp = spacy.load("en_core_web_trf")
nlp.add_pipe("segram", config={
"vectors": "en_core_web_lg"
})
nlp.add_pipe("segram_coref")
# Get standard 'spacy' document
doc = nlp(
"The merchants travelled a long way to buy spices "
"and rest in our taverns."
)
# Convert it to segram 'grammar' document
doc = doc._.segram
docThe code above parses the text using spacy and additionally applies
further processing pipeline components defined by segram. They inject
many additional functionalities into standard spacy tokens.
In particular, Doc instances are enhanced with a special extension
property ._.segram, which converts them to segram grammar documents.
Note that the printing results is different now - the output is colored!
The colors denote the partition of the document into components, which are groups of related tokens headed by a syntactically and/or semantically important token. They are divided into four distinct types which are marked with different colors when printing to the console. The following (default) color scheme is:
$\text{\color{orange}\bf Noun components}$ $\text{\color{red}\bf Verb components}$ $\text{\color{violet}\bf Description components}$ $\text{\color{limegreen}\bf Preposition components}$
Components are further organized into phrases, which are higher-order and more semantically-oriented units. Crucially, while components are non-overlapping and form a partition of the sentence, the phrases can be nested in each other and form a directed acyclic graph (DAG).
Examples are Jupyter notebooks with some sample analyses and tutorials. Below are instructions for setting up an environment sufficient for running the notebooks.
git clone [email protected]:sztal/segram.git
cd segram
conda env create -f environment-coref.yml # default env name is 'segram'
# In this case the versions of all language models are fixed
# so they are installed automatically with the rest of the dependencies
conda activate segram
pip install --editable .
# OR to allow for GPU acceleration:
pip install --editable .["gpu"]
# Finally, install some extra dependencies used in the notebooks
pip install -r requirements/docs.txtSee development and contributing guidelines.
If you have any suggestions or questions about segram feel free to email
me at <[email protected]>.
If you encounter any errors or problems, please also let me know! Open an Issue in the GitHub repository.
- Szymon Talaga, [email protected]

