Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use de UD_Portuguese-Bosque on CoreNLP to extract relations info ? #416

Open
alvieirajr opened this issue Jul 30, 2024 · 3 comments

Comments

@alvieirajr
Copy link

I don't want to write down rules of extraction triples of relations as we do using Spacy, like example below (The reason is that there is many and i don't have proficiency to write all of them):

            # (...)
            # Extrair Relações com Base em Substantivos e Preposições
            if token.dep_ == "prep":
                subject = [w for w in token.head.lefts if w.dep_ == "nsubj"]
                obj = [w for w in token.rights if w.dep_ == "pobj"]
                if subject and obj:
                    relations.append((subject[0].text, token.text, obj[0].text))

            # Extrair Relações com Base em Nouns e Seus Predicativos
            if token.dep_ == "attr":
                subject = [w for w in token.head.lefts if w.dep_ == "nsubj"]
                if subject:
                    relations.append((subject[0].text, token.head.lemma_, token.text))
            # (...)

Because this, i want use CoreNLP + Universal Dependencies to extract the relations. I'm using pt_bosque_models. Bellow some details:

To wake up the server i'm using this command:

 java  -cp "stanford-corenlp-4.5.7.jar"  edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-portuguese.properties -port 9000 -timeout 15000 

My StanfordCoreNLP-portuguese.properties file content is:

annotators = tokenize,ssplit,pos,lemma,depparse
#tokenize.language = pt
ssplit.eolonly = true
# Modelo de dependência
depparse.model = edu/stanford/nlp/models/parser/nndep/UD_Portuguese-Bosque.gz

The follow files are in UD_Portuguese-Bosque.gz:

LICENSE.txt
pt_bosque-ud-dev.conllu
pt_bosque-ud-dev.txt
pt_bosque-ud-test.conllu
pt_bosque-ud-test.txt
pt_bosque-ud-train.conllu
pt_bosque-ud-train.txt
README.md
stats.xml

This is my python example of request file:

import requests

# URL do servidor Stanford CoreNLP
url = 'http://[::1]:9000'

# Sentença de exemplo
sentence = "Qual é a opinião de Carl Sagan sobre a possibilidade de formas de vida baseadas em elementos diferentes do carbono e água?"

# Parâmetros para a requisição
params = {
    'annotators': 'depparse,ner',
}

#tokenize,ssplit,pos,lemma,depparse,
# Dados para a requisição
data = {
    'data': sentence
}
# Requisição ao servidor CoreNLP
response = requests.post(url, params=params, data=data)

# Verificar se a requisição foi bem sucedida
print(response)
if response.status_code == 200:
    result = response.json()
    for sentence in result['sentences']:
        for triple in sentence['openie']:
            print("Relação extraída:", triple['subject'], triple['relation'], triple['object'])
else:
    print("Erro ao fazer requisição ao servidor Stanford CoreNLP.")

The Problem:

If "depparse" is present on params i get the error:

java.lang.NumberFormatException: For input string: "MDA4MTs2NmE2ZTExZDtDaHJvbWU7"
  java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
  java.base/java.lang.Integer.parseInt(Integer.java:668)
  java.base/java.lang.Integer.parseInt(Integer.java:786)
  edu.stanford.nlp.parser.nndep.DependencyParser.loadModelFile(DependencyParser.java:539)
  edu.stanford.nlp.parser.nndep.DependencyParserCache$DependencyParserSpecification.loadModelFile(DependencyParserCache.java:53)
  edu.stanford.nlp.parser.nndep.DependencyParserCache.loadFromModelFile(DependencyParserCache.java:76)
  edu.stanford.nlp.parser.nndep.DependencyParser.loadFromModelFile(DependencyParser.java:498)

If only "rer" is present on params the request return without errors but come without relations infos, i get only entities and tokens and without the key openie on result raising a error on line "for triple in sentence['openie']:"

Any sugestion ?

@AngledLuffa
Copy link

There are multiple issues here:

  • This is effectively a Stanford NLP question, and should be posted there, not here
  • It'll be me answering that question anyway, so let's take a stab at some of the other items
  • CoreNLP doesn't have PT models, and even if we made PT CoreNLP models there wouldn't be an equivalent OpenIE annotator unless you wrote it, so effectively you'll want to use the Python ecosystem
  • To use the Stanford Python library, either for searching dependencies or for parsing text and then searching the dependencies, you likely want to use Stanza, not the older version of StanfordNLP. It says that in huge font when you go to the StanfordNLP github, but admittedly we could go even further and put some kind of user friendly foad message when someone pip installs stanfordnlp
  • The next question is, do you want to process the existing dependencies in the treebank, or do you want to parse new text and process that with models trained from that dependency treebank?
  • If you are parsing existing trees from the Bosque treebank, you can search that up using the semgrex interface.
  • If you want to parse new text, you first run depparse, then use semgrex

To be entirely honest, I'm not familiar with the SpaCy dependency graph. But for the first relation, it looks kind of like you want a head with 2 children, nsubj and prep, and the prep child itself has a pobj child. That's quite easy to find with semgrex:

{} >nsubj {}=first >prep ({}=second >pobj {}=third)

A couple weirdnesses being that there are no such things as prep or pobj relations in the Bosque treebank, but I'll leave it to you to figure out what triple you're actually trying to extract. Other relation patterns can be found in the Javadoc You can also put constraints on the words matched inside the {}, as documented in the SemgrexPattern Javadoc.

There are other dependency extraction toolkits, such as grew, and perhaps someone else here can walk you through using that if semgrex isn't satisfactory.

@alvieirajr
Copy link
Author

alvieirajr commented Jul 30, 2024

Hi @AngledLuffa. My real problem is extract dependencies on sentences without use a bank of dependencie rules wroted by myself (this is the Spacy's use case). How i a newbie in this area i was ask for sugestion of dependencies extraction rules in portuguese to chatGPT to use in Spacy, but i dont't know if this sugestions are truly and if this rules will work in portuguese language. So, i will try use a Universal Dependencie model called PT_BOSQUE in CoNLL-U format. Where already exists somes dependency rules. The idea is extract subj, obj and rel automaticly from small sentences.

I will considere your sugestions. Thanks a lot.

@AngledLuffa
Copy link

AngledLuffa commented Jul 30, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants