Skip to content

Commit

Permalink
move workbench to dev
Browse files Browse the repository at this point in the history
  • Loading branch information
arademaker committed Oct 9, 2023
1 parent 8abc5c9 commit d122d83
Show file tree
Hide file tree
Showing 4,005 changed files with 304,397 additions and 0 deletions.
The diff you're trying to view is too large. We only load the first 3000 changed files.
140 changes: 140 additions & 0 deletions not-to-release/HISTORY.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
This is an improved version of Bosque 7.5 UD, originally available
from Linguateca at the following URLs:

http://www.linguateca.pt/Floresta/ficheiros/bosque_CP.udep.conll.gz
http://www.linguateca.pt/Floresta/ficheiros/bosque_CF.udep.conll.gz

Issues found in the original Bosque 7.5 UD have been opened as issues
in this project, with the prefix [bosque-ud]; they have been either
fixed at the source (PALAVRAS and/or UD conversion), workarounds have
been provided by the script "fix-errors.sh".

Manually editing the file is discouraged and should only be made as a
last resort.

For reference:

1. bosque_CP.udep.conll.gz Bick's version of European Portuguese part
of Bosque 7.5 UD annotated, available at
http://www.linguateca.pt/Floresta/ficheiros/bosque_CP.udep.conll.gz

2. bosque_CF.udep.conll.gz ick's version of Brazilian Portuguese part
of Bosque 7.5 UD annotated, available at
http://www.linguateca.pt/Floresta/ficheiros/bosque_CF.udep.conll.gz

3. Dan Zeman's version of Bosque CoNLL (7.3), available at
https://github.com/UniversalDependencies/UD_Portuguese

4. Linguateca Version of Bosque CoNLL (7.3),
http://www.linguateca.pt/floresta/CoNLL-X/

The CG-converted UD Portuguese treebank is originally based on an
improved and enriched version of the 7.4 dependency version of the
revised Bosque part of the Floresta Sintá(c)tica treebank
(cf. Linguateca.pt). 7.4 was in 2006-2008 aligned with a new live run
with the PALAVRAS parser in order to propagate morphological features
from unambiguous to ambiguous words, and to add what the Floresta team
called "searchables", i.e. tags for features distributed across
several tokens, such as NP definiteness and complex tenses. The public
treebank only used this for the constituent version, which was the one
actively revised by the Floresta team until 2008 (Linguateca.pt
version 8.0).

Since 2008 Eckhard Bick has maintained an experimental version of the
dependency Bosque for semantic and other research, and made further
revisions to it, which were not aligned with either the constituent
version or the published 7.4 dependency version. In the beginning of
2016, Eckhard Bick wrote UD conversion rules for Constraint Grammar
input, and applied these to the updated version of the dependency
Bosque (Linguateca.pt version 7.5 of March 2016).

In a team effort in October 2016, Alexandre Rademaker, Cláudia
Freitas, Fabricio Chalub, Valeria de Paiva and Livy Maria Real Coelho,
aiming at full compatibility with ConLL UD specifications,
consistency-checked and discussed the 7.5 UD Bosque, leading to a
further round of manual treebank corrections and conversion rule
changes by Eckhard Bick. The conversion grammar ultimately used
contained some 530 rules. Of these 70 were simple feature mapping
rules, and 130 were local MWE splitting rules, assigning internal
structure, POS and features to MWE's from Bosque. The remainder of the
rules handle UD-specific dependency and function label changes in a
context-dependent fashion, the main issues being raising of copula
dependents to subject complements, inversion of prepositional
dependency and a change from syntactic to semantic verb chain
dependency.

The new UD treebank retains the additional tags for NP definiteness
and complex tenses, as well as the original syntactic functions tags
and secondary morphological tags. This way, the treebank retains its
original linguistic focus in addition to the machine learning uses
targeted by the ConLL UD format. For instance, conjuncts and roots
still feature a direct function tag (e.g. a verb complement role for a
conjunct or "question" for a root. In cases, where UD does not
distinguish between form and function, e.g. n/np adverbial modifiers,
where UD "duplicates" noun POS as 'nmod' function, the Bosque function
tag for free adverbial, adject or adverbial object is retained in
field 4 (@tags). Finally, some lost valency relations may be recovered
from an underspecified UD tag, e.g. the core clause arguments
"prepositional object" ('gostar de ARG') and valency-bound adverbial
('morar em ARG').

CONTRIBUTORS

The conversion was implemented by Eckhard Bick and revised by:

- Claudia Freitas
- Eckhard Bick
- Fabricio Chalub
- Alexandre Rademaker
- Livy Real
- Valeria Paiva

CHANGELOG

2016-10-31 v1.4
* Initial UD release.

LICENSE

See file LICENSE.txt

REFERENCES

- https://github.com/own-pt/bosque-UD (development of this corpus)

- http://www.linguateca.pt/Floresta/ (Floresta Treebank repository)

- http://visl.sdu.dk/tagset_cg_general.pdf (non-UD tags in field 4)

- http://visl.sdu.dk/constraint_grammar.html (cg3 compiler used for
the conversion grammar)

- http://visl.sdu.dk/visl/pt/parsing/automatic/ (PALAVRAS parser used
to create input trees for the manually revised Bosque treebank)

- Afonso, Susana, Eckhard Bick, Renato Haber & Diana Santos (2002),
Floresta sintá(c)tica: a treebank for Portuguese
<http://visl.sdu.dk/%7Eeckhard/pdf/AfonsoetalLREC2002.ps.pdf>, In
/Proceedings of LREC'2002, Las Palmas/. pp. 1698-1703, Paris: ELRA

- Freitas, Cláudia & Rocha, Paulo & Bick, Eckhard (2008), "Floresta
Sintá(c)tica: Bigger, Thicker and Easier", in: António Teixeira et
al. (eds.) /Computational Processing of the Portuguese Language/
(Proceedings of PROPOR 2008, Aveiro, Sept. 8th-10th, 2008),
pp.216-219. Springer

- Bick, Eckhard (2014). PALAVRAS, a Constraint Grammar-based Parsing
System for Portuguese. In: Tony Berber Sardinha & Thelma de Lurdes
São Bento Ferreira (eds.), /Working with Portuguese Corpora/, pp
279-302. London/New York:Bloomsburry Academic. ISBN
978-1-4411-9050-5

--- Machine readable metadata ---
Documentation status: partial
Data source: semi-automatic
Data available since: UD v1.4
License: CC BY-SA 4.0
Genre: news blog
Contributors: Freitas, Claudia; Bick, Eckhard; Chalub, Fabricio; Rademaker, Alexandre; Real, Livy; Paiva, Valeria
Contact: [email protected]

27 changes: 27 additions & 0 deletions not-to-release/README.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#+Title: README
#+Author: Alexandre Rademaker

* files

- scripts/ :: contains all the scripts used to process, validate, and
convert Bosque files.

- documents/ :: contains all the individual documents of the Bosque
corpus. All changes to the corpus should be made to these files.

* Current Team

- Alexandre Rademaker
- Leonel Figueiredo de Alencar

* Previous Contributors

- Claudia Freitas
- Eckhard Bick
- Fabricio Chalub
- Henrique Muniz
- Isabela Soares Bastos
- Livy Real
- Luísa Rocha
- Valeria de Paiva
- Wellington Silva
7 changes: 7 additions & 0 deletions not-to-release/documents/.dir-locals.el
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
;;; Directory Local Variables
;;; For more information see (info "(emacs) Directory Variables")

((conllu-mode
(conllu-flycheck-on? . yes)
(conllu-flycheck-lang . "pt")))

153 changes: 153 additions & 0 deletions not-to-release/documents/CF0001.conllu
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# newdoc_id = CF1
# text = PT no governo
# sent_id = CF1-1
# source = CETENFolha n=1 cad=Opinião sec=opi sem=94a
1 PT PT PROPN PROP|M|S|@NPHR Gender=Masc|Number=Sing 0 root _ _
2-3 no _ _ _ _ _ _ _ _
2 em em ADP <sam->|PRP|@N< _ 4 case _ _
3 o o DET <-sam>|<artd>|ART|M|S|@>N Definite=Def|Gender=Masc|Number=Sing|PronType=Art 4 det _ _
4 governo governo NOUN <np-def>|N|M|S|@P< Gender=Masc|Number=Sing 1 nmod _ _

# text = BRASÍLIA Pesquisa Datafolha publicada hoje revela um dado supreendente: recusando uma postura radical, a esmagadora maioria (77%) dos eleitores quer o PT participando do Governo Fernando Henrique Cardoso.
# sent_id = CF1-3
# source = CETENFolha n=1 cad=Opinião sec=opi sem=94a &W
1 BRASÍLIA Brasília PROPN PROP|F|S|@ADVL> Gender=Fem|Number=Sing 6 parataxis _ _
2 Pesquisa Pesquisa PROPN _ ExtPos=PROPN|Gender=Fem|Number=Sing 6 nsubj _ _
3 Datafolha Datafolha PROPN _ Number=Sing 2 flat:name _ _
4 publicada publicar VERB <mv>|V|PCP|F|S|@ICL-N< Gender=Fem|Number=Sing|VerbForm=Part 2 acl _ _
5 hoje hoje ADV ADV|@<ADVL _ 4 advmod _ _
6 revela revelar VERB <mv>|V|PR|3S|IND|@FS-STA Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
7 um um DET <arti>|ART|M|S|@>N Definite=Ind|Gender=Masc|Number=Sing|PronType=Art 8 det _ _
8 dado dado NOUN <np-idf>|N|M|S|@<ACC Gender=Masc|Number=Sing 6 obj _ _
9 supreendente surpreendente ADJ ADJ|M|S|@N< Gender=Masc|Number=Sing|Typo=Yes 8 amod _ CorrectForm=surpreendente|SpaceAfter=No
10 : : PUNCT PU|@PU _ 26 punct _ _
11 recusando recusar VERB <mv>|V|GER|@ICL-ADVL> VerbForm=Ger 26 advcl _ _
12 uma um DET <arti>|ART|F|S|@>N Definite=Ind|Gender=Fem|Number=Sing|PronType=Art 13 det _ _
13 postura postura NOUN <np-idf>|N|F|S|@<ACC Gender=Fem|Number=Sing 11 obj _ _
14 radical radical ADJ ADJ|F|S|@N< Gender=Fem|Number=Sing 13 amod _ SpaceAfter=No
15 , , PUNCT PU|@PU _ 26 punct _ _
16 a o DET <artd>|ART|F|S|@>N Definite=Def|Gender=Fem|Number=Sing|PronType=Art 18 det _ _
17 esmagadora esmagador ADJ ADJ|F|S|@>N Gender=Fem|Number=Sing 18 amod _ _
18 maioria maioria NOUN <np-def>|N|F|S|@SUBJ> Gender=Fem|Number=Sing 26 nsubj _ _
19 ( ( PUNCT PU|@PU _ 21 punct _ SpaceAfter=No
20 77 77 NUM <card>|NUM|M|P|@>N NumType=Card 21 nummod _ SpaceAfter=No
21 % % SYM <np-def>|N|M|P|@N<PRED _ 18 appos _ SpaceAfter=No
22 ) ) PUNCT PU|@PU _ 21 punct _ _
23-24 dos _ _ _ _ _ _ _ _
23 de de ADP <sam->|PRP|@N< _ 25 case _ _
24 os o DET <-sam>|<artd>|ART|M|P|@>N Definite=Def|Gender=Masc|Number=Plur|PronType=Art 25 det _ _
25 eleitores eleitor NOUN <np-def>|N|M|P|@P< Gender=Masc|Number=Plur 18 nmod _ _
26 quer querer VERB <mv>|V|PR|3S|IND|@FS-N<PRED Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 8 acl:relcl _ _
27 o o DET <artd>|ART|M|S|@>N Definite=Def|Gender=Masc|Number=Sing|PronType=Art 28 det _ _
28 PT PT PROPN PROP|M|S|@<ACC Gender=Masc|Number=Sing 29 nsubj _ _
29 participando participar VERB <mv>|V|GER|@ICL-<OC VerbForm=Ger 26 ccomp _ _
30-31 do _ _ _ _ _ _ _ _
30 de de ADP <sam->|PRP|@<PIV _ 32 case _ _
31 o o DET <-sam>|<artd>|ART|M|S|@>N Definite=Def|Gender=Masc|Number=Sing|PronType=Art 32 det _ _
32 Governo governo NOUN <prop>|<np-def>|N|M|S|@P< Gender=Masc|Number=Sing 29 obj _ _
33 Fernando Fernando PROPN _ ExtPos=PROPN|Gender=Masc|Number=Sing 32 nmod _ _
34 Henrique Henrique PROPN _ Number=Sing 33 flat:name _ _
35 Cardoso Cardoso PROPN _ Number=Sing 33 flat:name _ SpaceAfter=No
36 . . PUNCT PU|@PU _ 6 punct _ _

# text = Tem sentido -- aliás, muitíssimo sentido.
# sent_id = CF1-4
# source = CETENFolha n=1 cad=Opinião sec=opi sem=94a &D
1 Tem ter VERB <mv>|V|PR|3S|IND|@FS-STA Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
2 sentido sentido NOUN <np-idf>|N|M|S|@<ACC Gender=Masc|Number=Sing 1 obj _ _
3 -- -- PUNCT PU|@PU _ 1 punct _ _
4 aliás aliás ADV <kc>|ADV|@<ADVL _ 1 advmod _ SpaceAfter=No
5 , , PUNCT PU|@PU _ 7 punct _ _
6 muitíssimo muitíssimo DET <quant>|<SUP>|DET|M|S|@>N Gender=Masc|Number=Sing|PronType=Ind 7 det _ _
7 sentido sentido NOUN <np-idf>|N|M|S|@N<PRED Gender=Masc|Number=Sing 1 parataxis _ SpaceAfter=No
8 . . PUNCT PU|@PU _ 1 punct _ _

# text = Muito mais do que nos tempos na ditadura, a solidez do PT está, agora, ameaçada.
# sent_id = CF1-5
# source = CETENFolha n=1 cad=Opinião sec=opi sem=94a
1 Muito muito ADV <quant>|ADV|@>A _ 2 advmod _ _
2 mais mais ADV <quant>|<KOMP>|<COMP>|ADV|@ADVL> _ 22 advmod _ _
3-4 do _ _ _ _ _ _ _ _
3 de de ADP <sam->|PRP|@COM _ 8 case _ _
4 o o PRON <dem>|<-sam>|DET|M|S|@P< Gender=Masc|Number=Sing|PronType=Dem 3 fixed _ _
5 que que PRON <rel>|INDP|M|S|@N< Gender=Masc|Number=Sing|PronType=Rel 3 fixed _ _
6-7 nos _ _ _ _ _ _ _ _
6 em em ADP <sam->|<first-cjt>|PRP|@KOMP< _ 8 case _ _
7 os o DET <-sam>|<artd>|ART|M|P|@>N Definite=Def|Gender=Masc|Number=Plur|PronType=Art 8 det _ _
8 tempos tempo NOUN <first-cjt>|<np-def>|N|M|P|@P< Gender=Masc|Number=Plur 2 obl _ _
9-10 na _ _ _ _ _ _ _ _
9 em em ADP <sam->|PRP|@N< _ 11 case _ _
10 a o DET <-sam>|<artd>|ART|F|S|@>N Definite=Def|Gender=Fem|Number=Sing|PronType=Art 11 det _ _
11 ditadura ditadura NOUN <np-def>|N|F|S|@P< Gender=Fem|Number=Sing 8 nmod _ SpaceAfter=No
12 , , PUNCT PU|@PU _ 2 punct _ _
13 a o DET <artd>|ART|F|S|@>N Definite=Def|Gender=Fem|Number=Sing|PronType=Art 14 det _ _
14 solidez solidez NOUN <np-def>|N|F|S|@SUBJ> Gender=Fem|Number=Sing 22 nsubj _ _
15-16 do _ _ _ _ _ _ _ _
15 de de ADP <sam->|PRP|@N< _ 17 case _ _
16 o o DET <-sam>|<artd>|ART|M|S|@>N Definite=Def|Gender=Masc|Number=Sing|PronType=Art 17 det _ _
17 PT PT PROPN PROP|M|S|@P< Gender=Masc|Number=Sing 14 nmod _ _
18 está estar AUX <mv>|V|PR|3S|IND|@FS-STA Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 22 cop _ SpaceAfter=No
19 , , PUNCT PU|@PU _ 20 punct _ _
20 agora agora ADV <kc>|ADV|@<ADVL _ 22 advmod _ SpaceAfter=No
21 , , PUNCT PU|@PU _ 20 punct _ _
22 ameaçada ameaçar VERB <mv>|V|PCP|F|S|@ICL-<SC Gender=Fem|Number=Sing|VerbForm=Part 0 root _ SpaceAfter=No
23 . . PUNCT PU|@PU _ 22 punct _ _

# text = Nem Lula nem o partido ainda encontraram um discurso para se diferenciar.
# sent_id = CF1-6
# source = CETENFolha n=1 cad=Opinião sec=opi sem=94a
1 Nem nem CCONJ <parkc-1>|KC|@CO _ 2 cc _ _
2 Lula Lula PROPN <first-cjt>|PROP|M|S|@SUBJ> Gender=Masc|Number=Sing 7 nsubj _ _
3 nem nem CCONJ <co-subj>|<parkc-2>|KC|@CO _ 5 cc _ _
4 o o DET <artd>|ART|M|S|@>N Definite=Def|Gender=Masc|Number=Sing|PronType=Art 5 det _ _
5 partido partido NOUN <cjt>|<np-def>|N|M|S|@SUBJ> Gender=Masc|Number=Sing 2 conj _ _
6 ainda ainda ADV ADV|@ADVL> _ 7 advmod _ _
7 encontraram encontrar VERB <mv>|V|PS/MQP|3P|IND|@FS-STA Mood=Ind|Number=Plur|Person=3|VerbForm=Fin 0 root _ _
8 um um DET _ Definite=Ind|Gender=Masc|Number=Sing|PronType=Art 9 det _ _
9 discurso discurso NOUN <np-idf>|N|M|S|@<ACC Gender=Masc|Number=Sing 7 obj _ _
10 para para SCONJ _ _ 12 mark _ _
11 se se PRON PERS|M|3S|ACC|@ACC>-PASS Case=Acc|Gender=Masc|Number=Sing|Person=3|PronType=Prs 12 expl _ _
12 diferenciar diferenciar VERB _ VerbForm=Inf 9 acl _ SpaceAfter=No
13 . . PUNCT PU|@PU _ 7 punct _ _

# text = Eles se dizem oposição, mas ainda não informaram o que vão combater.
# sent_id = CF1-7
# source = CETENFolha n=1 cad=Opinião sec=opi sem=94a
1 Eles eles PRON PERS|M|3P|NOM|@SUBJ> Case=Nom|Gender=Masc|Number=Plur|Person=3|PronType=Prs 3 nsubj _ _
2 se se PRON PERS|M|3P|ACC|@ACC>-PASS Case=Acc|Gender=Masc|Number=Plur|Person=3|PronType=Prs 3 expl _ _
3 dizem dizer VERB <first-cjt>|<mv>|<se-passive>|V|PR|3P|IND|@FS-STA Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
4 oposição oposição NOUN <np-idf>|N|F|S|@<OC Gender=Fem|Number=Sing 3 xcomp _ SpaceAfter=No
5 , , PUNCT PU|@PU _ 9 punct _ _
6 mas mas CCONJ <co-fcl>|KC|@CO _ 9 cc _ _
7 ainda ainda ADV ADV|@>A _ 8 advmod _ _
8 não não ADV _ Polarity=Neg 9 advmod _ _
9 informaram informar VERB <cjt>|<mv>|V|PS/MQP|3P|IND|@FS-STA Mood=Ind|Number=Plur|Person=3|VerbForm=Fin 3 conj _ _
10 o o PRON _ Gender=Masc|Number=Sing|PronType=Dem 9 obj _ _
11 que que PRON <interr>|INDP|M|S|@ACC> Gender=Masc|Number=Sing|PronType=Rel 13 obj _ _
12 vão ir AUX <aux>|V|PR|3P|IND|@FS-<ACC Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin 13 aux _ _
13 combater combater VERB <mv>|V|INF|@ICL-AUX< VerbForm=Inf 10 acl:relcl _ SpaceAfter=No
14 . . PUNCT PU|@PU _ 3 punct _ _

# text = Muitas das prioridades do novo governo coincidem com as prioridades do PT.
# sent_id = CF1-8
# source = CETENFolha n=1 cad=Opinião sec=opi sem=94a
1 Muitas muito PRON <quant>|DET|F|P|@SUBJ> Gender=Fem|Number=Plur|PronType=Ind 9 nsubj _ _
2-3 das _ _ _ _ _ _ _ _
2 de de ADP <sam->|PRP|@N< _ 4 case _ _
3 as o DET <-sam>|<artd>|ART|F|P|@>N Definite=Def|Gender=Fem|Number=Plur|PronType=Art 4 det _ _
4 prioridades prioridade NOUN <np-def>|N|F|P|@P< Gender=Fem|Number=Plur 1 nmod _ _
5-6 do _ _ _ _ _ _ _ _
5 de de ADP <sam->|PRP|@N< _ 8 case _ _
6 o o DET <-sam>|<artd>|ART|M|S|@>N Definite=Def|Gender=Masc|Number=Sing|PronType=Art 8 det _ _
7 novo novo ADJ ADJ|M|S|@>N Gender=Masc|Number=Sing 8 amod _ _
8 governo governo NOUN <np-def>|N|M|S|@P< Gender=Masc|Number=Sing 4 nmod _ _
9 coincidem coincidir VERB <mv>|V|PR|3P|IND|@FS-STA Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
10 com com ADP PRP|@<PIV _ 12 case _ _
11 as o DET <artd>|ART|F|P|@>N Definite=Def|Gender=Fem|Number=Plur|PronType=Art 12 det _ _
12 prioridades prioridade NOUN <np-def>|N|F|P|@P< Gender=Fem|Number=Plur 9 obj _ _
13-14 do _ _ _ _ _ _ _ _
13 de de ADP <sam->|PRP|@N< _ 15 case _ _
14 o o DET <-sam>|<artd>|ART|M|S|@>N Definite=Def|Gender=Masc|Number=Sing|PronType=Art 15 det _ _
15 PT PT PROPN PROP|M|S|@P< Gender=Masc|Number=Sing 12 nmod _ SpaceAfter=No
16 . . PUNCT PU|@PU _ 9 punct _ _

Loading

0 comments on commit d122d83

Please sign in to comment.