-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IL Feedback #883
Comments
If I understand correctly, "לכולם" is a surface word that corresponds to two syntactic words "ל" and "כולם".
(the join=right is because comma follows). A similar case form IT sample would be ParlaMint/Samples/ParlaMint-IT/2018/ParlaMint-IT_2018-03-23-LEG18-Senato-sed-1.ana.xml Line 490 in f9a0b6a
This gets converted to CoNLL-U like: ParlaMint/Samples/ParlaMint-IT/2018/ParlaMint-IT_2018-03-23-LEG18-Senato-sed-1.conllu Lines 197 to 199 in f9a0b6a
and to vert like ParlaMint/Samples/ParlaMint-IT/2018/ParlaMint-IT_2018-03-23-LEG18-Senato-sed-1.vert Line 202 in f9a0b6a
Note that vert (exactly for cases like this) have multivalued attributes on norm, lemma etc. Not ideal, but best we can do with vertical files. |
random person check דורון אביטל
I have checked random person: https://github.com/GiliGoldin/ParlaMint/blob/8040ae5cd6579b7e4f414517a766ce5ce8b93f74/Samples/ParlaMint-IL/ParlaMint-IL-listPerson.xml#L241-L254 <person xml:id="person.18990">
<persName>
<forename>דורון</forename>
<surname>אביטל</surname>
</persName>
<sex value="M"/>
<birth when="1959-01-22">
<placeName>ישראל</placeName>
</birth>
<affiliation ref="#org.122" role="member" from="2011-03-18" to="2013-02-05"/>
<affiliation ref="#ParlaMint-IL-KNESS" role="member" from="2011-03-18" to="2013-02-05"/>
<affiliation ref="#ParlaMint-IL-GOV" role="member" from="2011-03-18" to="2013-02-05"/>
<affiliation ref="#ParlaMint-IL-GOV" role="minister" from="2011-03-18" to="2013-02-05"/>
</person> His parliamentary group status at the time of membership: <relation name="opposition" active="#org.122" passive="#ParlaMint-IL-GOV" from="2009-03-31" to="2012-05-08"/>
<relation name="coalition" mutual="#org.122" from="2012-05-08" to="2012-07-17"/>
<relation name="opposition" active="#org.122" passive="#ParlaMint-IL-GOV" from="2012-07-17" to="2013-02-05"/> There are some weirds:
Not sure if you understand the concept of members of the government in ParlaMint. It seems that all parliament members who are affiliated with the parliamentary group in the coalition are members of the government. |
INVALID @matyaskopp fault:
|
Sorry I don't understand. What is the repetition? Shouldn't there be a
meeting for each term? It's written in Hebrew and in English. How should it
be then?
…On Mon, Nov 25, 2024 at 8:36 AM Matyáš Kopp ***@***.***> wrote:
<meeting> element in teiCorpus
- non-unique meeting element in teiCorpus
<meeting> element should be unique within the file, there are repetitions
in a corpus root file:
https://github.com/GiliGoldin/ParlaMint/blob/8040ae5cd6579b7e4f414517a766ce5ce8b93f74/Samples/ParlaMint-IL/ParlaMint-IL.xml#L13-L36
<meeting n="14"
corresp="#ParlaMint-IL-KNESS"
ana="#parla.uni #parla.term #period_14"
xml:lang="he">הכנסת ה-14</meeting>
<meeting n="14"
corresp="#ParlaMint-IL-KNESS"
ana="#parla.uni #parla.term #period_14"
xml:lang="en">14th Knesset</meeting>
<meeting n="18"
corresp="#ParlaMint-IL-KNESS"
ana="#parla.uni #parla.term #period_18"
xml:lang="he">הכנסת ה-18</meeting>
<meeting n="18"
corresp="#ParlaMint-IL-KNESS"
ana="#parla.uni #parla.term #period_18"
xml:lang="en">18th Knesset</meeting>
<meeting n="24"
corresp="#ParlaMint-IL-KNESS"
ana="#parla.uni #parla.term #period_24"
xml:lang="he">הכנסת ה-24</meeting>
<meeting n="24"
corresp="#ParlaMint-IL-KNESS"
ana="#parla.uni #parla.term #period_24"
xml:lang="en">24th Knesset</meeting>
—
Reply to this email directly, view it on GitHub
<#883 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACAUPHHRALKQI4WGVG4P2EL2CLANJAVCNFSM6AAAAABSJLAHHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJWHE4DCOBSGI>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
Of course, you are right. |
extra filesremove from repository:
|
You are right, it seems that I inserted the time of his faction membership
as the time in the government instead of the time in the coalition. I will
fix this.
I did assign each coalition member as a government member and as a
minister. I see now that this is a mistake, I will remove the minister role
since I don't have the information regarding the roles of the ministers and
government positions. In our corpus we do consider all the people in the
coalition to be government members.
…On Sun, Nov 24, 2024, 23:49 Matyáš Kopp ***@***.***> wrote:
random person check דורון אביטל
- government affiliation of דורון אביטל (
https://he.wikipedia.org/wiki/%D7%93%D7%95%D7%A8%D7%95%D7%9F_%D7%90%D7%91%D7%99%D7%98%D7%9C
)
I have checked random person:
https://github.com/GiliGoldin/ParlaMint/blob/8040ae5cd6579b7e4f414517a766ce5ce8b93f74/Samples/ParlaMint-IL/ParlaMint-IL-listPerson.xml#L241-L254
<person xml:id="person.18990">
<persName>
<forename>דורון</forename>
<surname>אביטל</surname>
</persName>
<sex value="M"/>
<birth when="1959-01-22">
<placeName>ישראל</placeName>
</birth>
<affiliation ref="#org.122" role="member" from="2011-03-18" to="2013-02-05"/>
<affiliation ref="#ParlaMint-IL-KNESS" role="member" from="2011-03-18" to="2013-02-05"/>
<affiliation ref="#ParlaMint-IL-GOV" role="member" from="2011-03-18" to="2013-02-05"/>
<affiliation ref="#ParlaMint-IL-GOV" role="minister" from="2011-03-18" to="2013-02-05"/>
</person>
His parliamentary group status at the time of membership:
<relation name="opposition" active="#org.122" passive="#ParlaMint-IL-GOV" from="2009-03-31" to="2012-05-08"/>
<relation name="coalition" mutual="#org.122" from="2012-05-08" to="2012-07-17"/>
<relation name="opposition" active="#org.122" passive="#ParlaMint-IL-GOV" from="2012-07-17" to="2013-02-05"/>
There are some weirds:
- he is at the same time a member of government and in the opposition
- the government membership has the same timespan as parliament
membership (in Czechia, it takes some time(weeks-months) to become a
minister after becoming a parliament member)
- wiki does not say he was a minister
Not sure if you understand the concept of members of the government in
ParlaMint. It seems that all parliament members who are affiliated with the
parliamentary group in the coalition are members of the government.
https://clarin-eric.github.io/ParlaMint/#sec-affiliation
A member of government is someone who has some position in government (not
everyone from the coalition)
—
Reply to this email directly, view it on GitHub
<#883 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACAUPHBMBXSEFDNHDW7AJ7T2CJCWBAVCNFSM6AAAAABSJLAHHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJWGI2TEMJQGE>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
<langUsage>
<language ident="he">עברית</language>
<language ident="en">אנגלית</language>
<language ident="he">Hebrew</language>
<language ident="en">English</language>
</langUsage> should be: <langUsage>
<language ident="he" xml:lang="he">עברית</language>
<language ident="en" xml:lang="he">אנגלית</language>
<language ident="he" xml:lang="en">Hebrew</language>
<language ident="en" xml:lang="en">English</language>
</langUsage> |
There are still some taxonomies which are IL-specific or not linked:
I guess they can be removed |
Thanks for the great progress; I have ticked what has been resolved so far. If anything is unclear, please ask.
Well, you made more changes than just removing ministers and fixing the beginnings of timespans in 199b869; see Netanyahu: Some remove seem to be correct (e.g. Netanyahu was not in government with Bennett) - I hope you are aware of these changes - it was a bugfix, not accidental removal. The government beginnings seem to be okay (if the start of the coalition is the start of the government), but now you have most probably time spans without government because you have shifted only beginnings (old government still works after new MPs make parliamentary oath). We have a script for enriching tei with tsv data: |
Url should contain the proper source of the transcription (if available), so everyone can see the source that you have transformed to corpus.
The sources can be found online but I don't have this specific URL information since we didn't process the files directly from the website, we received them in email directly from the Knesset archivists. |
Okay, it's a shame. You can add it to your checklist for improving your source corpus. |
I made sure to use the coalition dates rather than the faction membership dates. This caused all the mentioned changes which are correct now. The start of the coalition membership is the start of the government membership, not the parliamentary oath, but yes the end will be the end of the coalition membership. |
This was fixed according to what TomazErjavec suggested.
This was fixed
This was fixed. There are no IL-specifix taxonomies anymore
This was fixed
Those are factions that are made of only one independent MP, but they are considered as a regular faction/political party in the parliament. I don't see why to save them differently. |
@GiliGoldin, it was fixed only partially, see my previous comment: #883 (comment)
|
If this reflects the reality in Knesset, then do it this way - I am ok with it. |
join attribute
There are too many joins, so the raw TEI and annotated (TEI.ana) versions are different
and then you can compare folders (I use meld):
|
@GiliGoldin, you removed your comment before I could react, so there are probably still some doubts. I can give you an example https://github.com/GiliGoldin/ParlaMint/blob/4571733fe48a9d200c92fd1ba7b02807bfc7ccfb/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-10-21-18ptv139208.ana.xml#L484-L519 on how should this sentence be encoded. <s xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0">
<w lemma="איפה"
pos="ADV"
msd="UPosTag=ADV|PronType=Int"
xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t1">איפה</w>
<w xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t2-3" join="right">המכינה<w xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t2"
norm="ה"
lemma="ה"
pos="DET"
msd="UPosTag=DET|Definite=Def|PronType=Art"/>
<w xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t3"
norm="מכינה"
lemma="מכינה"
pos="NOUN"
msd="UPosTag=NOUN|Gender=Fem|Number=Sing"/>
</w>
<w lemma="הוקם"
pos="VERB"
msd="UPosTag=VERB|Gender=Fem|HebBinyan=HUFAL|Number=Sing|Person=3|Tense=Fut|Voice=Pass"
xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t4">תוקם</w>
<pc xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t5"
msd="UPosTag=PUNCT"
join="right">?</pc>
<linkGrp targFunc="head argument" type="UD-SYN"><!-- SKIPPING --></linkGrp>
</s> It should be: <s xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0">
<w lemma="איפה"
pos="ADV"
msd="UPosTag=ADV|PronType=Int"
xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t1">איפה</w>
<!--
(ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t2-3) removing join="right"
because the token(ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t4)
on the right(=following) in this file is not joined
-->
<w xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t2-3">המכינה<w xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t2"
norm="ה"
lemma="ה"
pos="DET"
msd="UPosTag=DET|Definite=Def|PronType=Art"/>
<w xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t3"
norm="מכינה"
lemma="מכינה"
pos="NOUN"
msd="UPosTag=NOUN|Gender=Fem|Number=Sing"/>
</w>
<!--
(ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t4) added join="right"
because the punctation(ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t5) is joined with this token
-->
<w lemma="הוקם"
join="right"
pos="VERB"
msd="UPosTag=VERB|Gender=Fem|HebBinyan=HUFAL|Number=Sing|Person=3|Tense=Fut|Voice=Pass"
xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t4">תוקם</w>
<!--
(ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t5) removing join="right"
because the sentence is at the end of the sentence
-->
<pc xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t5"
msd="UPosTag=PUNCT"
join="right">?</pc>
<linkGrp targFunc="head argument" type="UD-SYN"><!-- SKIPPING --></linkGrp>
</s> I am sorry if I have written it unambiguously in #882. I hope this example helps |
Yes, I removed the comment since I noticed more problems that needed fixing. |
I have spotted one easy-fix join issue: Make sure that the last token in a sentence does not contain the <s xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u10.p0.s1">
<!-- SKIPPING -->
<pc xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u10.p0.s1.t12"
msd="UPosTag=PUNCT"
join="right">-</pc>
<pc xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u10.p0.s1.t13"
msd="UPosTag=PUNCT"
join="right">-</pc> <!-- REMOVE THIS JOIN -->
<linkGrp targFunc="head argument" type="UD-SYN"><!-- SKIPPING --></linkGrp>
</s> I believe your pipeline will be ready to run on all data when you fix this. @GiliGoldin, thanks for the exceptional work! @TomazErjavec, just for your update, ParlaMint-IL sample is close to being ready. |
That's great, thank you so much! |
Hi, the full data is located here: Happy holidays! |
@GiliGoldin, thanks for letting us know. Will have a look and try to process it soon. |
I transferred your corpus and had a look & tried to process it. A lot of found errors are surprising, as things are Many elements in the TEI headers (esp. in the corpus root file) do not mark their content as being English, I tried processing the complete corpus with our scripts (cf. Build/ directory) but it turns out your corpus it too The main errors seem to be:
I think the intention was to preserve the name of the parent session but this is not the right way to do it. Maybe simply:
As for the .ana files:
If you could fix (as many as you can of) these mistakes and post a new version, we could then take if from there. One other thing, it would be nice to localise the common taxonomies, i.e. translate their category descriptions into |
@TomazErjavec, @GiliGoldin added translations to samples, but I forgot to update the common taxonomies. I am doing it manually with: make translateTaxonomies-IL it is now included in e4a0f26 |
Okay, I will look into these problems and try to solve as many as possible. I will upload a new version soon. |
Okay I tried to fix most of the problems:
|
@GiliGoldin, I have tried to process your files:
<classDecl>
<xi:include href="ParlaMint-taxonomy-parla.legislature.xml"/>
<xi:include href="ParlaMint-taxonomy-politicalOrientation.xml"/>
<xi:include href="ParlaMint-taxonomy-speaker_types.xml"/>
<xi:include href="ParlaMint-taxonomy-subcorpus.xml"/>
<xi:include href="ParlaMint-taxonomy-NER.ana.xml"/> <!-- should not be present in TEI version -->
<xi:include href="ParlaMint-taxonomy-UD-SYN.ana.xml"/> <!-- should not be present in TEI version -->
</classDecl> I will fix it manually, but @GiliGoldin please fix your pipeline too.
<s xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3">
<w lemma="הצבעה" pos="NOUN" msd="UPosTag=NOUN|Gender=Fem|Number=Sing" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t1">הצבעה</w>
<w lemma="מס'" pos="NOUN" msd="UPosTag=NOUN|Abbr=Yes|Gender=Masc|Number=Sing" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t2">מס'</w>
<w lemma="6" pos="NUM" msd="UPosTag=NUM" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t3">6</w>
<w lemma="בעד" pos="ADP" msd="UPosTag=ADP" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t4">בעד</w>
<w lemma="סעיף" pos="NOUN" msd="UPosTag=NOUN|Gender=Masc|Number=Plur" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t5">סעיפים</w>
<w lemma="7-1" pos="NUM" msd="UPosTag=NUM" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t6">7-1</w>
<!-- should be PUNCT -->
<w lemma="" pos="SYM" msd="UPosTag=SYM" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t7">–</w>
<w lemma="4" pos="NUM" msd="UPosTag=NUM" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t8">4</w>
<w lemma="נגד" pos="ADP" msd="UPosTag=ADP" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t9">נגד</w>
<pc xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t10" msd="UPosTag=PUNCT">–</pc>
<w lemma="אין" pos="VERB" msd="UPosTag=VERB|Polarity=Neg" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t11">אין</w>
<w lemma="נמנע" pos="NOUN" msd="UPosTag=NOUN|Gender=Masc|Number=Plur" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t12">נמנעים</w>
<pc xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t13" msd="UPosTag=PUNCT">–</pc>
<w lemma="אין" pos="VERB" msd="UPosTag=VERB|Polarity=Neg" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t14">אין</w>
<w lemma="סעיף" pos="NOUN" msd="UPosTag=NOUN|Gender=Masc|Number=Plur" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t15">סעיפים</w>
<w lemma="7-1" pos="NUM" msd="UPosTag=NUM" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t16">7-1</w>
<w lemma="נתקבל" pos="VERB" msd="UPosTag=VERB|Gender=Masc|HebBinyan=NITPAEL|Number=Plur|Person=3|Tense=Past|Voice=Pass" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t17" join="right">נתקבלו</w>
<pc xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t18" msd="UPosTag=PUNCT">.</pc> Can I fix these situations in ParlaMint/Scripts/parlamint2release.xsl Lines 852 to 880 in 2de4c7c
<xsl:when test="@lemma='' and @msd = 'UPosTag=SYM' ">
<pc>
<xsl:attribute name="msd">'UPosTag=PUNCT</xsl:attribute>
<xsl:apply-templates mode="comp" select="@*[name() != 'lemma' and name() != 'pos' and name() != 'msd']"/>
<xsl:apply-templates mode="comp"/>
</pc>
</xsl:when> @GiliGoldin, @TomazErjavec, Can I do it? |
@matyaskopp, sorry, I also ran the corpus (with just the first and last year of transcripts) through my build process soon after @GiliGoldin released it but then didn't find the time to comment on it. In short, there are some other problems apart from the ones you reported (although the situation is much better than for the first release), so I'd suggest another revision. If @GiliGoldin doesn't fix the errors you report there, then, sure, pls. feel free to upgrade the release scripts to implement the fixes you suggest. For manually editing the root files, I'd say not, as there are other problems there as well, and all should be fixed. First, now that we have names, I could mount the sample corpus on our dev concordancer and you can test it at https://www.clarin.si/ske-beta/#dashboard?corpname=parlamint50_il (you need username "dev" and password "alfabetagama"). The log files of the build are, as before, on https://nl.ijs.si/et/tmp/ParlaMint/Logs/?C=M;O=D For the point by point reponse:
Indeed, and we all have this situation. However, we decided that it is better to cheat and move the first day of the next affiliation so that there is no overlap, as with this we introduce a very small error. On the other hand, if we wanted to support two affiliations on the same date, this would mean that we need to cater for multi-valued attributes, which introduces a large overhead in the processing, makes computing various statistics more difficult etc. So, I'd suggest you do the same - it might even be done automatically. Maybe @matyaskopp also has some thought here.
Yes, both fixed, thanks.
Indeed, it could be considered correct, but the schema does not allow for it, so, thanks for changing it.
OK, great. There is one other point that I made but you haven't addressed:
This is still the case, many elements with English content both in the root and component headers are not marked as English. And a minor plea: on HuggingFace your tarred files decompress into the current directory, it would help us a bit if they could decompress into the directory with the same name as the tar file, i.e. into ParlaMint-IL.TEI/ and ParlaMint-IL.TEI.ana/ (everything else stays the same). |
@TomazErjavec @matyaskopp
I fixed this automatically by incrementing one day to the start of the new affiliation, when it's the same as the last day of the previous one. I hope this solves the problem.
Sorry, I missed this one. Now I fixed it in these elements.
Ok, I think you can also do it with a parameter in the decompressing command, but I now compressed it so it will automatically keep the container folder. The new files are in the same link: |
I again processed your latest corpus @GiliGoldin (first and last year), and the log files, are, as before on https://nl.ijs.si/et/tmp/ParlaMint/Logs/?C=M;O=D. Apart from the linguistic annotation errors that @matyaskopp has/will take care of, there are still date clashes in coallition/opposition and party membership, it would be great if you coud take another look at this a fix them so they don't overlap. You write that you have fixed the missing
etc.
Please do fix this both in .TEI and .TEI.ana. But all the rest seems ok! |
Yes the problem was that if the dates weren't sorted by the start_date for a person so it didn't apply correctly my changes. I think it should be ok now.
For the missing ones- I previously thought that if I write the element only in English so I don't need to add the
Ok I tried to make the same fixes here too. This will affect all the files so I currently only did it to one protocol so you can check and make sure it's okay before I run it on all the files. Thanks! |
I am sorry I missed this language bug while checking the sample. <persName xml:lang="he"/> <change when="2025-01-23">Initial conversion to TEI format.</change> and also in root file: <change when="2025-01-23">Initial release.</change>
Well, I have fixed SYMbols with an empty lemma, but there are still many bugs: 1. character content of element "pc"
grepped:: https://ufallab.ms.mff.cuni.cz/~kopp/ParlaMint/Logs/ParlaMint-IL.20250123.CONTENT-PC-ERROR.txt 2. element "w" incomplete; missing required element "w"
grepped:: https://ufallab.ms.mff.cuni.cz/~kopp/ParlaMint/Logs/ParlaMint-IL.20250123.MISSING-W-ERROR.txt 3. value of attribute "lemma" is invalid
grepped: https://ufallab.ms.mff.cuni.cz/~kopp/ParlaMint/Logs/ParlaMint-IL.20250123.LEMMA-ERROR.txt 4. value of attribute "msd" is invalid
grepped: https://ufallab.ms.mff.cuni.cz/~kopp/ParlaMint/Logs/ParlaMint-IL.20250123.MSD-ERROR.txt these errors are not good. The question is, what else can be affected but not reported? <w lemma="(" pos="X" msd="UPosTag=1 –|PUNCT=" xml:id="ParlaMint-IL_2001-11-26-15ptv496918.u0.p0.s152.t25">1 –</w>
<w lemma="(א" pos="X" msd="UPosTag=1 -|NUM=" xml:id="ParlaMint-IL_2010-10-14-18ptv161765.u588.p0.s24.t20" join="right">1 -</w>
<w lemma="(א)" pos="X" msd="UPosTag=5 -|NUM=" xml:id="ParlaMint-IL_2010-10-14-18ptv161765.u589.p0.s2.t6">5 -</w>
<w lemma="(א" pos="X" msd="UPosTag=5א -|NUM=" xml:id="ParlaMint-IL_2010-10-14-18ptv161765.u589.p0.s5.t6" join="right">5א -</w>
<w lemma="(א" pos="X" msd="UPosTag=20 –|NUM=" xml:id="ParlaMint-IL_2010-11-02-18ptv163061.u87.p0.s2.t3" join="right">20 –</w>
<w lemma="(א" pos="X" msd="UPosTag=1א -|NUM=" xml:id="ParlaMint-IL_2010-11-14-18ptv163815.u302.p0.s0.t6" join="right">1א -</w>
<w lemma="(א)" pos="X" msd="UPosTag=3 -|NUM=" xml:id="ParlaMint-IL_2010-11-14-18ptv163815.u302.p0.s10.t6">3 -</w>
<w lemma="(א)" pos="X" msd="UPosTag=41 -|NUM=" xml:id="ParlaMint-IL_2010-11-14-18ptv163815.u857.p0.s1.t6">41 -</w>
<w lemma="(א" pos="X" msd="UPosTag=51 -|NUM=" xml:id="ParlaMint-IL_2010-11-14-18ptv163815.u859.p0.s0.t6" join="right">51 -</w>
<w lemma="(א" pos="X" msd="UPosTag=2 –|NUM=" xml:id="ParlaMint-IL_2010-11-15-18ptv163615.u1.p0.s7.t23" join="right">2 –</w>
<w lemma="(א" pos="X" msd="UPosTag=88 –|NUM=" xml:id="ParlaMint-IL_2010-11-16-18ptv164162.u1.p0.s2.t6" join="right">88 –</w>
<w lemma="(א" pos="X" msd="UPosTag=19א -|NUM=" xml:id="ParlaMint-IL_2010-11-18-18ptv164514.u37.p0.s0.t6" join="right">19א -</w>
<w lemma="(א" pos="X" msd="UPosTag=42 -|NUM=" xml:id="ParlaMint-IL_2010-11-18-18ptv164514.u133.p0.s0.t6" join="right">42 -</w>
<w lemma="(א" pos="X" msd="UPosTag=10 -|NUM=" xml:id="ParlaMint-IL_2010-11-23-18ptv164401.u45.p0.s0.t44" join="right">10 -</w>
<w lemma="(א" pos="X" msd="UPosTag=9 -|NUM=" xml:id="ParlaMint-IL_2010-11-23-18ptv164564.u457.p0.s1.t39" join="right">9 -</w>
<w lemma="(א)" pos="X" msd="UPosTag=75 -|NUM=" xml:id="ParlaMint-IL_2010-11-25-18ptv165046.u578.p0.s3.t6">75 -</w>
<w lemma="(א" pos="X" msd="UPosTag=94 -|NUM=" xml:id="ParlaMint-IL_2010-11-25-18ptv165046.u985.p0.s0.t6" join="right">94 -</w>
<w lemma="(" pos="X" msd="UPosTag=2 -|PUNCT=" xml:id="ParlaMint-IL_2010-11-30-18ptv165043.u43.p0.s0.t28">2 -</w>
<w lemma="(א" pos="X" msd="UPosTag=28 -|NUM=" xml:id="ParlaMint-IL_2010-11-30-18ptv165043.u51.p0.s10.t6" join="right">28 -</w>
<w lemma="(א" pos="X" msd="UPosTag=2 -|NUM=" xml:id="ParlaMint-IL_2010-12-06-18ptv164426.u334.p0.s1.t8" join="right">2 -</w>
<w lemma="(א" pos="X" msd="UPosTag=1 -|NUM=" xml:id="ParlaMint-IL_2010-12-09-18ptv164890.u242.p0.s0.t36" join="right">1 -</w>
<w lemma="(א" pos="X" msd="UPosTag=80 –|NUM=" xml:id="ParlaMint-IL_2010-12-14-18ptv165546.u167.p0.s0.t7" join="right">80 –</w>
<w lemma="3ה(" pos="X" msd="UPosTag=3ה(1) -|NUM=" xml:id="ParlaMint-IL_2023-07-20-25ptv3418350.u1763.p0.s0.t5">3ה(1) -</w>
<w lemma="(1" pos="X" msd="UPosTag=6 -|NUM=" xml:id="ParlaMint-IL_2024-01-30-25ptv4142816.u146.p0.s1.t15" join="right">6 -</w>
<w lemma="24" pos="X" msd="UPosTag=11 –|NUM=" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t4">11 –</w>
|
ok I think I fixed these bugs. I uploaded a new version of these files. let me know if there is anything else.
I can try to do my best to fix the problems but the corpus contains over 45K protocols and 35M sentences, I don't think we can expect the parser to be perfect on all of them. Especially in Hebrew, where the automatic parsers are in general not as good as models for Latin characters, and it won't be possible to manually fix all of the problems.
I thought I fixed this problem by removing pcs that contained spaces but I see that it didn't entirely work. I made a few changes in the code, hopefully it will catch most of these problems now.
I'm not sure what to do with this one. The protocols contain all kinds of weird "words" like these. Most of the time it's probably because they didn't complete the sentence or the writer of the protocol missed it. Words like "ב..." or "ל..." is like saying "in ..." or "to ..." in Hebrew. But it's prefixes that are part of the same word. The ones with only dots were probably supposed to be classified as pc but the parser didn't recognize them correctly. The only solution I see here is to erase all words that contain more than one dot, but there will also be a loss of information, or to keep it like this.
From what I understand the problem is mostly spaces in the lemmas? I'll try to automatically remove spaces here too. I can also check if the lemma matches the regular expression "(\S)|(\S[\S ]*\S)" and if not replace with a "UNKNOWN" value. Will that be ok?
Here too I can remove spaces and replace with a fallback value if doesn't match regex. |
@GiliGoldin, I can give you an example of an error (type 4 on the list). I don't understand Hebrew, so I can only comment on anomalies in the corpus. My belief is that your scripts are getting the wrong results when the paragraph this is whole utterance from the corpus: <u xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904" who="#person.30772" ana="#regular">
<seg xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0">בעד סעיפים 1–11 – 24</seg>
</u> and your annotation: <u xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904" who="#person.30772" ana="#regular">
<seg xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0">
<s xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0">
<w lemma="בעד" pos="ADP" msd="UPosTag=ADP" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t1">בעד</w>
<w lemma="סעיף" pos="NOUN" msd="UPosTag=NOUN|Gender=Masc|Number=Plur" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t2">סעיפים</w>
<w lemma="1" pos="NUM" msd="UPosTag=NUM" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t3">1–</w>
<w lemma="24" pos="X" msd="UPosTag=11 –|NUM=" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t4">11 –</w>
<linkGrp targFunc="head argument" type="UD-SYN">
<link ana="ud-syn:case" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t2 #ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t1"/>
<link ana="ud-syn:root" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0 #ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t2"/>
<link ana="ud-syn:nmod_npmod" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t2 #ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t3"/>
<link ana="ud-syn:dep" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t2 #ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t4"/>
</linkGrp>
</s>
</seg>
</u> where are multiple bugs present:
I tried to annotate this "sentence" in the online trankit demo tool http://nlp.uoregon.edu/trankit My guess is that the bug is not in the trankit annotation tool but in the pipeline, which processes the result. It would be nice to determine what is causing this problem and then decide how/if it can be fixed. |
Yes I see. I'll try to figure out what's causing this. |
Let me just comment on the latest TEI headers:
this is conceptually wrong as CLG lab is not a person, hence should not be annotated as a person name. If you would really want CLG the have the responsibility here (but I think it better if you list people), you would have to change this to orgName, and we would have to change the schema to allow this, it doesn't now. |
Thanks for the great work on the corpora!
Please do not be scared of a long task list (everyone received it). I hope it will help you improve your corpus. I am ready to help and discuss any ambiguities or doubts, so do not hesitate to ask.
Are component filenames really unique
The filenames (file IDs
/TEI/@id
) must be unique. I am not sure if multiple plenary/committee meetings can be held on the same day.maintitle unique and also in Hebrew
The text value of the main title in component files has to be unique within the corpus and there also should be Hebrew translation:
https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2021/ParlaMint-IL_2021-12-21.xml#L9
so instead of
reference corpus
, you can place date and some more info that makes it unique (because you encode committees too, I believe the date is not enough):<meeting>
element in plenarysparla.term
parla.session
?parla.meeting
parla.sitting
Values from the
<meeting>
elements are used in corcondancers for filtering transcriptions, so the correct encoding is really important. See documentation: https://clarin-eric.github.io/ParlaMint/#exa-titleStmtCompand also the taxonomy:
I believe this plenary hearing file: https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-03-12.xml#L12
should be encoded this way
<meeting>
element in committeesparla.term
parla.meeting
parla.sitting
https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2021/ParlaMint-IL_2021-12-21.xml#L12
can be encoded this way:
It is a pity you do not have committee organizations and texts so they can be linked. ParlaMint-BE has committee meetings too (but no
<org>
anization). On the other hand CZ and HU have organizations but not corresponding texts. It would be great to have one corpus that has both :-)<meeting>
element in teiCorpusparla.term
There should be a list of terms in the
<meeting>
elements in corpus root files, like this:ParlaMint/Samples/ParlaMint-AT/ParlaMint-AT.xml
Lines 10 to 17 in f9a0b6a
annotation of the file
TEI/@ana
#parla.sitting
intoTEI/@ana
Add
#parla.sitting
intoTEI/@ana
if one file corresponds to one sitting or the#parla.meeting
value can be used if sitting is one to one to meeting.bibliography
idno URL- texts available online, but the source is a different corpus that does not preserve this informationhttps://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-03-12.xml#L52-L53
should contain correct single day
when="2009-03-12"
- the day of making text public or meeting date.Url should contain the proper source of the transcription (if available), so everyone can see the source that you have transformed to corpus.
settingDesc
<setting>
Take a look at examples from other corpora:
ParlaMint/Samples/ParlaMint-AT/2022/ParlaMint-AT_2022-10-12-027-XXVII-NRSITZ-00178.xml
Lines 111 to 119 in 9946040
ParlaMint/Samples/ParlaMint-CZ/2023/ParlaMint-CZ_2023-07-26-ps2021-071-07-000-000.xml
Lines 106 to 114 in 9946040
ParlaMint/Samples/ParlaMint-SI/2022/ParlaMint-SI_2022-04-06-SDZ8-Izredna-99.xml
Lines 95 to 101 in 9946040
ID format
u/@id
seg/@id
s/@id
w/@id
andpc/@id
I know that ID value is just for technical purposes, but consider changing them in the way most corpora do it, something like
{file_id}.u{utteranceN}.p{paragraphN}.s{sentenceN}.w{tokenN}
(CZech style of creating ids).changed ids in annotated version
For technical reasons, we want to preserve utterances and segment ids in annotated versions (they would be equal). When you annotate the corpus, you are only enriching it, not changing existing content.
https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-10-21.xml#L97-L100
vs annotated version:
https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-10-21.ana.xml#L103-L106
syntactic vs orthographic words
https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-10-21.ana.xml#L107-L130
annotation with udpipe for easier illustration:
it should be encoded this way:
Or the ways documented here: https://clarin-eric.github.io/ParlaMint/#sec-ana-norm
Not sure... "לכולם" does not have a lemma....
@TomazErjavec please help me here. We need to be able to convert it into conllu and vert. It would also be great if it would be possible to search it as one word for users...
named entities
I have found only single-word named entities which were adjected, like this:
taxonomies
there are two types of taxonomies
ParlaMint-taxonomy-
prefix where no changes are allowed, only translation is required (except UD-SYN)ParlaMint-IL-taxonomy-
You have changed the content of the common taxonomies and also the filenames, so the taxonomies do not match the ParlaMint ones.
You can initialize common taxonomies with this command. Run it in the repository root folder:
it creates taxonomies in
Sample/ParlaMint-IL
and place placeholders where the translations should appear (it overwrites existing ones if the filename is equal)If you have the correct filename and IDs, you can use this sequence to prefill your translations:
languages
<langUsage>
https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/ParlaMint-IL.ana.xml#L143-L146
there should be both(
@ident
) languages information stored in both(@xml:lang
) languages, like this:ParlaMint/Samples/ParlaMint-CZ/ParlaMint-CZ.ana.xml
Lines 147 to 152 in f9a0b6a
invalit label content
org/listEvent/event/label
contenthttps://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/ParlaMint-IL-listOrg.xml#L420-L427
This appears in multiple organizations; the above is just a sample.
abbreviated form is longer than full
org/orgName
https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/ParlaMint-IL-listOrg.xml#L647-L648
This appears in multiple organizations; the above is just a sample.
independent MP forms parliamentary group
https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/ParlaMint-IL-listOrg.xml#L2045-L2051
approx 30 occurrences.
This solution allows to affiliate with political orientation an independent MP, but it is really strange. Probably we have to find a better solution. (@TomazErjavec ??)
only member affiliations
corpus contains only
member
roles. is there a possibility to add various roles? See https://clarin-eric.github.io/ParlaMint/#sec-affiliationunknown person name
it is not necessary to fill in both forename and surname if unknown. If the person is completely unknown, then he/she shouldn't have a person record in listPerson (you can also skip
@who
attribute in utterance)The text was updated successfully, but these errors were encountered: