Skip to content

Commit

Permalink
Tak v teto podobe to odeslu Samarovi k posouzeni.
Browse files Browse the repository at this point in the history
  • Loading branch information
dan-zeman committed Oct 30, 2009
1 parent b9ec72f commit 914de08
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions papers/2009-icon-hyderabad/paper.tex
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ \section{System Description}
%\microsection{MST Parser}
\subsection{MST Parser}
\label{sec:mst}
The Maximum Spanning Tree (MST) parser \citep{mst} views the sentence as an directed complete graph with edges weighted by a feature scoring function. It finds for the graph the spanning tree that maximizes the weights of the edges. A multi-class classification algorithm called MIRA is used to compute the scoring function.
The Maximum Spanning Tree (MST) parser \citep{mst} views the sentence as a directed complete graph with edges weighted by a feature scoring function. It finds for the graph the spanning tree that maximizes the weights of the edges. A multi-class classification algorithm called MIRA is used to compute the scoring function.

MST Parser achieved the best unlabeled attachment scores (UAS) for 9 out of the 13 languages of CoNLL-X, and second best scores in two others. Parsing is fast but training the parser takes many hours on large treebanks. On small data however, multiple quick experiments with different settings are still doable. The parser is implemented in Java and freely available for download.\footnote{\url{http://sourceforge.net/projects/mstparser/}}

Expand All @@ -135,7 +135,7 @@ \subsection{Voting Superparser}
\label{sec:voting}
The three parsers are combined using a simple weighted-voting approach similar to \citet{biblio:ZeZaImprovingParsing2005}, except that the output is guaranteed to be cycle-free. We start by evaluating every parser separately on the development data. The UAS of each parser is subsequently used as the weight of that parser's vote. Dependencies are parent-child relations, and for every node there are up to three candidates for its parent (if all three parsers disagree). Candidates get weighted votes -- e.g., if parsers with weights $w_1 = 0.8$ and $w_2 = 0.7$ agree on the candidate, the candidate gets 1.5 votes. Since we have only three parsers, in practice this means that the candidate of the best parser looses only if 1. the other two parsers agree on someone else, or 2. if attaching the child to this candidate would create a cycle.

The tree is constructed from the root down. We repeatedly add nodes whose winning parent candidates are already in the tree. If none of the remaining nodes meet this condition, we have to break a cycle. We do so by examining all unattached nodes. At each node we notice the votes of its current winning parent. Then we remove the least-scoring winner and go on with adding nodes until all nodes are attached or there is another cycle to break.
The tree is constructed from the root down. We repeatedly add nodes whose winning parent candidates are already in the tree. If none of the remaining nodes meet this condition, we have to break a cycle. We do so by examining all unattached nodes. At each node we note the votes of its current winning parent. Then we remove the least-scoring winner and go on with adding nodes until all nodes are attached or there is another cycle to break.

\section{Experiments}
\label{sec:experiments}
Expand Down Expand Up @@ -216,7 +216,7 @@ \subsection{Morphology}
bn & 85.70 & 84.71 & 54.38 & \textbf{86.19}\\
te & 79.85 & 80.89 & 45.78 & \textbf{82.37}\\
\end{tabular}
\caption{UAS on mixed data: MST and DZ use POS+case+vibhakti for all languages, Malt uses that for Hindi and POS only elsewhere. MST is now the best parser for hi and bn, which changes the voting weights.}
\caption{UAS on mixed data: MST and DZ use POS+case+vibhakti for all languages, Malt uses that for Hindi only, elsewhere it uses just POS.}
\label{tab:posmix2}
\end{centering}
\end{table}
Expand Down Expand Up @@ -246,7 +246,7 @@ \subsection{Error Patterns}

The accuracy of the dependencies is relatively high and it is difficult to trace repetitive error patterns. In Hindi, many wrong attachments seem to be long-distance, and verbs, conjunctions, root and NULL nodes are frequently involved. Frequent words should perhaps be available to the parsers as parts of tag strings: for instance, Hindi \hi{कि} \translit{ki} ``that'' or \hi{तो} \translit{to} are wrongly attached because the parser only sees the general CC tag. On a similar note, problems with coordination, also observed e.g. by \citet{dzparser}, occur here, too: \hi{भाई और भाभी} \translit{bhāī aura bhābhī} ``brother and his wife'' is correctly recognized as coordination rooted by the conjunction \hi{और}, however, the conjunction node lacks the information about its noun children and fails to attach as the subject of the verb.

The tag string should contain both the chunk label and the POS tag. So far we wrongly assumed that POS always determins the chunk label. It is often so but not always, as exemplified in the Bangla chunk sequence \bn{তবে সুদীপ ওকে একদিন আড়ালে ডেকে বলেছিল কৌতূহল দেখালে তুমি উঁচুতে উঠতে অনিমেষ} \translit{tabe sudīpa oke ekadina āṛāle ḍeke balechila kautūhala dekhāle tumi um̃cute uṭhate animeša}. The words \bn{ডেকে} and \bn{দেখালে} are tagged VGNF|VM while \bn{বলেছিল} and \bn{উঠতে} are VGF|VM. The parser gets them wrong and it could be caused by it seeing only VM in the tag.
The tag string should contain both the chunk label and the POS tag. So far we wrongly assumed that POS always determines the chunk label. It is often so but not always, as exemplified in the Bangla chunk sequence \bn{তবে সুদীপ ওকে একদিন আড়ালে ডেকে বলেছিল কৌতূহল দেখালে তুমি উঁচুতে উঠতে অনিমেষ} \translit{tabe sudīpa oke ekadina āṛāle ḍeke balechila kautūhala dekhāle tumi um̃cute uṭhate animeša}. The words \bn{ডেকে} and \bn{দেখালে} are tagged VGNF|VM while \bn{বলেছিল} and \bn{উঠতে} are VGF|VM. The parser gets them wrong and it could be caused by it seeing only VM in the tag.

In Telugu, extraordinal number of sentences follow the SOV order so strongly that the last node (verb) is almost always attached to the root and most other nodes are attached directly to the last node. An example chunk sequence where this rule would lead to 100~\% accuracy follows: \te{రాష్ట్రంలొ రంగారెడ్డి మెదక్ నిజామాబాద్ జిల్లాలలొ పంటను గొప్పొ పండిస్తున్నారు} \translit{rāšṭraṁlo raṁgāreḍḍi medak nijāmābād jillālalo paṁṭanu goppo paṁḍistunnāru}. In the light of such examples it seems reasonable to provide the parsers with an additional feature telling whether a particular dependency observes the ``naïve Telugu'' structure. Note however that this will not help with the other two languages. While 73.75~\% of Telugu dependncies follow this rule, it is only 39.52~\% in Bangla and 35.71~\% in Hindi.

Expand Down Expand Up @@ -311,7 +311,7 @@ \section{Related and Future Work}
\section{Conclusion}
\label{sec:concl}

We have described our system of voting parsers, as applied to the ICON 2009 NLP Tools Contest task. We showed that case and vibhakti are important features at least for parsing Hindi while their usability in Bangla and Telugu is limited by data sparseness. We also discussed several error patterns that could lead to further improvements of the parsing system in future.
We have described our system of voting parsers, as applied to the ICON 2009 NLP Tools Contest task. We showed that case and vibhakti are important features at least for parsing Hindi while their usability in Bangla and Telugu is limited by data sparseness. Providing these features to MST and DZ in all languages, and to Malt in Hindi only yielded the best combined parser. We also discussed several error patterns that could lead to further improvements of the parsing system in future.

\section*{Acknowledgements}

Expand Down
Binary file added papers/2009-icon-hyderabad/submitted1.pdf
Binary file not shown.

0 comments on commit 914de08

Please sign in to comment.