-
Notifications
You must be signed in to change notification settings - Fork 20
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add few grammar fixes and add conclusion
- Loading branch information
1 parent
5634a55
commit ff0c48e
Showing
2 changed files
with
81 additions
and
68 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,15 +26,15 @@ excerpt: | | |
<script type="text/javascript" src="https://cdn.jsdelivr.net/npm//[email protected]"></script> | ||
<script type="text/javascript" src="https://cdn.jsdelivr.net/npm//vega-embed@6"></script> | ||
|
||
<span class="firstcharacter">T</span>agalog is my native language. It's | ||
spoken by 76 million Filipinos and has been the country's official language | ||
since the 30s. It's a **text-rich** language, but unfortunately, a | ||
**low-resource** one. Hence, building NLP pipelines for Tagalog is difficult. | ||
<span class="firstcharacter">T</span>agalog is my native language. It's spoken | ||
by 76 million Filipinos and has been the country's official language since the | ||
30s. It's a **text-rich** language, but unfortunately, a | ||
**low-resource** one. In the age of big data and large language models, building | ||
NLP pipelines for Tagalog is still difficult. | ||
|
||
In this blog post, I'll outline my process in building a named-entity | ||
recognition (NER) pipeline for Tagalog. I'll discuss how I came up with a | ||
gold-standard dataset, my benchmarking results, and my hopes for the future of | ||
Tagalog NLP. | ||
In this blog post, I'll talk about how I built a named-entity recognition (NER) | ||
pipeline for Tagalog. I'll discuss how I came up with a gold-standard dataset, | ||
my benchmarking results, and my hopes for the future of Tagalog NLP. | ||
|
||
> I don't recommend using this pipeline for production purposes yet. See [caveats](#caveats). | ||
|
@@ -59,27 +59,17 @@ Tagalog NLP. | |
## <a id="corpora"></a>Tagalog NER data is scarce [↵](#toc) | ||
|
||
Even if Tagalog is text-rich, the amount of annotated data is scarce. We | ||
usually label these types of languages as **low-resource**. This problem isn't | ||
usually call these types of languages as **low-resource**. This problem isn't | ||
unique to Tagalog. Out of the approximately 7000 languages worldwide, only | ||
10 have adequate NLP resources ([Mortensen, 2017](#mortensen) and [Tsvetkov, | ||
2017](#tsvetkov2017opportunities)). Tagalog may as well be considered an underdog ([Joshi et al, 2021](#joshi2021state)).[^1] We can circumvent the data scarcity problem | ||
2017](#tsvetkov2017opportunities)). However, we can circumvent the data scarcity problem | ||
by bootstrapping the data we have. | ||
|
||
[^1]: | ||
|
||
Tagalog wasn't exclusively classified as an underdog in Joshi et al.'s | ||
taxonomy, but with its large amount of text data, the only challenge comes | ||
from the lack of annotated corpora. | ||
|
||
> We can circumvent the data scarcity problem by bootstrapping the data | ||
> we have. | ||
### <a id="circumvent"></a> We can circumvent the data scarcity problem... [↵](#toc) | ||
|
||
Many clever ways in language tech allow researchers to circumvent the data | ||
scarcity problem. They usually involve taking advantage of a high-resource | ||
language and transferring its capacity to a low-resource one. The table below | ||
outlines them: | ||
There are many clever ways to circumvent the data scarcity problem. They usually | ||
involve taking advantage of a high-resource language and transferring its | ||
capacity to a low-resource one. The table below outlines some techniques: | ||
|
||
|
||
| Approach | Data* | Prerequisites | Description | | ||
|
@@ -97,20 +87,20 @@ outlines them: | |
{:style="text-align: center;"} | ||
|
||
|
||
I will focus on **supervised** and **few-shot learning** in this blog post. | ||
Because most of these methods require a substantial amount of data, we need | ||
to take advantage of existing corpora. One way is to use *silver-standard data*. | ||
Their annotations are automatically generated, either by a statistical model | ||
trained from a similar language or a knowledge base. Silver-standard data may | ||
not be accurate or trustworthy, but they are faster and cheaper to create. | ||
In this blog post, I will focus on **supervised** and **few-shot learning**. | ||
Because most of these methods require a substantial amount of data, we need to | ||
take advantage of existing corpora. One way is to use *silver-standard data*. | ||
Silver-standard annotations are usually generated by a statistical model trained | ||
from a similar language or a knowledge base. They may not be accurate or | ||
trustworthy, but they're faster and cheaper. | ||
|
||
### <a id="bootstrapping"></a> ...by bootstrapping the data we have [↵](#toc) | ||
|
||
The best way to work with silver-standard data is to use them for bootstrapping | ||
the annotations of a much larger and diverse dataset, thereby producing | ||
the annotations of a much larger and diverse dataset, producing | ||
*gold-standard annotations*. By bootstrapping the annotations, we reduce the | ||
cognitive load of labeling and focus more on correcting the model's outputs | ||
rather than labeling from scratch. The figure below illustrates the workflow I'm | ||
rather than doing it from scratch. The figure below illustrates the workflow I'm | ||
following: | ||
|
||
{:width="650px"} | ||
|
@@ -119,11 +109,10 @@ following: | |
> By bootstrapping the annotations, we reduce the cognitive load of labeling | ||
> and focus more on correcting the model's outputs rather than labeling from scratch. | ||
The only major NER dataset for Tagalog is **WikiANN** ([Pan, Zhang, et al., | ||
2017](#pan2017wikiann)). It is a silver-standard dataset based on an English | ||
Knowledge Base (KB). The researchers created a framework for tagging entities | ||
based on Wikipedia and extended it to 282 other languages, including Tagalog. It | ||
could be better. For example, the [first few entries of the validation | ||
The only major NER dataset for Tagalog is **WikiANN**. It is a silver-standard | ||
dataset based on an English Knowledge Base (KB). [Pan, Zhang, et al., | ||
(2017)](#pan2017wikiann) created a framework for tagging entities based on | ||
Wikipedia and extended it to 282 other languages, including Tagalog. However, it's not perfect. For example, the [first few entries of the validation | ||
set](https://huggingface.co/datasets/wikiann/viewer/tl/validation) have glaring | ||
errors: | ||
|
||
|
@@ -147,10 +136,8 @@ errors: | |
|
||
Also, the texts themselves aren't complete sentences. A model trained on this | ||
data might translate poorly to longer documents as the *context* of an entity is | ||
lost. For example, articles (*ang*, *si*, *ang mga*) can point to a noun phrase | ||
and give clues if it's a person or organization. So we can't rely on a model | ||
trained with WikiANN. However, WikiANN can still be useful. We can use it to | ||
train a model for bootstrapping our annotations. | ||
lost.We can't rely solely on a model trained from WikiANN. However, it can still | ||
be useful: we can use it to train a model that bootstraps our annotations. | ||
|
||
> ...the texts [in WikiANN] aren't complete sentences. A model trained on this | ||
> data might translate poorly to longer documents...so we can't [just] rely [on it]. | ||
|
@@ -160,9 +147,9 @@ diversity of the Filipino language**. For example, there is the | |
[CommonCrawl](https://commoncrawl.org/) repository that contains web-crawled | ||
data for any language. We also have TLUnified ([Cruz and Cheng, | ||
2022](#cruz2022tlunified)) and WikiText TL-39 ([Cruz and Cheng, | ||
2019](#cruz2019wikitext)). For my experiments, I will use the TLUnified | ||
dataset as it's more recent, and one of its subdomains (news) resembles that of | ||
standard NER benchmarks like CoNLL. | ||
2019](#cruz2019wikitext)) that are much more recent. For my experiments, I will | ||
use the TLUnified dataset as it's more recent, and one of its subdomains (news) | ||
resembles that of standard NER benchmarks like CoNLL. | ||
|
||
> I will be using the TLUnified dataset as it's more recent, and one of its | ||
> subdomains resemble that of standard NER benchmarks like CoNLL. | ||
|
@@ -177,9 +164,8 @@ annotations. Piece of cake, right? | |
|
||
However, *labeling thousands of samples is not the hardest part.* As the sole | ||
annotator, I can easily influence a dataset of my biases and errors. In | ||
practice, you'd want three or more annotators (preferably linguists), then | ||
normalize their annotations based on some inter-annotator agreement. | ||
Unfortunately, this is the **limitation** of this work. In the next section, | ||
practice, you'd want three or more annotators and an inter-annotator agreement. | ||
Unfortunately, this is the limitation of this work. In the next section, | ||
I'll outline some of my attempts to be more objective when annotating. Of | ||
course, the ideal case is to have multiple annotators, so let me know if you | ||
want to help out! | ||
|
@@ -200,30 +186,25 @@ data source, `gold` - dataset type). | |
|
||
For the past three months, I corrected annotations produced by the WikiANN model. | ||
I learned that as an annotator, it's easier to fix annotations than label them | ||
from scratch. To make the annotation process more objective, I devised | ||
[annotation guidelines](https://github.com/ljvmiranda921/calamanCy/tree/master/datasets/tl_calamancy_gold_corpus/guidelines) ([Artstein, 2017](#artstein2017inter)). Professor Nils | ||
from scratch. I also devised | ||
[annotation guidelines](https://github.com/ljvmiranda921/calamanCy/tree/master/datasets/tl_calamancy_gold_corpus/guidelines) ([Artstein, 2017](#artstein2017inter)) to make the annotation process more objective. Professor Nils | ||
Reiter has an [excellent guide](https://sharedtasksinthedh.github.io/2017/10/01/howto-annotation/) for | ||
developing these. I also took inspiration from [*The Guardian*'s | ||
work](https://github.com/JournalismAI-2021-Quotes/quote-extraction/blob/28f429b260fc30dd884cd4d0a8ff0cb9047f0fe4/annotation_rules/Quote%20annotation%20guide.pdf), | ||
which uses [Prodigy for quotation | ||
detection](https://explosion.ai/blog/guardian). | ||
|
||
However, I'm the only annotator. Hence, the annotations produced in v1.0 of | ||
`tl_tlunified_gold` are **not ready** for production. Using Reiter's | ||
framework, my annotations are still in the pre-pilot phase. Getting | ||
multiple annotations and developing an inter-annotator agreement for several | ||
iterations is the ideal case. | ||
|
||
> ...the annotations produced in v1.0 of `tl_tlunified_gold` are **not ready** | ||
> for production...my annotations are still in the pre-pilot phase. | ||
Since there are still gaps in my annotation process, the annotations produced in | ||
v1.0 of `tl_tlunified_gold` are **not ready** for production. Getting multiple | ||
annotations and developing an inter-annotator agreement for several iterations | ||
is the ideal case. | ||
|
||
|
||
Nevertheless, I produced some annotations for around 7,000 documents. I split | ||
them between training, development, and test partitions and uploaded the v1.0 of | ||
raw annotations to the cloud. You can access the raw annotations and replicate | ||
the preprocessing step by checking out the [GitHub repository of this | ||
project](https://github.com/ljvmiranda921/calamanCy/tree/master/datasets/tl_calamancy_gold_corpus). | ||
The table below shows some dataset statistics: | ||
project](https://github.com/ljvmiranda921/calamanCy/tree/master/datasets/tl_calamancy_gold_corpus). The table below shows some dataset statistics: | ||
|
||
|
||
| Tagalog Data | Documents | Tokens | PER | ORG | LOC | | ||
|
@@ -240,7 +221,7 @@ The table below shows some dataset statistics: | |
I want to see how standard NER approaches fare with `tl_tlunified_gold`. **My | ||
eventual goal is to set up training pipelines to produce decent Tagalog | ||
models from this dataset.** I made two sets of experiments, one involving word | ||
vectors and the other using language models or transformers. I aim to identify | ||
vectors and the other using transformers. I aim to identify | ||
the best training setup for this Tagalog corpus. I'm not pitting | ||
one against the other; I want to set up training pipelines for both in the | ||
future. | ||
|
@@ -255,7 +236,7 @@ Then, I will examine if adding word vectors (also called [*static | |
vectors*](https://spacy.io/usage/embeddings-transformers#static-vectors) in | ||
spaCy) can improve performance. Finally, I will investigate if | ||
[*pretraining*](https://spacy.io/usage/embeddings-transformers#pretraining) | ||
can help push pipeline performance: | ||
can help push performance further: | ||
|
||
|
||
| Approach | Setup | Description | | ||
|
@@ -370,7 +351,7 @@ with a dimension of $$200$$, and a subword minimum (`minn`) and maximum size | |
(`maxn`) of $$3$$ and $$5$$ respectively. | ||
|
||
Lastly, I also removed the annotated texts from TLUnified during training to ensure | ||
no overlaps that might influence our benchmark results. These results can be seen | ||
no overlaps will influence our benchmark results. These results can be seen | ||
in the table below: | ||
|
||
| Word Vectors | Unique Vectors* | Precision | Recall | F1-score | | ||
|
@@ -479,7 +460,7 @@ language model as a **drop-in replacement for our token-to-vector** embedding la | |
is much faster. Previously, we slotted a [`tok2vec`](https://spacy.io/api/tok2vec) | ||
embedding layer that downstream components like | ||
[`ner`](https://spacy.io/api/entityrecognizer) use. Here, we effectively replace | ||
that with a transformer model. So, for example, the English transformer model | ||
that with a transformer model. For example, the English transformer model | ||
[`en_core_web_trf`](https://spacy.io/models/en#en_core_web_trf) uses RoBERTa | ||
([Liu, et al., 2019](#liu2019roberta)) as its base. We want transformers because | ||
of their dense and context-sensitive representations, even if they have higher | ||
|
@@ -673,7 +654,41 @@ annotating. | |
|
||
### <a id="final-thoughts"></a> Final thoughts [↵](#toc) | ||
|
||
TODO | ||
In Tagalog, we have this word called *diskarte*. There is no direct translation | ||
in English, but I can describe it loosely as resourcefulness and creativity. | ||
It's not a highly-cognitive trait: smart people may be bookish, but not | ||
*madiskarte*. It's more practical, a form of street smarts, even. *Diskarte* is | ||
a highly-Filipino trait, borne from our need to solve things creatively in the | ||
presence of constraints. I mention this because working in Tagalog, or any | ||
low-resource language, requires a little *diskarte*, and I enjoy it! | ||
|
||
There are many exciting ways to tackle Tagalog NLP. Right now, I'm taking the | ||
standard labeling, training, and evaluation approach. However, I'm interested in | ||
exploring model-based techniques like cross-lingual transfer learning and | ||
multilingual NLP to "get around" the data bottleneck. After three months (twelve | ||
weekends, to be specific) of labeling, I realized how long and costly the | ||
process was. I still believe in getting gold-standard annotations, but I also | ||
want to balance this approach with short-term solutions. | ||
|
||
I wish we had more consolidated efforts to work on Tagalog NLP. Right now, I | ||
noticed that research progress for each institution is disconnected from one | ||
another. I definitely like what's happening in | ||
[Masakhane](https://www.masakhane.io/) for African languages and | ||
[IndoNLP](https://indonlp.github.io/) for Indonesian. I think they are good | ||
community models to follow. Lastly, Tagalog is not the only language in the | ||
Philippines, and being able to solve one Filipino language at a time would be | ||
nice. | ||
|
||
Right now, I'm working on | ||
[calamanCy](https://github.com/ljvmiranda921/calamanCy), my attempt to create | ||
spaCy pipelines for Tagalog. Its name is based on kalamansi, a citrus fruit | ||
common in the Philippines. Unfortunately, it's something that I've been working | ||
on in my spare time, so progress is slower than usual! This blog post contains | ||
my experiments on building the NER part of the pipeline. I plan to add a | ||
dependency parser and POS tagger from Universal Dependencies in the future. | ||
|
||
That's all for now. Feel free to hit me up if you have any questions and want to | ||
collaborate! Maraming salamat! | ||
|
||
|
||
|
||
|
@@ -692,4 +707,3 @@ TODO | |
- <a id="pan2017wikiann">Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji.</a> 2017. [Cross-lingual Name Tagging and Linking for 282 Languages](https://aclanthology.org/P17-1178). In *Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)*, pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics. | ||
- <a id="tsvetkov2017opportunities">Yulia Tsvetkov</a>, 2017. Opportunities and Challenges in Working with Low-Resource Languages. Language Technologies Institute, Carnegie Mellon University. [[Slides]](https://www.cs.cmu.edu/~ytsvetko/jsalt-part1.pdf). | ||
|
||
### Footnotes |