Skip to content

Commit

Permalink
Add few grammar fixes and add conclusion
Browse files Browse the repository at this point in the history
  • Loading branch information
ljvmiranda921 committed Jan 20, 2023
1 parent 5634a55 commit ff0c48e
Show file tree
Hide file tree
Showing 2 changed files with 81 additions and 68 deletions.
7 changes: 3 additions & 4 deletions _drafts/tagalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,9 @@ noticed that research progress for each institution is disconnected from one
another. I definitely like what's happening in
[Masakhane](https://www.masakhane.io/) for African languages and
[IndoNLP](https://indonlp.github.io/) for Indonesian. I think they are good
community models to follow. In the future, wouldn't it be great if [Komisyon sa
Wikang Filipino](https://kwf.gov.ph/) had a dedicated computational linguistics
group? Tagalog is not the only language in the Philippines, and being able to
solve one Filipino language at a time would be nice.
community models to follow. Lastly, Tagalog is not the only language in the
Philippines, and being able to solve one Filipino language at a time would be
nice.
Right now, I'm working on
[calamanCy](https://github.com/ljvmiranda921/calamanCy), my attempt to create
Expand Down
142 changes: 78 additions & 64 deletions notebook/_posts/2023-02-04-tagalog-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,15 +26,15 @@ excerpt: |
<script type="text/javascript" src="https://cdn.jsdelivr.net/npm//[email protected]"></script>
<script type="text/javascript" src="https://cdn.jsdelivr.net/npm//vega-embed@6"></script>

<span class="firstcharacter">T</span>agalog is my native language. It's
spoken by 76 million Filipinos and has been the country's official language
since the 30s. It's a **text-rich** language, but unfortunately, a
**low-resource** one. Hence, building NLP pipelines for Tagalog is difficult.
<span class="firstcharacter">T</span>agalog is my native language. It's spoken
by 76 million Filipinos and has been the country's official language since the
30s. It's a **text-rich** language, but unfortunately, a
**low-resource** one. In the age of big data and large language models, building
NLP pipelines for Tagalog is still difficult.

In this blog post, I'll outline my process in building a named-entity
recognition (NER) pipeline for Tagalog. I'll discuss how I came up with a
gold-standard dataset, my benchmarking results, and my hopes for the future of
Tagalog NLP.
In this blog post, I'll talk about how I built a named-entity recognition (NER)
pipeline for Tagalog. I'll discuss how I came up with a gold-standard dataset,
my benchmarking results, and my hopes for the future of Tagalog NLP.

> I don't recommend using this pipeline for production purposes yet. See [caveats](#caveats).
Expand All @@ -59,27 +59,17 @@ Tagalog NLP.
## <a id="corpora"></a>Tagalog NER data is scarce [&crarr;](#toc)

Even if Tagalog is text-rich, the amount of annotated data is scarce. We
usually label these types of languages as **low-resource**. This problem isn't
usually call these types of languages as **low-resource**. This problem isn't
unique to Tagalog. Out of the approximately 7000 languages worldwide, only
10 have adequate NLP resources ([Mortensen, 2017](#mortensen) and [Tsvetkov,
2017](#tsvetkov2017opportunities)). Tagalog may as well be considered an underdog ([Joshi et al, 2021](#joshi2021state)).[^1] We can circumvent the data scarcity problem
2017](#tsvetkov2017opportunities)). However, we can circumvent the data scarcity problem
by bootstrapping the data we have.

[^1]:

Tagalog wasn't exclusively classified as an underdog in Joshi et al.'s
taxonomy, but with its large amount of text data, the only challenge comes
from the lack of annotated corpora.

> We can circumvent the data scarcity problem by bootstrapping the data
> we have.
### <a id="circumvent"></a> We can circumvent the data scarcity problem... [&crarr;](#toc)

Many clever ways in language tech allow researchers to circumvent the data
scarcity problem. They usually involve taking advantage of a high-resource
language and transferring its capacity to a low-resource one. The table below
outlines them:
There are many clever ways to circumvent the data scarcity problem. They usually
involve taking advantage of a high-resource language and transferring its
capacity to a low-resource one. The table below outlines some techniques:


| Approach | Data* | Prerequisites | Description |
Expand All @@ -97,20 +87,20 @@ outlines them:
{:style="text-align: center;"}


I will focus on **supervised** and **few-shot learning** in this blog post.
Because most of these methods require a substantial amount of data, we need
to take advantage of existing corpora. One way is to use *silver-standard data*.
Their annotations are automatically generated, either by a statistical model
trained from a similar language or a knowledge base. Silver-standard data may
not be accurate or trustworthy, but they are faster and cheaper to create.
In this blog post, I will focus on **supervised** and **few-shot learning**.
Because most of these methods require a substantial amount of data, we need to
take advantage of existing corpora. One way is to use *silver-standard data*.
Silver-standard annotations are usually generated by a statistical model trained
from a similar language or a knowledge base. They may not be accurate or
trustworthy, but they're faster and cheaper.

### <a id="bootstrapping"></a> ...by bootstrapping the data we have [&crarr;](#toc)

The best way to work with silver-standard data is to use them for bootstrapping
the annotations of a much larger and diverse dataset, thereby producing
the annotations of a much larger and diverse dataset, producing
*gold-standard annotations*. By bootstrapping the annotations, we reduce the
cognitive load of labeling and focus more on correcting the model's outputs
rather than labeling from scratch. The figure below illustrates the workflow I'm
rather than doing it from scratch. The figure below illustrates the workflow I'm
following:

![](/assets/png/tagalog-gold-standard/silver_standard_framework.png){:width="650px"}
Expand All @@ -119,11 +109,10 @@ following:
> By bootstrapping the annotations, we reduce the cognitive load of labeling
> and focus more on correcting the model's outputs rather than labeling from scratch.
The only major NER dataset for Tagalog is **WikiANN** ([Pan, Zhang, et al.,
2017](#pan2017wikiann)). It is a silver-standard dataset based on an English
Knowledge Base (KB). The researchers created a framework for tagging entities
based on Wikipedia and extended it to 282 other languages, including Tagalog. It
could be better. For example, the [first few entries of the validation
The only major NER dataset for Tagalog is **WikiANN**. It is a silver-standard
dataset based on an English Knowledge Base (KB). [Pan, Zhang, et al.,
(2017)](#pan2017wikiann) created a framework for tagging entities based on
Wikipedia and extended it to 282 other languages, including Tagalog. However, it's not perfect. For example, the [first few entries of the validation
set](https://huggingface.co/datasets/wikiann/viewer/tl/validation) have glaring
errors:

Expand All @@ -147,10 +136,8 @@ errors:

Also, the texts themselves aren't complete sentences. A model trained on this
data might translate poorly to longer documents as the *context* of an entity is
lost. For example, articles (*ang*, *si*, *ang mga*) can point to a noun phrase
and give clues if it's a person or organization. So we can't rely on a model
trained with WikiANN. However, WikiANN can still be useful. We can use it to
train a model for bootstrapping our annotations.
lost.We can't rely solely on a model trained from WikiANN. However, it can still
be useful: we can use it to train a model that bootstraps our annotations.

> ...the texts [in WikiANN] aren't complete sentences. A model trained on this
> data might translate poorly to longer documents...so we can't [just] rely [on it].
Expand All @@ -160,9 +147,9 @@ diversity of the Filipino language**. For example, there is the
[CommonCrawl](https://commoncrawl.org/) repository that contains web-crawled
data for any language. We also have TLUnified ([Cruz and Cheng,
2022](#cruz2022tlunified)) and WikiText TL-39 ([Cruz and Cheng,
2019](#cruz2019wikitext)). For my experiments, I will use the TLUnified
dataset as it's more recent, and one of its subdomains (news) resembles that of
standard NER benchmarks like CoNLL.
2019](#cruz2019wikitext)) that are much more recent. For my experiments, I will
use the TLUnified dataset as it's more recent, and one of its subdomains (news)
resembles that of standard NER benchmarks like CoNLL.

> I will be using the TLUnified dataset as it's more recent, and one of its
> subdomains resemble that of standard NER benchmarks like CoNLL.
Expand All @@ -177,9 +164,8 @@ annotations. Piece of cake, right?

However, *labeling thousands of samples is not the hardest part.* As the sole
annotator, I can easily influence a dataset of my biases and errors. In
practice, you'd want three or more annotators (preferably linguists), then
normalize their annotations based on some inter-annotator agreement.
Unfortunately, this is the **limitation** of this work. In the next section,
practice, you'd want three or more annotators and an inter-annotator agreement.
Unfortunately, this is the limitation of this work. In the next section,
I'll outline some of my attempts to be more objective when annotating. Of
course, the ideal case is to have multiple annotators, so let me know if you
want to help out!
Expand All @@ -200,30 +186,25 @@ data source, `gold` - dataset type).

For the past three months, I corrected annotations produced by the WikiANN model.
I learned that as an annotator, it's easier to fix annotations than label them
from scratch. To make the annotation process more objective, I devised
[annotation guidelines](https://github.com/ljvmiranda921/calamanCy/tree/master/datasets/tl_calamancy_gold_corpus/guidelines) ([Artstein, 2017](#artstein2017inter)). Professor Nils
from scratch. I also devised
[annotation guidelines](https://github.com/ljvmiranda921/calamanCy/tree/master/datasets/tl_calamancy_gold_corpus/guidelines) ([Artstein, 2017](#artstein2017inter)) to make the annotation process more objective. Professor Nils
Reiter has an [excellent guide](https://sharedtasksinthedh.github.io/2017/10/01/howto-annotation/) for
developing these. I also took inspiration from [*The Guardian*'s
work](https://github.com/JournalismAI-2021-Quotes/quote-extraction/blob/28f429b260fc30dd884cd4d0a8ff0cb9047f0fe4/annotation_rules/Quote%20annotation%20guide.pdf),
which uses [Prodigy for quotation
detection](https://explosion.ai/blog/guardian).

However, I'm the only annotator. Hence, the annotations produced in v1.0 of
`tl_tlunified_gold` are **not ready** for production. Using Reiter's
framework, my annotations are still in the pre-pilot phase. Getting
multiple annotations and developing an inter-annotator agreement for several
iterations is the ideal case.

> ...the annotations produced in v1.0 of `tl_tlunified_gold` are **not ready**
> for production...my annotations are still in the pre-pilot phase.
Since there are still gaps in my annotation process, the annotations produced in
v1.0 of `tl_tlunified_gold` are **not ready** for production. Getting multiple
annotations and developing an inter-annotator agreement for several iterations
is the ideal case.


Nevertheless, I produced some annotations for around 7,000 documents. I split
them between training, development, and test partitions and uploaded the v1.0 of
raw annotations to the cloud. You can access the raw annotations and replicate
the preprocessing step by checking out the [GitHub repository of this
project](https://github.com/ljvmiranda921/calamanCy/tree/master/datasets/tl_calamancy_gold_corpus).
The table below shows some dataset statistics:
project](https://github.com/ljvmiranda921/calamanCy/tree/master/datasets/tl_calamancy_gold_corpus). The table below shows some dataset statistics:


| Tagalog Data | Documents | Tokens | PER | ORG | LOC |
Expand All @@ -240,7 +221,7 @@ The table below shows some dataset statistics:
I want to see how standard NER approaches fare with `tl_tlunified_gold`. **My
eventual goal is to set up training pipelines to produce decent Tagalog
models from this dataset.** I made two sets of experiments, one involving word
vectors and the other using language models or transformers. I aim to identify
vectors and the other using transformers. I aim to identify
the best training setup for this Tagalog corpus. I'm not pitting
one against the other; I want to set up training pipelines for both in the
future.
Expand All @@ -255,7 +236,7 @@ Then, I will examine if adding word vectors (also called [*static
vectors*](https://spacy.io/usage/embeddings-transformers#static-vectors) in
spaCy) can improve performance. Finally, I will investigate if
[*pretraining*](https://spacy.io/usage/embeddings-transformers#pretraining)
can help push pipeline performance:
can help push performance further:


| Approach | Setup | Description |
Expand Down Expand Up @@ -370,7 +351,7 @@ with a dimension of $$200$$, and a subword minimum (`minn`) and maximum size
(`maxn`) of $$3$$ and $$5$$ respectively.

Lastly, I also removed the annotated texts from TLUnified during training to ensure
no overlaps that might influence our benchmark results. These results can be seen
no overlaps will influence our benchmark results. These results can be seen
in the table below:

| Word Vectors | Unique Vectors* | Precision | Recall | F1-score |
Expand Down Expand Up @@ -479,7 +460,7 @@ language model as a **drop-in replacement for our token-to-vector** embedding la
is much faster. Previously, we slotted a [`tok2vec`](https://spacy.io/api/tok2vec)
embedding layer that downstream components like
[`ner`](https://spacy.io/api/entityrecognizer) use. Here, we effectively replace
that with a transformer model. So, for example, the English transformer model
that with a transformer model. For example, the English transformer model
[`en_core_web_trf`](https://spacy.io/models/en#en_core_web_trf) uses RoBERTa
([Liu, et al., 2019](#liu2019roberta)) as its base. We want transformers because
of their dense and context-sensitive representations, even if they have higher
Expand Down Expand Up @@ -673,7 +654,41 @@ annotating.

### <a id="final-thoughts"></a> Final thoughts [&crarr;](#toc)

TODO
In Tagalog, we have this word called *diskarte*. There is no direct translation
in English, but I can describe it loosely as resourcefulness and creativity.
It's not a highly-cognitive trait: smart people may be bookish, but not
*madiskarte*. It's more practical, a form of street smarts, even. *Diskarte* is
a highly-Filipino trait, borne from our need to solve things creatively in the
presence of constraints. I mention this because working in Tagalog, or any
low-resource language, requires a little *diskarte*, and I enjoy it!

There are many exciting ways to tackle Tagalog NLP. Right now, I'm taking the
standard labeling, training, and evaluation approach. However, I'm interested in
exploring model-based techniques like cross-lingual transfer learning and
multilingual NLP to "get around" the data bottleneck. After three months (twelve
weekends, to be specific) of labeling, I realized how long and costly the
process was. I still believe in getting gold-standard annotations, but I also
want to balance this approach with short-term solutions.

I wish we had more consolidated efforts to work on Tagalog NLP. Right now, I
noticed that research progress for each institution is disconnected from one
another. I definitely like what's happening in
[Masakhane](https://www.masakhane.io/) for African languages and
[IndoNLP](https://indonlp.github.io/) for Indonesian. I think they are good
community models to follow. Lastly, Tagalog is not the only language in the
Philippines, and being able to solve one Filipino language at a time would be
nice.

Right now, I'm working on
[calamanCy](https://github.com/ljvmiranda921/calamanCy), my attempt to create
spaCy pipelines for Tagalog. Its name is based on kalamansi, a citrus fruit
common in the Philippines. Unfortunately, it's something that I've been working
on in my spare time, so progress is slower than usual! This blog post contains
my experiments on building the NER part of the pipeline. I plan to add a
dependency parser and POS tagger from Universal Dependencies in the future.

That's all for now. Feel free to hit me up if you have any questions and want to
collaborate! Maraming salamat!



Expand All @@ -692,4 +707,3 @@ TODO
- <a id="pan2017wikiann">Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji.</a> 2017. [Cross-lingual Name Tagging and Linking for 282 Languages](https://aclanthology.org/P17-1178). In *Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)*, pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics.
- <a id="tsvetkov2017opportunities">Yulia Tsvetkov</a>, 2017. Opportunities and Challenges in Working with Low-Resource Languages. Language Technologies Institute, Carnegie Mellon University. [[Slides]](https://www.cs.cmu.edu/~ytsvetko/jsalt-part1.pdf).

### Footnotes

0 comments on commit ff0c48e

Please sign in to comment.