Skip to content

Commit

Permalink
Add final thoughts for Tagalog on draft mode
Browse files Browse the repository at this point in the history
  • Loading branch information
ljvmiranda921 committed Jan 17, 2023
1 parent b57c0d9 commit 5634a55
Show file tree
Hide file tree
Showing 2 changed files with 38 additions and 39 deletions.
38 changes: 38 additions & 0 deletions _drafts/tagalog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
<!--
In Tagalog, we have this word called diskarte. There is no direct translation in
English, but I can describe it loosely as resourcefulness and creativity. It's
not a highly-cognitive trait: smart people may be bookish, but not madiskarte.
It's more practical, a form of street smarts, even. Diskarte is a
highly-Filipino trait, borne from our need to solve things creatively in the
presence of constraints. I mention this because working in Tagalog, or any
low-resource language, requires a little diskarte, and I enjoy it!
There are many exciting ways to tackle Tagalog NLP. Right now, I'm taking the
standard labeling, training, and evaluation approach. However, I'm interested in
exploring model-based techniques like cross-lingual transfer learning and
multilingual NLP to "get around" the data bottleneck. After three months (twelve
weekends, to be specific) of labeling, I realized how long and costly the
process was. I still believe in getting gold-standard annotations, but I also
want to balance this approach with short-term solutions.
I wish we had more consolidated efforts to work on Tagalog NLP. Right now, I
noticed that research progress for each institution is disconnected from one
another. I definitely like what's happening in
[Masakhane](https://www.masakhane.io/) for African languages and
[IndoNLP](https://indonlp.github.io/) for Indonesian. I think they are good
community models to follow. In the future, wouldn't it be great if [Komisyon sa
Wikang Filipino](https://kwf.gov.ph/) had a dedicated computational linguistics
group? Tagalog is not the only language in the Philippines, and being able to
solve one Filipino language at a time would be nice.
Right now, I'm working on
[calamanCy](https://github.com/ljvmiranda921/calamanCy), my attempt to create
spaCy pipelines for Tagalog. Its name is based on kalamansi, a citrus fruit
common in the Philippines. Unfortunately, it's something that I've been working
on in my spare time, so progress is slower than usual! This blog post contains
my experiments on building the NER part of the pipeline. I plan to add a
dependency parser and POS tagger from Universal Dependencies in the future.
That's all for now. Feel free to hit me up if you have any questions and want to
collaborate! Maraming salamat!
-->
39 changes: 0 additions & 39 deletions notebook/_posts/2023-02-04-tagalog-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -675,45 +675,6 @@ annotating.

TODO

<!--
In Tagalog, we have this word called diskarte. There is no direct translation in
English, but I can describe it loosely as resourcefulness and creativity. It's
not a highly-cognitive trait: smart people may be bookish, but not madiskarte.
It's more practical, a form of street smarts, even. Diskarte is a
highly-Filipino trait, borne from our need to solve things creatively in the
presence of constraints. I mention this because working in Tagalog, or any
low-resource language, requires a little diskarte, and I enjoy it!
There are many exciting ways to tackle Tagalog NLP. Right now, I'm taking the
standard labeling, training, and evaluation approach. However, I'm interested in
exploring model-based techniques like cross-lingual transfer learning and
multilingual NLP to "get around" the data bottleneck. After three months (twelve
weekends, to be specific) of labeling, I realized how long and costly the
process was. I still believe in getting gold-standard annotations, but I also
want to balance this approach with short-term solutions.
I wish we had more consolidated efforts to work on Tagalog NLP. Right now, I
noticed that research progress for each institution is disconnected from one
another. I definitely like what's happening in
[Masakhane](https://www.masakhane.io/) for African languages and
[IndoNLP](https://indonlp.github.io/) for Indonesian. I think they are good
community models to follow. In the future, wouldn't it be great if [Komisyon sa
Wikang Filipino](https://kwf.gov.ph/) had a dedicated computational linguistics
group? Tagalog is not the only language in the Philippines, and being able to
solve one Filipino language at a time would be nice.
Right now, I'm working on
[calamanCy](https://github.com/ljvmiranda921/calamanCy), my attempt to create
spaCy pipelines for Tagalog. Its name is based on kalamansi, a citrus fruit
common in the Philippines. Unfortunately, it's something that I've been working
on in my spare time, so progress is slower than usual! This blog post contains
my experiments on building the NER part of the pipeline. I plan to add a
dependency parser and POS tagger from Universal Dependencies in the future.
That's all for now. Feel free to hit me up if you have any questions and want to
collaborate! Maraming salamat!
-->



## References
Expand Down

0 comments on commit 5634a55

Please sign in to comment.