Skip to content

boostcampaitech7/level1-semantictextsimilarity-nlp-05

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

67 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Level 1. STS project


๐Ÿ”Ž ํ”„๋กœ์ ํŠธ ์†Œ๊ฐœ

๊ฐœ์š” ์„ค๋ช…
์ฃผ์ œ ๋ฌธ์žฅ ๊ฐ„ ์œ ์‚ฌ๋„ ์ธก์ •(STS): ๋‘ ๋ฌธ์žฅ์ด ์˜๋ฏธ์ ์œผ๋กœ ์–ผ๋งˆ๋‚˜ ์œ ์‚ฌํ•œ์ง€๋ฅผ ์ˆ˜์น˜ํ™”ํ•˜๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ํƒœ์Šคํฌ
๋ฐ์ดํ„ฐ์…‹ ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹: 9,324๊ฐœ, ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์…‹: 550๊ฐœ, ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹: 1,100๊ฐœ
ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์˜ 50%๋Š” Public ์ ์ˆ˜ ๊ณ„์‚ฐ์— ํ™œ์šฉ๋˜์–ด ์‹ค์‹œ๊ฐ„ ๋ฆฌ๋”๋ณด๋“œ์— ํ‘œ๊ธฐ, ๋‚จ์€ 50%๋Š” Private ๊ฒฐ๊ณผ ๊ณ„์‚ฐ์— ํ™œ์šฉ
ํ‰๊ฐ€ 0๊ณผ 5์‚ฌ์ด์˜ ์œ ์‚ฌ๋„ ์ ์ˆ˜๋ฅผ ์˜ˆ์ธก
ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜(Pearson Correlation Coefficient, PCC) ์ง€ํ‘œ
๊ฐœ๋ฐœ ํ™˜๊ฒฝ GPU: Tesla V100 Sever 4๋Œ€, IDE: Vscode, Jupyter Notebook

๐Ÿ‘จ๐Ÿปโ€๐Ÿ’ปย ํŒ€์› ์†Œ๊ฐœ ๋ฐ ์—ญํ• 

๊ถŒ์œ ์ง„ ๋ฐ•๋ฌด์žฌ ๋ฐ•์ •๋ฏธ ์ด์šฉ์ค€ ์ •์›์‹
๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ, ๋ชจ๋ธ ๋ฆฌ์„œ์น˜ ๋ฐ ์‹คํ—˜, ๋ชจ๋ธ ์•™์ƒ๋ธ” ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•, ์†์‹ค ํ•จ์ˆ˜ ์‹คํ—˜, ๋ชจ๋ธ ๋ฆฌ์„œ์น˜ ๋ฐ ์‹คํ—˜ EDA, ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋ฐ ์ฆ๊ฐ•, ๋ชจ๋ธ ์‹คํ—˜ ๋ฐ ์•™์ƒ๋ธ” ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ ๊ตฌ์„ฑ, ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋ฐ ์ฆ๊ฐ•, ๋ชจ๋ธ ๊ตฌํ˜„ ๋ฐ ์‹คํ—˜ ๋ชจ๋ธ ๋ฆฌ์„œ์น˜, ์†์‹ค ํ•จ์ˆ˜ ์‹คํ—˜, ๋ชจ๋ธ ๊ตฌํ˜„ ๋ฐ ์‹คํ—˜, ํŒŒ์ธ ํŠœ๋‹

๐Ÿ“Šย ํƒ์ƒ‰์  ๋ถ„์„ ๋ฐ ์ „์ฒ˜๋ฆฌ

ํƒ์ƒ‰์  ๋ถ„์„

  • ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ๋ฐ ํŠน์„ฑ ๋ถ„์„

์ „์ฒ˜๋ฆฌ

  • ๋งž์ถค๋ฒ•: hanspell library๋ฅผ ์ด์šฉํ•ด ๋งž์ถค๋ฒ• ๊ฒ€์‚ฌ ์ง„ํ–‰(~, ^^ ๋“ฑ๋„ ์ œ๊ฑฐ), ๋ฐ˜๋ณต ๋ฌธ์ž ์ œ๊ฑฐ
  • ๋ฐ์ดํ„ฐ balancing: Label ๊ฐ’์„ 0๋ถ€ํ„ฐ 5๊นŒ์ง€์˜ ์ •์ˆ˜ ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„์—ˆ์„ ๋•Œ ๋ถ„ํฌ๋ฅผ ์‚ดํŽด๋ณธ ํ›„, 0์„ ์ œ์™ธํ•œ ๋‹จ์œ„์˜ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜๊ฐ€ ์•ฝ 2000๊ฐœ๊ฐ€ ๋˜๋„๋ก drop/duplication/swap ์ง„ํ–‰

๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•

  • Swap sentence: ๋‘ ๋ฌธ์žฅ์„ [SEP] ํ† ํฐ ์ค‘์‹ฌ์œผ๋กœ concatํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ˆœ์„œ๋ฅผ ๋ฐ”๊พธ๋Š” ๊ฒƒ๋งŒ์œผ๋กœ๋„ ๋‹ค๋ฅธ tokenizer๊ฐ€ ์ƒ์„ฑ๋  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, swap ํ†ตํ•ด ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•
  • Duplication: Label 0 ๋ฐ์ดํ„ฐ์˜ ๊ท ํ˜•์„ ๋งž์ถ”๋Š” ๊ณผ์ •์—์„œ dropํ•œ ๋ฐ์ดํ„ฐ ์ผ๋ถ€๋ฅผ ํ™œ์šฉํ•˜์—ฌ sentence_2 ์œ„์น˜์— sentence_1์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•ด ๋™์ผํ•œ ๋ฌธ์žฅ์Œ์„ ๋งŒ๋“ค๊ณ  label์„ 5๋กœ ์ฃผ์–ด ์ฆ๊ฐ•
  • Adverb insertion, Random insertion: ๋‹ค์Œ ํ•œ๊ตญ์–ด ์‚ฌ์ „์—์„œ ๋ถ€์‚ฌ๋ฅผ ๊ฒ€์ƒ‰ํ•œ ๊ฒฐ๊ณผ๋กœ ๋ถ€์‚ฌ๋ฅผ ๊ต์ฒดํ•˜๊ฑฐ๋‚˜, BERT based ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ์˜๋ฏธ์ƒ ์ž์—ฐ์Šค๋Ÿฌ์šด ํ† ํฐ์„ random์œผ๋กœ ์‚ฝ์ž…ํ•˜๊ฑฐ๋‚˜ ๋Œ€์ฒดํ•˜์—ฌ ์ฆ๊ฐ•

๋ฐ์ดํ„ฐ ๋ฒ„์ „

Data Version ๋ฒ„์ „ ์„ค๋ช…
v1 ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ + Swap Sentence + Duplication
v2 v2 v1 + adverb insertion
v2.1 v1 + random insertion
v2.2 ์ดˆ์ฐฝ๊ธฐ ์ „์ฒ˜๋ฆฌ + adverb_insertion
v2.3 ์ดˆ์ฐฝ๊ธฐ ์ „์ฒ˜๋ฆฌ + random_insertion
v3 v3 label_balancing(random seed = 42)
v3.1 label_balancing(random seed = 123)
v3.2 v3 + adverb insertion
v3.3 v3 + random insertion

โš’๏ธย ๊ฐœ๋ฐœ

ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ

๋ชจ๋ธ ๋ถ„์„ ๋ฐ ์„ฑ๋Šฅ ๊ฐœ์„ 
  • ๋ชจ๋ธ ๋ณ„ [UNK] tokenize ๋ถ„์„

  • ์‚ฌ์šฉํ•œ ๋ชจ๋ธ
    • Roberta
      • klue/roberta-large, klue/roberta-base, klue/roberta-small
    • Electra
      • monologg/koelectra-base-dicriminator, monologg/koelectra-base-v3-dicriminator
      • snulp/KR-ELECTRA-dicriminator
      • beomi/KcELECTRA-base-v2022
    • deberta
      • team-lucid/deberta-v3-base-korean, team-lucid/deberta-v3-xlarge-korean

Learning Rate Scheduler ๊ตฌํ˜„

  • cosine annealing Warmup Restart๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ดˆ๊ธฐ์—๋Š” lr๋ฅผ ๋น ๋ฅด๊ฒŒ ์ƒ์Šน์‹œ์ผœ ์ˆ˜๋ ด ์†๋„๋ฅผ ๋น ๋ฅด๊ฒŒ ํ•˜๊ณ , epoch์ด ๋Š˜์–ด๋‚ ์ˆ˜๋ก lr๋ฅผ ์ค„์—ฌ ๋ฏธ์„ธ์กฐ์ • ๋˜๊ฒŒ๋” ํ•˜์˜€๋‹ค.

Early Stopping ๊ตฌํ˜„

  • val_pearson์„ ๊ธฐ์ค€์œผ๋กœ pearson์ด ๋–จ์–ด์ง€๋ฉด ํ•™์Šต์„ ์ค‘๋‹จํ•ด ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค.

Token ์ถ”๊ฐ€

  • ๋ฐ์ดํ„ฐ์˜ ์ต๋ช…ํ™”๋ฅผ ์œ„ํ•ด ์ด๋ฆ„์ด ์œผ๋กœ ์น˜ํ™˜๋˜์–ด ์žˆ๋Š” ๊ฒƒ์„ ๋ชจ๋ธ์—๊ฒŒ ์ธ์‹์‹œํ‚ค๊ธฐ ์œ„ํ•ด ํ† ํฐ์„ ๋ช…์‹œ์ ์œผ๋กœ ํ† ํฌ๋‚˜์ด์ €์— ์ถ”๊ฐ€ํ•˜์˜€๋‹ค.

Tokenizer max_length ์ง€์ •

  • tokenizer max_length๋ฅผ ๋ฐ์ดํ„ฐ ๊ธธ์ด๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ์ง€์ •ํ•จ์œผ๋กœ์จ ํ•™์Šต ์‹œ ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰์„ ์ค„์—ฌ ํฐ ๋ชจ๋ธ์„ ํ•™์Šตํ•  ๊ฒฝ์šฐ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ ํฌ๊ธฐ๋ฅผ ํ‚ค์›Œ ํ•™์Šต ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์˜€๋‹ค.

Multi-task Learning

  • binary-label์„ ํ•™์Šต์— ํ™œ์šฉํ•˜๊ณ ์ž model์˜ ์•„์›ƒํ’‹์ธ [CLS] ์ž„๋ฒ ๋”ฉ์„ label์„ ์˜ˆ์ธกํ•˜๋Š” regression ๋ชจ๋ธ์— ์ž…๋ ฅํ•˜๊ณ , ๋™์‹œ์— binary-label์„ ์˜ˆ์ธกํ•˜๋Š” classification ๋ชจ๋ธ์— ์ž…๋ ฅํ•ด ๋‘๊ฐœ์˜ ์•„์›ƒํ’‹์— ๋Œ€ํ•ด ๊ฐ๊ฐ์˜ loss๋ฅผ ๊ตฌํ•œ ํ›„ ๋”ํ•œ ์ตœ์ข… Loss๋ฅผ ์ด์šฉํ•ด ์ „์ฒด๋ฅผ ํ•™์Šตํ•˜์˜€๋‹ค. ๊ทธ ๊ฒฐ๊ณผ pearson ์ ์ˆ˜์˜ ์ƒ์Šน์€ ๋ฏธ๋ฏธํ–ˆ์œผ๋‚˜ ํ•™์Šต ์ˆ˜๋ ด์†๋„๋Š” ์ฆ๊ฐ€ํ•˜์˜€๋‹ค.

  • ํ•˜์œ„ ๋ ˆ์ด์–ด๋ฅผ ํ™œ์šฉํ•˜๋ฉด ๋‹จ์–ด ๋‹จ์œ„์˜ ์„ธ๋ถ€ ์ •๋ณด๋ฅผ ํ†ตํ•ด ์œ ์‚ฌ๋„ ์ธก์ •์— ๋„์›€์ด ๋  ๊ฒƒ์ด๋ผ ์ƒ๊ฐํ•˜๊ณ , ๊ฐ ๋ ˆ์ด์–ด์˜ ์‹œํ€€์Šค ์ž„๋ฒ ๋”ฉ์„ ํ‰๊ท ํ•ด CNN์œผ๋กœ ๊ฐ€์ค‘ํ•ฉํ•˜์—ฌ Head ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ํ•˜์œ„ ๋ ˆ์ด์–ด๋ฅผ ์ด์šฉํ•œ fine-tuning์œผ๋กœ train๊ณผ val loss๊ฐ€ ๊ฐ์†Œํ•˜๋ฉด์„œ ์†Œํญ ์ƒ์Šน์„ ์–ป์—ˆ๊ณ , ์ด๋ฏธ ์ƒ์„ธํ•œ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด BERT ์ธ์ฝ”๋” ๋•๋ถ„์— ํฐ ์ƒ์Šน์€ ์—†์—ˆ์ง€๋งŒ ํ•˜์œ„ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•ด ์ ์ˆ˜๋ฅผ ์กฐ๊ธˆ ์ƒ์Šน์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

๐Ÿ’ฅย ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ

๋ชจ๋ธ๋ณ„ loss graph ๋ฐ test pearson

  • WandB๋กœ ๊ธฐ๋ก๋œ ํ•™์Šต Loss๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • ๋ชจ๋ธ๋ณ„ pearson ์ ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. ์ƒ๋‹จ ๋‘๊ฐœ์˜ ๋ชจ๋ธ์€ multi-task learning์œผ๋กœ binary-label์— ๋Œ€ํ•œ classification acc๋„ ํ™•์ธ ๊ฐ€๋Šฅํ•˜๋‹ค.

๋ชจ๋ธ ๋ณ„ ์ผ๋ถ€ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ test set ๊ทธ๋ž˜ํ”„

  • ๋Œ€์ฒด๋กœ ์œ ์‚ฌํ•˜๊ฒŒ ์˜ˆ์ธก๋˜๋‚˜ ์˜ˆ์ธก์ด ๊ฐˆ๋ฆฌ๋Š” ๊ฒƒ๋“ค์— ๋Œ€ํ•ด์„œ๋Š” ์•™์ƒ๋ธ”์„ ํ†ตํ•ด ๊ฒฐ๊ณผ๊ฐ€ ๊ฐœ์„ ๋  ์ˆ˜ ์žˆ์„ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋œ๋‹ค.

๋ชจ๋ธ ์„ฑ๋Šฅ ๋ฐ ์„ ์ •

  • ๋ชจ๋ธ ๋ณ„๋กœ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์•˜๋˜ ๊ฒฝ์šฐ์˜ output์„ ์•™์ƒ๋ธ”์— ํ™œ์šฉํ•˜์˜€๋‹ค. roberta์˜ ๊ฒฝ์šฐ ์ œ์ถœ ์‹œ ์ ์ˆ˜๋กœ ๋ณด์•˜์„ ๋•Œ ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์ค€์ˆ˜ํ•˜๋‹ค๊ณ  ํŒ๋‹จํ•˜์—ฌ loss๊ฐ€ ๋‹ค๋ฅธ ๋‘ ๋ฒ„์ „์„ ํ™œ์šฉํ•˜์˜€๋‹ค.

    ๋ชจ๋ธ๋ช… Batch_size Learning Rate Epoch Data version Test Score Model Loss
    team-lucid/deberta-v3-xlarge-korean_2 16 2e-5 5 v3.1 0.96516 deberta MSE
    beomi/KcELECTRA-base-v2022_1 128 5e-5 30 v3.3 0.9348 kr-electra MSE
    team-lucid/deberta-v3-base-korean_1 64 2e-5 20 v1 0.9309 deberta MSE
    Roberta-large_11 16 5e-6 20 v1 0.9291 roberta L1Loss
    Roberta-large_12 64 5e-6 20 v1 0.9286 roberta MSE
    snunlp-KR-ELECTRA-discriminator_w. Multi-task_9 32 2e-5 20 v2.1 0.92785 kr-electra MSE
  • ์•™์ƒ๋ธ”์˜ ๊ฒฝ์šฐ ๊ตฌํ˜„ํ•˜๊ธฐ ์‰ฌ์šด ๋ชจ๋“  output.csv์„ ํ‰๊ท ์„ ๊ตฌํ•˜๋Š” ๋ฐฉ์‹๊ณผ, output.csv๋ฅผ ๋ฐ์ดํ„ฐ๋กœ ํ™œ์šฉํ•˜๋Š” ์Šคํƒœํ‚น ๋ฐฉ์‹์„ ์‹œ๋„ํ•˜์˜€๋‹ค. ์ตœ์ข… ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™์œผ๋ฉฐ, ๊ฐ™์€ ํ‰๊ฐ€ ์ ์ˆ˜๋ฅผ ๋ฐ›์•˜๋‹ค.

    • ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ ์ผ๋ถ€์— ๋Œ€ํ•œ ์ ์ˆ˜๋Š” 0.9284๋กœ 1๋“ฑ์ด 0.94์ธ ๋น„ํ•ด ๋‚ฎ์€ ํŽธ์ด์—ˆ์œผ๋‚˜, ์ค‘๊ฐ„ ์ ์ˆ˜์— ๊ณผ์ ํ•ฉ ์‹œํ‚ค๊ฑฐ๋‚˜ ๊ณผ๋„ํ•œ ์•™์ƒ๋ธ”์„ ํ•˜์ง€ ์•Š์•˜๊ธฐ์— ์ตœ์ข… ์ ์ˆ˜๊ฐ€ ๋†’์•„์ง„ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.

๊ฒฐ๊ณผ

  • ๋ฆฌ๋”๋ณด๋“œ [์ค‘๊ฐ„ ์ˆœ์œ„]

  • ๋ฆฌ๋”๋ณด๋“œ [์ตœ์ข… ์ˆœ์œ„]

About

level1-semantictextsimilarity-nlp-05 created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5