U.S. Patent Phrase to Phrase Matching
Now the dataset is released as a public general purpose benchmark dataset! You could find the related paper here
Ranked on 132th (132 out of 1889) - bronze medal
Adversarial-training let us train networks with significantly improved resistance to adversarial attacks, thus improving robustness of neural networks.
FGM implementation with PyTorch:
class FGM():
def __init__(self, model):
self.model = model
self.backup = {}
def attack(self, epsilon=1., emb_name='word_embeddings'):
for name, param in self.model.named_parameters():
if param.requires_grad and emb_name in name:
self.backup[name] = param.data.clone()
norm = torch.norm(param.grad)
if norm != 0:
r_at = epsilon * param.grad / norm
param.data.add_(r_at)
def restore(self, emb_name='word_embeddings'):
for name, param in self.model.named_parameters():
if param.requires_grad and emb_name in name:
assert name in self.backup
param.data = self.backup[name]
self.backup = {}
Add FGM in training code:
fgm = FGM(model)
for batch_input, batch_label in data:
loss = model(batch_input, batch_label)
loss.backward()
# adversarial training
fgm.attack()
loss_adv = model(batch_input, batch_label)
loss_adv.backward()
fgm.restore()
optimizer.step()
model.zero_grad()
AWP was also used for training the DeBERTa v3 large models.
Exponential Moving Average (EMA) is similar to Simple Moving Average (SMA), measuring trend direction over a period of time. However, whereas SMA simply calculates an average of price data, EMA applies more weight to data that is more current.
- microsoft/deberta-v3-large
- anferico/bert-for-patents
- ahotrod/electra_large_discriminator_squad2_512
- Yanhao/simcse-bert-for-patent
- funnel-transformer/large
Most of the magics that high scorers did were improving the input data by making more input data with existing features.
i.e.
Grouping the target words per "anchor + context" and attach them to the end of each sentence.
train['group'] = train['context'] + " " + train['anchor']
allres = {}
for text in tqdm(train["group"].unique()):
tmpdf = train[train["group"]==text].reset_index(drop=True)
texts = ",".join(tmpdf["target"])
allres[text] = texts
train["target_gp"] = train["group"].map(allres)
train["input"] = train.anchor + " " + tokenizer.sep_token + " " + train.target + " " + tokenizer.sep_token + " " + train.title + " " + tokenizer.sep_token + " " + train.target_gp
i.e.
-
Group the targets from the same anchor, such as 'target1, target2, target3, …'. Then add them to the context.
-
Group the targets from the same anchor and context. This brings more relevant targets.
-
Group the targets from the same anchor. Group the anchors from the same context. Add them to the context in turn.
It is well-known that training a neural network model with an optimal loss function could enhance the performance of the model.
One of the hottest arguments in this competition was do we need to consider this as a classification problem (classify the scrore) or a regression problem (regression for calculating similarity between target and anchor texts).
For the classification task, people used the BCE loss. And for the mse task, people used the MSE loss.
Using Pearson Correlation Coefficient as a loss function was adapted by many competitors since the USPPPM competition used the pearson correlation as a scoring metric.
This kaggle discussion contains a good explanation about using PCC as a loss function.
Most of the participants of this competition tried to use both DeBERTa-v3-large model and BERT-for-patents model.
Training notebook for anferico/bert-for-patents
Training notebook for anferico/bert-for-patents with masked language model
Training notebook for RoBERTa large
Training notebook for CoCoLM large
Training notebook for albert xxl v2
Using smoothed focal loss rather than BCE loss did not worked
Training the Huggingface BART large model did not work with my codes.
Training the DistilBERT model did not work for ensembling. I also tried XLM-RoBERTa model with same code, but it also did not work.
Pretraining with Masked Language Model
Pretrained on this dataset
Since the scoring metric that is used for this competition was a PCC (Pearson Correlation Coefficient), gives a score by comparing the distances between prediction data. So, doing things such as winsorizing, clipping, discretizing will not work.