Slow training speed for textcat pipeline #4633

ZhuoruLin · 2019-11-12T20:44:44Z

ZhuoruLin
Nov 12, 2019

Hi!

I am trying to train a textcat pipeline with over 6000 classes. The training data consists of around 300k documents. I tried to convert my training data to the correct jsonl format but that would result in a file size of over 100G. And the initialization of GoldCorpus would take forever in writing the message packs. Therefore I wrote the following TextcatGoldCorpus class:

class TextcatGoldCorpus(object):
    _textcat_cols = ['text', 'topic_id']

    def __init__(self, train, dev, labels):
        self.train = pd.read_csv(train, names=self._textcat_cols, header=None)
        self.dev = pd.read_csv(dev, names=self._textcat_cols, header=None)
        self.labels = labels

    def train_docs(self):
        train_shuffled = self.train.sample(frac=1)
        datasets = self.iter_textcat_df(train_shuffled, )
        yield from datasets

    def dev_docs(self):
        datasets = self.iter_textcat_df(self.dev, )
        yield from datasets

    def iter_textcat_df(self, textcat_df):
        for i, (text, topic_id) in textcat_df.iterrows():
            yield (text, {'cats': self.textcat_annotate(topic_id)})

    def textcat_annotate(self, gold_topic_id):
        cats = dict([(topic_id, True) if topic_id == gold_topic_id else (topic_id, False) for topic_id in self.labels])
        return cats

Then I write a regular train loop to call nlp.update for batch size of 64. However, the training is so slow. I am using an Nvidia V100 GPU and the average update speed is around 2-3 documents/second. This would take around two days to train one epoch for my task. I also notice the GPU training does not gain any significant speedup from CPU.

I used to train a Convolution model (with PyTorch) on the exact same task and each epoch takes around 3 to 4 hours. I also used to fine-tune Bert Base model on classification task and the entire training finished in around one day with 3 epochs.

I have almost no idea about the potential cause of this slowdown. Please give me some suggestions. Thanks!

adrianeboyd · 2019-11-13T09:40:36Z

adrianeboyd
Nov 13, 2019

Sparse categories are definitely a problem with the current training format. I haven't tried using the TextCategorizer on a task with so many categories, so I'm not sure whether the default model is just not well-suited or something else might be going on.

It's hard to guess what's going on from the code provided here. I would suggest profiling with something like cProfile (https://docs.python.org/3/library/profile.html) to see if you can figure out where it is spending so much time, for instance, if you're applying the full nlp() pipeline to your docs somewhere unintentionally. (What is the nlp argument doing in self.iter_textcat_df(nlp, self.dev, )?)

0 replies

ZhuoruLin · 2019-11-13T15:19:18Z

ZhuoruLin
Nov 13, 2019
Author

@adrianeboyd Hi! Thanks for your reply! Sorry the nlp argument is not doing anything in the script above. Originally I want self.iter_textcat_df to return (Doc, GoldParse) tuple that's why there were some left-over components in the code and I forgot to remove them when I was copy-pasting. But that's no longer the case anymore since I find out that nlp.update can handle (rawtext, annotation) tuple just fine.

I would need to investigate the default model's implementation further. But currently, you can only pass in ensemble, simple-cnn or bow for architecture in the model config. I trained all of these models in pytorch and they were fine (slow but usually you can have several epochs in a day with one GPU).

I would update the ticket later with the profilling result. Thanks for your suggestion!

0 replies

ZhuoruLin · 2019-11-14T19:12:18Z

ZhuoruLin
Nov 14, 2019
Author

Hi @adrianeboyd . I got some cprofile output to share with you :). I ran the training script with bow architecture for textcat pipeline for three to four minutes. Here are a couple of functions with most tottime (Seems like the model takes a very long time to update the parameters.):

Thu Nov 14 19:02:44 2019    ../train_script/fullset_textcat_cprofile.out

         4292175 function calls (4259105 primitive calls) in 224.518 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        5   84.115   16.823   84.115   16.823 ops.pyx:838(adam)
        4   43.436   10.859   43.436   10.859 ops.pyx:438(update_averages)
        5   37.004    7.401  180.125   36.025 optimizers.pyx:102(__call__)
        3   17.942    5.981   17.970    5.990 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
   370136   12.648    0.000   12.648    0.000 {method 'split' of 'str' objects}
        5    5.947    1.189   10.997    2.199 ops.pyx:455(clip_gradient)
1839/1665    5.078    0.003    5.089    0.003 {built-in method numpy.core._multiarray_umath.implement_array_function}
        5    4.572    0.914   88.688   17.738 optimizers.pyx:216(_adam)
        1    3.507    3.507    3.507    3.507 {thinc.linear.linear._init_W}
      202    3.185    0.016    3.185    0.016 {built-in method cupy.core.core.array}
   367158    2.125    0.000   14.806    0.000 train_textcat.py:158(<lambda>)
        2    0.616    0.308    0.616    0.308 {method 'use' of 'cupy.cuda.device.Device' objects}
     6712    0.522    0.000    0.556    0.000 pipes.pyx:1018(add_label)
  176/142    0.202    0.001    0.215    0.002 {built-in method _imp.create_dynamic}
        1    0.182    0.182   14.988   14.988 {pandas._libs.lib.map_infer}
     2500    0.175    0.000    0.175    0.000 tokenizer.pyx:176(_flush_cache)
     10/5    0.173    0.017    0.174    0.035 /home/zlin/.conda/envs/excalibur/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py:49(continue_update)
       52    0.154    0.003    0.235    0.005 gold.pyx:654(__init__)

My train script is really simple. The train loop looks like

batch_sizes = compounding(4, 32, 1.5)
    for epoch in range(n_iter):
        tessie_train_docs = tessie_gold_corpus.train_docs(nlp)
        tessie_dev_docs = tessie_gold_corpus.dev_docs(nlp)
        # Data Loader
        tessie_train_batches = minibatch(tessie_train_docs, size=batch_sizes)
        pbar = tqdm(total=ndoc_tessie)
        for batch in tessie_train_batches:
            if not batch:
                break
            docs, golds = zip(*batch)
            nlp.update(
                docs,
                golds,
                drop=0.1,
                sgd=optimizer,
                losses=losses,
            )

Also, it seems like the process is still using cpu even I passed in use_gpu=0 for the optimizer.

Thank you for your time!

0 replies

adrianeboyd · 2019-11-19T08:40:20Z

adrianeboyd
Nov 19, 2019

That's an interesting analysis. I am not an expert on the thinc internals, so I think @honnibal might need to take a look to see if he knows what might be going on.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Slow training speed for textcat pipeline #4633

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Slow training speed for textcat pipeline #4633

Uh oh!

Uh oh!

ZhuoruLin Nov 12, 2019

Replies: 4 comments

Uh oh!

adrianeboyd Nov 13, 2019

Uh oh!

ZhuoruLin Nov 13, 2019 Author

Uh oh!

Uh oh!

ZhuoruLin Nov 14, 2019 Author

Uh oh!

adrianeboyd Nov 19, 2019

ZhuoruLin
Nov 12, 2019

adrianeboyd
Nov 13, 2019

ZhuoruLin
Nov 13, 2019
Author

ZhuoruLin
Nov 14, 2019
Author

adrianeboyd
Nov 19, 2019