Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use more parallelization in training, other speedups #79

Open
trifle opened this issue Jan 22, 2019 · 2 comments
Open

Use more parallelization in training, other speedups #79

trifle opened this issue Jan 22, 2019 · 2 comments

Comments

@trifle
Copy link

trifle commented Jan 22, 2019

Hi Matt, Dan,

thanks for this wonderful library.

While training some augmented models, I noticed that there are some steps in the process which could benefit a lot from parallelization.
There are also small corners where expanding the interface a bit would streamline processing in some cases.

I could submit one or several PRs, but want to ask whether you would be willing to have them.

I have a fork that I'm collecting changes in.

So far I'd propose to:

  • Use a multiprocessing.Pool for data pre-processing in data_processing.py:prepare_all_data(This gives a near-linear speedup, saving a couple of minutes on my 4-core).
  • Use n_jobs=-1 where possible (e.g. in GridSearchCV). Substantial speedup by default that grows with the number of grid parameters.
  • Let the Blockifier.blockify interface accept pre-parsed etree instances. I understand that this may be a niche use case, but I'm processing a lot of html where I need to parse the tree before extraction for some pre-processing. Skipping the duplicate parsing saves about 30% overall time:
    Parsing 1000 entries with pre-existing trees, three run average: 28.9 seconds (user time)
    Parsing 1000 entries from string, three run average: 37.7 seconds (user time)
    (Note that this includes the first parsing pass and some trivial overhead for loading the data. Since the time includes python startup and model loading, the 30% saving is a lower bound estimate).
  • Finally, I've looked at some profiling data and saw that casting the blocks (str_block_list_cast in blocks.pyx: https://github.com/dragnet-org/dragnet/blob/master/dragnet/blocks.pyx#L860) presents a non-trivial overhead. There may be some potential here: I tried simply skipping the entire casting step, only decoding the text to unicode for the regex to work. The extraction seemed to still work fine. So I'm not quite sure in which cases the blocks would be in an unknown (bytes, str) state.

Thanks, best,
Pascal

@trifle trifle changed the title Use more parallelization in training Use more parallelization in training, other speedups Jan 22, 2019
@matt-peters
Copy link
Collaborator

Hi Pascal,

All of those changes sound great to me! Let's split them up into several PRs, it will be much easier to review and merge if it is broken up. I'd suggest breaking it up into 4 PRs following your bullet points.

For the n_jobs=-1: it's fine to train with this, but I found large performance slowdown with using it for prediction as sklearn was creating and destroying a thread pool for test instance. This was a few versions of sklearn ago, I don't know if it has since been fixed. So I had a hack to always run n_jobs=1 for prediction unless otherwise explicitly overridden.

@trifle
Copy link
Author

trifle commented Jan 26, 2019

Hi @matt-peters, sounds like a good plan.

It's a good idea to actually benchmark each of the changes separately.
I guess I'll find the time to split the PRs and document them the week after the next.

Best, Pascal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants