-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiprocessing is broken (at least on MacOS arm64) #14
Comments
Thank you for reporting this! I’ll just summarize our off-GitHub discussion here for future reference. The root of the problem is how worker processes are created on different platforms. On Linux, the default method is 'fork' which is very fast and is able to reuse the memory of the parent process. On MacOS (starting with Python 3.8) and Windows, the default method is 'spawn', which requires all objects loaded in the parent process to be pickled and to be unpickled in the worker processes (duplicating the amount of memory used). For the larger model files, e.g. the German web and social media model, this can take a substantial amount of time. The solution in the commit above is to prefer the 'fork' method over the default method, if it is supported by the operating system. This includes Linux and MacOS, but not Windows. Note that according to this issue, the 'fork' method might lead to crashes of the worker processes on MacOS (which is why the 'spawn' method is the default now). If this turns out to be the case for SoMeWeTa, please comment again or open a new issue. |
Pickling/unpickling the model file creates a large startup delay, but that can't be the only issue. Processing is also extremely slow in the long run, i.e. pickling/unpickling a sentence would have to take much longer than actually tagging it. The workaround should be fine, though. AFAIK problems with fork are related to use of UI code and other MacOS frameworks and shouldn't affect standalone Python scripts. |
I did a few timing experiments and this is what I found:
I suspect that the synchronization overhead is unavoidable if SoMeWeTa should also be able to do parallel tagging when reading from STDIN. However, if you have a large number of files that you want to tag in parallel, you could use the somewe-tagger-multifile script in the utils directory of this repository in combination with GNU parallel to achieve higher efficiencies: parallel -j 8 -X "somewe-tagger-multifile --tag <model> --output-prefix tagged/ --output-suffix ''" ::: tokenized/*.txt |
--parallel
slows down processing extremely, e.g.--parallel 8
from 10k tokens/s to about 1.5k tokens/s. It isn't clear whether this is a problem with themultiprocessing
functionality itself or with something the SoMeWeTa engine does; but there appears to be massive synchronisation overhead.MacOS 12.5.1 arm64 (M1)
Anaconda Python v3.9.12 with current SoMeWeTa from PyPI
The text was updated successfully, but these errors were encountered: