You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Good job on improving tensorflow on kubernetes for easy developing large scale training system. :-D
After reading some tutorials, we found ElasticDL designs new PS architecture and distributed framework, and want to ElasticDL team clarifys some more design considerations.
Large scale recommendation system requires several features from training system,
efficiently handle large scale embedding while distributed training enabled,and it requires parameter servers and DL framework to satisfy sparse SGD updating. (How about ElasticDL's features for large scale embedding training)
be compatiable with DL framework API to handle large scale embedding to make major model zoo works well.
How about ElasticDL?
The text was updated successfully, but these errors were encountered:
ElasticDL can handle very large models using its general parameter server in Go, which is based on the previous design we explained in Google Developer Day 2019, but with many performance improvements.
@QiJune I think @backyes 's question is a very inspiring hint -- we should add a benchmark showing the capability of ElasticDL in supporting large models.
ElasticDL supports large embedding tables and also supports sparse SGD updating.
An embedding table will be sharded to several PS instances. In forward pass, workers pull embedding vectors from PS. In the backward pass, workers push embedding gradients (IndexedSlices data structure) to PS. Then, the sparse gradients will be applied to the embedding table in PS.
Good job on improving tensorflow on kubernetes for easy developing large scale training system. :-D
After reading some tutorials, we found ElasticDL designs new PS architecture and distributed framework, and want to ElasticDL team clarifys some more design considerations.
Large scale recommendation system requires several features from training system,
How about ElasticDL?
The text was updated successfully, but these errors were encountered: