ElasticDL features for large scale recommendation #2156

backyes · 2020-07-10T09:06:48Z

Good job on improving tensorflow on kubernetes for easy developing large scale training system. :-D

After reading some tutorials, we found ElasticDL designs new PS architecture and distributed framework, and want to ElasticDL team clarifys some more design considerations.

Large scale recommendation system requires several features from training system,

efficiently handle large scale embedding while distributed training enabled，and it requires parameter servers and DL framework to satisfy sparse SGD updating. (How about ElasticDL's features for large scale embedding training)
be compatiable with DL framework API to handle large scale embedding to make major model zoo works well.

How about ElasticDL?

wangkuiyi · 2020-07-11T04:17:42Z

ElasticDL can handle very large models using its general parameter server in Go, which is based on the previous design we explained in Google Developer Day 2019, but with many performance improvements.

@QiJune I think @backyes 's question is a very inspiring hint -- we should add a benchmark showing the capability of ElasticDL in supporting large models.

QiJune · 2020-07-11T14:19:19Z

@backyes Thank you for your interest!

ElasticDL supports large embedding tables and also supports sparse SGD updating.

An embedding table will be sharded to several PS instances. In forward pass, workers pull embedding vectors from PS. In the backward pass, workers push embedding gradients (IndexedSlices data structure) to PS. Then, the sparse gradients will be applied to the embedding table in PS.

For more design details, please refer to parameter server and high performance PS.

For more implementation details, please refer to the Go PS code base and the RPC interface.

ElasticDL is also compatible with TensorFlow API well.

Users program their models withtf.keras.layers.Embedding directly. ElasticDL supports native TensorFlow Keras API.

ElasticDL will substitute the embedding layer with elasticdl.layers.embedding before training. It's transparent to users.

@wangkuiyi Thank you for the advice. Yes, we could make an experiment, a recommendation model with large embedding tables.

wangkuiyi assigned QiJune Jul 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ElasticDL features for large scale recommendation #2156

ElasticDL features for large scale recommendation #2156

backyes commented Jul 10, 2020

wangkuiyi commented Jul 11, 2020

QiJune commented Jul 11, 2020 •

edited

Loading

ElasticDL features for large scale recommendation #2156

ElasticDL features for large scale recommendation #2156

Comments

backyes commented Jul 10, 2020

wangkuiyi commented Jul 11, 2020

QiJune commented Jul 11, 2020 • edited Loading

QiJune commented Jul 11, 2020 •

edited

Loading