add argument for wikidump path + use IndexedRowMatrix to track docIds #108
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
About path argument: I found it easier to be able to pass the path to the wikidump file as an argument instead of recompiling every time I want to use another dump.
About docIds: In chapter 6, it says
Another way to keep track of docIds is to use
IndexRowMatrix
instead ofRowMatrix
. This way, document ids are embedded in the svd model and don't depend on the partitioning anymore. This technique has many advantages, one of which is that it is now possible to save the svd model for later use.To generate doc ids, I still use the
zipWithUniqueId
available for RDD only. A better way would be to use the sql functionmonotically_increasing_id
:but this generates huge ids (about 10 digits long), which is harder to read. Hence the
addNiceRowId
method.(By the way, loved your book, nice work !)