add argument for wikidump path + use IndexedRowMatrix to track docIds #108

derlin · 2017-06-10T17:18:05Z

About path argument: I found it easier to be able to pass the path to the wikidump file as an argument instead of recompiling every time I want to use another dump.

About docIds: In chapter 6, it says

"creating a mapping of row IDs to document titles is a little more difficult. To achieve it, we can use the zipWithUniqueId function ..."

Another way to keep track of docIds is to use IndexRowMatrix instead of RowMatrix. This way, document ids are embedded in the svd model and don't depend on the partitioning anymore. This technique has many advantages, one of which is that it is now possible to save the svd model for later use.

To generate doc ids, I still use the zipWithUniqueId available for RDD only. A better way would be to use the sql function monotically_increasing_id:

    import org.apache.spark.sql.functions._
    docTermMatrix.withColumn("id",monotonically_increasing_id)

but this generates huge ids (about 10 digits long), which is harder to read. Hence the addNiceRowId method.

(By the way, loved your book, nice work !)

…atrix to keep track of document Ids In chapter 6, it says "creating a mapping of row IDs to document titles is a little more difficult. To achieve it, we can use the zipWithUniqueId function ...". A less "hackish" way to keep track of docIds is to use IndexRowMatrix instead of RowMatrix. This way, doc ids are embedded in the svd model. There are many advantages, one of which is that it is now possible to save the svd model for later use. To generate doc ids, I still use the zipWithUniqueId available for RDD only. A better way would be to use the sql function "monotically_increasing_id": import org.apache.spark.sql.functions._ docTermMatrix.withColumn("id",monotonically_increasing_id) but this generates huge ids (about 10 digits long), which is harder to read. Hence the "addNiceRowId" method.

srowen · 2017-06-10T19:09:55Z

This looks like a good suggestion. The book has just gone to press though, so I'm not sure we can add this for the 2nd edition. But it can stay here as a note and suggestion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add argument for wikidump path + use IndexedRowMatrix to track docIds #108

add argument for wikidump path + use IndexedRowMatrix to track docIds #108

derlin commented Jun 10, 2017 •

edited

Loading

srowen commented Jun 10, 2017

add argument for wikidump path + use IndexedRowMatrix to track docIds #108

Are you sure you want to change the base?

add argument for wikidump path + use IndexedRowMatrix to track docIds #108

Conversation

derlin commented Jun 10, 2017 • edited Loading

srowen commented Jun 10, 2017

derlin commented Jun 10, 2017 •

edited

Loading