Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add argument for wikidump path + use IndexedRowMatrix to track docIds #108

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

derlin
Copy link

@derlin derlin commented Jun 10, 2017

About path argument: I found it easier to be able to pass the path to the wikidump file as an argument instead of recompiling every time I want to use another dump.

About docIds: In chapter 6, it says

"creating a mapping of row IDs to document titles is a little more difficult. To achieve it, we can use the zipWithUniqueId function ..."

Another way to keep track of docIds is to use IndexRowMatrix instead of RowMatrix. This way, document ids are embedded in the svd model and don't depend on the partitioning anymore. This technique has many advantages, one of which is that it is now possible to save the svd model for later use.

To generate doc ids, I still use the zipWithUniqueId available for RDD only. A better way would be to use the sql function monotically_increasing_id:

    import org.apache.spark.sql.functions._
    docTermMatrix.withColumn("id",monotonically_increasing_id)

but this generates huge ids (about 10 digits long), which is harder to read. Hence the addNiceRowId method.

(By the way, loved your book, nice work !)

…atrix to keep track of document Ids

    In chapter 6, it says "creating a mapping of row IDs to document titles is a little more difficult. To achieve it, we can use the zipWithUniqueId function ...". A less "hackish" way to keep track of docIds is to use IndexRowMatrix instead of RowMatrix. This way, doc ids are embedded in the svd model. There are many advantages, one of which is that it is now possible to save the svd model for later use.

    To generate doc ids, I still use the zipWithUniqueId available for RDD only. A better way would be to use the sql function "monotically_increasing_id":

        import org.apache.spark.sql.functions._
        docTermMatrix.withColumn("id",monotonically_increasing_id)

    but this generates huge ids (about 10 digits long), which is harder to read. Hence the "addNiceRowId" method.
@srowen
Copy link
Collaborator

srowen commented Jun 10, 2017

This looks like a good suggestion. The book has just gone to press though, so I'm not sure we can add this for the 2nd edition. But it can stay here as a note and suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants