Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for NUTCH-2455 more efficient usage of hostdb in generate #254

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

okedoki
Copy link
Contributor

@okedoki okedoki commented Dec 8, 2017

Three questions/modification left open:

  1. In several places we use url.getHost() in the Nutch code, in other we use url.getHost().toLower(). Why?
  2. public static class ScoreHostKeyComparator extends WritableComparator should Implement Raw comparator. If you know how to do it you are welcome to do.
  3. The whole Generator file is to big, it should be spread to several files. Again, if you know how to fix it in a good way, you are welcome.

@okedoki
Copy link
Contributor Author

okedoki commented Dec 13, 2017

Please review only fix for NUTCH-2455 more efficient usage of hostdb in generate(c1ce018)

The "added id to output files" is not correct commit, I have reverted it.

Copy link
Member

@lewismc lewismc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @okedoki can you please format this entire patch according to the eclipse-codeformatter thank you

@okedoki
Copy link
Contributor Author

okedoki commented Dec 28, 2017

I found a bug with partitioned that prevents to get correct hostdb data to the correct reducer. It is fixed.
The second, I have applied the Eclipse auto-formatting as suggested by @lewismc .

For some reasons, I have a conflict with Generator from master. I assume it happened because of autoformating, so instead of correct comparison it shows that the whole code of Generator is replaced.

What is the rule for fixing in this case?

@lewismc
Copy link
Member

lewismc commented Dec 30, 2017

Mmmm OK @okedoki we need to resolve this conflict. The issue here is that you have indented everything by 4 spaces by the looks of it. This is incorrect as indenting accoridng to the code formatting template is 2 space indents. Please update the ppull request again if you could. Thanks

@okedoki
Copy link
Contributor Author

okedoki commented Jan 18, 2018

@lewismc
Finally, I managed to solve the merging conflict.
Please review before Generator will be modified again.

@lewismc
Copy link
Member

lewismc commented Jan 18, 2018

@okedoki thank you very much, this is a big patch and we need to test it out.

@okedoki
Copy link
Contributor Author

okedoki commented Jan 25, 2018

There was a silly bug that didnt copy hostdb correctly in reducer because of copy-by-reference.
My bad, fixed with clone.

hostDomainCounts.put(key.second.toString(), new MutablePair<HostDatum, int[]>((HostDatum) hostDatum.clone(), new int []{1,0})); at line 484

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants