Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heritrix issue with an external location #702

Closed
liuqingli opened this issue Feb 12, 2017 · 3 comments
Closed

Heritrix issue with an external location #702

liuqingli opened this issue Feb 12, 2017 · 3 comments
Assignees
Milestone

Comments

@liuqingli
Copy link

liuqingli commented Feb 12, 2017

Hi, I am Liuqing from VT. For your powerful tool, I changed the default location from /sfm-data to /home/sfm-data. All services work fine except Heritrix, so now I could not crawl the URLs. Only 1 web harvester and 1 heritrix container are used at present.
The following are some logs for the issue:

$ docker logs sfm_webharvester_1
...
heritrix:8443 not available after wait.
...

$ docker logs sfm_heritrix_1
Oracle Corporation OpenJDK Runtime Environment 1.7.0_95-b00
Using ad-hoc HTTPS certificate with fingerprint...
SHA1:EB:3A:4D:95:96:41:4A:41:59:91:1B:88:70:CB:48:F1:53:C1:D1:2A
Verify in browser before accepting exception.
java.lang.IllegalStateException: java.io.IOException: Failed to create directory: /sfm-data/containers/91e060938deb/jobs
at org.archive.crawler.framework.Engine.(Engine.java:69)
at org.archive.crawler.Heritrix.instanceMain(Heritrix.java:335)
at org.archive.crawler.Heritrix.main(Heritrix.java:188)
Caused by: java.io.IOException: Failed to create directory: /sfm-data/containers/91e060938deb/jobs
at org.archive.util.FileUtils.ensureWriteableDirectory(FileUtils.java:677)
at org.archive.crawler.framework.Engine.(Engine.java:67)
... 2 more
Heritrix version: 3.3.0-LBS-2016-02
Exception in thread "main" java.lang.IllegalStateException: java.io.IOException: Failed to create directory: /sfm-data/containers/91e060938deb/jobs
at org.archive.crawler.framework.Engine.(Engine.java:69)
at org.archive.crawler.Heritrix.instanceMain(Heritrix.java:335)
at org.archive.crawler.Heritrix.main(Heritrix.java:188)
Caused by: java.io.IOException: Failed to create directory: /sfm-data/containers/91e060938deb/jobs
at org.archive.util.FileUtils.ensureWriteableDirectory(FileUtils.java:677)
at org.archive.crawler.framework.Engine.(Engine.java:67)
... 2 more

Could anyone give me some advices on the issue? Thanks very much!

@justinlittman
Copy link
Contributor

Heritrix is really finicky. I've added some troubleshooting guidance to your docs: http://sfm.readthedocs.io/en/latest/troubleshooting.html#web-harvesting-heritrix-problems. Let me know if that helps.

See you at AU, @liuqingli.

@liuqingli
Copy link
Author

Hi Justin,
Thanks for your quick reply. I followed your guidance to remove the containers, but unfortunately the problem still remains.

@justinlittman justinlittman added this to the 1.6 milestone Feb 13, 2017
@justinlittman justinlittman self-assigned this Feb 13, 2017
@justinlittman
Copy link
Contributor

@liuqingli and I worked offline to get Heritrix working, so I'm going to close this ticket. I'm going to add a note in #408, as Heritrix not scaling is problematic for him.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants