Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification of warcprox usage #1066

Open
SvenLieber opened this issue Apr 29, 2021 · 1 comment
Open

Clarification of warcprox usage #1066

SvenLieber opened this issue Apr 29, 2021 · 1 comment

Comments

@SvenLieber
Copy link
Contributor

I'm trying to get my head around some details of the warcprox usage, but do not find the needed information in the docs and have issues understanding what happens in the code (even though the code is well structured and nicely documented).

It's basically around these lines. Within the warced class a warcprox is started, but how is a harvester configured to actually use this proxy?

The only thing I could see so far is that within warced the HTTP_PROXY and HTTPS_PROXY environment variables are set. There seems to be no other way a harvester could know about the proxy. But when investigating https://github.com/gwu-libraries/sfm-twitter-harvester/, I don't see that Twarc uses a proxy.

My confusion stems from the fact that when using warcprox outside SFM I usually have to specify in a crawl to use the proxy, e.g. curl -Lk --proxy <my-running-warcprox-ip-and-port> https://google.com.

@SvenLieber
Copy link
Contributor Author

Nevermind, I found the issue which is simply misunderstanding of the warcprox usage from my side.

HTTP_PROXY and HTTPS_PROXY are not just SFM-related environment variables but actual environment variables of the used Debian operating system. Therefore the setting of these environment variables causes every HTTP traffic to run through the proxy, including Twarc's requests.

And in my provided "harvesting example" the working solution looks like this:

HTTP_PROXY="<my-running-warcprox-ip-and-port>" HTTPS_PROXY="<my-running-warcprox-ip-and-port>" REQUESTS_CA_BUNDLE="path-to-warcprox-certificate" curl -L https://google.com

And with the use of REQUESTS_CA_BUNDLE and the warcprox certificate (as also used by SFM), the curl example also does not need the insecure connection flag -k anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants