Clarification of warcprox usage #1066

SvenLieber · 2021-04-29T17:52:22Z

I'm trying to get my head around some details of the warcprox usage, but do not find the needed information in the docs and have issues understanding what happens in the code (even though the code is well structured and nicely documented).

It's basically around these lines. Within the warced class a warcprox is started, but how is a harvester configured to actually use this proxy?

The only thing I could see so far is that within warced the HTTP_PROXY and HTTPS_PROXY environment variables are set. There seems to be no other way a harvester could know about the proxy. But when investigating https://github.com/gwu-libraries/sfm-twitter-harvester/, I don't see that Twarc uses a proxy.

My confusion stems from the fact that when using warcprox outside SFM I usually have to specify in a crawl to use the proxy, e.g. curl -Lk --proxy <my-running-warcprox-ip-and-port> https://google.com.

The text was updated successfully, but these errors were encountered:

SvenLieber · 2021-05-06T09:01:06Z

Nevermind, I found the issue which is simply misunderstanding of the warcprox usage from my side.

HTTP_PROXY and HTTPS_PROXY are not just SFM-related environment variables but actual environment variables of the used Debian operating system. Therefore the setting of these environment variables causes every HTTP traffic to run through the proxy, including Twarc's requests.

And in my provided "harvesting example" the working solution looks like this:

HTTP_PROXY="<my-running-warcprox-ip-and-port>" HTTPS_PROXY="<my-running-warcprox-ip-and-port>" REQUESTS_CA_BUNDLE="path-to-warcprox-certificate" curl -L https://google.com

And with the use of REQUESTS_CA_BUNDLE and the warcprox certificate (as also used by SFM), the curl example also does not need the insecure connection flag -k anymore.

lwrubel added the documentation label Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification of warcprox usage #1066

Clarification of warcprox usage #1066

SvenLieber commented Apr 29, 2021

SvenLieber commented May 6, 2021

Clarification of warcprox usage #1066

Clarification of warcprox usage #1066

Comments

SvenLieber commented Apr 29, 2021

SvenLieber commented May 6, 2021