You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to get my head around some details of the warcprox usage, but do not find the needed information in the docs and have issues understanding what happens in the code (even though the code is well structured and nicely documented).
It's basically around these lines. Within the warced class a warcprox is started, but how is a harvester configured to actually use this proxy?
The only thing I could see so far is that within warced the HTTP_PROXY and HTTPS_PROXY environment variables are set. There seems to be no other way a harvester could know about the proxy. But when investigating https://github.com/gwu-libraries/sfm-twitter-harvester/, I don't see that Twarc uses a proxy.
My confusion stems from the fact that when using warcprox outside SFM I usually have to specify in a crawl to use the proxy, e.g. curl -Lk --proxy <my-running-warcprox-ip-and-port> https://google.com.
The text was updated successfully, but these errors were encountered:
Nevermind, I found the issue which is simply misunderstanding of the warcprox usage from my side.
HTTP_PROXY and HTTPS_PROXY are not just SFM-related environment variables but actual environment variables of the used Debian operating system. Therefore the setting of these environment variables causes every HTTP traffic to run through the proxy, including Twarc's requests.
And in my provided "harvesting example" the working solution looks like this:
And with the use of REQUESTS_CA_BUNDLE and the warcprox certificate (as also used by SFM), the curl example also does not need the insecure connection flag -k anymore.
I'm trying to get my head around some details of the
warcprox
usage, but do not find the needed information in the docs and have issues understanding what happens in the code (even though the code is well structured and nicely documented).It's basically around these lines. Within the
warced
class awarcprox
is started, but how is a harvester configured to actually use this proxy?The only thing I could see so far is that within warced the
HTTP_PROXY
andHTTPS_PROXY
environment variables are set. There seems to be no other way a harvester could know about the proxy. But when investigating https://github.com/gwu-libraries/sfm-twitter-harvester/, I don't see that Twarc uses a proxy.My confusion stems from the fact that when using
warcprox
outside SFM I usually have to specify in a crawl to use the proxy, e.g.curl -Lk --proxy <my-running-warcprox-ip-and-port> https://google.com
.The text was updated successfully, but these errors were encountered: