Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy Errors #101

Open
jj2018jj opened this issue Dec 13, 2024 · 8 comments
Open

Proxy Errors #101

jj2018jj opened this issue Dec 13, 2024 · 8 comments

Comments

@jj2018jj
Copy link

jj2018jj commented Dec 13, 2024

If I do a scrape without the -proxies flag, it works as expected.

Also if I use curl and the proxy url to the google maps url then it works as expected.

However when I add the -proxies flag I see these errors in the output and it does not work.

This is with everything installed directly on ubuntu 24.04 ... and not using Docker.

{"level":"info","component":"scrapemate","time":"2024-12-13T17:55:43.067587274Z","message":"starting scrapemate"}
{"level":"info","component":"scrapemate","numOfJobsCompleted":0,"numOfJobsFailed":0,"lastActivityAt":"0001-01-01T00:00:00Z","speed":"0.00 jobs/min","time":"2024-12-13T17:56:43.068652477Z","message":"scrapemate stats"}
{"level":"info","component":"scrapemate","error":"inactivity timeout: 0001-01-01T00:00:00Z","time":"2024-12-13T17:56:43.068682503Z","message":"exiting because of inactivity"}
{"level":"info","component":"scrapemate","job":"Job{ID: 71d978af-8a91-45ee-9d56-dd630099dc31, Method: GET, URL: https://www.google.com/maps/search/test, UrlParams: map[hl:en]}","error":"context canceled","status":"failed","duration":60899.47342,"time":"2024-12-13T17:56:43.967269068Z","message":"job finished"}
{"level":"error","component":"scrapemate","error":"context canceled","time":"2024-12-13T17:56:43.967304912Z","message":"error while processing job"}
{"level":"info","component":"scrapemate","time":"2024-12-13T17:56:43.967340977Z","message":"scrapemate exited"}

Any tips on what I could be doing wrong?

Update1: I tested a 2nd proxy service and I also tried setting the proxy with export http_proxy="proxy url here" and export https_proxy="proxy url here" while not using the -proxies flag ... but it still didn't work.

Update 2: I tested a 3rd proxy service using socks5 and the -proxies flag and got this error {"level":"error","component":"scrapemate","error":"playwright: Browser does not support socks5 proxy authentication","time":"2024-12-14T22:00:49.630242703Z","message":"error while processing job"}

@gosom
Copy link
Owner

gosom commented Dec 15, 2024

I just tested the functionality using a socks5 proxy:

ssh -D 1080 -q -C -N myserver

and then used this:

 go run main.go -input example-queries.txt -results demo.csv  -proxies 'socks5://127.0.0.1:1080'

also tried from the web interface.

The traffic went through the proxy.

The difference here is that I use a socks5 proxy without authentication.

I don't have access to a proxy with authentication at the moment.

I have tried the http proxy with authentication when this was implemented and worked.

@gosom
Copy link
Owner

gosom commented Dec 16, 2024

@jj2018jj this is confirmed . I believe that something may have changed in playwright.

I will investigate and get back to you

@EdwinUK
Copy link

EdwinUK commented Jan 9, 2025

@gosom Any update on this?

Also getting this error when trying to use HTTPS proxies:

{"level":"info","component":"scrapemate","time":"2025-01-09T08:41:10.0880056Z","message":"starting scrapemate"}
{"level":"info","component":"scrapemate","numOfJobsCompleted":0,"numOfJobsFailed":0,"lastActivityAt":"0001-01-01T00:00:00Z","speed":"0.00 jobs/min","time":"2025-01-09T08:42:10.0895125Z","message":"scrapemate stats"}
{"level":"info","component":"scrapemate","error":"inactivity timeout: 0001-01-01T00:00:00Z","time":"2025-01-09T08:42:10.0895125Z","message":"exiting because of inactivity"}
{"level":"info","component":"scrapemate","job":"Job{ID: e8d7abd1-4a27-4453-99f8-a3a640ff5daa, Method: GET, URL: https://www.google.com/maps/search/roofing+in+Worcester+Park/@0,0,15z, UrlParams: map[hl:en]}","error":"context canceled","status":"failed","duration":62254.0603,"time":"2025-01-09T08:42:12.343207Z","message":"job finished"}
{"level":"error","component":"scrapemate","error":"context canceled","time":"2025-01-09T08:42:12.343207Z","message":"error while processing job"}
{"level":"info","component":"scrapemate","time":"2025-01-09T08:42:12.343207Z","message":"scrapemate exited"}

@robinroloff
Copy link

@gosom Is there an update yet? I have a client who needs this really badly and I have to tell him whether it will be fixed or how long it takes

@SparksIRL
Copy link

SparksIRL commented Jan 29, 2025

@gosom I'm having issues with a rotating proxy also. No matter how I set up the proxy (using proxy domain or proxy IP in the appropriate format) I get the error:

{"level":"info","component":"scrapemate","time":"2025-01-28T15:00:43.445946977Z","message":"starting scrapemate"} {"level":"info","component":"scrapemate","numOfJobsCompleted":0,"numOfJobsFailed":0,"lastActivityAt":"0001-01-01T00:00:00Z","speed":"0.00 jobs/min","time":"2025-01-28T15:01:43.444493455Z","message":"scrapemate stats"} {"level":"info","component":"scrapemate","error":"inactivity timeout: 0001-01-01T00:00:00Z","time":"2025-01-28T15:01:43.444538295Z","message":"exiting because of inactivity"} {"level":"info","component":"scrapemate","job":"Job{ID: f5fbfcc5-40eb-4bc6-84e6-6c4977e2db1e, Method: GET, URL: https://www.google.com/maps/search/OBFUSCATED+OBFUSCATED+OBFUSCATED, UrlParams: map[hl:en]}","error":"context canceled","status":"failed","duration":90804.082496,"time":"2025-01-28T15:02:14.252046946Z","message":"job finished"} {"level":"error","component":"scrapemate","error":"context canceled","time":"2025-01-28T15:02:14.25207078Z","message":"error while processing job"} {"level":"info","component":"scrapemate","time":"2025-01-28T15:02:14.252090701Z","message":"scrapemate exited"}

...or similar. What happens is, the proxy does not load, and the script hangs until it exits due to inactivity.

@gosom
Copy link
Owner

gosom commented Feb 8, 2025

I tested using 2 commercial proxies and none of them seemed to work.
However, when I created a custom HTTP proxy in a VPS and a socks5 proxy then it worked.

It looks like the issue is a combination of playwright and commercial proxies like an anti-bot protection or similar

@BjoernRave
Copy link

may I ask how one could run this project for a bigger workload in the current situation. arent proxies a hard requirement or am I not aware of something?

@gosom
Copy link
Owner

gosom commented Feb 9, 2025

may I ask how one could run this project for a bigger workload in the current situation. arent proxies a hard requirement or am I not aware of something?

  1. Proxies are not a hard requirement.
  2. You can route the traffic from the docker container via a VPN if you like

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants