Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression testing? #2

Open
BigBaIIs opened this issue Feb 19, 2025 · 14 comments
Open

regression testing? #2

BigBaIIs opened this issue Feb 19, 2025 · 14 comments

Comments

@BigBaIIs
Copy link
Member

yo @Montana

anything planned for this? lmk we can zoom it after 6PM EST.

@Montana
Copy link
Member

Montana commented Feb 19, 2025

Hey @BigBaIIs,

Regression testing for this dod_spending.py project would involve verifying that new changes—such as the addition of pagination, the --use-curl option, or fixes like resolving the syntax error in setup_session()—do not break existing functionality while ensuring the script continues to accurately search for and collect DoD spending PDFs.

This process would include re-running a suite of test cases that check core features: executing default queries to fetch PDF links, validating output file generation (e.g., test_output.txt), confirming verbose logging works, and ensuring custom query parsing remains intact.

For example, after adding pagination, regression tests would confirm that the script still retrieves the expected number of PDFs from a single page of results (15 by default) and doesn’t crash with duplicate URLs, while also verifying that the --use-curl flag doesn’t disrupt HTTP requests compared to the original requests-based approach.

Automated tests could use mock search results and HTTP responses to simulate network behavior, ensuring stability across Python versions and environments, with manual spot-checks to validate real-world PDF discovery remains consistent with prior versions.

6PM EST works. Please call on personal Zoom.

@BigBaIIs
Copy link
Member Author

was going to suggest we use ngrok

since ngrok typically tunnels to a local server, but for dod_spending.py, we’ll proxy HTTP requests outbound. Since the script makes requests to external sites (not a local server), we’ll need a proxy server to route traffic through ngrok.

@Montana
Copy link
Member

Montana commented Feb 19, 2025

NB @BigBaIIs,

You and I can adapt it for dod_spending.py to proxy outbound HTTP requests instead.

Unlike its usual inbound use case, where ngrok forwards external traffic to a local service, this script makes requests to external sites (e.g., Google search, government domains) rather than hosting a local server.

@BigBaIIs
Copy link
Member Author

this adds context about why government servers (*.gov) pose unique challenges—security policies, IP restrictions, and SSL quirks.

good idea dude

@Montana
Copy link
Member

Montana commented Feb 19, 2025

Hey @BigBaIIs,

Made this table to leave here. I don't think it belongs in the official DoD Spend documentation. Essentially we'd want to run:

ngrok authtoken YOUR_AUTH_TOKEN && ngrok http 8080 --background

This would help debug issues with government server responses (e.g., 429 rate limits). Pair with mitmproxy and run python3 dod_spending.py --proxy http://abc123.ngrok.io -v for the verbose ngrok output. For all intensive purposes it works:

Image

Check the table out below:

Aspect for ngrok Description
Purpose Tunnels outbound HTTP requests from dod_spending.py to external sites via a public URL, reversing ngrok’s typical use.
Typical Use Case Exposes a local server (e.g., localhost:8080) to the internet (e.g., https://abc123.ngrok.io).
Adapted Use Case Proxies outbound requests to external sites (e.g., *.gov) through a local proxy, forwarded by ngrok.
Installation Download from ngrok.com, unzip, add to PATH (e.g., /usr/local/bin/ngrok on Linux/macOS).
Authentication Run ngrok authtoken YOUR_AUTH_TOKEN with token from ngrok dashboard to unlock features.
Start Command ngrok http 8080 creates tunnel from https://abc123.ngrok.io to localhost:8080.
Local Proxy Setup Use mitmproxy -p 8080 (install via pip install mitmproxy) to forward requests to ngrok.
Script Modification Add --proxy arg, update setup_session() with session.proxies = {"http": "http://abc123.ngrok.io", "https": ...}.
Government Servers *.gov (e.g., defense.gov) may have rate limits, IP whitelisting, strict SSL; ngrok aids debugging these issues.
Benefits Monitor/debug requests, bypass local restrictions, test against government server policies externally.
Limitations Needs local proxy, adds latency, free tier has limits; government IPs may still block.
Run Example python dod_spending.py --output test.txt --proxy http://abc123.ngrok.io routes requests via ngrok.

Something to note that, in particular Government servers take commands like --region us or --log=stdout help address *.gov quirks (e.g., IP restrictions, rate limits), so we have to keep an eye on the pagination so we don't exceed rate limit. I'm hoping ngrok can help us with this.

@BigBaIIs
Copy link
Member Author

BigBaIIs commented Feb 19, 2025

started with the nginx-reverse-proxy first

Image

anything from ngrok?

@Montana
Copy link
Member

Montana commented Feb 19, 2025

Hi @BigBaIIs,

Yeah I ran cursory tests on ngrok, whilst using nginx-reverse-proxy, here's what they looked like:

Image


I enjoyed making a file called extreme-tunnels.yml. Applying the nginx-reverse-proxy:

Image


Most of this stuff is just placeholder data. Point being I was able to get the PoC working with nginx-reverse-proxy and ngrok. Fairly simply, this also goes without saying MFA is also enabled.
The one thing I had a little fickle with is modifying the server block:

server {
    listen 443 ssl;
    server_name dogegov.com;
    ssl_certificate /etc/letsencrypt/live/dogegov.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/dogegov.com/privkey.pem;

    location / {
        proxy_pass http://localhost:3000;
    }
}
server {
    listen 80;
    server_name dogegov.com;
    return 301 https://$host$request_uri; 
}

Setting up also IPv4/IPv6 denying via compatibility in rate limiting:

location / {
    deny 192.168.1.100;  # Block specific IP
    if ($http_user_agent ~* "badbot|spider") {
        return 403;
    }
    proxy_pass http://localhost:3000;
}

Final thing was caching a static response this is basically like Argo trafficking:

proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=dogecache:10m inactive=24h;

server {
    listen 80;
    server_name doge.gov;

    location / {
        proxy_cache dogecache;
        proxy_cache_valid 200 301 302 1h; 
        proxy_pass http://localhost:3000;
    }
}

You may need to edit the proxy headers @BigBaIIs, depending on nginx.

Cheers,
Michael

@shubcodes
Copy link

@Montana @BigBaIIs you can use the ngrok python SDK with this to make the testing faster without having to download the binary and running CLI commands: ngrok-python-sdk

@Montana
Copy link
Member

Montana commented Feb 19, 2025

Hey @shubcodes,

Thank you for the recommendation - it may be something I do in a sandbox, as far as I know though the CLI and binary offer full access to ngrok’s features—HTTP/HTTPS/TCP tunnels, custom domains, edge configurations, and diagnostics like ngrok diagnose.

I think ultimately the Python SDK (ngrok-python), while slick for programmatic ingress (e.g., ngrok.forward(8080)), skips some advanced stuff. For instance, serving directory files (e.g., ngrok http --dir) isn’t supported in the SDK.

I think one thing @BigBaIIs's and I talked about was dependency bloat, via installing ngrok-python-sdk via pip pulls in a Rust toolchain, cmake, and libssl-dev (since it likely uses ngrok-rust. This jumps the install size to 50MB+ compared to, say, a standalone ngrok CLI binary at ~10MB.

In other terms, ngrok loses some of its elegance.

@BigBaIIs
Copy link
Member Author

@Montana

def need less deps on this laptop, 8TB isn't doing it anymore even

@GeXnY
Copy link

GeXnY commented Feb 19, 2025

should be able to use nginx-reverse-proxy fairly easy

server {
    listen 80;
    server_name localhost;

    location / {
        proxy_pass http://localhost:3000; 
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

once the configuration is right (if using sites-available), then symlink

sudo ln -s /etc/nginx/sites-available/myapp /etc/nginx/sites-enabled/

@fastbaII
Copy link

@Montana or @BigBaIIs is there no option to make the search recursive, once it’s done to restart?

@Montana
Copy link
Member

Montana commented Feb 20, 2025

You throwing a curveball at @BigBaIIs and I @fastbaII?

@BigBaIIs
Copy link
Member Author

hit a grand slam, oh just in time for time for brekky

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants