Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML base element #46

Open
Thyra opened this issue Sep 15, 2017 · 1 comment
Open

HTML base element #46

Thyra opened this issue Sep 15, 2017 · 1 comment

Comments

@Thyra
Copy link

Thyra commented Sep 15, 2017

I found another thing that has to be considered when crawling a website: The HTML base element. It changes the address relative hrefs are relative to.

@vezaynk
Copy link
Owner

vezaynk commented Sep 15, 2017

This is an interesting case of which I was not aware.

This line currently uses the parent url to resolve relative urls. A simple regex to attempt to extract the base url should be easy enough.

But like with all things that seem easy, we get a bunch of edge cases!

"Absolute and relative URLs are allowed."

I can't fathom why someone would use a relative URL for this. I will probably handle the absolute case first and open a new issue for the relative one after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants