You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Imagine a website with millions of endpoints that serve an identical page in functionality but vary by UUID, sequential number or another easily identifiable regex pattern. Katana currently recognizes each path as unique, and there’s no option to disable this feature. This can result in an overwhelming number of requests, making crawling really tedious.
I want a flag to limit crawling to this section
example.org/blogpost/dd414f7e-2293-49f4-9252-014aec97f787
example.org/blogpost/dd414f7e-2293-49f4-9252-014aec97f787/comments
example.org/blogpost/dd414f7e-2293-49f4-9252-014aec97f787/comments?recent=true
example.org/blogpost/dd414f7e-2293-49f4-9252-014aec97f787/like
this is redundant
example.org/blogpost/0bddaf77-6c4e-4db9-afd8-8b3483a7e47e
example.org/blogpost/0bddaf77-6c4e-4db9-afd8-8b3483a7e47e/comments
example.org/blogpost/0bddaf77-6c4e-4db9-afd8-8b3483a7e47e/comments?recent=true
example.org/blogpost/0bddaf77-6c4e-4db9-afd8-8b3483a7e47e/like
this is redundant
example.org/blogpost/9ad491cb-f6a8-4b26-b440-bd1d2c0fdc32
example.org/blogpost/9ad491cb-f6a8-4b26-b440-bd1d2c0fdc32/comments
example.org/blogpost/9ad491cb-f6a8-4b26-b440-bd1d2c0fdc32/comments?recent=true
example.org/blogpost/9ad491cb-f6a8-4b26-b440-bd1d2c0fdc32/like
In its current form, Katana sends all of those requests and there is no way to restict it to just the first block. I would like an option to disable this feature, allowing Katana to halt crawling when it encounters similar paths that differ only by a subpath, such as /uuid/, /123/ or custom regex. In most use cases, there’s little benefit in crawling beyond the first UUID, especially if the core functionality is unlikely to change. Proposed Flag:
-nsp, -no-similar-paths
# Disables crawling for paths that are nearly identical, differing only by defined subpaths. This flag marks such paths as duplicates.
Options: uuid, sequential_number, regex (default:uuid)
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Imagine a website with millions of endpoints that serve an identical page in functionality but vary by UUID, sequential number or another easily identifiable regex pattern. Katana currently recognizes each path as unique, and there’s no option to disable this feature. This can result in an overwhelming number of requests, making crawling really tedious.
In its current form, Katana sends all of those requests and there is no way to restict it to just the first block. I would like an option to disable this feature, allowing Katana to halt crawling when it encounters similar paths that differ only by a subpath, such as /uuid/, /123/ or custom regex. In most use cases, there’s little benefit in crawling beyond the first UUID, especially if the core functionality is unlikely to change.
Proposed Flag:
Beta Was this translation helpful? Give feedback.
All reactions