-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed issue 82 - added noindex functionality #83
base: master
Are you sure you want to change the base?
Conversation
sitemap.functions.php
Outdated
global $scanned, $deferredLinks, $file_stream, $freq, $priority, $enable_priority, $enable_frequency, $max_depth, $depth, $real_site, $indexed; | ||
global $scanned, $deferredLinks, $file_stream, $freq, $priority, $enable_priority, $enable_frequency, $max_depth, $depth, $real_site, $indexed, $noindex; | ||
$pribydepth = array ( 1, .8, .64, .5, .32, .25, .1 ); | ||
$priority = $pribydepth[$depth]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would the code handle depths that a greater than the amount of numbers in the list? Maybe it should default to the last one if exceeded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points in both comments.... Please forgive these mistakes, as I'm new to contributing to stuff. I'll make some changes and get this re-submitted. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to apologize. You're already doing more than I have.
sitemap.functions.php
Outdated
@@ -365,6 +367,11 @@ function scan_url($url) | |||
return $depth--; | |||
} | |||
|
|||
if ($noindex && preg_match('/content="noindex"/', $html)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a reliable way to check for html.
Possible false positive: The content may have been included as an attribute of some random tag. We're just looking for meta tags. It's also not applicable to all meta tags.
Possible false negative: There is a lot of variety allowed by html. Here are some examples:
// Single quotes
content='noindex'
// Spaces around equality
content = "noindex"
// capitals
CONTENT="noindex"
We do need to account for those.
I'm okay merging this despite the false negatives, but false positives should be fixed before I can do so.
…ty for greater depths
I think I got this stuff resolved. Tested a couple of use cases and it seemed to work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a very creative approach to these problems.
This codebase doesn't have proper testing in place, so please do test your changes to make sure it doesn't break.
I see that this is your first contribution on GitHub, congrats! I started this project many years ago and don't use it anymore personally. As a result, the codebase is kind of garbage quality. The depth by priority idea is definitely going to be included when I re-write the project. It's a really good idea, I'm surprised I never saw it before.
Anyways, there's a couple more changes to make.
@@ -334,8 +334,12 @@ function get_links($html, $parent_url, $regexp) | |||
function scan_url($url) | |||
{ | |||
global $scanned, $deferredLinks, $file_stream, $freq, $priority, $enable_priority, $enable_frequency, $max_depth, $depth, $real_site, $indexed, $noindex; | |||
$pribydepth = array ( 1, .8, .64, .5, .32, .25, .1 ); | |||
$priority = $pribydepth[$depth]; | |||
if ($depth > 6) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't hard-code the length of the array. Use a function to get it dynamically.
if ($depth > 6) { | ||
$priority = .1; | ||
} else { | ||
$pribydepth = array ( 1, .8, .64, .5, .32, .25, .1 ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO this belongs in the configuration file. Different priorities depending on the user.
$pribydepth = array ( 1, .8, .64, .5, .32, .25, .1 ); | ||
$priority = $pribydepth[$depth]; | ||
if ($depth > 6) { | ||
$priority = .1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should use the last element of the array. Hard-coding a constant is not ideal. Consider that some users may want a constant priority across all pages.
@@ -367,7 +371,7 @@ function scan_url($url) | |||
return $depth--; | |||
} | |||
|
|||
if ($noindex && preg_match('/content="noindex"/', $html)) { | |||
if ($noindex && (preg_match_all('/\<meta.*?\>/mis',$html,$ar) and strstr(join(',',$ar[0]),'noindex'))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really interesting way of approaching the problem. Has room for improvement, but I'm a fan of it.
These are all great ideas, and thanks for the "welcome to contributing" note. :) It'll take a little bit of time for me to think about them and get them into place. Job priorities interfere with this fun stuff sometimes. |
It works as expected. but it should follow the links from noindex pages and add them to sitemap.xml file example: Result: Page A should be omitted but Page B and C should be added to the sitemap.xml file. |
@wcmohler @knyzorg what is the status of this PR? Do let me know if any work is needed to close this. I have some bandwidth to spare. @mylselgan, what you want to achieve is pretty simple, instead of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor change.
@@ -365,6 +371,11 @@ function scan_url($url) | |||
return $depth--; | |||
} | |||
|
|||
if ($noindex && (preg_match_all('/\<meta.*?\>/mis',$html,$ar) and strstr(join(',',$ar[0]),'noindex'))) { | |||
logger("This page is set to noindex.", 1); | |||
return $depth--; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of returning here, can you set a flag $is_noindex_url
to skip fwrite($file_stream, $map_row);
so that child links to be processed?
@cbsiddharth your patch works well. This PR now skips "noindex" pages and follows links from "noindex" pages. Thank you all. |
Hi @sidcha, Sorry I've been real busy with life, work and whatever else lately. GitHub sends notifications to the wrong e-mail for this repository so I end up missing them. It's been nearly a year, so I doubt you're still looking for an answer but for anyone who wanders here: The status of the PR is incomplete. It does what it sets out to do but still has unaddressed comments. I would love for someone to pick up this PR (fork it), finish it up and send it in for merging. There is also a lot of room for improvement for how the noindex pages are both detected and processed. I would like to see something on that front before merging it in. |
I also updated the priority functionality to tie the priority level to the depth, like other sitemap generators.
Added a space in the "Sitemap has been generated in" logger part of sitemap.php