Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed issue 82 - added noindex functionality #83

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

wcmohler
Copy link

I also updated the priority functionality to tie the priority level to the depth, like other sitemap generators.

Added a space in the "Sitemap has been generated in" logger part of sitemap.php

global $scanned, $deferredLinks, $file_stream, $freq, $priority, $enable_priority, $enable_frequency, $max_depth, $depth, $real_site, $indexed;
global $scanned, $deferredLinks, $file_stream, $freq, $priority, $enable_priority, $enable_frequency, $max_depth, $depth, $real_site, $indexed, $noindex;
$pribydepth = array ( 1, .8, .64, .5, .32, .25, .1 );
$priority = $pribydepth[$depth];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would the code handle depths that a greater than the amount of numbers in the list? Maybe it should default to the last one if exceeded?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points in both comments.... Please forgive these mistakes, as I'm new to contributing to stuff. I'll make some changes and get this re-submitted. Thanks!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to apologize. You're already doing more than I have.

@@ -365,6 +367,11 @@ function scan_url($url)
return $depth--;
}

if ($noindex && preg_match('/content="noindex"/', $html)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a reliable way to check for html.
Possible false positive: The content may have been included as an attribute of some random tag. We're just looking for meta tags. It's also not applicable to all meta tags.

Possible false negative: There is a lot of variety allowed by html. Here are some examples:

// Single quotes
content='noindex'
// Spaces around equality
content = "noindex"
// capitals
CONTENT="noindex"

We do need to account for those.

I'm okay merging this despite the false negatives, but false positives should be fixed before I can do so.

@wcmohler
Copy link
Author

I think I got this stuff resolved. Tested a couple of use cases and it seemed to work.

Copy link
Owner

@vezaynk vezaynk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a very creative approach to these problems.

This codebase doesn't have proper testing in place, so please do test your changes to make sure it doesn't break.

I see that this is your first contribution on GitHub, congrats! I started this project many years ago and don't use it anymore personally. As a result, the codebase is kind of garbage quality. The depth by priority idea is definitely going to be included when I re-write the project. It's a really good idea, I'm surprised I never saw it before.

Anyways, there's a couple more changes to make.

@@ -334,8 +334,12 @@ function get_links($html, $parent_url, $regexp)
function scan_url($url)
{
global $scanned, $deferredLinks, $file_stream, $freq, $priority, $enable_priority, $enable_frequency, $max_depth, $depth, $real_site, $indexed, $noindex;
$pribydepth = array ( 1, .8, .64, .5, .32, .25, .1 );
$priority = $pribydepth[$depth];
if ($depth > 6) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't hard-code the length of the array. Use a function to get it dynamically.

if ($depth > 6) {
$priority = .1;
} else {
$pribydepth = array ( 1, .8, .64, .5, .32, .25, .1 );
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this belongs in the configuration file. Different priorities depending on the user.

$pribydepth = array ( 1, .8, .64, .5, .32, .25, .1 );
$priority = $pribydepth[$depth];
if ($depth > 6) {
$priority = .1;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should use the last element of the array. Hard-coding a constant is not ideal. Consider that some users may want a constant priority across all pages.

@@ -367,7 +371,7 @@ function scan_url($url)
return $depth--;
}

if ($noindex && preg_match('/content="noindex"/', $html)) {
if ($noindex && (preg_match_all('/\<meta.*?\>/mis',$html,$ar) and strstr(join(',',$ar[0]),'noindex'))) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really interesting way of approaching the problem. Has room for improvement, but I'm a fan of it.

@wcmohler
Copy link
Author

These are all great ideas, and thanks for the "welcome to contributing" note. :) It'll take a little bit of time for me to think about them and get them into place. Job priorities interfere with this fun stuff sometimes.

@mylselgan
Copy link

@wcmohler

It works as expected. but it should follow the links from noindex pages and add them to sitemap.xml file

example:
page A have "noindex" meta
Page A links to page B and Page C
Page B and Page C don't have meta "noindex"

Result: Page A should be omitted but Page B and C should be added to the sitemap.xml file.

@sidcha
Copy link

sidcha commented Jan 9, 2020

@wcmohler @knyzorg what is the status of this PR? Do let me know if any work is needed to close this. I have some bandwidth to spare.

@mylselgan, what you want to achieve is pretty simple, instead of return $depth--;, you must set a flag $is_noindex_url and skip $map_row build/write sequence and allow rest of the method to execute. I created a patch to demonstrate this. See https://gist.github.com/cbsiddharth/27902a169a0f72a27d549653d1a3c47b. Beware, I haven't tested this change, let me know if you face some issues.

Copy link

@sidcha sidcha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor change.

@@ -365,6 +371,11 @@ function scan_url($url)
return $depth--;
}

if ($noindex && (preg_match_all('/\<meta.*?\>/mis',$html,$ar) and strstr(join(',',$ar[0]),'noindex'))) {
logger("This page is set to noindex.", 1);
return $depth--;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of returning here, can you set a flag $is_noindex_url to skip fwrite($file_stream, $map_row); so that child links to be processed?

@mylselgan
Copy link

@cbsiddharth your patch works well. This PR now skips "noindex" pages and follows links from "noindex" pages. Thank you all.

@vezaynk
Copy link
Owner

vezaynk commented Nov 5, 2021

Hi @sidcha,

Sorry I've been real busy with life, work and whatever else lately. GitHub sends notifications to the wrong e-mail for this repository so I end up missing them. It's been nearly a year, so I doubt you're still looking for an answer but for anyone who wanders here: The status of the PR is incomplete. It does what it sets out to do but still has unaddressed comments.

I would love for someone to pick up this PR (fork it), finish it up and send it in for merging. There is also a lot of room for improvement for how the noindex pages are both detected and processed. I would like to see something on that front before merging it in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants