Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fscrawler ignores exclusion folder for subdirectories #1974

Open
TonySoderbergRMT opened this issue Nov 20, 2024 · 4 comments · May be fixed by #2012
Open

fscrawler ignores exclusion folder for subdirectories #1974

TonySoderbergRMT opened this issue Nov 20, 2024 · 4 comments · May be fixed by #2012
Assignees
Labels
bug For confirmed bugs

Comments

@TonySoderbergRMT
Copy link

Describe the bug

Having a structure where only files in the folders named "publicerat" should be indexed. So I want to exclude other folders (arbets,original,historik,attachments). These are in multiple locations including subfolders.
In this case everything inside /arbets/, /original/, /historik/ and /attachments/ are getting indexed.

Job Settings

---
name: "rmt_view_doc"
fs:
  url: "G:\\dokument"
  update_rate: "15m"
  includes:
  - "*.docx"
  - "*.xlsx"
  - "*.pptx"
  - "*.pdf"
  excludes:
  - "*/historik/*"
  - "*/attachments/*"
  - "*/arbets/*"
  - "*/original/*"
  json_support: false
  filename_as_id: true
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: false
  lang_detect: false
  continue_on_error: true
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  ssl_verification: true
  username: elastic
  password: xxx

Logs

18:27:43,059 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from G:\dokument
18:27:43,105 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 33 local files found
18:27:43,106 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(G:\dokument, G:\dokument\arbets) = \arbets
18:27:43,108 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [true], filename = [\arbets], includes = [[*.docx, *.xlsx, *.pptx, *.pdf]], excludes = [[*/historik/*, */attachments/*, */arbets/*, */original/*]]
18:27:43,110 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [\arbets], excludes = [[*/historik/*, */attachments/*, */arbets/*, */original/*]]
18:27:43,111 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] checking exclusion for filename = [\arbets], matches = [[*/historik/*, */attachments/*, */arbets/*, */original/*]]
18:27:43,113 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [\arbets], includes = [[*.docx, *.xlsx, *.pptx, *.pdf]]
18:27:43,114 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] checking inclusion for filename = [\arbets], matches = [[*.docx, *.xlsx, *.pptx, *.pdf]]
18:27:43,119 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [\arbets], excludes = [[*/historik/*, */attachments/*, */arbets/*, */original/*]]
18:27:43,120 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] checking exclusion for filename = [\arbets], matches = [[*/historik/*, */attachments/*, */arbets/*, */original/*]]
18:27:43,123 DEBUG [f.p.e.c.f.FsParserAbstract] [\arbets] can be indexed: [true]
18:27:43,126 DEBUG [f.p.e.c.f.FsParserAbstract]   - folder: arbets
18:27:43,129 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [G:\dokument\arbets] content
18:27:43,130 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from G:\dokument\arbets
18:27:44,147 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 1512 local files found
18:27:44,149 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(G:\dokument, G:\dokument\arbets\0001.ppt) = \arbets\0001.ppt
18:27:44,150 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [\arbets\0001.ppt], includes = [[*.docx, *.xlsx, *.pptx, *.pdf]], excludes = [[*/historik/*, */attachments/*, */arbets/*, */original/*]]
18:27:44,151 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [\arbets\0001.ppt], excludes = [[*/historik/*, */attachments/*, */arbets/*, */original/*]]
18:27:44,151 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] checking exclusion for filename = [\arbets\0001.ppt], matches = [[*/historik/*, */attachments/*, */arbets/*, */original/*]]

Expected behavior

It's expected that fscrawler will not index folders that are in exclusion path.

Versions:

  • OS: Windows Server 2022
  • fscrawler fscrawler-distribution-2.10-20241120.045907-436
@TonySoderbergRMT TonySoderbergRMT added the check_for_bug Needs to be reproduced label Nov 20, 2024
@dadoonet
Copy link
Owner

Let me check about this one.

In the meantime, an "easy" way to exclude dirs is by adding a .fscrawlerignore in each dir you'd like to exclude. In your example: historik/.fscrawlerignore.

But i'm looking into the issue.

@dadoonet
Copy link
Owner

I think it's somewhat related to Windows path names vs linux names.
I checked the code and I do have an integration test which is checking exclusions...

@dadoonet
Copy link
Owner

Could you try the same operation with the following path instead?

fs:
  url: "G:/dokument"

@dadoonet dadoonet added bug For confirmed bugs and removed check_for_bug Needs to be reproduced labels Feb 10, 2025
@dadoonet dadoonet self-assigned this Feb 10, 2025
@dadoonet
Copy link
Owner

I think I found a patch for this. PR is coming ;)

dadoonet added a commit that referenced this issue Feb 10, 2025
There were 2 issues here:

* We are comparing a folder name like `*/foo/*` with a virtual dir name which is something like `/foo` or `/bar/foo`. It's missing the `/` at the end when it's a directory.
* On windows, the exclusion for a dir named `\foo\arbets` does not match the exclusion `*/arbets/*` because of the `/` vs `\` mismatch

This commit fixes this behavior.

Closes #1974.
dadoonet added a commit that referenced this issue Feb 11, 2025
There were 2 issues here:

* We are comparing a folder name like `*/foo/*` with a virtual dir name which is something like `/foo` or `/bar/foo`. It's missing the `/` at the end when it's a directory.
* On windows, the exclusion for a dir named `\foo\arbets` does not match the exclusion `*/arbets/*` because of the `/` vs `\` mismatch

This commit fixes this behavior.

Closes #1974.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For confirmed bugs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants