Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty Folder not deleted from folder index on Windows server (works correctly on Linux) #2014

Open
newschapmj1 opened this issue Feb 11, 2025 · 0 comments
Labels
check_for_bug Needs to be reproduced

Comments

@newschapmj1
Copy link

newschapmj1 commented Feb 11, 2025

Describe the bug
Simplest example of Windows empty folder problem.

url contains only a folder sub1 (no other folders or documents)
Scan 1 adds folder to folder index
Deleting the folder from url in Windows does not remove folder from index. No other changes to contents of url.
Delete _status.json
Scan 2 folder is not removed from folder index

The same steps work correctly on Linux machine. Folder is removed from folder index.

In both cases data is on local drive (not mapped drive etc)

Job Settings
Only difference between Windows and Linux is url

Windows

name: "metadata_index_url14"
fs:
  url: "C://data3/folderurl14"
#url: "T:\\folderurl13"

  update_rate: "600m"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  indexed_chars: "1"
  attributes_support: true
  raw_metadata: true
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: true
  ocr:
    language: "eng"
    enabled: false
    pdf_strategy: "ocr_and_text"
  follow_symlinks: true
elasticsearch:
  nodes:
  - url: "https://localhost:9200"
  username: "elastic"
  password: "x"
  index: "metadata_index_url14"
  index_folder: "metadata_index_url14_folder"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "1mb"
  ssl_verification: false
#  pipeline: "metadata_index_pipeline"

Logs

Linux fscrawler.log has this section

14:51:22,541 �[30mTRACE�[m [f.p.e.c.f.c.ElasticsearchClient] Calling POST [https://localhost:9200/metadata_index_url14_folder/\_search](https://localhost:9200/metadata_index_url14_folder/%5C_search) with params [version=true]  
14:51:22,572 �[30mTRACE�[m [f.p.e.c.f.c.ElasticsearchClient] POST [https://localhost:9200/metadata_index_url14_folder/\_search](https://localhost:9200/metadata_index_url14_folder/%5C_search) gives {"took":6,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":0,"relation":"econtaq"},"max_score":null,"hits":[]}}  
14:51:22,574 �[36mDEBUG�[m [f.p.e.c.f.FsParserAbstract] Deleting metadata_index_url14_folder**/694ca77db71937d803f96050584060

No such section in Windows fscrawler.log

21:58:59,006 DEBUG [f.p.e.c.p.FsCrawlerPluginsManager] Loading plugins
21:58:59,069 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [1.9gb/2gb=97.51%], RAM [1.9gb/11.9gb=16.55%], Swap [1.7gb/13.7gb=12.73%].
21:58:59,069 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
21:58:59,069 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
21:58:59,069 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [metadata_index_url14]...
21:58:59,475 WARN  [f.p.e.c.f.s.Elasticsearch] username is deprecated. Use apiKey instead.
21:58:59,475 WARN  [f.p.e.c.f.s.Elasticsearch] password is deprecated. Use apiKey instead.
21:58:59,475 INFO  [f.p.e.c.f.c.FsCrawlerCli] attributes_support is set to true but getting group is not available on [windows server 2019].
21:58:59,475 DEBUG [f.p.e.c.p.FsCrawlerPluginsManager] Starting plugins
21:58:59,522 DEBUG [f.p.e.c.p.FsCrawlerPluginsManager] Found FsCrawlerExtensionFsProvider extension for type [http]
21:58:59,522 DEBUG [f.p.e.c.p.FsCrawlerPluginsManager] Found FsCrawlerExtensionFsProvider extension for type [local]
21:58:59,522 DEBUG [f.p.e.c.p.FsCrawlerPluginsManager] Found FsCrawlerExtensionFsProvider extension for type [s3]
21:58:59,537 INFO  [f.p.e.c.f.FsCrawlerImpl] attributes_support is set to true but getting group is not available on [windows server 2019].
21:58:59,537 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [metadata_index_url14] for [C://data3/folderurl14] every [10h]
21:58:59,537 DEBUG [f.p.e.c.f.FsParserAbstract] We are running on Windows without Server settings so we use the separator in accordance with fs.url
21:58:59,537 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
21:58:59,537 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
21:58:59,803 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
21:58:59,866 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version
21:59:01,444 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version returns 8.17.2 and 8 as the major version number
21:59:01,444 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.17.2
21:59:01,444 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Semantic search is enabled and we are running on a version of Elasticsearch 8.17.2 which is 8.17 or higher. We will try to use the semantic search features.
21:59:01,444 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get license
21:59:01,459 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get license returns basic
21:59:01,459 WARN  [f.p.e.c.f.c.ElasticsearchClient] Semantic search is enabled but we are running Elasticsearch with a basic license although we need either an enterprise or trial license.We will not be able to use the semantic search features ATM. We might switch later to a vector embeddings generation.
21:59:01,475 DEBUG [f.p.e.c.f.s.FsCrawlerManagementServiceElasticsearchImpl] Elasticsearch Management Service started
21:59:01,475 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
21:59:01,475 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version
21:59:01,631 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version returns 8.17.2 and 8 as the major version number
21:59:01,631 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.17.2
21:59:01,631 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Semantic search is enabled and we are running on a version of Elasticsearch 8.17.2 which is 8.17 or higher. We will try to use the semantic search features.
21:59:01,631 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get license
21:59:01,647 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get license returns basic
21:59:01,647 WARN  [f.p.e.c.f.c.ElasticsearchClient] Semantic search is enabled but we are running Elasticsearch with a basic license although we need either an enterprise or trial license.We will not be able to use the semantic search features ATM. We might switch later to a vector embeddings generation.
21:59:01,647 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Elasticsearch Document Service started
21:59:01,647 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Creating/updating component templates
21:59:01,647 DEBUG [f.p.e.c.f.c.ElasticsearchClient] push component template [fscrawler_alias]
21:59:01,678 DEBUG [f.p.e.c.f.c.ElasticsearchClient] push component template [fscrawler_settings_shards]
21:59:01,678 DEBUG [f.p.e.c.f.c.ElasticsearchClient] push component template [fscrawler_settings_total_fields]
21:59:01,692 DEBUG [f.p.e.c.f.c.ElasticsearchClient] push component template [fscrawler_mapping_attributes]
21:59:01,700 DEBUG [f.p.e.c.f.c.ElasticsearchClient] push component template [fscrawler_mapping_file]
21:59:01,709 DEBUG [f.p.e.c.f.c.ElasticsearchClient] push component template [fscrawler_mapping_path]
21:59:01,709 DEBUG [f.p.e.c.f.c.ElasticsearchClient] push component template [fscrawler_mapping_attachment]
21:59:01,725 DEBUG [f.p.e.c.f.c.ElasticsearchClient] push component template [fscrawler_mapping_content]
21:59:01,725 DEBUG [f.p.e.c.f.c.ElasticsearchClient] push component template [fscrawler_mapping_meta]
21:59:01,741 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Creating/updating index templates
21:59:01,741 DEBUG [f.p.e.c.f.c.ElasticsearchClient] push index template [fscrawler_docs_metadata_index_url14]
21:59:01,741 DEBUG [f.p.e.c.f.c.ElasticsearchClient] push index template [fscrawler_folders_metadata_index_url14_folder]
21:59:01,757 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [metadata_index_url14] for [C://data3/folderurl14] every [10h]
21:59:01,757 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [metadata_index_url14] is now running. Run #1...
21:59:01,757 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C://data3/folderurl14, C://data3/folderurl14) = /
21:59:01,819 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [C://data3/folderurl14] content
21:59:01,819 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from C://data3/folderurl14
21:59:01,819 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 0 local files found
21:59:01,819 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [C://data3/folderurl14]...
21:59:01,851 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed directories in [C://data3/folderurl14]...
21:59:01,882 DEBUG [f.p.e.c.f.FsParserAbstract] Updating job metadata after run for [metadata_index_url14]: lastrun [2025-02-11T21:58:59.757186500], indexed [0], deleted [0]
21:59:01,882 INFO  [f.p.e.c.f.FsParserAbstract] Closing FS crawler file abstractor [FileAbstractorFile].
21:59:01,882 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for 10h
21:59:06,632 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Going to execute new bulk composed of 1 actions
21:59:06,647 DEBUG [f.p.e.c.f.c.ElasticsearchEngine] Sending a bulk request of [1] documents to the Elasticsearch service
21:59:06,647 DEBUG [f.p.e.c.f.c.ElasticsearchClient] bulk a ndjson of 393 characters
21:59:06,694 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Executed bulk composed of 1 actions

Expected behavior

metadata_index_url14
metadata_index_url14_folder
url contains a single folder (sub1), no documents

1 scan 1: Outcome folder sub1 is added to folder index

2 delete folder sub1 (no other changes to contents of url)
delete _status.json

3 Scan 2: Outcome folder sub1 is removed from folder index
sub1 is removed in Linux but not in Windows

A clear and concise description of what you expected to happen.

Windows

Scan 1 metadata_index_url14_folder count 2, metadata_index count 0
delete folder, delete _status.json
Scan 2 metadata_index_url14_folder 2, metadata_index 0

Linux

Scan 1 metadata_index_url14_folder 2, metadata_index 0
delete folder, delete _status.json
Scan 2 metadata_index_url14_folder 1, metadata_index 0

lin_fscrawler.log

win_fscrawler.log

win_settings.txt

lin_settings.txt

Versions:
Elastic 8.17
FSCrawler v462

  • OS: [Windows Server Ubuntu]
  • Version [2019, 2404]

Attachment

If the bug is related to a given file, please share this file, so we can reuse it in tests
to reproduce the problem and may be use it in our integration tests.

@newschapmj1 newschapmj1 added the check_for_bug Needs to be reproduced label Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
check_for_bug Needs to be reproduced
Projects
None yet
Development

No branches or pull requests

1 participant