fscrawler jobs with Elastics pipeline setting errors Can't find stored field name to check existing filenames in path [/]. Please set store: true on field [file.filename] #1238 #1240
Replies: 2 comments 4 replies
-
The error message shows that the mapping for folders and/or documents is wrong. did you manually create the indices? could you remove both indices and start FSCrawler again? |
Beta Was this translation helpful? Give feedback.
-
So I tried some tests locally. TLDR; I can not reproduce the behavior you saw. Here is my setup. Started Elasticsearch and Kibana locally with docker-compose. Then I ran from dev tools:
Then I downloaded fscrawler-2.7 from maven central, unzipped it and ran: bin/fscrawler --config_dir ./config pipeline I edited the ---
name: "pipeline"
fs:
url: "/tmp/es"
elasticsearch:
username: "elastic"
password: "changeme"
pipeline: "fscrawler" Then I ran fscrawler again: bin/fscrawler --config_dir ./config pipeline Which fails with:
Which is expected as I don't have this But the goal for me was to check the mapping generated when a pipeline is defined. So {
"pipeline" : {
"mappings" : {
"dynamic_templates" : [
{
"raw_as_text" : {
"path_match" : "meta.raw.*",
"mapping" : {
"fields" : {
"keyword" : {
"ignore_above" : 256,
"type" : "keyword"
}
},
"type" : "text"
}
}
}
],
"properties" : {
"attachment" : {
"type" : "binary"
},
"attributes" : {
"properties" : {
"group" : {
"type" : "keyword"
},
"owner" : {
"type" : "keyword"
}
}
},
"content" : {
"type" : "text"
},
"file" : {
"properties" : {
"checksum" : {
"type" : "keyword"
},
"content_type" : {
"type" : "keyword"
},
"created" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"extension" : {
"type" : "keyword"
},
"filename" : {
"type" : "keyword",
"store" : true
},
"filesize" : {
"type" : "long"
},
"indexed_chars" : {
"type" : "long"
},
"indexing_date" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"last_accessed" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"last_modified" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"url" : {
"type" : "keyword",
"index" : false
}
}
},
"meta" : {
"properties" : {
"altitude" : {
"type" : "text"
},
"author" : {
"type" : "text"
},
"comments" : {
"type" : "text"
},
"contributor" : {
"type" : "text"
},
"coverage" : {
"type" : "text"
},
"created" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"creator_tool" : {
"type" : "keyword"
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"description" : {
"type" : "text"
},
"format" : {
"type" : "text"
},
"identifier" : {
"type" : "text"
},
"keywords" : {
"type" : "text"
},
"language" : {
"type" : "keyword"
},
"latitude" : {
"type" : "text"
},
"longitude" : {
"type" : "text"
},
"metadata_date" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"modifier" : {
"type" : "text"
},
"print_date" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"publisher" : {
"type" : "text"
},
"rating" : {
"type" : "byte"
},
"relation" : {
"type" : "text"
},
"rights" : {
"type" : "text"
},
"source" : {
"type" : "text"
},
"title" : {
"type" : "text"
},
"type" : {
"type" : "text"
}
}
},
"path" : {
"properties" : {
"real" : {
"type" : "keyword",
"fields" : {
"fulltext" : {
"type" : "text"
},
"tree" : {
"type" : "text",
"analyzer" : "fscrawler_path",
"fielddata" : true
}
}
},
"root" : {
"type" : "keyword"
},
"virtual" : {
"type" : "keyword",
"fields" : {
"fulltext" : {
"type" : "text"
},
"tree" : {
"type" : "text",
"analyzer" : "fscrawler_path",
"fielddata" : true
}
}
}
}
}
}
}
}
} So {
"pipeline_folder" : {
"mappings" : {
"properties" : {
"file" : {
"properties" : {
"content_type" : {
"type" : "keyword"
},
"filename" : {
"type" : "keyword",
"store" : true
}
}
},
"path" : {
"properties" : {
"real" : {
"type" : "keyword",
"fields" : {
"fulltext" : {
"type" : "text"
},
"tree" : {
"type" : "text",
"analyzer" : "fscrawler_path",
"fielddata" : true
}
}
},
"root" : {
"type" : "keyword"
},
"virtual" : {
"type" : "keyword",
"fields" : {
"fulltext" : {
"type" : "text"
},
"tree" : {
"type" : "text",
"analyzer" : "fscrawler_path",
"fielddata" : true
}
}
}
}
}
}
}
}
} Which is all expected. Could you try to reproduce the steps I did and tell me if it is still not giving the same mapping as I shown here? |
Beta Was this translation helpful? Give feedback.
-
When we run the fscrawler jobs with an Elasticsearch pipeline added to the _settings. yaml file we get errors like the one below.
16:25:25,859 WARN [f.p.e.c.f.FsParserAbstract] Can't find stored field name to check existing filenames in path [/mnt/folder1/document/files/qp]. Please set store: true on field [file.filename]
16:25:25,859 WARN [f.p.e.c.f.FsParserAbstract] Error while crawling /mnt/folder1/document/files/qp: Mapping is incorrect: please set stored: true on field [file.filename].**
We need the pipeline to replace part of the file.url field's content. . We need to the search results to display the file.url content differently. When we add the pipeline setting to the jobs file the index is created but sometime the document count total doesn't match the contents in the folder. When crawling smaller content it finishes but when crawler larger content the crawl error out and stops crawling and stops injecting documents into the elasticseach index. We notice if the crawl completes , when we add new documents to the crawl location or remove document the crawl doesn't update the index.
FYI: The folder locations, the path where the content being crawled is a mounted shares on a windows operating system, I'm not sure if that's causing an issue. The fscrawler job is running on a REHL8 server.
Here's the elastic search pipeline script
PUT _ingest/pipeline/fscrawler
{
"processors":
[
{
"script": {
"source": "ctx.file.url = ctx.file.url.replace('file:///mnt/folder1/document/','http://testsite/document/')",
"if": "ctx?.file?.url != null"
}
}
]
}
We added the pipeline settings to the jobs:
elasticsearch:
pipeline: "fscrawler"
Here’s the Index field mapping generated by fscrawler.
{
"qp" : {
"mappings" : {
"properties" : {
"content" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"file" : {
"properties" : {
"content_type" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"created" : {
"type" : "date"
},
"extension" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"filename" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"filesize" : {
"type" : "long"
},
"indexing_date" : {
"type" : "date"
},
"last_accessed" : {
"type" : "date"
},
"last_modified" : {
"type" : "date"
},
"url" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"meta" : {
"properties" : {
"author" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"created" : {
"type" : "date"
},
"creator_tool" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"date" : {
"type" : "date"
},
"format" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"language" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"path" : {
"properties" : {
"real" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"root" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"virtual" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
Here's the _setting.yaml file
name: "qp"
fs:
url: "/mnt/folder1/document/files/qp"
update_rate: "1m"
excludes:
"/~"
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: false
ocr:
language: "eng"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: false
elasticsearch:
username: "elastic"
password: "password"
pipeline: "fscrawler"
ssl_verification: false
nodes:
url: "https://servername:9200"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
105 origqp.pdf
Beta Was this translation helpful? Give feedback.
All reactions