Skip to content

jbroll/fts

Repository files navigation

fts is a command line interface exposing the full text search capabilities of sqlite3.

It indexes a directory tree of documentation and can extract text by excuting
external filter programs.  Open source filters for extracting text from .doc,
.docx, .xlx, .xlsx, .ppt and .pdf are included in the distribution.



	fts check 			- check that all the documents in the index
					  still exist.  Remove any that do not exist.

	fts index [<files>]	 	- create of update the search index
	    
	    If no additional arguments are given, index a set of directories
	    indicated in the configuration file by the index-path directives.
	    
	    Or index the files (and directories) given on the command line.
	    The files given must be included within the paths covered by
	    index-path directives in the configuration file.

	    The full text index includes three column of text, a title, a
	    description and the body of text extracted from the file itself.
	    By default the title is the name of the file with "+" and "_"
	    replaced with space and the description is empty.  Alternate values
	    for these columns can be provided by calling a group proc declaired
	    in the index-path directive of the config file.

	fts excludes 			- display the exclude patterns from <conf>
	fts filters  			- display the filter  patterns from <conf>
	fts list 			- display a table of documents in the index.
	fts search [-t tmpl]  <query>	- seach the index for query

	    The search command produces a table of search results...

	    The option -t allows specification of an optional template.  Two templates
	    are included on the source, text and html.  The default is text.

	fts rm docid <docids ..>	- remove documents by docid.
	fts rm file  <files  ..>	- remove documents by file path.

	Finding the config file:

	  The full path to the configuration file may be specified on the
	  command line as the first argument, prefixed with the "@" symbol.  If
	  this is not specified the name of the executable will used as the
	  name of the config file by suffixing ".conf" to it.

	Config file commands:

	  set tmp   <temporary-directory>

	  set wTitle  <weight of title match>	# Weight is positive a real number 
	  set wDescip <weight of description match>
	  set wBody   <weight of body match>

	  database  <sqlite3-database-file>
	  stopwords <stop-words-file>

	  filter <pattern> <extraction-command>

	    Any indexing file candidate that matches the glob style pattern will have
	    text extracted from it by executing the extraction command.  The "%f" and 
	    "@F" tokens in the extraction command string will be replaced with the file
	    name matching the pattern.  The extraction command will be executed and its
	    standard output used as the text to index.

	    If the replacement token "@F" is found in the extraction command string and
	    the pattern  is of the form "*.xxx" the rule is chained.  The matched extension
	    will be removed from the file name and the result will be matched against the
	    list of extraction filters again.
	   
	    File extension patterns of the form "*.xxx" are not case sensative.

	  exclude <glob-pattern> ...

	  index-path <tag> <path> [url] [regexp]
      
	    Index all files in the path, recursing to subdirectories.
	    A url entry for the database is generated by calling:

		set url [regsub $regexp $filepath $url]
		
	    The default values for url and regexp are {\1} and {%p(.*)}, where
	    %p is substituted with the indexed path.  This generates the 
	    file tail as the default url entry in the search results.

	    The tag can be used to associate different directory trees of documents with
	    a proc to provide title and description text to index.  If a proc with the
	    same name as the tag is found it will be called with two args to retrieve this
	    text.

	        $tag title   <file>
	        $tag descrip <file>

	    The result of this call will be indexed in the associated column.  The value of 
	    the title column is available for use in the search results template.

	  template name { header rows footer }

	    Declair a template whose name may be used with the -t option to
	    search.  The template is a list of three strings that will be
	    expanded with subst to produce the results of the search.  The
	    first string is expanded before the search, it represents the
	    header of the result.  When the header string is expaneded with
	    subst, the value $query is available.

	    The seconds string is expanded once for each row. The values
	    $rowid, $tag, $mtime, $fsize, $url, $file, and $title assiciated
	    with the search result document are available with the string is
	    expanded.

	    The third string is expanded after the search results have been
	    generated and represents the footer of the search results.

	    If the result of any individual template expansion is an empty
	    string the result is ignored.  If the oranization of the search
	    results needs to be returned in an order different from the search
	    ranking, the parts of a template to be utilized as callbacks where
	    search results are accumulated in calls to the row template,
	    transformed and returned in the footer.
    

About

Full text search w/sqlite fts-v4

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •