Skip to content

This project implements a simplified search engine using C++ programming and data structures like maps and tries. It is designed to efficiently store and retrieve text documents based on keyword queries.

Notifications You must be signed in to change notification settings

bhavesh1409/Simplified-Search-Engine-Cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Search-Engine-C++

Welcome to this Search-Engine! This project implements a simplified search engine using C++ programming and data structures like maps and tries. It is designed to efficiently store and retrieve text documents based on keyword queries.

Following contents to be followed

Introduction

Inputs

Data structures

Relevance score

Application interface

Project Overview:

This project is focused on creating a mini search engine. It utilizes data structures and methodologies that are a simplified version of what major search engines like Google or Bing use.

Inverted Index:

In the context of this project, an inverted index is a fundamental data structure. It serves as a key component for efficiently storing and retrieving text documents.

Keyword Queries:

The search engine is designed to handle keyword queries. A keyword query is a user's request for information expressed as a set of keywords or search terms. For example, a user might enter the query "brown fox" to find documents related to brown foxes.

Top-K Results:

The search engine's primary task is to find and return the top-K documents that are most relevant to a given query. The term top-K means that it doesn't return all matching documents but a specified number (K) of the most relevant ones. This concept is crucial for presenting users with a manageable list of results, making it easier to find what they're looking for.

Efficiency:

Efficiency is a key consideration in this project. Search engines must handle vast amounts of data and respond quickly to user queries. To achieve this, the project employs strategies and data structures that ensure fast and efficient searching.

The docfile (or any other file name) is a file that includes all the documents that the application saves and processes.
Each line of this file is a document with words. For example, if the contents of the file are:

0 The quick brown fox leaped over the lazy lazy dog
1 Quick brown foxes leaped over lazy dogs for fun

It means that there are two documents, one for each line. The first word in each line is the ID of the specific document.

There are some demo datasets provided (in the datasets folder), as well as some queries (in the queries folder) for them and their corresponding results (in the results folder).

The document IDs must be given in order (starting from 0). An appropriate check is performed and in the case that an error is found the application is terminated.

  • A dynamically allocated array that stores all the text documents. It is used as a map, mapping <doc_IDs> to documents' text (standard array indexing is used for document identification).

    docID text
    0 The quick brown fox leaped over the lazy lazy dog
    1 Quick brown foxes leaped over lazy dogs for fun
    ... ...
  • The inverted index is essentially a matching of every word in the documents collection to these documents. For example, for the two texts above, a part of the inverted index is represented as follows:

    word [ <doc_ID>, <word_freq>]
    brown [0, 1] [1, 1]
    lazy [0, 2] [1, 1]
    for [1, 1]
    fun [1, 1]
    ... ...

    On the left side there is the key (every word), while on the right side there is a list of document IDs, as well as how many times each word appears in each document. This list is called a postings list. For example, the word "lazy" appears twice in document 0 ( thus [ 0, 2 ] ) and once in document 1.

  • A Trie is used in order to efficiently find the postings lists and thus achieve fast searches in the index. It stores the documents' words. In its leaves each posting list is included. For example, in the following figure, a part of the Trie and postings lists is depicted.

Trie

Given a query (e.g. brown fun), the answer is calculated as follows:
For each word, the posting list from Trie is found. In the particular example [0,1] [1,1] and [1,1] respectively.
For each document ID D that appears in either one list or the other, the relevance of this particular document to the query (i.e. its relevance score) is calculated, using the BM25 type:

BM25 type

  • qi ​: a word part of the query.

  • f(qi, D)​ : the qi's term frequency in the document.

  • ​|D| ​: the number of document's words.

  • avgdl​ : the average number of words from all the documents in the index.

  • k1, b : the parameters that allow us to optimize the search engine (default values: k1 = 1.2, b = 0.75).

  • IDF(qi) : the inverse document frequency (it represents in how many texts the word qi exists) and is calculated as follows:

    IDF type

    • N : the number of documents in the index.
    • n(qi) : the number of documents that contain the word qi.

After calculating the score for each document the K higher scores are found and the corresponding documents are presented to the user.

A makefile is provided. You can use the command $ make for easy compilation and $ make clean to delete all *.o auto generated files.

To run the application use the command:
$ ./minisearch -i <docfile> -k <K>

Command line arguments

  • -k : indicates how many results the application returns for each query.
    e.g. For K=5 the application will return 5 results (at most) for each query. If a query has less than, K will only return these.
  • -i : the input file name (docfile by default).

Application commands

When the application finishes creating the index, it will wait for the user's input. The user will be able to give the following commands:

  • /search q1 q2 q3 q4 ... q10: The user searches for documents that contain one or more of the qi words after the command keyword. These words consist the query. For example, for the command /search brown lazy both of the above demo documents are returned. The command expects at least one word and at most 10. If more than 10 words are given, only the first 10 are used. Once the command /search is given, the application finds the documents that contain the words, calculates a score based on how relevant each document is with the query, sorts these documents based on this relevance score and prints the top-K.

    The user sees on his screen the serial number of the result, the ID of the document, the relevance score, the document, as well as where the query words are located in the document. More specifically for the query /search brown lazy for the two documents above, the user will see the following:

    /search brown lazy
    1.( 1)[0.0341] Quick brown foxes leaped over lazy dogs for fun
                         ^^^^^                   ^^^^
    2.( 0)[0.0320] The quick brown fox leaped over the lazy lazy dog
                             ^^^^^                     ^^^^ ^^^^
    
  • /df: The document frequency vector is printed, i.e. each word from the input file along with the number of documents in which it appears. In the example of the two documents above, the following will appear:

    /df
    the 1
    quick 1
    over 2
    leaped 2
    lazy 2
    fun 1
    foxes 1
    fox 1
    for 1
    dogs 1
    dog 1
    brown 2
    The 1
    Quick 1
    

    If there is a word after the command, then the document frequency is printed only for the particular word. For example:

    /df leaped
    leaped 2
    
  • /tf id q_word: Prints the term frequency for the particular word in the particular document, i.e. how many times the word "q_word" appears in the document with the given ID. For example:

    /tf 0 lazy
    0 lazy 2
    
  • /exit: Τhe application terminates.

About

This project implements a simplified search engine using C++ programming and data structures like maps and tries. It is designed to efficiently store and retrieve text documents based on keyword queries.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published