- HashMap<String, String> allMisspelledWords
This hashmap is used to keep the misspelled words in the error dataset for calculating emission probabilities. - HashMap<String, ArrayList> misspelledWordWithOneEditDistanceCorrectWord
If the minimum edit distance is 1 between the misspelled word of error dataset and the correct words of cleaned dataset,this misspelled and possible correct words are stored in this hashmap. - HashMap<String, Integer> soundEventForDeletion
This hashmap is used to store the count of deletion dictionaries. - HashMap<String, Integer> soundEventForInsertion
This hashmap is used to store the count of insertion dictionaries. - HashMap<String, Integer> soundEvenetForSubstitution
This hashmap is used to store the count of substitution dictionaries. - HashMap<String, Integer> unigramCount
This hashmap keeps the count of words as single - HashMap<String, Integer> bimapCount
This bigram hashmap is used to store the count of double words
- There are 4 classes in this application. Bigram class represents the bigram language model.Also Unigram class represents the unigram language model.
SpellCorrection class handles the main operations of homework such as viterbi,spelling correction and language model.Main class is used to trigger the spelling correction operation. - The language model is being created when the cleaning incorrect dataset is done. In other words, the cleaned dataset is not written back to another file for creating language model. This status makes this application faster.
- The regex is used the cleaning dataset extensively.
- The table/matrix is not used to calculate the minimum edit distance.Because the runtime of application takes too long by using table/matrix.So two words are compared by looking at each character.This makes application faster and interactive.
- The misspelled words are compared with correct words in cleaned dataset by using unigram hashmap. Also the emission probabilities are calculated by using the deletion, insertion and substitution dictionaries.
- The punctuations are removed from the end of word or sentence by using regex.
- The all words are converted to lowercase by using tolowerCase method in Java.
- "#" character represents the word boundary in the deletion, insertion and substitution dictionaries.Also the count of "#" equals to the total count of all words in the cleaned dataset.
- Stack is used to make backtrace in Viterbi algorithm.
- Initial probabilities are calculated by using sentence boundary.
- The log probability is used to prevent underflow problem.
- The emission probabilities are calculated by using edit distance.
- If the initial or transition probability is zero,then infinity problem takes place in Viterbi algorithm. In other words, the log probability is calculated to prevent underflow problem in Viterbi ,but
if the parameter of log is zero , the infinity problem occurs. So these initial and transition probabilities are assigned Double.MIN_VALUE.
- When the transition probabilities are zero,then these probabilities are assigned to Double.MIN_VALUE.
- Also if there is no candidate word of misspelled word,then the emission probabilities of these misspelled words are assigned to Double.MIN_VALUE.
- Double.NEGATIVE_INFINITY is used to initialize the max variable that is used to detect the maximum probability of current word in Viterbi.
- Also Double.NEGATIVE_INFINITY is used to detect the last word in the sentence.The last word is very important because of backtrace.
- The sentences that does not have error/misspelled words are not included in the Viterbi algorithm.
- The output file has the generating sentences by using Viterbi and evaluation value.
- The working time of this application is approximately 6 seconds for all operations.