diff --git a/papers/jyotika_singh/banner.png b/papers/jyotika_singh/banner.png deleted file mode 100644 index c5dd028e26..0000000000 Binary files a/papers/jyotika_singh/banner.png and /dev/null differ diff --git a/papers/jyotika_singh/main.log b/papers/jyotika_singh/main.log deleted file mode 100644 index c5945021ab..0000000000 --- a/papers/jyotika_singh/main.log +++ /dev/null @@ -1,6 +0,0 @@ -This is pdfTeX, Version 3.141592653-2.6-1.40.22 (TeX Live 2021) (preloaded format=pdflatex 2021.12.27) 30 MAY 2024 18:00 -entering extended mode - restricted \write18 enabled. - %&-line parsing enabled. -**main.tex -(./main.tex \ No newline at end of file diff --git a/papers/jyotika_singh/main.tex b/papers/jyotika_singh/main.tex deleted file mode 100644 index 8d4598d953..0000000000 --- a/papers/jyotika_singh/main.tex +++ /dev/null @@ -1,280 +0,0 @@ -\begin{abstract} - For many natural language processing (NLP) tasks today, several AI tool options exist. Which tool one chooses for their task depends on many factors-such as previous success of different tools for their type of data, latency, and a lot of experimentation and research to actually test different options. Sometimes, for an application, one may need to perform a particular NLP task, but still require usage of different tools. For example, named entity recognition where you need to extract multiple entities of different types, may be accomplished using numerous methods or tools, rather than one tool being used for extraction of all entities. This paper walks through decision making on open-source tools for popular NLP tasks like named entity recognition, sentiment analysis, text similarity, and more. The paper also introduces nlprw\_toolkit. The nlprw\_toolkit \footnote{\url{https://github.com/jsingh811/NLP-in-the-real-world/tree/toolkit/nlprw_toolkit}} offers a solution to this issue by simplifying the process of selecting the most suitable tool or set of tools for a given task. It achieves this by integrating various tools necessary for completing complex tasks when a single tool doesn't suffice. Moreover, it assists in selecting the appropriate tool by considering factors like the language style of data and the desired outcome. The toolkit also helps you run multiple methods for a task so you can compare their outcomes and make an informed decision in your tool choice. - -\end{abstract} - -\section{Introduction}\label{introduction} - -NLP, or Natural Language Processing \citep{Jones1994}, is a field of artificial intelligence (AI) that focuses on the interaction between computers and humans through natural language. Natural language includes any language humans use to communicate with one another, including audio/speech \citep{audioIntro} and text. More commonly, NLP refers to use with text data. Audio data processing and analysis has a bigger overlap with the domain of digital signal processing and involves different processing techniques \citep{audio2022} compared to text. - -NLP encompasses a wide range of tasks aimed at enabling computers to understand, interpret, and generate human language in a manner that is both meaningful and contextually relevant. Some common tasks in NLP include: -\begin{itemize} - \item \textbf{Text Classification:} Assigning categories or labels to text documents based on their content, such as spam detection, sentiment analysis, or topic categorization. - \item \textbf{Named Entity Recognition (NER):} Identifying and categorizing named entities (such as names of people, organizations, locations, etc.) within text documents. - \item \textbf{Sentiment Analysis:} Determining the sentiment expressed in a piece of text, such as whether it is positive, negative, or neutral. - \item \textbf{Text Summarization:} Generating concise summaries of longer pieces of text, preserving the most important information. -\end{itemize} - -Other examples of NLP tasks include Question Answering on given text, language translation, Language Generation, and more in \citep{tdsnlp} and \citep{mklearn}. These tasks and many others in NLP play a crucial role in various applications, including search engines, virtual assistants, social media analysis \citep{social2021}, language translation, and more. - -In today's landscape, there's a plethora of tool options available for various tasks. Selecting the most suitable tool hinges on numerous factors, including past successes with similar data types, latency considerations, extensive experimentation and research to evaluate different options. - -In certain scenarios, a single NLP task may necessitate the utilization of multiple tools. Take named entity recognition, for instance, where extracting diverse entities might require employing various methods or tools, rather than relying solely on one tool for extracting various types of entities. Named Entity Recognition, also known as entity extraction classifies named entities that are present in a text into pre-defined categories like “individuals”, “companies”, “places”, “organization”, “cities”, “dates”, “product terminologies” etc and adds a wealth of semantic knowledge to your content and helps you to promptly understand the subject of any given text \citep{kdnner} \citep{nersd}. Several tools such as SpaCy \citep{spacy}, NLTK \citep{nltk}, Stanford NER \citep{stannlp} and others can be used for this purpose. - -Additionally, when faced with the task of selecting from numerous tool options for a particular task, there are often requirements for writing necessary code and utilizing the required tool for each possible option under consideration. For example, in a text similarity based recommendation system, which embedding model will one choose, given that there are 100s of such models available today. One will likely choose a couple of models and evaluate them solely based on a general understanding of the type of data each model is expected to perform better with. This knowledge also comes with experience, making this process more cumbersome for individuals less familiar with or new to NLP and the vast World of NLP model and tool options. - -The toolkit nlprw\_toolkit \footnote{\url{https://github.com/jsingh811/NLP-in-the-real-world/tree/toolkit/nlprw_toolkit}} is an attempt to help with this problem and draws learnings from the book Natural Language Processing in the Real-World published by CRC Press/Taylor and Francis \citep{Singh2023}, which contains real world NLP applications across 15+ industry verticals and solutions for most prominent NLP use cases, and the accompany code NLP-in-the-Real-World \footnote{\url{https://github.com/jsingh811/NLP-in-the-real-world}} that serves as a reference guide for executing a large no. of NLP applications, including end to end implementation of real world use cases, all using open source tools. - - -\section{Methods and Results}\label{methods-and-results} - - -\subsection{Tool selection}\label{tool-selection} - -NLP has gained a lot of popularity over the last decade. This has been accompanied by several useful open-source tools and models that can be leveraged for a variety of tasks. Several paid services from cloud providers have been launched in this space as well, creating a plethora of options of businesses to easily plug their data into models, without putting in the work to create them from scratch. There still exist a lot of use cases requiring custom model building, but the availability of many great options have led to the creation of a standard workflow of approaching solution building in NLP. First, the data science / machine learning developer gauges the probelm statement, understand available data, and evaluation based on business goals. Then, when it comes to building the solution, the developer first explores existing tools and models. If existing tools have gaps, then further work can be done to fill those gaps. If the existing solutions don't work, training a custom model is the next step. - -Data Scientists spend at least 20\% of their time in model selection and training \footnote{/url{https://businessoverbroadway.com/2019/02/19/how-do-data-professionals-spend-their-time-on-data-science-projects)}}. This time is even larger when there is a new/unfamiliar problem to solve. The problem may have existed, but it is the developer's first time trying to build a solution for it that fits their data. However, with so many available tools to choose from, how can the choice be made? For this, the developer needs to read-up a lot about available options and then find similar patterns to what might work on their data, then proceeding with experimentation with trial and errors. The process of tool selection has a gap in the market today and can be made quicker. Usually, as a developer gains experience, this process gets easier and less time-consuming over time, which this problem remains bigger for early career professionals. To solve this problem, this section explores some common NLP tasks and how you can short-list tools based on your data. The nlprw\_toolkit integrates this knowledge and shares a way to make this assessment using the toolkit directly. This section sheds more light on decision making while selecting tools and shares code samples of doing so using the toolkit. - - -\subsubsection{Based on desired task}\label{based-on-desired-task} - -Let’s consider named entity recognition (NER) for example. NER is a very popular task, finding applications across industry verticals as well as research and academia projects. Some popular open-source tools for this task include SpaCy and NLTK. If you want to extract email IDs from text, then tools like spacy NER are not very helpful, and a regex expression will be useful instead. This is because entities like email IDs and phone numbers have predictable patterns that compose them, which make pattern matching techniques a more viable option. It is also computationally less expensive to implement pattern matching using regex. Example regex for emails: - -\begin{lstlisting} -import re - -re.finditer("\S+@\S+\.\S+", text) # email pattern -\end{lstlisting} - -But if you want to extract dates, then NER tools like SpaCy, NLTK, or other transformer-based NER models will likely yield better results. Spacy comes with several models of different sizes that you can choose from for NER. - -\begin{lstlisting} -import spacy - -nlp = spacy.load("en_core_web_sm") # small model trained on web-based data -doc = nlp(text) -for word in doc.ents: - print(word.text, word.label_) -\end{lstlisting} - -Furthermore, for cleaner data, you can opt for smaller SpaCy models for NER and may not see much lift in accurate resposes with larger models. For more complex and noisy data, larger models may do better, including transformer models. Note that larger models will lead to higher latency as well. Thus, keep in mind the trade-off as well while making the choice. - -If you want to extract an entity not offered by any of these existing models, then you may need to train your own model, which you can do using SpaCy itself, and build on top of its existing models. Code \footnote{\url{https://github.com/jsingh811/NLP-in-the-real-world/blob/toolkit/section5/training-ner-spacy.ipynb}} shows an example of doing so, buildign a custom NER model using SpaCy, and \footnote{\url{https://github.com/jsingh811/NLP-in-the-real-world/blob/toolkit/section5/transformers-ner-fine-tuning.ipynb}} shows code to build a custom NER model using tranformer-based LLMs. - -Thus, based on the desired task within the same NLP application, the tool choice can vary. - -Using this nlprw\_toolkit you can specify details on the entities you are interested in, and the tool selects the model it needs behind the scenes and gives you all the extractions you want. Internally the toolkit may use multiple tools behind the scenes or a single tool, depending on the task, but the user experience remains singular. An example with SpaCy and regex is shared below. - -\begin{lstlisting} -from nlprw_toolkit import infoextractor - -doc = "please write me at fejfow@iejf.com tomorrow about MOM Mission statement by 12.30 pm." -entities = ['email', 'DATE', 'TIME'] - -extracted_entities, model_selection = run(doc, entities=entities) - -print(extracted_entities) -# >> {'email': [('fejfow@iejf.com', 19, 34)], 'DATE': [('tomorrow', (5, 6))], 'TIME': [('12.30 pm', (11, 13))]} - -print(models) -#>> {'regex': ['email'], 'spacy': ['DATE', 'TIME']} -\end{lstlisting} - -\subsubsection{Based on type of data (quality/source)}\label{Based-on-type-of-data} - -To exemplify, let’s consider the task of sentiment analysis. Sentiment analysis is a very popular tasks and is very important for several business applications across e-commerce, social media, finance, and other verticals. Popular use cases are identifying sentiment from customer reviews about a product or service. This helps inform businesses on how well something is doing and impacts their action strategies. - -There are many open-source pre-trained models that can be used for a majority of data types for sentiment analysis. The different models are trained using various sources of data. Choosing the model that is likely to be better on your type of data has the best chances to give you desirable results. For instance, VADER \citep{vader} may be preferable if your data contains informal language, whereas TextBlob \citep{textblob} may be better if the language in your text is more formal. VADER is trained on a variety of data. This data includes customer reviews, but also informal language data that is likely to contain typos and emoticons, very similar to the kind of language you see and expect on social media (e.g., :-), LOL, nah, meh.) Several studies \citep{nepai} have also reported VADER doing better than TextBlob on informal language, where TextBlob does not appear to understand informal language nuances as well and does better on text with more structure. - -If choosing between TextBlob and VADER, based on the aggregated knwoledge from several studies, VADER will likely be a better choice if dealing with informal language, data with many terms that can't be found in a dictionary, or data from sources like social media. TextBlob is likely to be a better choice is you have more formal and reasonably structured language, fewer terms that can't be found in a dictionary, or language found in review comments like hotel reviews. VADER may also do well for review comments from sources like movie reviews and social media reviews. - -Overall, many factors play into this choice, including the following. \\ - -STYLE={"formal", "informal", "mixed"} \\ -TYPOS={"many\_nondict\_terms"; "some\_nondict\_terms", "mostly\_clean"} \\ -SOURCE={"social\_media", "review\_comments", "articles"} \\ - -Using the tool, you can specify the kind of language contained in your data. Based on the information provided, the nlprw\_toolkit makes recommendation for the model and returns the computed sentiment using the recommended model. - -\begin{lstlisting} -from nlprw_toolkit import sentiment - -# example 1 -sentences = ["i love you", "you dislike me or what?", "you hate ice cream", ""] -sentiment, model_choice = get_sentiment( - sentences, - style={"mixed": True}, - typos={"mostly_clean": True}, - source={"review_comments": True} -) -print(sentiment) -# >> ['positive', 'neutral', 'negative', 'negative'] - -print(model_choice) -# >> 'textblob' - -# example 2 -sentences = ["Show was really good it ws soooo fun."] -sentiment, model_choice = get_sentiment( - sentences, - style={"informal": True}, - typos={"many_nondict_terms": True}, - source={"social_media": True, "review_comments": True} -) -print(sentiment) -# >> ['positive'] - -print(model_choice) -# >> 'vader' -\end{lstlisting} - - -If you have labeled data available for any task, but not in big enough quantities to train a model, you can leverage it to test existing tools performance to find which tool may be the better choice for your data. This evaluation can help shortlist tools with stronger corroboration. For example, there are multiple sentiment analysis libraries with pre-trained models that work well for many data sources and types. Passing in the libraries available that you want to test against your test data, along with passing in the test data will enable you to compute and compare evaluations. If you don't know which tools may be options for a task like that, you can let the toolkit run through its default model options and recommend a model for you. - - -\subsection{Full project examples}\label{project-examples} - -\subsubsection{Recommendation system} -The core of text-based recommendation systems is finding similarity between a reference text and a corpus of text documents. Documents from the corpus which exhibit the highest similarity are likely good candidates to show as recommendations. The key is finding numerical representations of the text and computing similarity metrics.\\ - -Lets look at an example of accessing and evaluating multiple tool via a single interface, based on desired task components and type of data (quality/source). - -Let’s take text similarity for instance. There is a corpus of text samples and one piece of text for which one wants to find most similar text samples in the corpus. There are many embedding models that can be used to compute numeric representations from text followed by using cosine similarity to find semantic similarity between two pieces of text. You can also use methods other than user defined pre-trained embedding models, such as creating your own embedding model. - -Consider a scenario where you possess a corpus containing lengthy sentences. Traditional numerical representations like one-hot encoding or TF-IDF result in sparse vectors, where many elements are zeros. An alternative approach involves using dense representations, such as word embeddings. Word embeddings offer a method to represent words numerically within a corpus. This results in a vector for each term in the corpus, where each vector is of uniform size, typically much smaller compared to TF-IDF or one-hot encoded vectors. Examples of word embeddings include word2vec \footnote{\url{https://radimrehurek.com/gensim/models/word2vec.html}} (tools: gensim, SpaCy), fastText \footnote{\url{https://ai.facebook.com/tools/fasttext/}}, Doc2Vec \footnote{\url{https://radimrehurek.com/gensim/models/doc2vec.html}}, GloVe embedding \footnote{\url{https://nlp.stanford.edu/projects/glove/}}, ELMo \footnote{\url{https://paperswithcode.com/method/elmo}}, universal sentence encoder \citep{articleuse}, transformers \footnote{ \url{https://www.sbert.net/docs/pretrained_models.html}} and further ever growing transformer-based models/LLMs. Here is a curated list of pros and cons of these embedding models, providing more context \cite{Singh2023}. - -\begin{itemize} -\item The main disadvantage of Word2Vec is that you will not have a vector representing a word that does not exist in the corpus. For instance, if you trained the model on biological articles only, then that model will not be able to return vectors of unseen words, such as "curtain" or "cement". -\item The advantage of fastText over Word2Vec is that you can get a word representation for words not in the training data/vocabulary with fastText. Since fastText uses character-level details on a word, it is able to compute vectors for unseen words containing the characters it has seen before. One disadvantage of this method is that unrelated words containing similar characters/alphabets may result in being close in the vector space without semantic closeness. Example, words like "love", "solve", and "glove" contain many similar alphabets "l", "o", "v", "e", and may all be close together in vector space. -\item Doc2Vec is based on WordVec except it is suitable for larger documents. -\item Glove vectors treat each word as one, without considering the same word can have multiple meanings. The word "bark" in "a tree bark" will have the same representation as "a dog bark". Since it is based on co-occurrence, which needs every word in the corpus, glove vectors can be memory intensive based on corpus size. -\item ELMo can handle words with different contexts used in different sentences, which GloVe is unable to. Thus the same word with multiple meanings can have different embeddings. -\item Tranformer-based models are larger, however, have more understanding on language in general. -\end{itemize} - -For text similarity task, representativeness of the text corpus matters a lot. If the text for which you want to find similar pieces in the corpus does not have representation in the corpus, then you need to opt for models with a more general understanding, thus pre-trained models would be useful. However, if your data is domain specific and/or representative of the text you want to pass through the model, then a model with generic understanding may not be advantageous, and may actually hurt the performance. In this case, training your own model would be preferable, and starting with the simplest TF-IDF can be hihgly beneficial. TF-IDF will also be computationally least expensive and often times suffice for the task at hand. Opting for other more complex models should be done by establishing baseline with the smaller and simpler models first. - -For instance, lets say you have a corpus that is specific to text related to ‘Python’ - both the programming language as well as the snake.\\ - -The text you want to find similar documents for is \\ -\begin{quote} - Spotted rattle skin in the field -\end{quote} - -Since your dataset is specific to Python, the dataset may not have representative data to the sample you are analyzing in the corpus. Thus, choosing a generic understanding model is likely to yield better results. \\ - -Using generic pre-trained embedding model from SpaCy, the top result returned to the above text is: - -\begin{quote} - Ball python bite I decided to grab a snake out of its tank without asking, this was totally my fault. But now I can say I've been bit by a snake -\end{quote} - -Using a custom trained embedding model instead (TF-IDF based) yields the following top result. (consequence of wrong tool choice) -\begin{quote} - Python Machine Learning Tutorial (Data Science) Python Machine Learning Tutorial - Learn how to predict the kind of music people like. Subscribe for more Python tutorials like … -\end{quote} - -By passing details about the data in the toolkit (in this case setting ‘sample\_likely\_represented\_in\_corpus’ to False), the toolkit makes such determination automatically and recommends the tool of choice, and returns text similarity scores using the recommended tool. Similarly, whether the data is domain specific or not and other details about the data influence the tool choice as well. - -\begin{lstlisting} -from nlprw_toolkit import rec_sys - -corpus = [ - "Python Machine Learning Tutorial (Data Science) Python Machine Learning Tutorial - Learn how to predict the kind of music people like. Subscribe for more Python tutorials like ...", - "Ball python bite I decided to grab a snake out of its tank without asking, this was totally my fault. But now I can say I've been bit by a snake" -] -sample = "spotted rattle skin in the field" - -print("'sample_likely_represented_in_corpus': False") -recs = run_rec_system( - corpus, - [sample], - top_n=2, - data={'corpus_domain_specific': True, 'sample_likely_represented_in_corpus': False} -) -print(recs) -#>> [[("Ball python bite I decided to grab a snake out of its tank without asking, this was totally my fault. But now I can say I've been bit by a snake", 0.731253418677792), ('Python Machine Learning Tutorial (Data Science) Python Machine Learning Tutorial - Learn how to predict the kind of music people like. Subscribe for more Python tutorials like ...', 0.6951807783413815)]] - -print("'sample_likely_represented_in_corpus': True (consequence of wrong tool choice)") -recs = run_rec_system( - corpus, - [sample], - top_n=2, - data={'corpus_domain_specific': True, 'sample_likely_represented_in_corpus': True} -) -print(recs) -#>> [[('Python Machine Learning Tutorial (Data Science) Python Machine Learning Tutorial - Learn how to predict the kind of music people like. Subscribe for more Python tutorials like ...', 0.17008208798133495), ("Ball python bite I decided to grab a snake out of its tank without asking, this was totally my fault. But now I can say I've been bit by a snake", 0.0)]] -\end{lstlisting} - - -\subsubsection{Comment review analysis} - -Let’s say you have hotel review comments and want to understand them better, analyze sentiment, understand complaints reported in the data. Let’s see how you can get started on this quickly using the toolkit. -Once you run the review-analysis with a list of comments, the following information prints to give you stats about the data, as well as sentiment computed based on the recommended model by the toolkit based on your description of the data (as described in 1(b)), as seen in \ref{sentdis}.\\ -\begin{quote} - Total no. of reviews are 1837\\ - Shortest review length: 10 chars\\ - Longest review length: 793.0 chars\\ - Mean review length: 980.9229311433986 chars\\ - Median review length: 19846 chars\\ -\end{quote} -\begin{figure}[] - \includegraphics[width=0.5\textwidth]{sentimentdistn.png} - \caption{Sentiment breakdown in the corpus of user review comments. \label{sentdis}} -\end{figure} - -Now, viewing each comment and its sentiment manually could be time consuming. The tool leverages popular visualizers like matplotlib \citep{matplotlib} and wordcloud \cite{wordcloud} and puts them together on top of your data to easily visualize the words that make up each sentiment. Top words, mainly nouns, found in positive sentiment comments can be seen in \ref{pos_noun_wc}. The bigger the word, the more commonly it was found. - -\begin{figure}[] - \includegraphics[width=0.5\textwidth]{pos\_noun\_wc.png} - \caption{Postive comments - word cloud. \label{pos_noun_wc}} -\end{figure} - -Top words, mainly nouns, found in negative sentiment comments can be seen in \ref{neg_noun_wc}. - -\begin{figure}[] - \includegraphics[width=0.5\textwidth]{neg\_noun\_wc.png} - \caption{Negative comments - word cloud. \label{neg_noun_wc}} -\end{figure} - -This gives an analytical understanding of the underlying data to a person trying to analyze this data. For instance, people with more negative reviews talked more about night time, bed, staff, and desk. People with positive reviews spoke about time, staff, location, bathroom, etc. It shows that some people had a positive experience with the staff, while some may have had a negative experience. There are other visualization tools that may be preferred over word cloud that can be leveraged for analytics as well. - -Next, let’s say you want to create a classification model on the negative comments data, to understand whether the negative comment was about a staff member or about the property itself. How can you do this without any labeled data? This isn’t the most intuitive stage for understanding the next set of steps to follow, especially for early career or new professionals. This is where no-labeled data techniques like zero-shot classification \citep{zscbench} come in handy. Zero-shot text classification is a task in natural language processing where a model is trained on a set of labeled examples but is then able to classify new examples from previously unseen classes \citep{zsc}. Here, you can use a pre-trained model with general understanding of text and words, and see a comment is closer to which class/category in semantic space. This can be a great starting point and can also form a labeler for your data, which you can verify and then train a custom classification model using that data. You can also try relatively smaller LLMs for this task. An example below uses a bert based model for zero-shot classification. - - -\begin{lstlisting} -from transformers import pipeline - -categories = []# define list of categories you want your data classified into -classifier = pipeline( - "zero-shot-classification", model="typeform/distilbert-base-uncased-mnli" -) - -classifier(sentence, candidate_labels=categories) -\end{lstlisting} - -\begin{quote} - {'sequence': "My experience at check-in counter was terrible. The staff was kind of rude and didn't want to help out much. No complaints otherwise.", 'labels': ['staff person', 'hotel property'], 'scores': [0.7497782707214355, 0.25022172927856445]} -\end{quote} -\begin{quote} - {'sequence': "The carpet in the room was quite stained. I am surprised they didn't replace it given its condition.", 'labels': ['hotel property', 'staff person'], 'scores': [0.6811559796333313, 0.3188440203666687]} -\end{quote} -\begin{quote} - {'sequence': 'I love everything. The front desk was helpful.', 'labels': ['staff person', 'hotel property'], 'scores': [0.6326051950454712, 0.3673948347568512]} -\end{quote} - -The toolkit also suggests an appropriate classification method as above based on your data in the absence of labeled data and returns the results for you. - - - -\section{Limitations and Future work}\label{Limitations-and-Future-work} - -There are several factors that play an important role in tool selection, and optimization of this process is an ongoing and evolving effort. This paper represents information for helping with decision making for tool selection for popular NLP tasks. It also presents a toolkit which represents an early effort to simplify tool selection and facilitate the use of multiple tools through a single interface, incorporating practical logic for tool selection. Numerous updates can further enhance the toolkit's offerings. Considering the dynamic nature of NLP in AI, several functionality and software updates will aid in keeping pace with the rapid advancements in the field. - - -\section{Conclusion}\label{conclusion} - -In this paper, considerations and decision making for popular NLP tasks is shared with examples. The concept is to make the tool choice process faster and easier for individuals. This is especially helpful for individuals new to the field. NLP is a growing field with a lot of increase in people trying to leverage this technology for many tasks. nlprw\_toolkit is introduced which is an early attempt to integrate the decision making into an open-source tool. Popular NLP tasks such as text classification, summarization, named entity recognition, sentiment analysis can be done faster with informed tool choices using the toolkit. Full Real world use cases, including recommendation systems and customer review analysis were presented as examples that can be built using informed tool choices. - - - diff --git a/papers/jyotika_singh/mybib.bib b/papers/jyotika_singh/mybib.bib deleted file mode 100644 index 1e270ecaa5..0000000000 --- a/papers/jyotika_singh/mybib.bib +++ /dev/null @@ -1,259 +0,0 @@ -@inbook{Jones1994, - title = {Natural Language Processing: A Historical Review}, - ISBN = {9780585359588}, - url = {http://dx.doi.org/10.1007/978-0-585-35958-8_1}, - DOI = {10.1007/978-0-585-35958-8_1}, - booktitle = {Current Issues in Computational Linguistics: In Honour of Don Walker}, - publisher = {Springer Netherlands}, - author = {Jones, Karen Sparck}, - year = {1994}, - pages = {3–16} -} - -@article{nepai, - url = {https://neptune.ai/blog/sentiment-analysis-python-textblob-vs-vader-vs-flair}, - title={Sentiment Analysis in Python: TextBlob vs Vader Sentiment vs Flair vs Building It From Scratch}, - publisher={Neptune.ai}, - author={Shahul ES}, - year = {2023}, - month = aug, - day={30} - -} - -@article{textblob, - title={textblob Documentation}, - author={Loria, Steven}, - journal={Release 0.15}, - volume={2}, - year={2018} -} - -@article{articleuse, - author = {Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Sung, Yun-Hsuan and Strope, Brian and Kurzweil, Ray}, - year = {2018}, - month = {03}, - pages = {}, - title = {Universal Sentence Encoder} -} - -@article{vader, - title={VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text}, - volume={8}, url={https://ojs.aaai.org/index.php/ICWSM/article/view/14550}, - DOI={10.1609/icwsm.v8i1.14550}, - number={1}, - journal={Proceedings of the International AAAI Conference on Web and Social Media}, - author={Hutto, C. and Gilbert, Eric}, - year={2014}, - month={May}, - pages={216-225} -} - -@book{Singh2023, - title = {Natural Language Processing in the Real World: Text Processing, Analytics, and Classification}, - ISBN = {9781003264774}, - url = {http://dx.doi.org/10.1201/9781003264774}, - DOI = {10.1201/9781003264774}, - publisher = {Chapman and Hall/CRC}, - author = {Singh, Jyotika}, - year = {2023}, - month = may -} - -@article{tdsnlp, - title = {Natural Language Processing Tasks}, - publisher = {Towards Data Science}, - author = {Meyer, Patrick}, - year = {2021}, - month = oct, - url={https://towardsdatascience.com/natural-language-processing-tasks-3278907702f3}, - urldate={2024-05-31} -} - -@article{audioIntro, - title={An introduction to audio processing and machine learning using Python}, - url={https://opensource.com/article/19/9/audio-processing-machine-learning-python}, - publisher={Opensource.com}, - author={Singh, Jyotika}, - year = {2019}, - month=Sep , - urldate={2024-05-31} -} - -@article{mklearn, - title={11 NLP Applications & Examples in Business}, - publisher={Monkey Learn}, - author={Wolff, Rachel}, - year={2020}, - month=may, - day={20}, - url={https://monkeylearn.com/blog/natural-language-processing-applications/}, - urldate={2024-05-31} -} - -@InProceedings{audio2022, - author={Singh, J}, - title={py{A}udio{P}rocessing: {A}udio {P}rocessing, {F}eature {E}xtraction, and {M}achine {L}earning {M}odeling}, - booktitle={{P}roceedings of the 21st {P}ython in {S}cience {C}onference}, - pages={152-158}, - year={2022}, - doi={10.25080/majora-212e5952-017} -} - - -@article{nersd, - title={Named Entity Recognition}, - publisher={Science Direct}, - author={Science Direct}, - url={https://www.sciencedirect.com/topics/computer-science/named-entity-recognition#featured-authors}, - urldate={2024-05-31} -} - -@InProceedings{ social2021, - author={ Singh, J}, - title={{S}ocial {M}edia {A}nalysis using {N}atural {L}anguage {P}rocessing {T}echniques}, - booktitle={{P}roceedings of the 20th {P}ython in {S}cience {C}onference}, - pages={74-80}, - year={2021}, - doi={10.25080/majora-1b6fd038-009} -} - -@article{kdnner, - author={Banerjee, Suvro}, - title={Introduction to Named Entity Recognition}, - url= {https://www.kdnuggets.com/2018/12/introduction-named-entity-recognition.html}, - year= { 2018}, - month=dec, - urldate={2024-05-31} -} - -@article{attn, - doi = {10.48550/ARXIV.1706.03762}, - url = {https://arxiv.org/abs/1706.03762}, - author = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia}, - keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences}, - title = {Attention Is All You Need}, - publisher = {arXiv}, - year = {2017}, - copyright = {arXiv.org perpetual, non-exclusive license} -} - -@inproceedings{jupyter, - author = {Kluyver, Thomas and Ragan-Kelley, Benjamin and Pérez, Fernando and Granger, Brian and Bussonnier, Matthias and Frederic, Jonathan and Kelley, Kyle and Hamrick, Jessica and Grout, Jason and Corlay, Sylvain and Ivanov, Paul and Avila, Damián and Abdalla, Safia and Willing, Carol and {Jupyter development team}}, - editor = {Loizides, Fernando and Scmidt, Birgit}, - location = {Netherlands}, - publisher = {IOS Press}, - url = {https://eprints.soton.ac.uk/403913/}, - booktitle = {Positioning and Power in Academic Publishing: Players, Agents and Agendas}, - year = {2016}, - pages = {87--90}, - title = {Jupyter Notebooks - a publishing format for reproducible computational workflows}, -} - -@inproceedings{stannlp, - url = {http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf}, - author = {Jenny Rose Finkel, Trond Grenager, and Christopher Manninga}, - title = {Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling}, - publisher = {Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005)}, - year = {2005}, - pages = {363--370}, - doi = {http://dx.doi.org/10.3115/1219840.1219885} -} - -@article{spacy, - author = {Honnibal, Matthew and Montani, Ines and Van Landeghem, Sofie and Boyd, Adriane}, - title = {spaCy: Industrial-strength Natural Language Processing in Python}, - year = {2020}, - doi = {10.5281/zenodo.1212303} -} - -@article{nltk, - doi = {10.48550/ARXIV.CS/0205028}, - url = {https://arxiv.org/abs/cs/0205028}, - author = {Loper, Edward and Bird, Steven}, - keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences, D.2.6; I.2.7; J.5; K.3.2}, - title = {NLTK: The Natural Language Toolkit}, - publisher = {arXiv}, - year = {2002}, - copyright = {Assumed arXiv.org perpetual, non-exclusive license to distribute this article for submissions made before January 2004} -} - -@article{matplotlib, - author = {Hunter, J. D.}, - publisher = {IEEE COMPUTER SOC}, - year = {2007}, - doi = {https://doi.org/10.1109/MCSE.2007.55}, - journal = {Computing in Science \& Engineering}, - number = {3}, - pages = {90--95}, - title = {Matplotlib: A 2D graphics environment}, - volume = {9}, -} - -@article{wordcloud, - title={WordCloud: a Cytoscape plugin to create a visual semantic summary of networks}, - author = {Oesper, Layla and Merico, Daniele and Isserlin, Ruth and Bader, Gary D}, - journal = {Source code for biology and medicine}, - volume = {6}, - number = {1}, - pages = {7}, - year = {2011}, - doi = {http://dx.doi.org/10.1186/1751-0473-6-7}, - publisher = {Springer} -} - -@article{zsc, - url = {https://huggingface.co/tasks/zero-shot-classification}, - title = {Zero-Shot Classification}, - publisher = {Hugging Face}, - author = {Hugging Face}, - urldate = {2024-05-31} -} - -@article{zscbench, - title = {Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach}, - author = {Wenpeng Yin and Jamaal Hay and Dan Roth}, - journal = {ArXiv}, - year = {2019}, - volume = {abs/1909.00161}, - url = {https://api.semanticscholar.org/CorpusID:202540839}, - doi = {http://dx.doi.org/10.18653/v1/D19-1404} -} - -@inbook{2011, - ISBN = {9780387301648}, - url = {http://dx.doi.org/10.1007/978-0-387-30164-8_832}, - DOI = {10.1007/978-0-387-30164-8_832}, - booktitle = {Encyclopedia of Machine Learning}, - publisher = {Springer US}, - year = {2011}, - pages = {986–987} -} - - -@misc{cnn, - doi = {10.48550/ARXIV.1511.08458}, - url = {https://arxiv.org/abs/1511.08458}, - author = {O'Shea, Keiron and Nash, Ryan}, - keywords = {Neural and Evolutionary Computing (cs.NE), Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences}, - title = {An Introduction to Convolutional Neural Networks}, - publisher = {arXiv}, - year = {2015}, - copyright = {arXiv.org perpetual, non-exclusive license} -} - -@article{lda, - author = {Blei, David M. and Ng, Andrew Y. and Jordan, Michael I.}, - title = {Latent dirichlet allocation}, - year = {2003}, - issue_date = {3/1/2003}, - publisher = {JMLR.org}, - volume = {3}, - number = {null}, - issn = {1532-4435}, - journal = {J. Mach. Learn. Res.}, - month = {mar}, - pages = {993–1022}, - numpages = {30}, - doi={http://dx.doi.org/10.7551/mitpress/1120.003.0082} -} diff --git a/papers/jyotika_singh/myst.yml b/papers/jyotika_singh/myst.yml deleted file mode 100644 index 6c5316e92a..0000000000 --- a/papers/jyotika_singh/myst.yml +++ /dev/null @@ -1,45 +0,0 @@ -version: 1 -project: - # Update this to match `scipy-2024-` the folder should be `` - id: scipy-2024-jyotika_singh - title: Navigating Model Selection for NLP tasks - subtitle: Considerations for decision making and nlprw_toolkit - # Authors should have affiliations, emails and ORCIDs if available - authors: - - name: Jyotika Singh - email: singhjyotika811@gmail.com - orcid: 0000-0002-5442-3004 - affiliations: - - Independent - keywords: - - NLP - - Natural Language Processing - - Language data - # Add the abbreviations that you use in your paper here - abbreviations: - NLP: Natural Language Processing - # It is possible to explicitly ignore the `doi-exists` check for certain citation keys - error_rules: - - rule: doi-exists - severity: ignore - keys: - - jupyter - - audioIntro - - tdsnlp - - mklearn - - kdnner - - zsc - - nersd - - nepai - - textblob - - articleuse - # A banner will be generated for you on publication, this is a placeholder - banner: banner.png - # The rest of the information shouldn't be modified - subject: Research Article - open_access: true - license: CC-BY-4.0 - venue: Scipy 2024 - date: 2024-07-10 -site: - template: article-theme diff --git a/papers/jyotika_singh/neg_noun_wc.png b/papers/jyotika_singh/neg_noun_wc.png deleted file mode 100644 index dfef1477e6..0000000000 Binary files a/papers/jyotika_singh/neg_noun_wc.png and /dev/null differ diff --git a/papers/jyotika_singh/pos_noun_wc.png b/papers/jyotika_singh/pos_noun_wc.png deleted file mode 100644 index 7b9637fbe3..0000000000 Binary files a/papers/jyotika_singh/pos_noun_wc.png and /dev/null differ diff --git a/papers/jyotika_singh/sentimentdistn.png b/papers/jyotika_singh/sentimentdistn.png deleted file mode 100644 index 82c0587ba8..0000000000 Binary files a/papers/jyotika_singh/sentimentdistn.png and /dev/null differ