Live demo hosted on AWS EC2: http://52.34.250.57/
This app is a basic notes app with a smart organizer feature that allows you to auto group relevant notes together. It works via an embeddings API to compute the vector representation of a note's text in a latent space. Then groups the notes with DBSCAN, a type of clustering algorithm. You can click-and-drag to recategorize notes. For future notes, you can use "Classify New Notes" to group new notes into the existing categories.
See instructions for deployment instructions.
This app's frontend is React / Typescript.
The notes are not persisted on the server. The notes are saved to localStorage, the browser's Web Storage API.
The backend serves the client requests to reorganize notes.
I used a few embedding providers, openai and nlp cloud. These providers will return in different latent spaces, so the distances between vectors could be at a different scale, the dimensionality could be greater or fewer, or the general quality of the embeddings might be better or worse. Downstream tasks that depend on the embeddings will require tuning based on these qualities.
Clustering is an ML problem to take a list of points and "cluster" them together based on similarity.
This project uses DBSCAN as the underlying clustering algorithm. I considered using K Means clustering. However, K Means clustering requires inputting the number of clusters, which is unknown to both the client and server. Therefore I took an alternative approach with density clustering, which creates clusters based on how close they are in the latent space.
As mentioned in the Embeddings section, the latent space of the embedding will influence the tuning of the hyperparameters in the clustering algorithm I choose.
The DBSCAN algorithm takes in a parameter known as eps
, which serves as the boundary for determining if two points belong to the same cluster. This is based on the distance between the two points, and if the distance is less than or equal to the eps
value, they are considered part of the same cluster. I tuned this to 3.3.
Just using clustering has an issue. Clustering is usually a global process and adding new notes requires reclustering all notes again. This can unexpectedly shuffle previously categorized notes into different categories.
Since users will want to categorize more notes incrementally, I added a classification algorithm, KNN. This allows us to classify new notes to the existing categorized notes. In essence, we fix the currently categorized notes, then for each note we find the currently classified note with the nearest embedding.
Now users can continuously and incrementally add notes.
I created this project to test out NLP methods to improve existing types of text-based products.
If you liked this idea, want to provide feedback, or just chat, feel free to reach out to me at [email protected].