GitHub - CatClawed/Clustering-Wikipedia: Old grad school project.

An old grad school project.

1% of Wikipedia is clustered using SOM (Self-organizing map, a neural network).

Non article pages were removed as best as possible, as well as stubs. Each word was lemmatized (reduced to a base form), and certain common words were removed.

The SOM and GSOMs were implemented by hand because it sounded fun. Euclidian distance and cosine similarity were used to determine similarity of words in articles. The results were pretty decent for SOM but not amazing for GSOM; which is to say articles that are similar tended to be in the same cluster and organized decently throughout the map in SOM.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
GSOM		GSOM
SOM		SOM
Wiki		Wiki
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

CatClawed/Clustering-Wikipedia

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages