Skip to content

Mmunabau/Preparing-TF-IDF-with-spark-and-EMR-studio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Preparing TF-IDF with spark and EMR studio

Description

In this lab,you'll be working with a raw dataset from wikipedia to build a custom TF-IDF(Term Frequency-Inverse Document Frequency) model.After cleaning the data and training the model, we applied it to perfrom a pseudo-search,retrieving meaningful results for a given term.using EMR(Elastic MapReduce),Utilizing: - EMR Workspace - EMR Notebook running PySpark - Within an EMR studio instance,supported by an EMR cluster The integration of these tools enables us to process and analyze the data efficiently at scale.

Languages and Utilities Used

  • EMF STUDIO
  • SPARK
  • AWS CONSOLE
  • phyton

Environments Used

  • Windows 11 (21H2)

Program walk-through:

Launch EMR WORKSPACE Running PySpark:
TF-IDF Steps

Implenting raw data from wikipedia to the custom TF-IDF model:
TF-IDF Steps

Processing and analysing Data efficiently Using the (KEY:SHIFT ENTER)
TF-IDF Steps

TF-IDF Data model complete:
TF-IDF Steps

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors