Preparing TF-IDF with spark and EMR studio

Description

In this lab,you'll be working with a raw dataset from wikipedia to build a custom TF-IDF(Term Frequency-Inverse Document Frequency) model.After cleaning the data and training the model, we applied it to perfrom a pseudo-search,retrieving meaningful results for a given term.using EMR(Elastic MapReduce),Utilizing: - EMR Workspace - EMR Notebook running PySpark - Within an EMR studio instance,supported by an EMR cluster The integration of these tools enables us to process and analyze the data efficiently at scale.

Languages and Utilities Used

EMF STUDIO
SPARK
AWS CONSOLE

phyton

Environments Used

Windows 11 (21H2)

Program walk-through:

Launch EMR WORKSPACE Running PySpark:

Implenting raw data from wikipedia to the custom TF-IDF model:

Processing and analysing Data efficiently Using the (KEY:SHIFT ENTER)

TF-IDF Data model complete:

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
example 2.png		example 2.png
example.png		example.png
example1.png		example1.png
example3.png		example3.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Preparing TF-IDF with spark and EMR studio

Description

Languages and Utilities Used

Environments Used

Program walk-through:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Preparing TF-IDF with spark and EMR studio

Description

Languages and Utilities Used

Environments Used

Program walk-through:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages