You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this lab,you'll be working with a raw dataset from wikipedia to build a custom TF-IDF(Term Frequency-Inverse Document Frequency) model.After cleaning the data and training the model, we applied it to perfrom a pseudo-search,retrieving meaningful results for a given term.using EMR(Elastic MapReduce),Utilizing:
- EMR Workspace
- EMR Notebook running PySpark
- Within an EMR studio instance,supported by an EMR cluster
The integration of these tools enables us to process and analyze the data efficiently at scale.
Languages and Utilities Used
EMF STUDIO
SPARK
AWS CONSOLE
phyton
Environments Used
Windows 11 (21H2)
Program walk-through:
Launch EMR WORKSPACE Running PySpark:
Implenting raw data from wikipedia to the custom TF-IDF model:
Processing and analysing Data efficiently Using the (KEY:SHIFT ENTER)