-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathselect_tokens.py
35 lines (30 loc) · 1.63 KB
/
select_tokens.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import pandas as pd
# Extract the 50 highest-scored verbs for each document (romance and non-romance)
# given this verb scores 0 in the other document
# Making dataframe
df = pd.read_csv("output/tf_idf_romance_matrix_v4.csv")
# 50 highest-scored verbs for Romance, while scoring 0 for Non-romance
df_romance = df.loc[df.non_romance == 0.0]
select_romance_verbs = df_romance.nlargest(54, ['romance'])
export_selected_romance = select_romance_verbs.to_csv(f'output/highest_tf_idf_romance_verbs_x50.csv', header=True)
# 50 highest-scored verbs for Non-romance, while scoring 0 for Romance
df_non_romance = df.loc[df.romance == 0.0]
select_non_romance_verbs = df_non_romance.nlargest(50, ['non_romance'])
export_selected_non_romance = select_non_romance_verbs.to_csv(f'output/highest_tf_idf_non_romance_verbs_x50.csv', header=True)
"""
First attempt of selecting 'romance' word resulted in 04 words LSA does not recognize:
For romance corpus:
Can't find any terms from text: 'contort'
Can't find any terms from text: 'dishevel'
Can't find any terms from text: 'enrol'
Can't find any terms from text: 'appal'
So, in second attempt, 54 highest were extract, and these 4 words were mannually deleted that LSA space.
"""
"""
Further on working with dataframe through Pandas:
https://www.geeksforgeeks.org/python-pandas-dataframe-where/
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.html
https://datatofish.com/select-rows-pandas-dataframe/
This can also be achieved using NumPy:
https://stackoverflow.com/questions/41298073/how-to-get-the-most-representative-features-in-the-following-tfidf-model
"""