-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
全対象URLをローカルで機械学習し、質問する前に選別する #204
base: master
Are you sure you want to change the base?
Changes from 12 commits
f633abd
00273b5
43a6899
7103095
39c4eaa
e32d37f
f30eecb
6c321d0
d117cf3
6219d30
fe8899f
87475b8
4e6f14f
49aa81b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
#!/bin/bash | ||
set -e | ||
|
||
|
||
# 依存lib | ||
. ./lib/url-helper.sh | ||
. ./lib/string-helper.sh | ||
|
||
get_text_by_url() { | ||
set +e | ||
url=$1 | ||
res=$(wget -q -O - --tries=1 --timeout=5 --dns-timeout=5 --connect-timeout=5 --read-timeout=5 $url) | ||
takano32 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if [ $? -ne 0 ]; then | ||
return 1 | ||
fi | ||
title=$(get_title_by_res "$res"|sed "s/\"/ /g") | ||
desc=$(get_desc_by_res "$res" | remove_newline_and_comma | sed "s/\"/ /g") | ||
set -e | ||
echo "$title $desc" | ||
} | ||
|
||
|
||
echo "" > ./data/eval.csv | ||
while read line;do | ||
echo $line | ||
md5=$(echo $line|cut -d',' -f 1) | ||
url=$(echo $line|cut -d',' -f 2) | ||
if [[ $url == "" ]]; then | ||
continue | ||
fi | ||
text=$(get_text_by_url $url) | ||
if [[ $text == "" ]]; then | ||
continue | ||
fi | ||
echo "$md5,$text" >> ./data/eval.csv | ||
done < ./data/urls-md5.csv | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. urls-md5.csvにある4万件のURLをすべてwgetして、テキストを抽出して、機械学習できるcsvに整える There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
#!/bin/bash | ||
set -e | ||
|
||
# ファイルを結合して一つにまとめる | ||
# ソートする | ||
# 重複を取り除く | ||
cat ./tmp/grep_コロナ_*.txt.tmp | sort | uniq > ./tmp/grep_aggregate.txt |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,13 +16,9 @@ set -e | |
# 依存lib | ||
. ./lib/url-helper.sh | ||
|
||
# ファイルを結合して一つにまとめる | ||
# ソートする | ||
# 重複を取り除く | ||
cat ./tmp/grep_コロナ_*.txt.tmp | sort | uniq > ./tmp/results.txt | ||
|
||
# result.txtからURLのみを抜き出す | ||
urls=$(cat ./tmp/results.txt | cut -d':' -f 1 | sed -z 's/\.\/www-data\///g') | ||
urls=$(cat ./tmp/grep-aggregate.txt | cut -d':' -f 1 | sed -z 's/\.\/www-data\///g') | ||
Comment on lines
20
to
+21
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 今回の変更で
|
||
|
||
echo "" > ./tmp/urls.txt | ||
|
||
|
@@ -34,7 +30,7 @@ done | |
# sortしてuniqする | ||
sort < ./tmp/urls.txt | uniq > ./tmp/urls-uniq.txt | ||
|
||
echo "" > ./urls-md5.csv | ||
echo "" > ./data/urls-md5.csv | ||
|
||
for domain_and_path in `cat ./tmp/urls-uniq.txt`; do | ||
# domain=example.com | ||
|
@@ -51,5 +47,5 @@ for domain_and_path in `cat ./tmp/urls-uniq.txt`; do | |
# url=https://example.com/foo/bar.html | ||
url="$schema//$domain/$path" | ||
md5=`get_md5_by_url $url` | ||
echo "$md5,$url" >> ./urls-md5.csv | ||
echo "$md5,$url" >> ./data/urls-md5.csv | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tmp/eval.csv
というファイルパスは意図通りですか?他の
eval.csv
はdata/eval.csv
となっているようですが。