How to Scrape Tweets From Twitter (updated)
For initial setup, create a virtual environment & run it:
$ python3 -m venv venv
$ . venv/bin/activate
Then install requirements:
$ pip install -r requirements.txt
Thereafter, just run the second venv line to activate the virtual environment.
The first Scraper used snscrape, which can no longer access the Twitter API.
Scraper2 Uses Tweepy to access the Twitter API.
NOTE: Did not test as Twitter API is $100/mo.
Type python
and the file name.
$ python Scraper2.py
Archived
NOTE: Reddit Scraper Broken as of 12/14/22.
Scraper uses snscrape to scrape twitter and reddit posts. Thank you to Martin Beck's How to Scrape Tweets With snscrape write-up, which got me started. See his files under /TwitterScraper
. The files in that directory are not necessary to run Scraper.py
- Type
python3 Scraper.py
- Choose:
- To search on Twitter. This accepts the same advanced search operators as the Twitter search box.
- To search on Reddit.
- To search for a Subreddit with the term you entered. Results should show posts from that subreddit if it exists. This results aren't complete. It doesn't show the post, but it does show the URL.
- Type the maximum number of results to receive.
- Type a search term or terms.
- Type a filename prefix (random numbers and the count will be appended to this name).
- Output is a
.csv
file with the full name shown in the console.
- make first choice a function
- make if statements into functions
- make it so that you can go back and make a different choice
- twitter from:username search
- twitter from:username since: until: options
- twitter search -- choose from straight up search to username, since, until
- print to csv: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
- https://www.w3schools.com/python/ref_func_input.asp
- https://www.freecodecamp.org/news/python-convert-string-to-int-how-to-cast-a-string-in-python/
- https://docs.python.org/3/library/json.html#basic-usage
- https://stackoverflow.com/questions/68453165/python-can-you-refresh-a-variable-to-re-initialize-with-new-sub-variables
- https://docs.python.org/3/library/csv.html
- https://stackoverflow.com/questions/18791882/how-to-make-program-go-back-to-the-top-of-the-code-instead-of-closing -- for the if/elif/else example
- https://stackoverflow.com/questions/39612262/how-to-convert-a-large-json-file-into-a-csv-using-python -- convert json lines to csv (a lifesaver)
- https://github.com/JustAnotherArchivist/snscrape/blob/master/snscrape/modules/twitter.py
- https://github.com/JustAnotherArchivist/snscrape
- https://stackoverflow.com/questions/44287011/valueerror-expected-object-or-value-when-reading-json-as-pandas-dataframe
- https://stackoverflow.com/questions/30088006/loading-a-file-with-more-than-one-line-of-json-into-pandas -- dead end, ended up using csv.writer
- https://www.statology.org/valueerror-trailing-data/ -- same as above
- https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html#min-tut-02-read-write -- future to do, use pandas for data manipulation
- https://www.w3schools.com/python/gloss_python_elif.asp
Reddit search help https://www.reddit.com/wiki/search/
- Summary:
author:name
flair:flairname
- Show text posts only
self:true
- The body of the post:
selftext:term
- The domain of the submitted URL:
site:domain
- The submission's subreddit:
subreddit:name
- The submission title:
title:term
- The submission's URL (the website's address):
url:address
- Combined search:
author:name subreddit:name searchterm