-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Twitter API v2 #1075
Comments
Fyi, some work is done on the branch https://github.com/sebastian-nagel/sfm-twitter-harvester/tree/twarc2:
Backward-incompatible changes in the v2 API:
TODOs are
So far, we'll like not be able to do all the remaining upgrade work in the next days/weeks. In case the work is useful, feel free to pick whatever you want. Thanks! |
I've done some preliminary work on integrating @sebastian-nagel's harvester code side-by-side with existing Twitter v1 REST API harvests and wired some of the harvest types up to sfm-ui. See:
Assumptions include:
So far, on these branches you can:
Remaining work on this includes:
Note that twitter_rest_warc_iter.py still contains I occasionally get 400 errors from the API if I use the exact same seed twice in a search. I haven't figured out what might be prompting that, but a new seed will run just fine. Just wanted to report that so you're aware. |
Thank you, @lwrubel ! |
Hi @lwrubel, thanks a lot! I think we'll switch to your branch. We'll share all experiences - but for now: happy holidays! |
Hi @sebastian-nagel, we're planning to work on implementing support for Twitter v. 2 this summer, probably starting in July. (We did a previous sprint this semester to identify impacts on the UI and the database models, especially as concerns the new filter stream API.) Since you had already started on this work, we're wondering if you'd be interested in, and have the bandwidth for, collaborating with us this summer. If so, we could coordinate work on a sprint. If not, you have our gratitude for getting us started! Best, |
Hi @dolsysmith, great to hear! I cannot promise that I can take part in the sprint. But happy to have a look at the specification or implementation. In June, we plan to bring SFM one Twitter-Harvester using the v2 API into production, in order to harvest longer user timelines. But this would not include any new features. |
Thanks, @sebastian-nagel -- totally understandable. And we'd definitely be interested to hear about your experience bringing the v2 timeline into production. |
I've also tested the changes in the branch twitter-v2.
Fixed in gwu-libraries/sfm-twitter-harvester#55
Some kind of auto-detection would be helpful here. |
@dolsysmith, the roadmap document isn't publicly readable. Is this intended? We've brought the SFM Twitter v2 harvester into production:
|
Sorry about that, @sebastian-nagel. Our enterprise version of Google Docs doesn't allow public sharing, but I'll post it in a different format once the team here was had a chance to discuss the roadmap (probably later today). And thanks for your update! |
@sebastian-nagel A couple of questions:
|
1 - yes, 2 - no (only had a look - it seems that Twarc's json2csv.py does not yet support v2 API results) |
Thanks, @sebastian-nagel. We'll be working on the exporter during this sprint; in twarc2, the JSON-to-CSV utility has been separated into a new library, but in my testing at the command line, it works quite well. |
I added a condensed version (without all the working notes) of our roadmap for v. 2 to the repo's wiki. |
Initial observations on testing the twitter-v2 branch with one of the PR's contributed by @sebastian-nagel:
|
That's added by the Twarc client. But SFM captures the HTTP traffic between the client and the Twitter server. Because there's no |
@kerchner @adhithyakiran @sebastian-nagel I pushed a new commit to t1103-exporter-v2 on the sfm-twitter-harvester repo. This version uses the twarc_csv code more efficiently than in my first attempt, so exporting to CSV/Excel/TSV seems to take time comparable to exports for v. 1. For further discussion/testing:
|
Update on the above: latest code in the branch now implements a Note that for CSV and TSV files, the app will respect the segment size selected by the user for creating the number of files (e.g., 250K, 1M, etc.) by doing an append operation. But the current Excel engine in use in the app doesn't allow appending, so the segment size for Excel exports is limited to the maximum DataFrame size. (We'll need to change the options in the UI.) |
Exporter performance comparison on large dataset (~1 million Tweets):
I suspect that this relative slowness arises from either the overhead of converting the Tweet JSON to a pandas DataFrame (twarc-csv) or from flattening the original JSON response (twarc2). We can try increasing the |
No description provided.
The text was updated successfully, but these errors were encountered: