Add simple tool for processing status availability files #41
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The
twcc check-existence
tool takes as input a list of tweet IDs on standard input and outputs a CSV file like this, where0
indicates that the tweet is unavailable (either deleted or from a suspended or locked account) and1
indicates that it's live:We've published this output for large batches of tweet IDs at
s3://twitter-metadata/status-availability/
. The output there includes rechecks, where we check the same IDs multiple times (to get a fresh view after months, etc.).It can be useful to combine these results into a single list that indicates the most recently known status for each tweet ID, which is what this tool does. You point it to a directory of CSV files like the example above, where the filenames correspond to chronological order, and it outputs a sorted list of each unique tweet ID and its most recent status.
Note that it takes about four minutes to run on our current availability files (representing a hundred-million-ish tweets and retweets, most of which have been checked at least twice).