Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simple tool for processing status availability files #41

Merged
merged 1 commit into from
Oct 14, 2021

Conversation

travisbrown
Copy link
Owner

The twcc check-existence tool takes as input a list of tweet IDs on standard input and outputs a CSV file like this, where 0 indicates that the tweet is unavailable (either deleted or from a suspended or locked account) and 1 indicates that it's live:

9142291,1
10643701,1
18984851,1
34661512,1
48826972,1
72706082,1
75507232,1
102425362,0
119141862,1
120506682,0

We've published this output for large batches of tweet IDs at s3://twitter-metadata/status-availability/. The output there includes rechecks, where we check the same IDs multiple times (to get a fresh view after months, etc.).

It can be useful to combine these results into a single list that indicates the most recently known status for each tweet ID, which is what this tool does. You point it to a directory of CSV files like the example above, where the filenames correspond to chronological order, and it outputs a sorted list of each unique tweet ID and its most recent status.

Note that it takes about four minutes to run on our current availability files (representing a hundred-million-ish tweets and retweets, most of which have been checked at least twice).

@codecov-commenter
Copy link

Codecov Report

Merging #41 (7012473) into main (5afe23f) will decrease coverage by 0.19%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #41      +/-   ##
==========================================
- Coverage   26.63%   26.44%   -0.20%     
==========================================
  Files          44       45       +1     
  Lines        2775     2795      +20     
==========================================
  Hits          739      739              
- Misses       2036     2056      +20     
Impacted Files Coverage Δ
src/bin/availability.rs 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5afe23f...7012473. Read the comment docs.

@travisbrown travisbrown merged commit e3d6a9c into main Oct 14, 2021
@travisbrown travisbrown deleted the topic/availability-tool branch October 14, 2021 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants