-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to deal with images that cannot be downloaded? #3
Comments
In my machine,~94% images can be download successfully. |
Thanks for your nice advice! Since the raw image ( |
Hi, i advise you set up the right DNS resolver to increase your success rate |
@rom1504 Thanks, I will have a try. How is the proportion of successful downloads running on your machine? |
Hey @vtddggg. I'm also getting 94-95% success rate currently. I suspect this difference will matter little in terms of the accuracy of the trained models. In many experiments we found that changing the size of datasets coming from the same distribution have little impact in performance. For example, in Figure 3 we show that using only 50% of the unfiltered pools performs very closely to using the entire pool (see https://arxiv.org/abs/2304.14108).
We understand that releasing the tars directly would make things simpler for participants. However, our dataset is designed to be an index to public images on the internet, which means that if any image is deleted from its original source, it will also be deleted from our dataset. Releasing the tars directly would mean creating public copies of the images, which is problematic. For those reasons, we won't be able to share the data directly, and hope you understand the decision |
Thanks for sharing your success rate. |
Hi, I believe the success rate will become lower and lower over time. We downloaded the 45TB set and our rates look as following: Success: 89.9% So 10% missing already. |
Hello @pfischer-nvidia , |
We did change the DNS servers to the Google ones (8.8.8.8 and 8.8.4.4) but we did not change the resolver to bind9. Instead we used dnspython. |
I found that I can increase the success rate by increasing the number of retries, when I set retries=2, I got ~95% success rate currently . Just add retries=2 setting on line 136 of download_upstream.py like this.
|
Updated download_upstream! |
Hi, We are getting success rate of +- 86% (downloading the small, 12 million images, dataset): 15it [44:26, 19.65s/it]worker - success: 0.867 - failed to download: 0.128 - failed to resize: 0.005 - images per sec: 4 - count: 10000 total - success: 0.865 - failed to download: 0.130 - failed to resize: 0.005 - images per sec: 56 - count: 150000 We are using knot dns resolver (8 instances), using the default 16 processes, and using just 16 threads (instead of the default 128). We did this to slow down the dns resolve requests. We use an e2-highcpu-32 (Efficient Instance, 32 vCPUs, 32 GB RAM) instance on GCP. 86% is the best we could get, but due to the low number of threads, the downloading is very slow. Can you share any more details on how you got the 95% success rate (configuration, machine type, number of threads and processes, GCP/AWS/Azure, dns resolver, etc.) and any advice to improve speed and success rate? Many thanks! |
My current download success rate is around 94% now. My suggestion is for you to use wandb to monitor the download process. This way, you can analyze the reasons for download failures more effectively. I've shared the analysis of the download process recorded on my end using wandb. My wandb report link is here . It's evident from the analysis that the primary reason for failures is due to invalid download links (2% of the links are invalid), while 0.8% are due to IP bans. Your current download speed appears to be very slow. Bandwidth and CPU utilization don't seem to be fully occupied. It also doesn't seem like a DNS server problem. I believe the main reason for the failures could be related to your IP. You can consider trying to download the dataset using an IP address from a different region. If your bandwidth and CPU utilization aren't maxed out, increasing the number of processes and threads might not significantly impact your download success rate. My thread-to-process ratio config is 4:1.
|
#39 (comment) answered there |
This is the wandb ouput we are getting, (https://api.wandb.ai/links/alexander-remmerie/n1kikknk for the full report). As you can see, most of the errors are network unreachable errors and network timeout errors. Knot CPU usage is 2-5% CPU, so it is using knot as dns resolver. resolv.conf file:
|
Ok I see. Yeah the network error is about something lower level. Could be a
kernel setting, eg limit of number of files, could be local network card
limit, a limit on your local router
…On Thu, Aug 10, 2023, 12:23 Alexander Remmerie ***@***.***> wrote:
[image: image]
<https://user-images.githubusercontent.com/48564828/259694620-58df9125-d33e-4737-ac36-f25b1ca489df.png>
This is the wandb ouput we are getting, (
https://api.wandb.ai/links/alexander-remmerie/n1kikknk for the full
report). As you can see, most of the errors are network unreachable errors
and network timeout errors. Knot CPU usage is 2-5% CPU, so it is using knot
as dns resolver. resolv.conf file: (base) ***@***.***:/etc$
cat /etc/resolv.conf nameserver 127.0.0.1.
—
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437VBZPZNCF2N6XUYPPTXUSZAXANCNFSM6AAAAAAXWWDYLA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
We managed to get 92% finally. For future reference: we didn't have a public ip (for security reasons) on GCP. By giving the instance a public ip we were able to get much faster downloads. Now using 8 instaces of knot dns resolver, 88 cores, 88 processes and 128 threads we can download the small dataset (13 million images, 450 GB) in about 2 hours. |
@rom1504 @Vaishaal Can this not be solved without the dns (or other networking) setups? Even 92% is not the full download. And how do I know my 92% is the same as @alexanderremmerie 92%? Is there a way we can go towards more 'immutability' for these open vision-language datasets? |
The reality of the web is it's ever changing and it has a number of laws restricting redistribution. That's even more true for going beyond the web and towards world mutability. All that said, if you do really want a few billions of images that you can redistribute to everyone, and it should stay up for many years, then I think the only way is to build a service that hosts only perpetually granted public domain images. And then probably incentivizing a lot of people to put content on it. TLDR: it's an interesting topic, and there are a lot of possible solutions. However tuning a downloader tool while keeping the same collection of img links extracted from the web is unlikely to achieve immutability. |
Thanks for your great and meaningful competition.
When I run
python download_upstream.py --scale $scale --data_dir $data_dir
, only around ~91% images can be download successfully. It means the actually pool size will be smaller than the given pool size (< 12.8M).How to deal with that? Definitely I think the person with more candidate samples can benefit more.
The text was updated successfully, but these errors were encountered: