How to deal with images that cannot be downloaded? #3

vtddggg · 2023-05-05T06:21:10Z

Thanks for your great and meaningful competition.

When I run python download_upstream.py --scale $scale --data_dir $data_dir, only around ~91% images can be download successfully. It means the actually pool size will be smaller than the given pool size (< 12.8M).

How to deal with that? Definitely I think the person with more candidate samples can benefit more.

The text was updated successfully, but these errors were encountered:

zzzzzero · 2023-05-06T02:07:49Z

In my machine，~94% images can be download successfully.
I'm also wondering if there's a way to download the failed parts , because I've noticed that some downloads don't complete due to broken links or network issues，a considerable portion of the failed links can be successfully downloaded again. If no such method exists, we may have to create a Python script ourselves.
Here are methods that can improve download success rate: 1. You can try to reduce the number of download threads and processes. 2. You can try to change the network or machine used for downloading.

vtddggg · 2023-05-06T02:55:00Z

Thanks for your nice advice!

Since the raw image (tars) data is not very large for small scale track (450 GB). Will the organizers consider to open tars data directly? It can help us conduct more rigorous academic exploration on image-text data filtering problem.

rom1504 · 2023-05-06T19:55:24Z

Hi, i advise you set up the right DNS resolver to increase your success rate
See img2dataset readme for details

vtddggg · 2023-05-08T03:14:54Z

@rom1504 Thanks, I will have a try. How is the proportion of successful downloads running on your machine?
Just for a reference of what is a reasonable success rate.

gabrielilharco · 2023-05-09T13:34:59Z

Hey @vtddggg. I'm also getting 94-95% success rate currently. I suspect this difference will matter little in terms of the accuracy of the trained models. In many experiments we found that changing the size of datasets coming from the same distribution have little impact in performance. For example, in Figure 3 we show that using only 50% of the unfiltered pools performs very closely to using the entire pool (see https://arxiv.org/abs/2304.14108).

Will the organizers consider to open tars data directly?

We understand that releasing the tars directly would make things simpler for participants. However, our dataset is designed to be an index to public images on the internet, which means that if any image is deleted from its original source, it will also be deleted from our dataset. Releasing the tars directly would mean creating public copies of the images, which is problematic. For those reasons, we won't be able to share the data directly, and hope you understand the decision

vtddggg · 2023-05-10T01:10:16Z

Thanks for sharing your success rate.
Agree your point. As changing the size of datasets coming from the same distribution have little impact in performance, it is perfectly acceptable for lost some images during downloading.

pfischer-nvidia · 2023-06-05T11:46:36Z

Hi, I believe the success rate will become lower and lower over time. We downloaded the 45TB set and our rates look as following:

Success: 89.9%
Failed to download: 9.5%
Failed to resize: 0.6%

So 10% missing already.

afang-story · 2023-06-17T11:58:45Z

Hello @pfischer-nvidia ,
Sorry for the late reply - can you confirm whether you have tried this #3 (comment)?

pfischer-nvidia · 2023-06-19T06:51:44Z

We did change the DNS servers to the Google ones (8.8.8.8 and 8.8.4.4) but we did not change the resolver to bind9. Instead we used dnspython.

rom1504 · 2023-06-19T07:00:41Z

See https://github.com/rom1504/img2dataset#setting-up-a-high-performance-dns-resolver

zzzzzero · 2023-07-05T03:39:39Z

I found that I can increase the success rate by increasing the number of retries, when I set retries=2, I got ~95% success rate currently .

Just add retries=2 setting on line 136 of download_upstream.py like this.

img2dataset.download(
    url_list=str(metadata_dir),
    image_size=args.image_size,
    output_folder=str(shard_dir),
    processes_count=args.processes_count,
    thread_count=args.thread_count,
    resize_mode=args.resize_mode,
    resize_only_if_bigger=not args.no_resize_only_if_bigger,
    encode_format=args.encode_format,
    output_format=args.output_format,
    input_format='parquet',
    url_col='url',
    caption_col='text',
    bbox_col=bbox_col,
    save_additional_columns=['uid'],
    number_sample_per_shard=10000,
    oom_shard_count=8,
    retries=2,
)

Vaishaal · 2023-08-01T02:34:07Z

Updated download_upstream!

alexanderremmerie · 2023-08-09T15:38:48Z

Hi, We are getting success rate of +- 86% (downloading the small, 12 million images, dataset):

15it [44:26, 19.65s/it]worker - success: 0.867 - failed to download: 0.128 - failed to resize: 0.005 - images per sec: 4 - count: 10000 total - success: 0.865 - failed to download: 0.130 - failed to resize: 0.005 - images per sec: 56 - count: 150000
16it [44:49, 20.77s/it]worker - success: 0.860 - failed to download: 0.134 - failed to resize: 0.006 - images per sec: 4 - count: 10000 total - success: 0.864 - failed to download: 0.131 - failed to resize: 0.005 - images per sec: 60 - count: 160000

We are using knot dns resolver (8 instances), using the default 16 processes, and using just 16 threads (instead of the default 128). We did this to slow down the dns resolve requests. We use an e2-highcpu-32 (Efficient Instance, 32 vCPUs, 32 GB RAM) instance on GCP.

86% is the best we could get, but due to the low number of threads, the downloading is very slow. Can you share any more details on how you got the 95% success rate (configuration, machine type, number of threads and processes, GCP/AWS/Azure, dns resolver, etc.) and any advice to improve speed and success rate? Many thanks!

zzzzzero · 2023-08-09T16:45:37Z

My current download success rate is around 94% now. My suggestion is for you to use wandb to monitor the download process. This way, you can analyze the reasons for download failures more effectively. I've shared the analysis of the download process recorded on my end using wandb. My wandb report link is here . It's evident from the analysis that the primary reason for failures is due to invalid download links (2% of the links are invalid), while 0.8% are due to IP bans.

Your current download speed appears to be very slow. Bandwidth and CPU utilization don't seem to be fully occupied. It also doesn't seem like a DNS server problem. I believe the main reason for the failures could be related to your IP. You can consider trying to download the dataset using an IP address from a different region. If your bandwidth and CPU utilization aren't maxed out, increasing the number of processes and threads might not significantly impact your download success rate. My thread-to-process ratio config is 4:1.

Hi, We are getting success rate of +- 86% (downloading the small, 12 million images, dataset):

15it [44:26, 19.65s/it]worker - success: 0.867 - failed to download: 0.128 - failed to resize: 0.005 - images per sec: 4 - count: 10000 total - success: 0.865 - failed to download: 0.130 - failed to resize: 0.005 - images per sec: 56 - count: 150000 16it [44:49, 20.77s/it]worker - success: 0.860 - failed to download: 0.134 - failed to resize: 0.006 - images per sec: 4 - count: 10000 total - success: 0.864 - failed to download: 0.131 - failed to resize: 0.005 - images per sec: 60 - count: 160000

We are using knot dns resolver (8 instances), using the default 16 processes, and using just 16 threads (instead of the default 128). We did this to slow down the dns resolve requests. We use an e2-highcpu-32 (Efficient Instance, 32 vCPUs, 32 GB RAM) instance on GCP.

86% is the best we could get, but due to the low number of threads, the downloading is very slow. Can you share any more details on how you got the 95% success rate (configuration, machine type, number of threads and processes, GCP/AWS/Azure, dns resolver, etc.) and any advice to improve speed and success rate? Many thanks!

rom1504 · 2023-08-09T22:50:58Z

#39 (comment) answered there

alexanderremmerie · 2023-08-10T10:22:55Z

This is the wandb ouput we are getting, (https://api.wandb.ai/links/alexander-remmerie/n1kikknk for the full report). As you can see, most of the errors are network unreachable errors and network timeout errors. Knot CPU usage is 2-5% CPU, so it is using knot as dns resolver. resolv.conf file:

(base) jupyter@datacomp-ubuntu:/etc$ cat /etc/resolv.conf
nameserver 127.0.0.1
```

rom1504 · 2023-08-10T10:41:49Z

Ok I see. Yeah the network error is about something lower level. Could be a kernel setting, eg limit of number of files, could be local network card limit, a limit on your local router

…

On Thu, Aug 10, 2023, 12:23 Alexander Remmerie ***@***.***> wrote: [image: image] <https://user-images.githubusercontent.com/48564828/259694620-58df9125-d33e-4737-ac36-f25b1ca489df.png> This is the wandb ouput we are getting, ( https://api.wandb.ai/links/alexander-remmerie/n1kikknk for the full report). As you can see, most of the errors are network unreachable errors and network timeout errors. Knot CPU usage is 2-5% CPU, so it is using knot as dns resolver. resolv.conf file: (base) ***@***.***:/etc$ cat /etc/resolv.conf nameserver 127.0.0.1. — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437VBZPZNCF2N6XUYPPTXUSZAXANCNFSM6AAAAAAXWWDYLA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

alexanderremmerie · 2023-08-10T15:21:40Z

We managed to get 92% finally. For future reference: we didn't have a public ip (for security reasons) on GCP. By giving the instance a public ip we were able to get much faster downloads. Now using 8 instaces of knot dns resolver, 88 cores, 88 processes and 128 threads we can download the small dataset (13 million images, 450 GB) in about 2 hours.

nahidalam · 2024-01-04T13:56:52Z

@rom1504 @Vaishaal Can this not be solved without the dns (or other networking) setups? Even 92% is not the full download. And how do I know my 92% is the same as @alexanderremmerie 92%? Is there a way we can go towards more 'immutability' for these open vision-language datasets?

rom1504 · 2024-01-04T20:33:38Z

The reality of the web is it's ever changing and it has a number of laws restricting redistribution.
So I would argue instead of trying to make the web immutable, it would make more sense to accept its mutability and adapt training and evaluation recipes to it.
For example estimating that 2 collections contain the same distribution in some ways would go a long way.
Going further, continual training and on demand dataset collection would be true adaptation to the web.

That's even more true for going beyond the web and towards world mutability.

All that said, if you do really want a few billions of images that you can redistribute to everyone, and it should stay up for many years, then I think the only way is to build a service that hosts only perpetually granted public domain images. And then probably incentivizing a lot of people to put content on it.

TLDR: it's an interesting topic, and there are a lot of possible solutions. However tuning a downloader tool while keeping the same collection of img links extracted from the web is unlikely to achieve immutability.

Vaishaal closed this as completed Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deal with images that cannot be downloaded? #3

How to deal with images that cannot be downloaded? #3

vtddggg commented May 5, 2023

zzzzzero commented May 6, 2023

vtddggg commented May 6, 2023

rom1504 commented May 6, 2023

vtddggg commented May 8, 2023

gabrielilharco commented May 9, 2023

vtddggg commented May 10, 2023

pfischer-nvidia commented Jun 5, 2023 •

edited

Loading

afang-story commented Jun 17, 2023

pfischer-nvidia commented Jun 19, 2023

rom1504 commented Jun 19, 2023

zzzzzero commented Jul 5, 2023

Vaishaal commented Aug 1, 2023

alexanderremmerie commented Aug 9, 2023

zzzzzero commented Aug 9, 2023

rom1504 commented Aug 9, 2023

alexanderremmerie commented Aug 10, 2023 •

edited

Loading

rom1504 commented Aug 10, 2023 via email

alexanderremmerie commented Aug 10, 2023

nahidalam commented Jan 4, 2024

rom1504 commented Jan 4, 2024

How to deal with images that cannot be downloaded? #3

How to deal with images that cannot be downloaded? #3

Comments

vtddggg commented May 5, 2023

zzzzzero commented May 6, 2023

vtddggg commented May 6, 2023

rom1504 commented May 6, 2023

vtddggg commented May 8, 2023

gabrielilharco commented May 9, 2023

vtddggg commented May 10, 2023

pfischer-nvidia commented Jun 5, 2023 • edited Loading

afang-story commented Jun 17, 2023

pfischer-nvidia commented Jun 19, 2023

rom1504 commented Jun 19, 2023

zzzzzero commented Jul 5, 2023

Vaishaal commented Aug 1, 2023

alexanderremmerie commented Aug 9, 2023

zzzzzero commented Aug 9, 2023

rom1504 commented Aug 9, 2023

alexanderremmerie commented Aug 10, 2023 • edited Loading

rom1504 commented Aug 10, 2023 via email

alexanderremmerie commented Aug 10, 2023

nahidalam commented Jan 4, 2024

rom1504 commented Jan 4, 2024

pfischer-nvidia commented Jun 5, 2023 •

edited

Loading

alexanderremmerie commented Aug 10, 2023 •

edited

Loading