Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deal with images that cannot be downloaded? #3

Closed
vtddggg opened this issue May 5, 2023 · 20 comments
Closed

How to deal with images that cannot be downloaded? #3

vtddggg opened this issue May 5, 2023 · 20 comments

Comments

@vtddggg
Copy link

vtddggg commented May 5, 2023

Thanks for your great and meaningful competition.

When I run python download_upstream.py --scale $scale --data_dir $data_dir, only around ~91% images can be download successfully. It means the actually pool size will be smaller than the given pool size (< 12.8M).

How to deal with that? Definitely I think the person with more candidate samples can benefit more.

@zzzzzero
Copy link

zzzzzero commented May 6, 2023

In my machine,~94% images can be download successfully.
I'm also wondering if there's a way to download the failed parts , because I've noticed that some downloads don't complete due to broken links or network issues,a considerable portion of the failed links can be successfully downloaded again. If no such method exists, we may have to create a Python script ourselves.
Here are methods that can improve download success rate: 1. You can try to reduce the number of download threads and processes. 2. You can try to change the network or machine used for downloading.

@vtddggg
Copy link
Author

vtddggg commented May 6, 2023

Thanks for your nice advice!

Since the raw image (tars) data is not very large for small scale track (450 GB). Will the organizers consider to open tars data directly? It can help us conduct more rigorous academic exploration on image-text data filtering problem.

@rom1504
Copy link

rom1504 commented May 6, 2023

Hi, i advise you set up the right DNS resolver to increase your success rate
See img2dataset readme for details

@vtddggg
Copy link
Author

vtddggg commented May 8, 2023

@rom1504 Thanks, I will have a try. How is the proportion of successful downloads running on your machine?
Just for a reference of what is a reasonable success rate.

@gabrielilharco
Copy link
Contributor

Hey @vtddggg. I'm also getting 94-95% success rate currently. I suspect this difference will matter little in terms of the accuracy of the trained models. In many experiments we found that changing the size of datasets coming from the same distribution have little impact in performance. For example, in Figure 3 we show that using only 50% of the unfiltered pools performs very closely to using the entire pool (see https://arxiv.org/abs/2304.14108).

Will the organizers consider to open tars data directly?

We understand that releasing the tars directly would make things simpler for participants. However, our dataset is designed to be an index to public images on the internet, which means that if any image is deleted from its original source, it will also be deleted from our dataset. Releasing the tars directly would mean creating public copies of the images, which is problematic. For those reasons, we won't be able to share the data directly, and hope you understand the decision

@vtddggg
Copy link
Author

vtddggg commented May 10, 2023

Thanks for sharing your success rate.
Agree your point. As changing the size of datasets coming from the same distribution have little impact in performance, it is perfectly acceptable for lost some images during downloading.

@pfischer-nvidia
Copy link

pfischer-nvidia commented Jun 5, 2023

Hi, I believe the success rate will become lower and lower over time. We downloaded the 45TB set and our rates look as following:

Success: 89.9%
Failed to download: 9.5%
Failed to resize: 0.6%

So 10% missing already.

@afang-story
Copy link
Contributor

Hello @pfischer-nvidia ,
Sorry for the late reply - can you confirm whether you have tried this #3 (comment)?

@pfischer-nvidia
Copy link

We did change the DNS servers to the Google ones (8.8.8.8 and 8.8.4.4) but we did not change the resolver to bind9. Instead we used dnspython.

@rom1504
Copy link

rom1504 commented Jun 19, 2023

@zzzzzero
Copy link

zzzzzero commented Jul 5, 2023

I found that I can increase the success rate by increasing the number of retries, when I set retries=2, I got ~95% success rate currently .

Just add retries=2 setting on line 136 of download_upstream.py like this.

img2dataset.download(
    url_list=str(metadata_dir),
    image_size=args.image_size,
    output_folder=str(shard_dir),
    processes_count=args.processes_count,
    thread_count=args.thread_count,
    resize_mode=args.resize_mode,
    resize_only_if_bigger=not args.no_resize_only_if_bigger,
    encode_format=args.encode_format,
    output_format=args.output_format,
    input_format='parquet',
    url_col='url',
    caption_col='text',
    bbox_col=bbox_col,
    save_additional_columns=['uid'],
    number_sample_per_shard=10000,
    oom_shard_count=8,
    retries=2,
)

@Vaishaal
Copy link
Contributor

Vaishaal commented Aug 1, 2023

Updated download_upstream!

@Vaishaal Vaishaal closed this as completed Aug 1, 2023
@alexanderremmerie
Copy link

Hi, We are getting success rate of +- 86% (downloading the small, 12 million images, dataset):

15it [44:26, 19.65s/it]worker - success: 0.867 - failed to download: 0.128 - failed to resize: 0.005 - images per sec: 4 - count: 10000 total - success: 0.865 - failed to download: 0.130 - failed to resize: 0.005 - images per sec: 56 - count: 150000
16it [44:49, 20.77s/it]worker - success: 0.860 - failed to download: 0.134 - failed to resize: 0.006 - images per sec: 4 - count: 10000 total - success: 0.864 - failed to download: 0.131 - failed to resize: 0.005 - images per sec: 60 - count: 160000

We are using knot dns resolver (8 instances), using the default 16 processes, and using just 16 threads (instead of the default 128). We did this to slow down the dns resolve requests. We use an e2-highcpu-32 (Efficient Instance, 32 vCPUs, 32 GB RAM) instance on GCP.

86% is the best we could get, but due to the low number of threads, the downloading is very slow. Can you share any more details on how you got the 95% success rate (configuration, machine type, number of threads and processes, GCP/AWS/Azure, dns resolver, etc.) and any advice to improve speed and success rate? Many thanks!

@zzzzzero
Copy link

zzzzzero commented Aug 9, 2023

My current download success rate is around 94% now. My suggestion is for you to use wandb to monitor the download process. This way, you can analyze the reasons for download failures more effectively. I've shared the analysis of the download process recorded on my end using wandb. My wandb report link is here . It's evident from the analysis that the primary reason for failures is due to invalid download links (2% of the links are invalid), while 0.8% are due to IP bans.

Your current download speed appears to be very slow. Bandwidth and CPU utilization don't seem to be fully occupied. It also doesn't seem like a DNS server problem. I believe the main reason for the failures could be related to your IP. You can consider trying to download the dataset using an IP address from a different region. If your bandwidth and CPU utilization aren't maxed out, increasing the number of processes and threads might not significantly impact your download success rate. My thread-to-process ratio config is 4:1.

Hi, We are getting success rate of +- 86% (downloading the small, 12 million images, dataset):

15it [44:26, 19.65s/it]worker - success: 0.867 - failed to download: 0.128 - failed to resize: 0.005 - images per sec: 4 - count: 10000 total - success: 0.865 - failed to download: 0.130 - failed to resize: 0.005 - images per sec: 56 - count: 150000 16it [44:49, 20.77s/it]worker - success: 0.860 - failed to download: 0.134 - failed to resize: 0.006 - images per sec: 4 - count: 10000 total - success: 0.864 - failed to download: 0.131 - failed to resize: 0.005 - images per sec: 60 - count: 160000

We are using knot dns resolver (8 instances), using the default 16 processes, and using just 16 threads (instead of the default 128). We did this to slow down the dns resolve requests. We use an e2-highcpu-32 (Efficient Instance, 32 vCPUs, 32 GB RAM) instance on GCP.

86% is the best we could get, but due to the low number of threads, the downloading is very slow. Can you share any more details on how you got the 95% success rate (configuration, machine type, number of threads and processes, GCP/AWS/Azure, dns resolver, etc.) and any advice to improve speed and success rate? Many thanks!

@rom1504
Copy link

rom1504 commented Aug 9, 2023

#39 (comment) answered there

@alexanderremmerie
Copy link

alexanderremmerie commented Aug 10, 2023

image

This is the wandb ouput we are getting, (https://api.wandb.ai/links/alexander-remmerie/n1kikknk for the full report). As you can see, most of the errors are network unreachable errors and network timeout errors. Knot CPU usage is 2-5% CPU, so it is using knot as dns resolver. resolv.conf file:

(base) jupyter@datacomp-ubuntu:/etc$ cat /etc/resolv.conf
nameserver 127.0.0.1
```

@rom1504
Copy link

rom1504 commented Aug 10, 2023 via email

@alexanderremmerie
Copy link

We managed to get 92% finally. For future reference: we didn't have a public ip (for security reasons) on GCP. By giving the instance a public ip we were able to get much faster downloads. Now using 8 instaces of knot dns resolver, 88 cores, 88 processes and 128 threads we can download the small dataset (13 million images, 450 GB) in about 2 hours.

@nahidalam
Copy link

@rom1504 @Vaishaal Can this not be solved without the dns (or other networking) setups? Even 92% is not the full download. And how do I know my 92% is the same as @alexanderremmerie 92%? Is there a way we can go towards more 'immutability' for these open vision-language datasets?

@rom1504
Copy link

rom1504 commented Jan 4, 2024

The reality of the web is it's ever changing and it has a number of laws restricting redistribution.
So I would argue instead of trying to make the web immutable, it would make more sense to accept its mutability and adapt training and evaluation recipes to it.
For example estimating that 2 collections contain the same distribution in some ways would go a long way.
Going further, continual training and on demand dataset collection would be true adaptation to the web.

That's even more true for going beyond the web and towards world mutability.

All that said, if you do really want a few billions of images that you can redistribute to everyone, and it should stay up for many years, then I think the only way is to build a service that hosts only perpetually granted public domain images. And then probably incentivizing a lot of people to put content on it.

TLDR: it's an interesting topic, and there are a lot of possible solutions. However tuning a downloader tool while keeping the same collection of img links extracted from the web is unlikely to achieve immutability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants