Inference testing using random data #199

lopho · 2022-11-04T20:26:10Z

I have started to work on integration tests for with random data.
This test runs on all pretrained models at fp32 with JIT True/False where applicable.

Related issue: #198

Inference testing on pre-made input and gt
- all models as listed by list_models()
- Image
- Text
Random test data generator
- Random image data in PIL format
- Random text data
Determine best way to store and recall test data: ~4.6MB for all models with 1 sample per config
Parallelize tests, unlikely due to RAM constraints

To generate test data:

python tests/util_test.py --all

populates the tests/data folder with one torch pt per model config, to be used by the test.

lopho · 2022-11-06T13:58:52Z

@rom1504 could you take a look?
Generating random test data and tests using that data is ready.

The next step would be to determine a good approach to storing and recalling test data across tags or commits.

rom1504 · 2022-11-06T14:05:26Z

Can you explain why you want to generate test data multiple times ?
Also why so much test data (50MB) and not just a few samples?

lopho · 2022-11-06T14:09:36Z

50mb is a single sample for each model config. It is total across all configs.
It can definetly be less, because, as you noticed im generating input data for each model config.
This can be reduced by creating one input for each input size. For text it can be one for all models. Regardless of input data, the output has to be stored for every model config.

rom1504 · 2022-11-06T14:09:52Z

tests/test_inference.py

+
+@pytest.mark.skip('storing and recalling of test data not ready')
+@pytest.mark.parametrize('model_name, pretrained, precision, jit', util_test.model_config_permutations)
+def test_inference_with_data(


How slow is this ?
Let's try to keep test running in less than 5min
Either remove redundant tests or use the (matrix) parallel feature of GH actions
(And also possibly the parallel feature of pytest)

======================================================= test session starts ======================================================= platform linux -- Python 3.9.2, pytest-7.0.1, pluggy-1.0.0 -- /REDACTED/open_clip/.venv/bin/python cachedir: .pytest_cache rootdir: /REDACTED/open_clip plugins: xdist-2.5.0, forked-1.4.0 collected 62 items tests/test_inference.py::test_inference_with_data[RN50-openai-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[RN50-openai-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[RN50-yfcc15m-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[RN50-cc12m-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[RN50-quickgelu-openai-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[RN50-quickgelu-openai-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[RN50-quickgelu-yfcc15m-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[RN50-quickgelu-cc12m-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[RN101-openai-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[RN101-openai-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[RN101-yfcc15m-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[RN101-quickgelu-openai-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[RN101-quickgelu-openai-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[RN101-quickgelu-yfcc15m-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[RN50x4-openai-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[RN50x4-openai-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[RN50x16-openai-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[RN50x16-openai-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[RN50x64-openai-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[RN50x64-openai-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-32-openai-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-32-openai-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-32-laion400m_e31-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-32-laion400m_e31-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-32-laion400m_e32-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-32-laion400m_e32-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-32-laion2b_e16-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-32-laion2b_e16-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-32-laion2b_s34b_b79k-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-32-laion2b_s34b_b79k-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-32-quickgelu-openai-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-32-quickgelu-openai-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-32-quickgelu-laion400m_e31-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-32-quickgelu-laion400m_e31-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-32-quickgelu-laion400m_e32-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-32-quickgelu-laion400m_e32-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-16-openai-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-16-openai-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-16-laion400m_e31-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-16-laion400m_e31-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-16-laion400m_e32-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-16-laion400m_e32-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-16-plus-240-laion400m_e31-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-16-plus-240-laion400m_e31-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-16-plus-240-laion400m_e32-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-B-16-plus-240-laion400m_e32-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-L-14-openai-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-L-14-openai-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-L-14-laion400m_e31-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-L-14-laion400m_e31-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-L-14-laion400m_e32-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-L-14-laion400m_e32-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-L-14-laion2b_s32b_b82k-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-L-14-laion2b_s32b_b82k-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-L-14-336-openai-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-L-14-336-openai-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-H-14-laion2b_s32b_b79k-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-H-14-laion2b_s32b_b79k-fp32-True] PASSED tests/test_inference.py::test_inference_with_data[ViT-g-14-laion2b_s12b_b42k-fp32-False] PASSED tests/test_inference.py::test_inference_with_data[ViT-g-14-laion2b_s12b_b42k-fp32-True] PASSED tests/test_simple.py::test_inference[False] PASSED tests/test_simple.py::test_inference[True] PASSED ======================================================== warnings summary ========================================================= tests/test_inference.py::test_inference_with_data[RN50-openai-fp32-True] /REDACTED/open_clip/tests/test_inference.py:38: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at ../aten/src/ATen/native/BinaryOps.cpp:600.) image_features = model.encode_image(prepped) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================ 62 passed, 1 warning in 374.16s (0:06:14) ============================================

0:06:14 on i7-4790K

batch size 1, single sample

Testing configs in list_models(), now all unit tests (training, infer, hf, ...) take about 8:30-12:00 minutes. With setup over head 10-14 minutes.

seems now 3min, what changed ?

tests/test_inference.py

rom1504 · 2022-11-06T14:13:17Z

The output should also be stored only once.

Is it really useful to test every single model config ?
I would rather we test all code paths but not necessarily all model configs.
For example one resnet, one vit, one timm,.. but not 10 vit
What do you think?

lopho · 2022-11-06T14:20:05Z

The outputs are different for each model, so we need to store it.
It's not possible to compare the outputs of say, ViT-B-32 and RN50.

Regarding the configurations, correct me if I'm wrong, but some pretrained models are treated differently within the same model architecture, ex. openai.
If not, we could just pick one for each architecture.

lopho · 2022-11-06T14:27:54Z

Regarding input data: Since the data is random, having a larger pool of samples increases robustness by reducing the probability of encountering only false positives.

Overall I tried to design this to be an integration test.
So it can run regardless of internal knowlege about which configs create which architecture and avoid having to create special cases for each.
This is also why the test data are PIL images and strings, instead of input tensors. So the entire usage is covered.

I'll go ahead and split input and output data, so we only need one set for inputs per input size accross all models and inflection points.
Also taking a look at GH action matrix to parameterize this for configs.

rom1504 · 2022-11-07T14:13:41Z

Could you remove the skip of the test so we could see if it works and how long it takes to run ?

lopho · 2022-11-07T14:35:06Z

I can, but the GH action will fail due to missing test data.

If you want to run the new test, be sure to remove @pytest.mark.skip before test_inference_with_data.
This is in place to prevent GH actions to fail while this PR is open and test data isn't ready.

lopho · 2022-11-07T14:43:53Z

Quote from OP.

GH action to trigger creating test data for every commit or release/tag

Modify test action to insert a predefined commit or tag to test against

Determine best way to store and recall test data: ~50MB for all models with 1 sample per config (60 configs)

For these steps I still need feedback from you.
Specifically where to store test data in the long term.
Alternatively I could try to get GH actions to check out older commits and create test data on the fly. But that would double the run time.

rom1504 · 2022-11-07T16:32:38Z

I can, but the GH action will fail due to missing test data.

It's ok, it just needs to pass before merging

rom1504 · 2022-11-07T16:34:09Z

GH action to trigger creating test data for every commit or release/tag

I don't think this is necessary.

Test data should be introduced manually for new models.

I think it would make sense to make it so test data is < 5MB (so that means much less for each model) and then simply store in the the repo

rom1504 · 2022-11-07T16:35:50Z

We could also consider storing test data in releases data if we truly need to make it big but if not needed it's best to keep it small and in the repo

lopho · 2022-11-07T18:31:12Z

Test data is already minimal per model.
1 input image + 1 output tensor, 1 input string + 1 output tensor.
220kB - 1.7MB per model depending on output size.
But as we said, we can reduce it by deduplicating input images to 1 per input size. That could shave off a few mb.

lopho · 2022-11-07T18:35:07Z

I'll refactor tomorrow to generate one input image per size, one input string for all models.
Maybe that'll shrink it enough to include it in the repo.

rom1504 · 2022-11-07T19:00:48Z

sounds good

lopho · 2022-11-08T18:45:15Z

Changed test data creation to only create one image per size, and one string for all models.
Test data is down to 3.2MB.
I created a set of test data against the latest 2.3.1 release, and pushed that as well.

CI failed due to hash mismatch in the model download for RN50 thought.
I'm not sure if that is due to the new download tests or something else.
It works on my machine (including the new tests).

EDIT: nevermind, misread the warning as the error.
It seems the output differs. Not sure if that is dependent on the machine, or on the torch /numpy/pillow version, or something else entirely
I will change the assertion to isclose() instead of equality, and see if precision is the problem.

Maybe data should be created by the CI as well. I'm not certain this will alleviate the problem though.

lopho · 2022-11-08T18:53:32Z

Do you have a suggestion for an acceptable maximum precision for isclose?
Otherwise, I'll stick to the defaults.

lopho · 2022-11-08T19:37:08Z

@rom1504 test pass now (it failed because of either a timeout or memory constraint, but all tests up to vit-g passed, which was interrupted during download)
I had to set the tolerance to 1e-5 (which I'm not very happy about).
The bulk of the runtime is spent on downloading the weights.
I'll try parallelizing it next using either GH action matrix of pytest parallel.

lopho · 2022-11-11T00:48:22Z

I'm reconsidering running tests with pretrained weights.
No matter what, it will either take too long or, if done in parallel, hit the memory restraints.
Both due to weight downloads.

Instead, running tests against randomly initialized non-openai models (as listed by list_models())
Then a separate test for openai models.
I'm not sure yet how to do this while retaining blackbox testing mentality.
E.g. not having internal knowledge of how openai models differ from non-openai in terms of config.

rom1504 · 2022-11-11T02:27:26Z

Config of models is unrelated from openai/non-openai. The model is fully specified by the config. I agree testing random weights sounds good. Using a fixed seed should work well.

…

On Fri, Nov 11, 2022, 01:48 lopho ***@***.***> wrote: I'm reconsidering running tests with pretrained weights. No matter what, it will either take too long or, if done in parallel, hit the memory restraints. Both due to weight downloads. Instead, running tests against randomly initialized non-openai models (as listed by list_models()) Then a separate test for openai models. I'm not sure yet how to do this while retaining blackbox testing mentality. E.g. not having internal knowledge of how openai models differ from non-openai in terms of config. — Reply to this email directly, view it on GitHub <#199 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437SXHSMPLIC45R6X5SLWHWJWBANCNFSM6AAAAAARXQ43HY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

lopho · 2022-11-11T05:37:12Z

Testing all models as listed by list_models() is implemented.
Added a small script in util_test.py to help with data generation.

# generate test data for all models in tests/data
# will only overwrite old data with --overwrite
python tests/util_test.py --all
# for a specific model
python tests/util_test.py --model RN50

Total runtime for all tests (inference, training, hf, ...) is ~10-14min including setup.
Testing dependencies had to be updated to include timm==0.6.11.
Added exceptions for timm-convnext_xlarge & timm-vit_medium_patch16_gap_256 due to #219 and ViT-e-14 as it exceeds the runners system RAM.

Ready to merge?

tests/util_test.py

rom1504 · 2022-11-11T10:11:43Z

I think it would be cool to use gh action infra for parallel testing to run N (probably =8) tests in parallel (it uses N actual containers so no ram issue)
That would decrease run time of tests and would allow us to add more things in the future
https://github.com/PrismarineJS/mineflayer/blob/master/.github/workflows/ci.yml here's an example

tests/util_test.py

rom1504 · 2022-11-11T10:20:05Z

I'm thinking something like that:

in gh action have a "parallel test" action, that will have a matrix with for example 8 parallel things (just number from 0 to 7). it will run only that inference test
the normal test action will not run this inference test

then in the inference test, you use that number (0 to 7) to pick only 1/8 of the models if the environment variable exists, otherwise run all

How does that sound?

rom1504 · 2022-11-11T10:20:38Z

another (simpler) option is to simply put the names of the models to test in the gh action matrix and use that

rom1504 · 2022-11-11T10:21:00Z

overall nice work! it's almost ready now

lopho · 2022-11-12T13:35:37Z

Using torch.autocast('cpu') increases runtime by two orders of magnitude on my machine.

without autocast: 147.27s
with autocast: 10244.21s

It's an i7-4790k and does not support AVX-512-BF16 or AVX-512-FP16.
Also the machine running the GH action outputs are vastly different than on my machine, and not comparable anymore (ex. -0.3516 on my machine vs -0.2471 on GH runner), this is probably also due to the lack of AVX512 support.

I would like to keep it without autocast, to make the test usable regardless of instruction set supported by the cpu and to keep implicit changes to behaviour to a minimum (ex. output is bfloat16 instead of float32 with autocast)

rom1504 · 2022-11-12T15:00:56Z

Did you disable jit ? It tends to make things slow

lopho · 2022-11-12T15:17:48Z

yes, jit is disabled for all tests
https://github.com/lopho/open_clip/blob/2c45266b67b5d3b08e8d1d01120820c3c51f189c/tests/test_inference.py#L28

rom1504 · 2022-11-12T16:23:04Z

ok I guess let's not use autocast then

any thoughts regarding gh action parallism ?

rom1504 · 2022-11-12T17:17:36Z

looks like tests took only 3m ?!

did anything change ?

tests/util_test.py

lopho · 2022-11-12T17:27:41Z

im on parallelizing it right now
Runtime is dependent on luck of the draw regarding which machine and priority we get from github.
vit-G failing is new, seems like it was very close to the RAM limit and the new tokenizer pushed it just above it.

I've got it working with vit-g when running in parallel.
Its almost ready.

EDIT: yes, i mean ViT-G

rom1504 · 2022-11-12T17:29:05Z

it's vit-G which was failing

rom1504 · 2022-11-12T17:29:24Z

I think we could merge now really, let's maybe do that and do another PR for parallel ?

lopho · 2022-11-12T17:29:34Z

This is a parallel run on my staging branch, if you want to take a look.
https://github.com/lopho/open_clip/actions/runs/3451850195

lopho · 2022-11-12T17:32:02Z

Ok, parallel can wait.

Let me squash the commits first before merging please.

rom1504 · 2022-11-12T17:42:58Z

It'll all be squashed during GitHub merge

lopho · 2022-11-12T17:44:07Z

ah, sorry, too late, it's already done.

rom1504 · 2022-11-12T17:47:17Z

It's ok, I guess 2 commits makes sense here

…

On Sat, Nov 12, 2022, 18:44 lopho ***@***.***> wrote: ah, sorry, too late, it's already done. — Reply to this email directly, view it on GitHub <#199 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437WAE26QIHXSZA3RDCLWH7JPFANCNFSM6AAAAAARXQ43HY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rom1504 · 2022-11-12T17:50:50Z

Merged

Parallel with GH action would still be appreciated

And next big step would be same but with one step of training (see test training simple for a place to start), so we make sure training keeps doing the same thing

If you want to continue working on this, it would be pretty helpful

There is active development in this repo, and keeping things working will be important

lopho marked this pull request as draft November 4, 2022 20:26

lopho mentioned this pull request Nov 4, 2022

Add non regression tests #198

Open

7 tasks

rom1504 requested changes Nov 6, 2022

View reviewed changes

lopho requested a review from rom1504 November 11, 2022 05:39

lopho marked this pull request as ready for review November 11, 2022 05:39

rom1504 reviewed Nov 11, 2022

View reviewed changes

tests/util_test.py Outdated Show resolved Hide resolved

rom1504 reviewed Nov 11, 2022

View reviewed changes

tests/util_test.py Show resolved Hide resolved

rom1504 reviewed Nov 11, 2022

View reviewed changes

tests/util_test.py Show resolved Hide resolved

rom1504 reviewed Nov 11, 2022

View reviewed changes

tests/util_test.py Show resolved Hide resolved

rom1504 reviewed Nov 12, 2022

View reviewed changes

tests/util_test.py Show resolved Hide resolved

lopho added 2 commits November 12, 2022 18:40

inference testing using pre-generated data

a6e1197

test data bb924a8

9d3af9a

rom1504 merged commit 8d3429f into mlfoundations:main Nov 12, 2022

lopho deleted the integration_test branch November 13, 2022 23:08

Inference testing using random data #199

Inference testing using random data #199

Conversation

lopho commented Nov 4, 2022 • edited Loading

lopho commented Nov 6, 2022

rom1504 commented Nov 6, 2022

lopho commented Nov 6, 2022 • edited Loading

rom1504 Nov 6, 2022

Choose a reason for hiding this comment

lopho Nov 6, 2022 • edited Loading

Choose a reason for hiding this comment

lopho Nov 6, 2022

Choose a reason for hiding this comment

lopho Nov 11, 2022

Choose a reason for hiding this comment

rom1504 Nov 12, 2022

Choose a reason for hiding this comment

rom1504 commented Nov 6, 2022

lopho commented Nov 6, 2022

lopho commented Nov 6, 2022 • edited Loading

rom1504 commented Nov 7, 2022

lopho commented Nov 7, 2022 • edited Loading

lopho commented Nov 7, 2022

rom1504 commented Nov 7, 2022

rom1504 commented Nov 7, 2022

rom1504 commented Nov 7, 2022

lopho commented Nov 7, 2022 • edited Loading

lopho commented Nov 7, 2022 • edited Loading

rom1504 commented Nov 7, 2022

lopho commented Nov 8, 2022 • edited Loading

lopho commented Nov 8, 2022

lopho commented Nov 8, 2022

lopho commented Nov 11, 2022

rom1504 commented Nov 11, 2022 via email

lopho commented Nov 11, 2022

rom1504 commented Nov 11, 2022 • edited Loading

rom1504 commented Nov 11, 2022

rom1504 commented Nov 11, 2022

rom1504 commented Nov 11, 2022

lopho commented Nov 12, 2022 • edited Loading

rom1504 commented Nov 12, 2022

lopho commented Nov 12, 2022

rom1504 commented Nov 12, 2022

rom1504 commented Nov 12, 2022

lopho commented Nov 12, 2022 • edited Loading

rom1504 commented Nov 12, 2022

rom1504 commented Nov 12, 2022

lopho commented Nov 12, 2022

lopho commented Nov 12, 2022

rom1504 commented Nov 12, 2022

lopho commented Nov 12, 2022

rom1504 commented Nov 12, 2022 via email

rom1504 commented Nov 12, 2022

lopho commented Nov 4, 2022 •

edited

Loading

lopho commented Nov 6, 2022 •

edited

Loading

lopho Nov 6, 2022 •

edited

Loading

lopho commented Nov 6, 2022 •

edited

Loading

lopho commented Nov 7, 2022 •

edited

Loading

lopho commented Nov 7, 2022 •

edited

Loading

lopho commented Nov 7, 2022 •

edited

Loading

lopho commented Nov 8, 2022 •

edited

Loading

rom1504 commented Nov 11, 2022 •

edited

Loading

lopho commented Nov 12, 2022 •

edited

Loading

lopho commented Nov 12, 2022 •

edited

Loading