Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Column class for Parameter #151

Open
wants to merge 1 commit into
base: profile-optimization-db-clean
Choose a base branch
from

Conversation

glatterf42
Copy link
Member

This PR removes the Column class for Parameter as it's not necessary to keep around and even slightly faster to do so. See the following benchmarks:

Benchmarks with `Column`
tests/profiling/test_profiling_parameters.py ......                                                                                                                                    [100%]


---------------------------------------------------------------------------------------------------- benchmark: 6 tests ----------------------------------------------------------------------------------------------------
Name (time in ms)                                           Min                 Max                Mean             StdDev              Median                IQR            Outliers      OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_parameter_add_data_insert[sqlite-small]            45.7072 (1.0)      108.1139 (1.0)       52.2688 (1.0)      19.6248 (1.0)       45.9778 (1.0)       0.6958 (1.0)           1;1  19.1319 (1.0)          10           1
test_parameter_add_data_upsert_all[sqlite-small]        54.6769 (1.20)     136.7937 (1.27)      71.5041 (1.37)     30.9461 (1.58)      55.8460 (1.21)     11.0482 (15.88)         2;2  13.9852 (0.73)         10           1
test_parameter_add_data_upsert_half[sqlite-small]       57.9691 (1.27)     122.9973 (1.14)      65.2775 (1.25)     20.2952 (1.03)      58.5939 (1.27)      1.4961 (2.15)          1;1  15.3192 (0.80)         10           1
test_parameter_add_data_insert[sqlite-medium]          142.7757 (3.12)     208.8990 (1.93)     150.6410 (2.88)     20.4936 (1.04)     144.2870 (3.14)      1.9785 (2.84)          1;1   6.6383 (0.35)         10           1
test_parameter_add_data_upsert_all[sqlite-medium]      181.2061 (3.96)     254.5545 (2.35)     192.6365 (3.69)     22.0582 (1.12)     185.4630 (4.03)      3.7982 (5.46)          1;2   5.1911 (0.27)         10           1
test_parameter_add_data_upsert_half[sqlite-medium]     251.3761 (5.50)     427.4622 (3.95)     279.5083 (5.35)     54.8663 (2.80)     254.5343 (5.54)     38.5086 (55.35)         1;1   3.5777 (0.19)         10           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
============================================================================= 6 passed, 13 deselected in 21.86s ==============================================================================
Benchmarks now
tests/profiling/test_profiling_parameters.py ......                                    [100%]


---------------------------------------------------------------------------------------------------- benchmark: 6 tests ----------------------------------------------------------------------------------------------------
Name (time in ms)                                           Min                 Max                Mean             StdDev              Median                IQR            Outliers      OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_parameter_add_data_insert[sqlite-small]            43.5097 (1.0)      105.0430 (1.53)      50.5152 (1.0)      19.1710 (4.03)      44.4000 (1.0)       0.6198 (1.0)           1;2  19.7960 (1.0)          10           1
test_parameter_add_data_upsert_all[sqlite-small]        52.9452 (1.22)      68.6140 (1.0)       55.2633 (1.09)      4.7614 (1.0)       53.6298 (1.21)      1.3119 (2.12)          1;1  18.0952 (0.91)         10           1
test_parameter_add_data_upsert_half[sqlite-small]       55.7147 (1.28)     125.8547 (1.83)      64.4476 (1.28)     21.8285 (4.58)      56.5543 (1.27)      2.1717 (3.50)          1;2  15.5165 (0.78)         10           1
test_parameter_add_data_insert[sqlite-medium]          135.4159 (3.11)     208.4362 (3.04)     146.6560 (2.90)     21.7741 (4.57)     140.3144 (3.16)      1.5109 (2.44)          1;2   6.8187 (0.34)         10           1
test_parameter_add_data_upsert_all[sqlite-medium]      172.6628 (3.97)     241.9307 (3.53)     182.8954 (3.62)     20.9766 (4.41)     175.4558 (3.95)      4.0788 (6.58)          1;1   5.4676 (0.28)         10           1
test_parameter_add_data_upsert_half[sqlite-medium]     242.7693 (5.58)     437.8403 (6.38)     284.0225 (5.62)     61.7876 (12.98)    250.1303 (5.63)     70.3974 (113.57)        1;1   3.5208 (0.18)         10           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
============================= 6 passed, 13 deselected in 19.70s ==============================

It remains to be seen if the rest of the normalization (see #143) is beneficial in terms of performance or not. That may not be urgent, though, as the performance seems to already be better than ixmp's.

TODO:

  • Add DB migration

@glatterf42 glatterf42 added the enhancement New feature or request label Jan 16, 2025
@glatterf42 glatterf42 requested a review from meksor January 16, 2025 14:46
@glatterf42 glatterf42 self-assigned this Jan 16, 2025
Copy link

codecov bot commented Jan 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.9%. Comparing base (d18873e) to head (a763e80).

Additional details and impacted files
@@                      Coverage Diff                      @@
##           profile-optimization-db-clean    #151   +/-   ##
=============================================================
  Coverage                           86.9%   86.9%           
=============================================================
  Files                                230     230           
  Lines                               8144    8164   +20     
=============================================================
+ Hits                                7082    7102   +20     
  Misses                              1062    1062           
Files with missing lines Coverage Δ
ixmp4/core/optimization/parameter.py 93.2% <100.0%> (-0.1%) ⬇️
ixmp4/data/abstract/optimization/parameter.py 82.0% <100.0%> (+0.4%) ⬆️
ixmp4/data/api/optimization/parameter.py 94.0% <100.0%> (ø)
ixmp4/data/db/base.py 92.3% <100.0%> (+<0.1%) ⬆️
ixmp4/data/db/optimization/base.py 100.0% <100.0%> (ø)
ixmp4/data/db/optimization/indexset/repository.py 98.1% <100.0%> (ø)
ixmp4/data/db/optimization/parameter/model.py 100.0% <100.0%> (ø)
ixmp4/data/db/optimization/parameter/repository.py 96.1% <100.0%> (-0.2%) ⬇️
ixmp4/data/db/optimization/utils.py 100.0% <100.0%> (ø)
ixmp4/data/types.py 100.0% <100.0%> (ø)
... and 1 more

@glatterf42
Copy link
Member Author

Of course, one question still remains open: what about the rest of the "normalization" PRs? What about the .data attribute? Might it not be faster to also normalize that with its own table?

For future reference: I tried (though the branch is not on GitHub). I followed the same approach as in #143 and added a float "values" column as well as a relationship to Unit as "units" to the ParameterData table. This worked fine except for one part: the UPSERT functionality.
With our current implementation, we load the JSON dict into a pd.DataFrame and then use combine_first() to achieve an operation that's very similar (if not identical) to UPSERT. With a dedicated ParameterData table, we would not need to load the data into memory, but could instead make use of sqlite's and postgres' INSERT ... ON CONFLICT DO UPDATE structure, which is fully supported by sqlalchemy. This kind of statement requires us to specify the UniqueConstraint or columns that should be checked for a conflict. And this is where the problems come in:

  • As in TableData, I defined 15 columns for data linked to IndexSets. All of these plus the parameter__id were grouped in the UniqueConstraint. However, when I ran some test that only added data to two of these columns and specified these columns for the conflict check, sqlalchemy told me it couldn't find a UniqueConstraint fitting these two columns.
  • We could alternatively specify the name of the UniqueConstraint, but that would most likely not work either because most of the columns are NULL. And NULLs are distinct, so even rows where the non-NULL values are identical would be seen as distinct. Postgres offers a NULLS NOT DISTINCT clause, but sqlite doesn't support it and it's generally accepted SQL standard that they are always distinct.
  • We could work around this problem by using a special default value instead of NULL, such as "___". However, since this is not intended behaviour, this would not make our DB model more stable.
  • Another idea was to use partial indexes. These can even be used for partial unique indexes, which sounds close to what we need: every Parameter with X columns would get a partial unique index for all its rows that is only defined for these X columns. This would likely lead to thousands of indexes on this DB table, which would all be checked for each query to figure out if they are applicable, so would slow down this table over time considerably.
  • One way around that might be via partitioning. We could define separate categories based on the number of non-NULL columns, and make one partition for each category. Each of these might then be able to get a (partial) unique index covering the whole partition, and we would only need to ensure that each ParameterData coming in is stored in the right partition. I have not tried this nor checked if partitions generally support having distinct indexes, etc, so I don't know how well this might work.
  • Lastly, @meksor suggested putting a UniqueConstaint on each column. Then, specifying two columns would trigger both constraints, but we are not sure how these would be evaluated. Would there be a conflict only in the case where we want it to be, i.e. when both constraints fail, or would the conflict arise already when just one constraint fails, which we would consider premature?

Last but not least, even if we figure this problem out, we would be left with a collection that already firmly falls into the category "large" in sqlalchemy as we would frequently want to have several thousand ParameterData rows in .data. They offer a few techniques of handling these large collections, but the most prominent seems to be a WriteOnlyCollection, which is not what we need: we often want to read the whole data stored in a Parameter.
Sqlalchemy also offers various relationship loading techniques. Generally, lazy is the default and selectin is the most efficient. These can be set per query, which I admittedly did not try, but I set the whole relationship (in the table definition) of IndexSet to IndexSetData to selectin (since that seems to take most of the time in our profiles), and our benchmarks were actually a little slower than with lazy.

Note

One thing that might be worth a try still is to revert indexset.data to a json list and see if that enhances our performance.


For the time being, I will not continue work on this. The profiles generated for #150 show that combine_first() makes up at most 10% of the runtime, almost negligible compared to validate_data(). And since we seem to already surpass ixmp in performance, this work can wait at least, if continued at all (due to the problems outlined above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant