Skip to content

fix: global code update #225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 56 commits into
base: main
Choose a base branch
from
Draft

fix: global code update #225

wants to merge 56 commits into from

Conversation

jjmaynard
Copy link
Collaborator

Description

Updated global functions:

  • list_soils_global
  • rank_soils_global
  • sg_list

Copy link
Member

@paulschreiber paulschreiber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please:

  • fix the highlighted issues
  • add some test lat/lng points
  • provide a copy of the SQL data(base) needed for testing

Copy link
Member

@shrouxm shrouxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 🎉 🎉 big milestone, congrats & thank you so much! left a couple very minor comments while skimming, and some general comments:

  • fantastic that it has the exact same API as the US version
  • we'll need to get the US tests working again (not sure why they're not working 🤔) and add some smoke tests for global before merging
  • it seems the global code relies on having a database available at runtime. we'll need to figure out how to get that working from static files like US does
    • or if that's somehow completely impossible, there'll be a lift to get that DB set up in our dev, CI, and prod environments
  • we'll need a way to determine whether to call the US or global functions

@@ -2060,7 +2060,6 @@ def rank_soils(
Score_Data_Loc = (D_final_loc["Score_Data_scale"] + D_final_loc["distance_score_scale"]) / (
D_final_loc["data_weight"] + location_weight
)
Score_Data_Loc /= np.nanmax(Score_Data_Loc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why this small change to us_soil.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this change shouldn't be in this branch and should be ignored


# Call soildgrids API

def sg_list(lon, lat):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: rename for clarity

Suggested change
def sg_list(lon, lat):
def soil_grids_list(lon, lat):

@shrouxm shrouxm changed the title Fix/global code update fix: global code update Feb 5, 2025
@jjmaynard jjmaynard force-pushed the fix/global-code-update branch from 898207a to d9ba6ab Compare February 6, 2025 01:40
@shrouxm
Copy link
Member

shrouxm commented Mar 11, 2025

hi @jjmaynard! i had a chance to set up the bulk testing harness today! here's some questions and what i found from a first pass of running it over the test data you sent (global_pedon_test_dataset.csv):

  • i was able to get the existing test_global.py function to run successfully after setting up a local postgres database from the DB dump you uploaded in the google drive! :)
  • list_soils_global runs successfully on around 80% of the pedons i've tested so far. there are two distinct tracebacks i get when it fails, which are copied below.
  • on all pedons where list_soils_global runs successfully, rank_soils_global fails with the same traceback, copied below. it looks like a quick fix and i may just not be invoking the function correctly (if i try to add measurement values to the test in test_global.py it fails with the same traceback)
  • 80% of queries take over 20s to run on my machine
  • i've updated this PR with:
    • the scripts for the bulk testing
    • i added a coordinate for each distinct crash traceback i found to test_global.py
    • i added some arbitrary measurement values to the rank_soils_global call in test_global.py
    • so you should be able to just work off of trying to get test_global.py to run successfully again, and then if you have time try running the bulk testing scripts to see if any new crashes emerge

question: once we get it running successfully, which column of the test CSV should i compare against the output soil name to determine how well the algorithm ranked the result? there are four columns that seem like they could be relevant:

Description_74	Description_90	Description_fao74	Description_fao90
Cambisols	Calcisols	Calcic Cambisol	        Luvic Calcisol

and lastly here are the 3 distinct tracebacks i found from the algorithm crashes:

Traceback #1, occurred 204 times. Example pedon: AO SOTER_P.101c/63, lat: -10.950086, lon: 17.573093
  Traceback (most recent call last):
    File "/home/roux/terraso/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 56, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/home/roux/terraso/soil-id-algorithm/soil_id/global_soil.py", line 662, in rank_soils_global
      pedon_LAB = pedon_color(lab_Color, horizonDepth)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/roux/terraso/soil-id-algorithm/soil_id/utils.py", line 1798, in pedon_color
      if None in (pedon_top, pedon_bottom, pedon_l, pedon_a, pedon_b):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/roux/terraso/soil-id-algorithm/.venv/lib/python3.12/site-packages/pandas/core/generic.py", line 1577, in __nonzero__
      raise ValueError(
  ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Traceback #2, occurred 43 times. Example pedon: AF0001, lat: 34.5, lon: 69.16667
  Traceback (most recent call last):
    File "/home/roux/terraso/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 54, in <module>
      list_result = list_soils_global(lat=lat, lon=lon)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/roux/terraso/soil-id-algorithm/soil_id/global_soil.py", line 90, in list_soils_global
      mucompdata_pd = hwsd2_data[["hwsd2", "fao90_name", "distance", "share", "compid"]]
                      ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  TypeError: 'NoneType' object is not subscriptable

Traceback #3, occurred 23 times. Example pedon: AO SOTER_P.104c/62, lat: -10.07856, lon: 15.107436
  Traceback (most recent call last):
    File "/home/roux/terraso/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 54, in <module>
      list_result = list_soils_global(lat=lat, lon=lon)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/roux/terraso/soil-id-algorithm/soil_id/global_soil.py", line 473, in list_soils_global
      reordered_lists = [[lst[i] for i in mucomp_index] for lst in lists_to_reorder]
                          ~~~^^^
  IndexError: list index out of range

fixed the 3 traceback issues @shrouxm identified. code now runs under 2s. SoilGrids API returns are sometimes taking 20-30 s.
jjmaynard and others added 13 commits April 13, 2025 23:55
Co-authored-by: Paul Schreiber <[email protected]>
fixed the 3 traceback issues @shrouxm identified. code now runs under 2s. SoilGrids API returns are sometimes taking 20-30 s.
-fixed distance calculation errors associated with coordinate system transformation.
-added top depth input to rank function
When no map data is available at a location - returns "Data_unavailable"
-modified sql code to query all home mapunit components and all adjacent map units and components.
-fixed component name indexing.
…requirements.txt requirements/dev.in setup.cfg from main
@paulschreiber paulschreiber force-pushed the fix/global-code-update branch from 6c9d5d5 to d54c625 Compare April 14, 2025 03:55
layer_name=None,
buffer_size=0.5,
)
def list_soils_global(lon, lat):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: this should be called list_soils() to match the us version. this module is namespaced and doesn't need a prefix.


##############################################################################################
# rankPredictionGlobal #
##############################################################################################
def rankPredictionGlobal(
lon, lat, soilHorizon, horizonDepth, rfvDepth, lab_Color, bedrock, cracks, plot_id=None
def rank_soils_global(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: this should be called rank_soils() to match the us version. this module is namespaced and doesn't need a prefix.

@jjmaynard
Copy link
Collaborator Author

@shrouxm when you list the postgis table indexes, do all of these appear?
SELECT * FROM pg_indexes WHERE tablename = 'hwsdv2';

| schemaname | tablename | indexname             | tablespace | indexdef                                                                 |
|------------|-----------|------------------------|------------|--------------------------------------------------------------------------|
| public     | hwsdv2    | hwsdv2_pkey            |            | CREATE UNIQUE INDEX hwsdv2_pkey ON public.hwsdv2 USING btree (fid)       |
| public     | hwsdv2    | hwsdv2_geom_geom_idx   |            | CREATE INDEX hwsdv2_geom_geom_idx ON public.hwsdv2 USING gist (geom)     |
| public     | hwsdv2    | hwsdv2_geom_idx        |            | CREATE INDEX hwsdv2_geom_idx ON public.hwsdv2 USING gist (((geom)::geography)) |

I was able to run your sql query after creating a new spatial index and adding 'geom' to the selected columns:

point = f"ST_SetSRID(ST_Point({lon}, {lat}), 4326)"
        main_query = f"""
            SELECT
                geom,
                hwsd2,
                ST_Distance(
                    geom::geography,
                    {point}::geography
                ) AS distance,
                ST_Intersects(geom, {point}) AS pt_intersect
            FROM {table_name}
            WHERE ST_DWithin(geom::geography, {point}::geography, {buffer_dist});
        """

Here are the results using your version:

Garo's SQL query version
# Total results: 2667

# Result proportions:

result
1          17.547807
2           9.223847
3           6.224222
4           4.911886
5           4.011999
6           1.949756
7           1.349831
8           0.862392
9           0.637420
10          0.449944
11          0.262467
12          0.224972
crash       0.524934
missing    51.818523
Name: pedon_key, dtype: float64

# Execution time deciles:

[ 1.15685061  2.08929843  2.36388511  2.61102933  2.91068624  3.27961859
  3.82248324  4.32113001  4.81062495  5.42749154 11.16506981]

# Unique crash tracebacks (5 unique, 14 total):

Traceback #1, occurred 5 times. Example pedon: BR0675, lat: -7.28333, lon: -34.33333
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 732, in rank_soils_global
      soilIDRank_output_pd = pd.read_csv(io.StringIO(list_output_data.rank_data_csv))
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  AttributeError: 'str' object has no attribute 'rank_data_csv'

Traceback #2, occurred 4 times. Example pedon: BR0632, lat: -21.35, lon: -47.78333
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 847, in rank_soils_global
      D = gower_distances(slice_mat)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 882, in gower_distances
      X, Y = check_pairwise_arrays(X, Y, dtype=dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 829, in check_pairwise_arrays
      X = validation.check_array(X, accept_sparse="csr", dtype=dtype, estimator=estimator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/env/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1096, in check_array
      raise ValueError(
  ValueError: Found array with 0 feature(s) (shape=(3, 0)) while a minimum of 1 is required by check_pairwise_arrays.

Traceback #3, occurred 3 times. Example pedon: IN0137, lat: 23.16667, lon: 79.95
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 847, in rank_soils_global
      D = gower_distances(slice_mat)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 882, in gower_distances
      X, Y = check_pairwise_arrays(X, Y, dtype=dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 829, in check_pairwise_arrays
      X = validation.check_array(X, accept_sparse="csr", dtype=dtype, estimator=estimator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/env/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1096, in check_array
      raise ValueError(
  ValueError: Found array with 0 feature(s) (shape=(4, 0)) while a minimum of 1 is required by check_pairwise_arrays.

Traceback #4, occurred 1 times. Example pedon: CM0030, lat: 8.51667, lon: 13.16667
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 847, in rank_soils_global
      D = gower_distances(slice_mat)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 882, in gower_distances
      X, Y = check_pairwise_arrays(X, Y, dtype=dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 829, in check_pairwise_arrays
      X = validation.check_array(X, accept_sparse="csr", dtype=dtype, estimator=estimator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/env/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1096, in check_array
      raise ValueError(
  ValueError: Found array with 0 feature(s) (shape=(2, 0)) while a minimum of 1 is required by check_pairwise_arrays.

Traceback #5, occurred 1 times. Example pedon: AO0022, lat: -11.96667, lon: 14.95
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 847, in rank_soils_global
      D = gower_distances(slice_mat)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 882, in gower_distances
      X, Y = check_pairwise_arrays(X, Y, dtype=dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 829, in check_pairwise_arrays
      X = validation.check_array(X, accept_sparse="csr", dtype=dtype, estimator=estimator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/env/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1096, in check_array
      raise ValueError(
  ValueError: Found array with 0 feature(s) (shape=(6, 0)) while a minimum of 1 is required by check_pairwise_arrays.

Here are the results from a modified version of my previous sql query that uses the more complicated spatial filter (wkt polygon buffer:

Jon's revised SQL query version
main_query = f"""
            WITH 
            inputs AS (
                SELECT
                    ST_GeomFromText('{buffer_wkt}', 4326) AS buffer_geom,
                    ST_SetSRID(ST_Point({lon}, {lat}), 4326) AS pt_geom
            ),

            valid_geom AS (
                SELECT
                    hwsd2,
                    geom
                FROM {table_name}, inputs
                WHERE geom && inputs.buffer_geom
                AND ST_Intersects(geom, inputs.buffer_geom)
            )

            SELECT
                vg.hwsd2,
                ST_AsEWKB(vg.geom) AS geom,
                ST_Distance(
                    ST_ClosestPoint(vg.geom::geography, inputs.pt_geom::geography),
                    inputs.pt_geom::geography
                ) AS distance,
                ST_Intersects(vg.geom, inputs.pt_geom) AS pt_intersect
            FROM valid_geom vg, inputs;
        """
Resutls:

# Total results: 2667

# Result proportions:

result
1          17.660292
2           9.073866
3           6.224222
4           4.949381
5           4.086989
6           1.912261
7           1.312336
8           0.862392
9           0.637420
10          0.449944
11          0.262467
12          0.224972
crash       0.524934
missing    51.818523
Name: pedon_key, dtype: float64

# Execution time deciles:

[0.24824    0.98612169 1.21317366 1.43319294 1.68784062 1.9345772
 2.27300433 2.61946906 3.03584662 3.64735029 6.57143089]

# Unique crash tracebacks (5 unique, 14 total):

Traceback #1, occurred 5 times. Example pedon: BR0675, lat: -7.28333, lon: -34.33333
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 732, in rank_soils_global
      soilIDRank_output_pd = pd.read_csv(io.StringIO(list_output_data.rank_data_csv))
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  AttributeError: 'str' object has no attribute 'rank_data_csv'

Traceback #2, occurred 4 times. Example pedon: BR0632, lat: -21.35, lon: -47.78333
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 847, in rank_soils_global
      D = gower_distances(slice_mat)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 882, in gower_distances
      X, Y = check_pairwise_arrays(X, Y, dtype=dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 829, in check_pairwise_arrays
      X = validation.check_array(X, accept_sparse="csr", dtype=dtype, estimator=estimator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/env/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1096, in check_array
      raise ValueError(
  ValueError: Found array with 0 feature(s) (shape=(3, 0)) while a minimum of 1 is required by check_pairwise_arrays.

Traceback #3, occurred 3 times. Example pedon: IN0137, lat: 23.16667, lon: 79.95
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 847, in rank_soils_global
      D = gower_distances(slice_mat)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 882, in gower_distances
      X, Y = check_pairwise_arrays(X, Y, dtype=dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 829, in check_pairwise_arrays
      X = validation.check_array(X, accept_sparse="csr", dtype=dtype, estimator=estimator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/env/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1096, in check_array
      raise ValueError(
  ValueError: Found array with 0 feature(s) (shape=(4, 0)) while a minimum of 1 is required by check_pairwise_arrays.

Traceback #4, occurred 1 times. Example pedon: CM0030, lat: 8.51667, lon: 13.16667
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 847, in rank_soils_global
      D = gower_distances(slice_mat)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 882, in gower_distances
      X, Y = check_pairwise_arrays(X, Y, dtype=dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 829, in check_pairwise_arrays
      X = validation.check_array(X, accept_sparse="csr", dtype=dtype, estimator=estimator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/env/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1096, in check_array
      raise ValueError(
  ValueError: Found array with 0 feature(s) (shape=(2, 0)) while a minimum of 1 is required by check_pairwise_arrays.

Traceback #5, occurred 1 times. Example pedon: AO0022, lat: -11.96667, lon: 14.95
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 847, in rank_soils_global
      D = gower_distances(slice_mat)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 882, in gower_distances
      X, Y = check_pairwise_arrays(X, Y, dtype=dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 829, in check_pairwise_arrays
      X = validation.check_array(X, accept_sparse="csr", dtype=dtype, estimator=estimator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/env/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1096, in check_array
      raise ValueError(
  ValueError: Found array with 0 feature(s) (shape=(6, 0)) while a minimum of 1 is required by check_pairwise_arrays.

@shrouxm
Copy link
Member

shrouxm commented Apr 22, 2025

@jjmaynard i have both those indexes yeah :/

ok well i will stop beating the query time horse until i'm able to run it on our staging DB, since i'm not sure how to debug further why some of these queries are much slower on my machine and i'm willing to just move on from that if it runs ok everywhere else

so i guess the main thing to look into at this point is the high missingness

@jjmaynard
Copy link
Collaborator Author

@shrouxm Not sure if you've already tried running these functions but can help to improve performance of existing index:

# Rebuild the Index for Efficiency:
REINDEX INDEX hwsdv2_geom_geom_idx;

#Reorder the table based on spatial index:
CLUSTER hwsdv2 USING hwsdv2_geom_geom_idx;

#Vacuum & Analyze the Table
VACUUM ANALYZE hwsdv2;

@shrouxm
Copy link
Member

shrouxm commented May 6, 2025

@jjmaynard ooh i've never even heard of the postgres cluster command! i will try it out

@shrouxm
Copy link
Member

shrouxm commented May 6, 2025

@jjmaynard i think the only thing we need to get this merged is to fix the US tests that got broken at some point along the way in this PR, are you able to look into that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants