fix: global code update #225

jjmaynard · 2025-02-04T07:15:08Z

Description

Updated global functions:

list_soils_global
rank_soils_global
sg_list

-updated code to query postgres database - fixed random bugs

…rs/soil-id-algorithm into fix/global-code-update

paulschreiber

Please:

fix the highlighted issues
add some test lat/lng points
provide a copy of the SQL data(base) needed for testing

soil_id/db.py

soil_id/utils.py

soil_id/tests/us/test_us.py

soil_id/global_soil.py

garobrik

🎉 🎉 🎉 big milestone, congrats & thank you so much! left a couple very minor comments while skimming, and some general comments:

fantastic that it has the exact same API as the US version
we'll need to get the US tests working again (not sure why they're not working 🤔) and add some smoke tests for global before merging
it seems the global code relies on having a database available at runtime. we'll need to figure out how to get that working from static files like US does
- or if that's somehow completely impossible, there'll be a lift to get that DB set up in our dev, CI, and prod environments
we'll need a way to determine whether to call the US or global functions

garobrik · 2025-02-05T18:19:25Z

soil_id/us_soil.py

@@ -2060,7 +2060,6 @@ def rank_soils(
        Score_Data_Loc = (D_final_loc["Score_Data_scale"] + D_final_loc["distance_score_scale"]) / (
            D_final_loc["data_weight"] + location_weight
        )
-        Score_Data_Loc /= np.nanmax(Score_Data_Loc)


question: why this small change to us_soil.py?

Yeah, this change shouldn't be in this branch and should be ignored

garobrik · 2025-02-05T18:20:44Z

soil_id/global_soil.py


-    # Call soildgrids API
+
+def sg_list(lon, lat):


suggestion: rename for clarity

Suggested change

def sg_list(lon, lat):

def soil_grids_list(lon, lat):

Co-authored-by: Paul Schreiber <[email protected]>

…rs/soil-id-algorithm into fix/global-code-update

garobrik · 2025-03-11T23:51:29Z

hi @jjmaynard! i had a chance to set up the bulk testing harness today! here's some questions and what i found from a first pass of running it over the test data you sent (global_pedon_test_dataset.csv):

i was able to get the existing test_global.py function to run successfully after setting up a local postgres database from the DB dump you uploaded in the google drive! :)
list_soils_global runs successfully on around 80% of the pedons i've tested so far. there are two distinct tracebacks i get when it fails, which are copied below.
on all pedons where list_soils_global runs successfully, rank_soils_global fails with the same traceback, copied below. it looks like a quick fix and i may just not be invoking the function correctly (if i try to add measurement values to the test in test_global.py it fails with the same traceback)
80% of queries take over 20s to run on my machine
i've updated this PR with:
- the scripts for the bulk testing
- i added a coordinate for each distinct crash traceback i found to test_global.py
- i added some arbitrary measurement values to the rank_soils_global call in test_global.py
- so you should be able to just work off of trying to get test_global.py to run successfully again, and then if you have time try running the bulk testing scripts to see if any new crashes emerge

question: once we get it running successfully, which column of the test CSV should i compare against the output soil name to determine how well the algorithm ranked the result? there are four columns that seem like they could be relevant:

Description_74	Description_90	Description_fao74	Description_fao90
Cambisols	Calcisols	Calcic Cambisol	        Luvic Calcisol

and lastly here are the 3 distinct tracebacks i found from the algorithm crashes:

Traceback #1, occurred 204 times. Example pedon: AO SOTER_P.101c/63, lat: -10.950086, lon: 17.573093
  Traceback (most recent call last):
    File "/home/roux/terraso/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 56, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/home/roux/terraso/soil-id-algorithm/soil_id/global_soil.py", line 662, in rank_soils_global
      pedon_LAB = pedon_color(lab_Color, horizonDepth)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/roux/terraso/soil-id-algorithm/soil_id/utils.py", line 1798, in pedon_color
      if None in (pedon_top, pedon_bottom, pedon_l, pedon_a, pedon_b):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/roux/terraso/soil-id-algorithm/.venv/lib/python3.12/site-packages/pandas/core/generic.py", line 1577, in __nonzero__
      raise ValueError(
  ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Traceback #2, occurred 43 times. Example pedon: AF0001, lat: 34.5, lon: 69.16667
  Traceback (most recent call last):
    File "/home/roux/terraso/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 54, in <module>
      list_result = list_soils_global(lat=lat, lon=lon)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/roux/terraso/soil-id-algorithm/soil_id/global_soil.py", line 90, in list_soils_global
      mucompdata_pd = hwsd2_data[["hwsd2", "fao90_name", "distance", "share", "compid"]]
                      ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  TypeError: 'NoneType' object is not subscriptable

Traceback #3, occurred 23 times. Example pedon: AO SOTER_P.104c/62, lat: -10.07856, lon: 15.107436
  Traceback (most recent call last):
    File "/home/roux/terraso/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 54, in <module>
      list_result = list_soils_global(lat=lat, lon=lon)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/roux/terraso/soil-id-algorithm/soil_id/global_soil.py", line 473, in list_soils_global
      reordered_lists = [[lst[i] for i in mucomp_index] for lst in lists_to_reorder]
                          ~~~^^^
  IndexError: list index out of range

fixed the 3 traceback issues @shrouxm identified. code now runs under 2s. SoilGrids API returns are sometimes taking 20-30 s.

-fixed distance calculation errors associated with coordinate system transformation. -added top depth input to rank function

When no map data is available at a location - returns "Data_unavailable"

-modified sql code to query all home mapunit components and all adjacent map units and components. -fixed component name indexing.

…requirements.txt requirements/dev.in setup.cfg from main

paulschreiber · 2025-04-14T04:07:19Z

soil_id/global_soil.py

-        layer_name=None,
-        buffer_size=0.5,
-    )
+def list_soils_global(lon, lat):


issue: this should be called list_soils() to match the us version. this module is namespaced and doesn't need a prefix.

paulschreiber · 2025-04-14T04:07:35Z

soil_id/global_soil.py


 ##############################################################################################
 #                                   rankPredictionGlobal                                     #
 ##############################################################################################
-def rankPredictionGlobal(
-    lon, lat, soilHorizon, horizonDepth, rfvDepth, lab_Color, bedrock, cracks, plot_id=None
+def rank_soils_global(


issue: this should be called rank_soils() to match the us version. this module is namespaced and doesn't need a prefix.

jjmaynard · 2025-04-15T19:28:54Z

@shrouxm when you list the postgis table indexes, do all of these appear?
SELECT * FROM pg_indexes WHERE tablename = 'hwsdv2';

| schemaname | tablename | indexname             | tablespace | indexdef                                                                 |
|------------|-----------|------------------------|------------|--------------------------------------------------------------------------|
| public     | hwsdv2    | hwsdv2_pkey            |            | CREATE UNIQUE INDEX hwsdv2_pkey ON public.hwsdv2 USING btree (fid)       |
| public     | hwsdv2    | hwsdv2_geom_geom_idx   |            | CREATE INDEX hwsdv2_geom_geom_idx ON public.hwsdv2 USING gist (geom)     |
| public     | hwsdv2    | hwsdv2_geom_idx        |            | CREATE INDEX hwsdv2_geom_idx ON public.hwsdv2 USING gist (((geom)::geography)) |

I was able to run your sql query after creating a new spatial index and adding 'geom' to the selected columns:

point = f"ST_SetSRID(ST_Point({lon}, {lat}), 4326)"
        main_query = f"""
            SELECT
                geom,
                hwsd2,
                ST_Distance(
                    geom::geography,
                    {point}::geography
                ) AS distance,
                ST_Intersects(geom, {point}) AS pt_intersect
            FROM {table_name}
            WHERE ST_DWithin(geom::geography, {point}::geography, {buffer_dist});
        """

Here are the results using your version:

Garo's SQL query version

# Total results: 2667

# Result proportions:

result
1          17.547807
2           9.223847
3           6.224222
4           4.911886
5           4.011999
6           1.949756
7           1.349831
8           0.862392
9           0.637420
10          0.449944
11          0.262467
12          0.224972
crash       0.524934
missing    51.818523
Name: pedon_key, dtype: float64

# Execution time deciles:

[ 1.15685061  2.08929843  2.36388511  2.61102933  2.91068624  3.27961859
  3.82248324  4.32113001  4.81062495  5.42749154 11.16506981]

# Unique crash tracebacks (5 unique, 14 total):

Traceback #1, occurred 5 times. Example pedon: BR0675, lat: -7.28333, lon: -34.33333
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 732, in rank_soils_global
      soilIDRank_output_pd = pd.read_csv(io.StringIO(list_output_data.rank_data_csv))
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  AttributeError: 'str' object has no attribute 'rank_data_csv'

Traceback #2, occurred 4 times. Example pedon: BR0632, lat: -21.35, lon: -47.78333
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 847, in rank_soils_global
      D = gower_distances(slice_mat)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 882, in gower_distances
      X, Y = check_pairwise_arrays(X, Y, dtype=dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 829, in check_pairwise_arrays
      X = validation.check_array(X, accept_sparse="csr", dtype=dtype, estimator=estimator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/env/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1096, in check_array
      raise ValueError(
  ValueError: Found array with 0 feature(s) (shape=(3, 0)) while a minimum of 1 is required by check_pairwise_arrays.

Traceback #3, occurred 3 times. Example pedon: IN0137, lat: 23.16667, lon: 79.95
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 847, in rank_soils_global
      D = gower_distances(slice_mat)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 882, in gower_distances
      X, Y = check_pairwise_arrays(X, Y, dtype=dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 829, in check_pairwise_arrays
      X = validation.check_array(X, accept_sparse="csr", dtype=dtype, estimator=estimator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/env/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1096, in check_array
      raise ValueError(
  ValueError: Found array with 0 feature(s) (shape=(4, 0)) while a minimum of 1 is required by check_pairwise_arrays.

Traceback #4, occurred 1 times. Example pedon: CM0030, lat: 8.51667, lon: 13.16667
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 847, in rank_soils_global
      D = gower_distances(slice_mat)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 882, in gower_distances
      X, Y = check_pairwise_arrays(X, Y, dtype=dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 829, in check_pairwise_arrays
      X = validation.check_array(X, accept_sparse="csr", dtype=dtype, estimator=estimator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/env/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1096, in check_array
      raise ValueError(
  ValueError: Found array with 0 feature(s) (shape=(2, 0)) while a minimum of 1 is required by check_pairwise_arrays.

Traceback #5, occurred 1 times. Example pedon: AO0022, lat: -11.96667, lon: 14.95
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 847, in rank_soils_global
      D = gower_distances(slice_mat)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 882, in gower_distances
      X, Y = check_pairwise_arrays(X, Y, dtype=dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 829, in check_pairwise_arrays
      X = validation.check_array(X, accept_sparse="csr", dtype=dtype, estimator=estimator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/env/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1096, in check_array
      raise ValueError(
  ValueError: Found array with 0 feature(s) (shape=(6, 0)) while a minimum of 1 is required by check_pairwise_arrays.

Here are the results from a modified version of my previous sql query that uses the more complicated spatial filter (wkt polygon buffer:

Jon's revised SQL query version

main_query = f"""
            WITH 
            inputs AS (
                SELECT
                    ST_GeomFromText('{buffer_wkt}', 4326) AS buffer_geom,
                    ST_SetSRID(ST_Point({lon}, {lat}), 4326) AS pt_geom
            ),

            valid_geom AS (
                SELECT
                    hwsd2,
                    geom
                FROM {table_name}, inputs
                WHERE geom && inputs.buffer_geom
                AND ST_Intersects(geom, inputs.buffer_geom)
            )

            SELECT
                vg.hwsd2,
                ST_AsEWKB(vg.geom) AS geom,
                ST_Distance(
                    ST_ClosestPoint(vg.geom::geography, inputs.pt_geom::geography),
                    inputs.pt_geom::geography
                ) AS distance,
                ST_Intersects(vg.geom, inputs.pt_geom) AS pt_intersect
            FROM valid_geom vg, inputs;
        """

Resutls:

# Total results: 2667

# Result proportions:

result
1          17.660292
2           9.073866
3           6.224222
4           4.949381
5           4.086989
6           1.912261
7           1.312336
8           0.862392
9           0.637420
10          0.449944
11          0.262467
12          0.224972
crash       0.524934
missing    51.818523
Name: pedon_key, dtype: float64

# Execution time deciles:

[0.24824    0.98612169 1.21317366 1.43319294 1.68784062 1.9345772
 2.27300433 2.61946906 3.03584662 3.64735029 6.57143089]

# Unique crash tracebacks (5 unique, 14 total):

Traceback #1, occurred 5 times. Example pedon: BR0675, lat: -7.28333, lon: -34.33333
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 732, in rank_soils_global
      soilIDRank_output_pd = pd.read_csv(io.StringIO(list_output_data.rank_data_csv))
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  AttributeError: 'str' object has no attribute 'rank_data_csv'

Traceback #2, occurred 4 times. Example pedon: BR0632, lat: -21.35, lon: -47.78333
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 847, in rank_soils_global
      D = gower_distances(slice_mat)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 882, in gower_distances
      X, Y = check_pairwise_arrays(X, Y, dtype=dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 829, in check_pairwise_arrays
      X = validation.check_array(X, accept_sparse="csr", dtype=dtype, estimator=estimator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/env/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1096, in check_array
      raise ValueError(
  ValueError: Found array with 0 feature(s) (shape=(3, 0)) while a minimum of 1 is required by check_pairwise_arrays.

Traceback #3, occurred 3 times. Example pedon: IN0137, lat: 23.16667, lon: 79.95
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 847, in rank_soils_global
      D = gower_distances(slice_mat)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 882, in gower_distances
      X, Y = check_pairwise_arrays(X, Y, dtype=dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 829, in check_pairwise_arrays
      X = validation.check_array(X, accept_sparse="csr", dtype=dtype, estimator=estimator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/env/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1096, in check_array
      raise ValueError(
  ValueError: Found array with 0 feature(s) (shape=(4, 0)) while a minimum of 1 is required by check_pairwise_arrays.

Traceback #4, occurred 1 times. Example pedon: CM0030, lat: 8.51667, lon: 13.16667
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 847, in rank_soils_global
      D = gower_distances(slice_mat)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 882, in gower_distances
      X, Y = check_pairwise_arrays(X, Y, dtype=dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 829, in check_pairwise_arrays
      X = validation.check_array(X, accept_sparse="csr", dtype=dtype, estimator=estimator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/env/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1096, in check_array
      raise ValueError(
  ValueError: Found array with 0 feature(s) (shape=(2, 0)) while a minimum of 1 is required by check_pairwise_arrays.

Traceback #5, occurred 1 times. Example pedon: AO0022, lat: -11.96667, lon: 14.95
  Traceback (most recent call last):
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/tests/global/generate_bulk_test_results.py", line 60, in <module>
      result_record["rank_result"] = rank_soils_global(
                                     ^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/global_soil.py", line 847, in rank_soils_global
      D = gower_distances(slice_mat)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 882, in gower_distances
      X, Y = check_pairwise_arrays(X, Y, dtype=dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/soil_id/utils.py", line 829, in check_pairwise_arrays
      X = validation.check_array(X, accept_sparse="csr", dtype=dtype, estimator=estimator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mnt/c/Users/jmaynard/Documents/GitHub/soil-id-algorithm/env/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1096, in check_array
      raise ValueError(
  ValueError: Found array with 0 feature(s) (shape=(6, 0)) while a minimum of 1 is required by check_pairwise_arrays.

garobrik · 2025-04-22T14:41:45Z

@jjmaynard i have both those indexes yeah :/

ok well i will stop beating the query time horse until i'm able to run it on our staging DB, since i'm not sure how to debug further why some of these queries are much slower on my machine and i'm willing to just move on from that if it runs ok everywhere else

so i guess the main thing to look into at this point is the high missingness

jjmaynard · 2025-04-22T18:48:13Z

@shrouxm Not sure if you've already tried running these functions but can help to improve performance of existing index:

# Rebuild the Index for Efficiency:
REINDEX INDEX hwsdv2_geom_geom_idx;

#Reorder the table based on spatial index:
CLUSTER hwsdv2 USING hwsdv2_geom_geom_idx;

#Vacuum & Analyze the Table
VACUUM ANALYZE hwsdv2;

format global test dataset gitignore global testing

…rs/soil-id-algorithm into fix/global-code-update

garobrik · 2025-05-06T18:39:55Z

@jjmaynard ooh i've never even heard of the postgres cluster command! i will try it out

garobrik · 2025-05-06T20:16:30Z

@jjmaynard i think the only thing we need to get this merged is to fix the US tests that got broken at some point along the way in this PR, are you able to look into that?

garobrik · 2025-05-27T23:25:46Z

@jjmaynard adventures from trying to run global on our staging environment:

even on non-trivially beefy machines ($250/mo in compute) the request times out after 30s

i was still seeing that the DB query was the bottleneck on my machine, so i looked into optimizing it by segmenting the polygons. i ran the following commands:

    CREATE TABLE IF NOT EXISTS hwsd2_segment AS
    SELECT fid as parent_id, hwsd2 as hwsd2_id, ST_Subdivide(ST_Transform(geom, 4326), 10)::geography AS shape
    FROM hwsdv2;
    
    CREATE INDEX IF NOT EXISTS hwsd2_segment_shape_idx ON hwsd2_segment USING gist(shape);
    
    CLUSTER hwsd2_segment USING hwsd2_segment_shape_idx;
    
    point = f"ST_SetSRID(ST_Point({lon}, {lat}), 4326)::geography"
    main_query = f"""
        SELECT
            hwsd2_id as hwsd2,
            MIN(ST_Distance(
                shape,
                {point}
            )) AS distance,
            BOOL_OR(ST_Intersects(shape, {point})) AS pt_intersect
        FROM hwsd2_segment
        WHERE ST_DWithin(shape, {point}, {buffer_dist})
        GROUP BY hwsd2_id;
    """

now seeing much faster queries (an order of magnitude or even 2)
now some queries are succeeding on staging, but some are still timing out after 30s
if i pull down the buffer distance from 100km to 10km, all queries are succeeding on staging after 2 seconds or so, with an accuracy trade-off
locally, i was trying to figure out why things are still so slow. i did some good old print statement benchmarking, and found that for e.g. the coordinate latitude: 34.55, longitude: 69.16667, on my machine:
- the extract_hwsd2_data function takes under a second
- the rest of the list function takes 65 seconds
- the rank function takes 2 seconds
i know from a previous test run that even without the optimized DB query 99% of coordinates ran in less than 17s, so it does seem like these issues are from particular pathological coordinates. the one above for example returns 302 rows from the main query, so maybe there's a sneaky n² somewhere? my understanding was that after the extract_hwsd2_data function there's no more geospatial business going on.

so, all in all, good progress on getting things running on staging. maybe some accuracy/cost/performance trade-offs to consider. and wondering if you're able to shed any light on the long execution time in the list function!

garobrik · 2025-06-26T00:48:38Z

this code was merged to main in #266

jjmaynard added 7 commits December 20, 2024 07:03

fix: update global soilid code

f0ac653

fix: update global soilid code

64cf160

fix: postgres database integration

93af38f

-updated code to query postgres database - fixed random bugs

Merge branch 'fix/global-code-update' of https://github.com/techmatte…

f5a11fb

…rs/soil-id-algorithm into fix/global-code-update

fix: created .env file and updated .gitignore

9ff26ea

fix: revision of global functions

f8bfc57

fix: remove test files

37561eb

jjmaynard requested review from paulschreiber and garobrik February 4, 2025 07:15

jjmaynard and others added 3 commits February 4, 2025 09:56

fix: lint format changes

b273b3d

fix: make format changes

51cf8c8

style: use LF line endings

ddab942

paulschreiber requested changes Feb 4, 2025

View reviewed changes

This was referenced Feb 4, 2025

Jon reviews Global Soil ID code and implements version for new app #150

Closed

Plan Engineering Work for Global Soil ID techmatters/terraso-product#1237

Closed

garobrik reviewed Feb 5, 2025

View reviewed changes

garobrik changed the title ~~Fix/global code update~~ fix: global code update Feb 5, 2025

DerekCaelin mentioned this pull request Feb 5, 2025

Global Soil ID (Soil ID works in non-US locations) techmatters/terraso-product#993

Open

21 tasks

Normalize line endings to LF

d9ba6ab

jjmaynard force-pushed the fix/global-code-update branch from 898207a to d9ba6ab Compare February 6, 2025 01:40

jjmaynard and others added 6 commits February 11, 2025 17:12

Update soil_id/db.py

0b6c5c9

Co-authored-by: Paul Schreiber <[email protected]>

fix: new psql color tables

418ea47

Merge branch 'fix/global-code-update' of https://github.com/techmatte…

082568b

…rs/soil-id-algorithm into fix/global-code-update

fix: update psql functions

1299c7e

fix: HWSDv2 postgres integration

959add1

fix: global color calculation

605b369

feat: update tests for global algorithm

587cc7e

garobrik mentioned this pull request Mar 18, 2025

feat: integrate global with terraso #240

Closed

fix: global bugs

2fc46f0

fixed the 3 traceback issues @shrouxm identified. code now runs under 2s. SoilGrids API returns are sometimes taking 20-30 s.

garobrik and others added 6 commits April 13, 2025 23:55

test: update tests

65152ab

fix: traceback errors-geo

4e02a14

-fixed distance calculation errors associated with coordinate system transformation. -added top depth input to rank function

fix: no map data return

8d6e0b1

When no map data is available at a location - returns "Data_unavailable"

fix: make black/lint format

6a5ef1e

fix: sql code update

0951ee4

-modified sql code to query all home mapunit components and all adjacent map units and components. -fixed component name indexing.

build: update Makefile README.md pyproject.toml requirements-dev.txt …

d54c625

…requirements.txt requirements/dev.in setup.cfg from main

paulschreiber force-pushed the fix/global-code-update branch from 6c9d5d5 to d54c625 Compare April 14, 2025 03:55

paulschreiber added 3 commits April 14, 2025 00:01

style: run make format

ab9d207

fix: fix imports in config

caebd71

fix: remove unsued variables

22eccb5

paulschreiber reviewed Apr 14, 2025

View reviewed changes

jjmaynard added 2 commits April 22, 2025 12:07

fix: global test dataset and testing

6a947a2

format global test dataset gitignore global testing

Merge branch 'fix/global-code-update' of https://github.com/techmatte…

e16183e

…rs/soil-id-algorithm into fix/global-code-update

garobrik added 4 commits May 6, 2025 11:50

test: update tests

d657f9a

dx: add docker file for dev database

e727737

fix: lint

a4fef55

test: skip global test by default (no DB in CI)

9cc216a

fix: pick a postgres version that matches our backend

f7c48f4

Merge branch 'main' into fix/global-code-update

6d4a70f

garobrik mentioned this pull request Jun 19, 2025

feat: final aggregation of global code #266

Merged

garobrik closed this Jun 26, 2025

garobrik deleted the fix/global-code-update branch June 26, 2025 00:48

fix: global code update #225

fix: global code update #225

Uh oh!

Conversation

jjmaynard commented Feb 4, 2025

Description

Uh oh!

paulschreiber left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garobrik left a comment

Choose a reason for hiding this comment

Uh oh!

garobrik Feb 5, 2025

Choose a reason for hiding this comment

Uh oh!

jjmaynard Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

garobrik Feb 5, 2025

Choose a reason for hiding this comment

Uh oh!

garobrik commented Mar 11, 2025

Uh oh!

paulschreiber Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

paulschreiber Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

jjmaynard commented Apr 15, 2025

Uh oh!

garobrik commented Apr 22, 2025

Uh oh!

jjmaynard commented Apr 22, 2025

Uh oh!

garobrik commented May 6, 2025

Uh oh!

garobrik commented May 6, 2025

Uh oh!

garobrik commented May 27, 2025

Uh oh!

garobrik commented Jun 26, 2025

Uh oh!

Uh oh!