Skip to content

Fix Hierarchy calculation and comparison#3021

Draft
Jongmassey wants to merge 3 commits into
mainfrom
Jongmassey/new-style-hierarchy-comparison
Draft

Fix Hierarchy calculation and comparison#3021
Jongmassey wants to merge 3 commits into
mainfrom
Jongmassey/new-style-hierarchy-comparison

Conversation

@Jongmassey
Copy link
Copy Markdown
Contributor

@Jongmassey Jongmassey commented Apr 9, 2026

This PR covers three problems found as likely causative of #3038

  1. Calculation of hierarchies in _check_version_by_hierarchy() is only ever done old-style
  2. Calculation of hierarchies in _create_version_with_codes() is only ever done old-style
  3. Deletion of a search (using the delete_search() action) does not trigger a hierarchy cache refresh

I will address the role of each of these in its own comment for clarity.

This PR contains fixes for these three problem, and therefore fixes #3038

TODO:

  • rewrite TestUpdateCodelistVersionCompatibilityWithSearchButMismatchedHierarchy to work with these fixes as it previously relied on bugs
  • add failing tests for each of the above

Use same hierarchy calculation methods for cached hierarchy
and that calculated during compatibility checks
CodelistVersion._calculate_new_style_hierarchy() creates a hierarchy from a codelistversion's code_objs.
_create_version_with_codes() followed the logic of CodelistVersion._calculate_old_style_hierarchy() which creates a hierarchy from a codelistversion's codes.

The latter includes only included codes, the former includes excluded codes that are hierarchically related to the included.
Where the latter and the former are not equal then this can produce different hierarchies.

By no longer passing the old-style-calculated hierarchy (which is the only style of hierarchy we can calculate before we've got code_objs, and we need it to create the code_objs) to cache_hierarchy we force it to calculate a new-style one for us before caching it.

This ensures that the cached hierarchy is calculated in a manner consistent with the model's calculate_hierarchy()
Deleting a search can remove unresolved codes from a CodelistVersion's code_objs
These code_objs are used to calculate the CodelistVersion's Hierarchy
Therefore, if we change the code_objs, we should recalculate the cached hierarchy
@Jongmassey Jongmassey marked this pull request as draft April 9, 2026 16:58
@Jongmassey
Copy link
Copy Markdown
Contributor Author

Jongmassey commented Apr 21, 2026

1. Calculation of hierarchies in _check_version_by_hierarchy() is only ever done old-style

CodelistVersion.calculate_hierarchy() will call either _calculate_old_style_hierarchy() or _calculate_new_style_hierarchy() depending on whether the CodelistVersion is "old" (has csv_data) or "new" style.

_calculate_old_style_hierarchy() instantiates a Hierarchy by passing the CodelistVersion's codes to Hierarchy.from_codes().

_calculate_new_style_hierarchy() instantiates a Hierarchy by passing the CodelistVersion's code_objs to Hierarchy.from_codes().

_check_version_by_hierarchy() only uses the "old style" method for constructing a hierarchy against the incoming coding system release. This newly-constructed hierarchy is compared against a CodelistVersion's cached hierarchy.

Old-style and new-style hierarchy calculation can produce different results.

If a CodelistVersion has a new-style cached hierarchy, calculating an old-style hierarchy with a new coding system release may produce a dissimilar hierarchy where calculating a new-style one would produce an identical one. This incorrect choice of hierarchy calculation style may therefore lead to an incorrect assertion of incompatibility between a CodelistVersion and a CodingSystemRelease.

@Jongmassey
Copy link
Copy Markdown
Contributor Author

Jongmassey commented Apr 21, 2026

2. Calculation of hierarchies in _create_version_with_codes() is only ever done old-style

_create_version_with_codes() is used to create CodelistVersions from uploaded csv files.

It builds, then caches, a hierarchy from the uploaded codes in a manner identical to _calculate_old_style_hierarchy():

hierarchy = Hierarchy.from_codes(coding_system, codes)

even when the resultant CodelistVersion object is considered "new style" - i.e. the uploaded csv data has been translated into code objects instead of being stored raw in csv_data.

As a new-style CodelistVersion is being constructed, new-style hierarchies should be constructed using these codeobjs.

As it happens, the problem in point 1. means that for uploaded codelists, the "incorrect" behaviour of only calculating an old style codelist for hierarchy comparison purposes means this behaviour may not result in under-ascertainment of CodelistVersion-CodingSystemRelease compatibility for uploaded codelists, but if point 1 is fixed it will for the same reasons as is currently the case for other new-style codelists.

@Jongmassey
Copy link
Copy Markdown
Contributor Author

Jongmassey commented Apr 21, 2026

3.Deletion of a search (using the delete_search() action) does not trigger a hierarchy cache refresh

Codelists built with the builder are always "new style". New-style hierarchy calculation uses a CodelistVersion's CodeObjs.

When a Search is added to a draft CodelistVersion in the builder, CodeObjs for each of its SearchResults are added to the CodelistVersion, each with an Undecided status (unless they are already present with a different status by virtue of having been also found in a previous search).

Therefore, create_search() calls cache_hierarchy() (without passing a pre-calculated hierarchy as the hierarchy kwarg, thus telling cache_hierarchy to calculate a fresh hierarchy before caching it).

Deleting a search from a CodelistVersion in the builder will remove any Undecided CodeObjs that resulted from that search from the CodelistVersion.

However, delete_search() does not call cache_hierarchy(). This leaves nodes for these deleted CodeObjs in the cached hierarchy.

This means that CodelistVersions that have deleted searches can have cached hierarchies that do not reflect their current state.

@Jongmassey
Copy link
Copy Markdown
Contributor Author

Jongmassey commented Apr 21, 2026

Finding some bugs

This investigation originated from an observation that CodelistVersion-CodingSystemRelease compatibility rates were extremely low (via @opensafely-core/team-rsi 's Codelist Quality Reports) and that creating a new version of an CodelistVersion that was apparently incompatible with the latest release of a coding system resulted in a draft with no unresolved codes - indicating that both its searches and its hierarchy were actually compatible with the latest release after all.

This led to further inspection of _check_version_by_hierarchy() - where a previous bug had recently been found and fixed (#2911 ). Initial thoughts were that this fix may have introduced a new bug, but since the historical record of compatibility had not yet been rebuilt following this fix, and re-checking after checking out the commit before this fix, it was decided that this was not the case.

Comparison of the hierarchy comparison and hierarchy calculation logic revealed problem 1, which is asserted to be prima facie incorrect due to the inconsistent methodologies between these two places.

A core finding driving the investigation of this issue was the fact that some CodelistVersions don't have hierarchies that "round-trip" i.e. their cached hierarchy is mismatched to one calculated afresh using the coding system release with which they were created.

Finding these bugs was started by comparing CodelistVersions with round-tripping hierarchies to those without.

matched = []
mismatched = []
for cv in CodelistVersion.objects.all():
    if cv.hierarchy == cv.calculate_hierarchy(cv.coding_system):
        matched.append(cv)
    else:
        mismatched.append(cv)

No clean time-based, coding-system-based, or creation-method-based differences were observed between these two sets. The exceptions to this were that cloned codelists inherited the round-trippiness of their source CodelistVersion, and BNF-to-dm+d codelists were completely unaffected (see #3031 )

Instead, example codelists were created by stepping through the debugger for both built and uploaded codelists and observing changes in the codelistversion state and cached hierarchy at each stage.

This revealed problem 2 and 3 above.

Proving that problem 2 and 3 were indeed the cause of the all the non-round-tripping cases was more challenging as only some CodelistVersions were affected.

@Jongmassey
Copy link
Copy Markdown
Contributor Author

Jongmassey commented Apr 21, 2026

Proving problem 2

Problem 2 regards uploaded codelists, and only affects some. This presents two challenges: robustly identifying uploaded new-style codelists, and understanding why only some are affected.

Old-style uploaded CodelistVersions have a populated 'csv_data' property, new-style ones do not.

The first step in building a CodelistVersion is to add some searches. Therefore any CodelistVersion without searches (and without any clone or converted characteristic strings in its Methodology statement) was considered to be most likely uploaded.

It was theorised that affected uploaded CodelistVersions should have differing results for _calculate_old_style_hierarchy() and _calculate_new_style_hierarchy(), and unaffected should have identical results.

Calculating all these hierarchies was an extremely resource intensive and exercise.

Filtering for just codelists supposed to be uploaded and for cases where this theorised condition didn't hold completely true given these results:

>>> {k:len(v) for k,v in hierarchies_uploaded.items()}
{
'bnf_matched_old_new_hierarchy_not_equal': 12
'bnf_mismatched_old_new_hierarchy_equal': 4,
'dmd_matched_old_new_hierarchy_not_equal': 842,
'dmd_mismatched_old_new_hierarchy_equal': 5,
'icd10_matched_old_new_hierarchy_not_equal': 40,
'icd10_mismatched_old_new_hierarchy_equal': 11,
'snomedct_matched_old_new_hierarchy_not_equal': 455,
'snomedct_mismatched_old_new_hierarchy_equal': 39,
}

these are just the examples that violate the proposed theory, there are many thousands that align with it

Clearly either the establishment of uploaded status or my theory weren't 100% correct (in the context of the many thousands of CodelistVersions these are a small percentage, but nonetheless there's something not quite right here).

BNF

Focussing on BNF since it's easy to see the hierarchy within the codes themselves, the four cases where the hierarchy did not round trip but had equal old and new style hierarchies were examined.
In all these cases, neither the old nor new hierarchies were equal to the cache.
Therefore, it can not be the case that the cached hierarchy was created by _create_version_with_codes(). This was validated by checking out the commit immediately prior to each CodelistVersion's creation to ensure there were no logic changes that could affect this.

In all four cases these also only had included ("+","(+)") codeobjs, which might also suggest they were uploaded but
were found to be CodelistVersions which had actually been created using the builder and then had all their searches deleted.

To give the example of one such CodelistVersion with hash 5b45576f - a codelist of GLP-1 RAs - the hierarchy state was successfully reconstructed by building a codelist and deleting its only search:

  • Create new draft BNF Codelist
  • Search for code = '0601023'
  • Include '0601023AB', '0601023AZ', '0601023AQ', '0601023Y0'
  • Compare cached hierarchy to 5b45576f - equal
  • Compare codeobjs to 5b45576f - not equal, new version has lots more unresolved codes
  • Delete search
  • Compare codeobjs to 5b45576f - equal
  • Compare cached hierarchy to 5b45576f - equal
  • Compare hierarchy calculated by calculate_hierarchy() to one freshly calculated for 5b45576f - equal
  • Compare old and new calculated hierarchies to 5b45576f's cached - not equal

For the 12 CodelistVersions with successfully round-tripping hierarchies but unequal old and new hierarchies, these were also misidentified uploaded codelists that had actually been built with deleted searches but had a few key differences to the previous case. The first difference are the codeobjs in these 12 contain both explicit and implicit exclusions, and crucially explicit exclusions further up the hierarchy to their included codes. This scenario cannot happen with uploaded codelists so these must also be built.

A worked example for recreation of 1ce8a2d1 - Drugs used in diabetes (BNF section 6.1) is given here:

  • Create new draft BNF Codelist
  • Search for code '0601'
  • Include "060102", "060101"
  • Exclude "0601", "060106"
  • Leave "060104" unresolved
  • Delete the search

Search deletion as part of a codelist development strategy is a reasonable optimisation/shortcut for users to seek to employ, and we should ensure that hierarchies support this whilst considering the implications of this strategy on reproducibility.

dm+d

The high numbers of observed incongruent dmd cases feels improbable to attribute to use of the builder and deletion of searches, given the relatively recent implementation of this feature (#2704 ). Indeed, almost all of "dmd_matched_old_new_hierarchy_not_equal" date from before this PR was merged.

Some of the earlier examples from this set are actually converted from BNF in a roundabout download/upload fashion before this conversion was possible in-line and thus miss the diagnostic statements in their Methodologies.

Many of them also have codes unknown to the coding system, as they were uploaded before validation of such things was put in place:

>>> Counter(any(set(cv.codes) - set(cv.coding_system.lookup_names(cv.codes))) for cv in hierarchies_uploaded['dmd_matched_old_new_hierarchy_not_equal'])
Counter({False: 515, True: 327})

where "True" means there are codes present not found in the coding system

I'm going to ignore this set as they are an historical, low-data-quality artifact.

This still leaves some historic out-of-band BNF converted codelists. They were crudely excluded by mention of "BNF" in the methodology:

>>> cvs = [cv for cv in hierarchies_uploaded['dmd_matched_old_new_hierarchy_not_equal'] if not any(set(cv.codes) - set(cv.coding_system.lookup_names(cv.codes))) ]
>>> cvs = [cv for cv in cvs if "bnf" not in (cv.codelist.methodology or "").lower()]
>>> len(cvs)
137

This also still leaves some "old style" older uploaded codelists which need to be excluded. The dm+d coding system currently supports new style hierarchies but previously did not and all older dm+d codelists were not ported over to new style (there exists a management command for this) when #2704 was merged.
Let's exclude them from this analysis, but we may wish to consider batch conversion of them in the future.

>>> cvs = [cv for cv in cvs if not cv.csv_data]
>>> len(cvs)
0

We have effectively shown that the 847 apparent counter-examples are all just historical data artifacts.

As for the 5 in "'dmd_mismatched_old_new_hierarchy_equal'", these are plausibly caused by search deletion. Indeed they are all created after the merge of #2704

I am satisfied by this point that problem 2 is adequately proven to cause mismatched hierarchies for uploaded codelists.

TODO: create a test fixture which displays the old/new hierarchy difference to support a failing test for this problem as current fixtures all produce the same hierarchy for both old and new style calculation (I think, need to double-check this).

@Jongmassey
Copy link
Copy Markdown
Contributor Author

Jongmassey commented Apr 21, 2026

Proving problem 3

This has already been substantially proven within the proof for problem 2 - where all searches for a built codelist are deleted.

Let's try to find some examples where only some of the searches have been deleted and see if this proof still holds.
Our candidate versions have searches, but have some CodeObjs without corresponding SearchResults:

>>> bnf_has_searches_codeobjs_not_in_results = [cv for cv in CodelistVersion.objects.filter(codelist__coding_system_id="bnf") if cv.searches.count() >0 and any([co for co in cv.code_objs.all() if co.results.count()==0])]
>>> len(bnf_has_searches_codeobjs_not_in_results)
78
>>> Counter([cv.hierarchy == cv.calculate_hierarchy() for cv in bnf_has_searches_codeobjs_not_in_results])
Counter({True: 75, False: 3})

Those 75 which are in the candidate problem set but don't have mismatched/non-round-tripping hierarichies suggest that either this problem assessment isn't correct or the methodology to find partial search deletes is incorrect.
Let's investigate:

>>> bnf_has_searches = [cv for cv in CodelistVersion.objects.filter(codelist__coding_system_id="bnf") if cv.searches.count() >0]
>>> bnf_has_searches_mismatched = []
>>> bnf_has_searches_matched = []
>>> for cv in bnf_has_searches:
...     if cv.hierarchy == cv.calculate_hierarchy():
...         bnf_has_searches_matched.append(cv)
...     else:
...         bnf_has_searches_mismatched.append(cv)
...         
...     
... 
>>> len(bnf_has_searches_mismatched)
206
>>> len(bnf_has_searches_matched)
1094
>>> len([cv for cv in bnf_has_searches_mismatched if cv not in bnf_has_searches_codeobjs_not_in_results])
203

Clearly there are many more builder-created BNF codelists with mismatched hierarchies than the number of "partial search deleted" codelists. This suggests the existence of another problem re: hierarchies for built codelists, or maybe some of these are actually "partial search deletes" but we missed them.

Picking 0afcc5cc as an example:

>>> cv.hierarchy.nodes
{'0212000B0AAAMAM', '0803043N0AAACAC', '0212000Y0AAAAAA', '0212000Y0BBACAD', '1310020U0BEAAAG', '0212000B0AAAPAP', '0212000B0AAAIAI', '0212000AABBAAAA', '0702020H0BBADAF', '0212000Y0BF', '0502030B0AAACAC', '1201', '1310020U0AAADAD', '0212000B0AAANAN', '0502', '0502030B0AA', '0702020H0', '0803043N0AAANAN', '0212000C0BBAAA[...]}
>>> cv.codes
()

In the browser this shows "None of your searches match any concepts".
The presence of a non-empty hierarchy strongly suggests the presence of a previously-deleted search term that would have had results. It didn't match the "partial delete" logic because it returned no results.

Another such example is 425193ca

>>> cv = bnf_mismatched_full_search_results[-2]
>>> cv.hierarchy.nodes
{None, '04', '0407010AD', '0407010', '0407010ADBC', '0407010ADBBAAAA', '0407010ADBB', '0407010ADAAABAB', '0407010ADBCAAAB', '0407010ADAA', '0407010ADAAAAAA', '040701', '0407'}
>>> cv.codes
()

38686baa, however, is suggestive of some other issue. There are many such like it.

>>> cv.calculate_hierarchy().nodes - cv.hierarchy.nodes
{'0401020K0AACLCL'}
>>> cv.hierarchy.nodes - cv.calculate_hierarchy().nodes
set()
>>> cv.code_objs.filter(code="0401020K0AACLCL")
<QuerySet []>
>>> [s.term or s.code for s in cv.searches.all()]
['diazepam', '040102', '0401']
>>> "0401020K0AACLCL" in [dr[1] for dr in cv.coding_system.descendant_relationships(["0401"])]
True

One would certainly expect this "missing" code (i.e. present in hierarchy but not in code_objs or codes) to have been found by either of the code searches.

This probably warrants further investigation in its own issue.

Nonetheless I am satisfied that problem 3 is indeed causative in some cases and thus warrants fixing.
The majority of BNF CodelistVersions that have non-roundtripping hierarchies and have searches do have nodes in their cached hierarchies that are not present in their code_objs which is highly suggestive of deleted search terms :

>>> Counter([any(cv.hierarchy.nodes - set(cv.code_objs.all().values_list("code",flat=True))) for cv in bnf_has_searches_mismatched])
Counter({True: 167, False: 39})

@Jongmassey
Copy link
Copy Markdown
Contributor Author

Stray thoughts regarding search deletion and a possible argument to not re-build the hierarchy after deletion (but arguably if someone deleted one search then added another it'd get rebuilt anyway).

  • does adding then deleting searches not because they were wrong but because you did so as to clear up unwanted results make the codelist unreproducable?

  • If there were new search results in a release for those searches that you might have chosen to include had you seen them when you first built it does that mean that this codelist is now incompatible with that release

  • Should we allow deletion of searches from which selections have been made?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug report: Many Cached hierarchies often do not reflect the current CodelistVersion state

1 participant