Fix Hierarchy calculation and comparison#3021
Conversation
Use same hierarchy calculation methods for cached hierarchy and that calculated during compatibility checks
CodelistVersion._calculate_new_style_hierarchy() creates a hierarchy from a codelistversion's code_objs. _create_version_with_codes() followed the logic of CodelistVersion._calculate_old_style_hierarchy() which creates a hierarchy from a codelistversion's codes. The latter includes only included codes, the former includes excluded codes that are hierarchically related to the included. Where the latter and the former are not equal then this can produce different hierarchies. By no longer passing the old-style-calculated hierarchy (which is the only style of hierarchy we can calculate before we've got code_objs, and we need it to create the code_objs) to cache_hierarchy we force it to calculate a new-style one for us before caching it. This ensures that the cached hierarchy is calculated in a manner consistent with the model's calculate_hierarchy()
Deleting a search can remove unresolved codes from a CodelistVersion's code_objs These code_objs are used to calculate the CodelistVersion's Hierarchy Therefore, if we change the code_objs, we should recalculate the cached hierarchy
1. Calculation of hierarchies in _check_version_by_hierarchy() is only ever done old-style
Old-style and new-style hierarchy calculation can produce different results. If a CodelistVersion has a new-style cached hierarchy, calculating an old-style hierarchy with a new coding system release may produce a dissimilar hierarchy where calculating a new-style one would produce an identical one. This incorrect choice of hierarchy calculation style may therefore lead to an incorrect assertion of incompatibility between a CodelistVersion and a CodingSystemRelease. |
2. Calculation of hierarchies in _create_version_with_codes() is only ever done old-style
It builds, then caches, a hierarchy from the uploaded codes in a manner identical to hierarchy = Hierarchy.from_codes(coding_system, codes)even when the resultant CodelistVersion object is considered "new style" - i.e. the uploaded csv data has been translated into code objects instead of being stored raw in As a new-style CodelistVersion is being constructed, new-style hierarchies should be constructed using these As it happens, the problem in point 1. means that for uploaded codelists, the "incorrect" behaviour of only calculating an old style codelist for hierarchy comparison purposes means this behaviour may not result in under-ascertainment of CodelistVersion-CodingSystemRelease compatibility for uploaded codelists, but if point 1 is fixed it will for the same reasons as is currently the case for other new-style codelists. |
3.Deletion of a search (using the delete_search() action) does not trigger a hierarchy cache refreshCodelists built with the builder are always "new style". New-style hierarchy calculation uses a CodelistVersion's CodeObjs. When a Search is added to a draft CodelistVersion in the builder, CodeObjs for each of its SearchResults are added to the CodelistVersion, each with an Undecided status (unless they are already present with a different status by virtue of having been also found in a previous search). Therefore, Deleting a search from a CodelistVersion in the builder will remove any Undecided CodeObjs that resulted from that search from the CodelistVersion. However, This means that CodelistVersions that have deleted searches can have cached hierarchies that do not reflect their current state. |
Finding some bugsThis investigation originated from an observation that CodelistVersion-CodingSystemRelease compatibility rates were extremely low (via @opensafely-core/team-rsi 's Codelist Quality Reports) and that creating a new version of an CodelistVersion that was apparently incompatible with the latest release of a coding system resulted in a draft with no unresolved codes - indicating that both its searches and its hierarchy were actually compatible with the latest release after all. This led to further inspection of Comparison of the hierarchy comparison and hierarchy calculation logic revealed problem 1, which is asserted to be prima facie incorrect due to the inconsistent methodologies between these two places. A core finding driving the investigation of this issue was the fact that some CodelistVersions don't have hierarchies that "round-trip" i.e. their cached hierarchy is mismatched to one calculated afresh using the coding system release with which they were created. Finding these bugs was started by comparing CodelistVersions with round-tripping hierarchies to those without. matched = []
mismatched = []
for cv in CodelistVersion.objects.all():
if cv.hierarchy == cv.calculate_hierarchy(cv.coding_system):
matched.append(cv)
else:
mismatched.append(cv)No clean time-based, coding-system-based, or creation-method-based differences were observed between these two sets. The exceptions to this were that cloned codelists inherited the round-trippiness of their source CodelistVersion, and BNF-to-dm+d codelists were completely unaffected (see #3031 ) Instead, example codelists were created by stepping through the debugger for both built and uploaded codelists and observing changes in the codelistversion state and cached hierarchy at each stage. This revealed problem 2 and 3 above. Proving that problem 2 and 3 were indeed the cause of the all the non-round-tripping cases was more challenging as only some CodelistVersions were affected. |
Proving problem 2Problem 2 regards uploaded codelists, and only affects some. This presents two challenges: robustly identifying uploaded new-style codelists, and understanding why only some are affected. Old-style uploaded CodelistVersions have a populated 'csv_data' property, new-style ones do not. The first step in building a CodelistVersion is to add some searches. Therefore any CodelistVersion without searches (and without any clone or converted characteristic strings in its Methodology statement) was considered to be most likely uploaded. It was theorised that affected uploaded CodelistVersions should have differing results for Calculating all these hierarchies was an extremely resource intensive and exercise. Filtering for just codelists supposed to be uploaded and for cases where this theorised condition didn't hold completely true given these results: >>> {k:len(v) for k,v in hierarchies_uploaded.items()}
{
'bnf_matched_old_new_hierarchy_not_equal': 12
'bnf_mismatched_old_new_hierarchy_equal': 4,
'dmd_matched_old_new_hierarchy_not_equal': 842,
'dmd_mismatched_old_new_hierarchy_equal': 5,
'icd10_matched_old_new_hierarchy_not_equal': 40,
'icd10_mismatched_old_new_hierarchy_equal': 11,
'snomedct_matched_old_new_hierarchy_not_equal': 455,
'snomedct_mismatched_old_new_hierarchy_equal': 39,
}these are just the examples that violate the proposed theory, there are many thousands that align with it Clearly either the establishment of uploaded status or my theory weren't 100% correct (in the context of the many thousands of CodelistVersions these are a small percentage, but nonetheless there's something not quite right here). BNFFocussing on BNF since it's easy to see the hierarchy within the codes themselves, the four cases where the hierarchy did not round trip but had equal old and new style hierarchies were examined. In all four cases these also only had included ("+","(+)") codeobjs, which might also suggest they were uploaded but To give the example of one such CodelistVersion with hash
For the 12 CodelistVersions with successfully round-tripping hierarchies but unequal old and new hierarchies, these were also misidentified uploaded codelists that had actually been built with deleted searches but had a few key differences to the previous case. The first difference are the codeobjs in these 12 contain both explicit and implicit exclusions, and crucially explicit exclusions further up the hierarchy to their included codes. This scenario cannot happen with uploaded codelists so these must also be built. A worked example for recreation of
Search deletion as part of a codelist development strategy is a reasonable optimisation/shortcut for users to seek to employ, and we should ensure that hierarchies support this whilst considering the implications of this strategy on reproducibility. dm+dThe high numbers of observed incongruent dmd cases feels improbable to attribute to use of the builder and deletion of searches, given the relatively recent implementation of this feature (#2704 ). Indeed, almost all of "dmd_matched_old_new_hierarchy_not_equal" date from before this PR was merged. Some of the earlier examples from this set are actually converted from BNF in a roundabout download/upload fashion before this conversion was possible in-line and thus miss the diagnostic statements in their Methodologies. Many of them also have codes unknown to the coding system, as they were uploaded before validation of such things was put in place: >>> Counter(any(set(cv.codes) - set(cv.coding_system.lookup_names(cv.codes))) for cv in hierarchies_uploaded['dmd_matched_old_new_hierarchy_not_equal'])
Counter({False: 515, True: 327})where "True" means there are codes present not found in the coding system I'm going to ignore this set as they are an historical, low-data-quality artifact. This still leaves some historic out-of-band BNF converted codelists. They were crudely excluded by mention of "BNF" in the methodology: >>> cvs = [cv for cv in hierarchies_uploaded['dmd_matched_old_new_hierarchy_not_equal'] if not any(set(cv.codes) - set(cv.coding_system.lookup_names(cv.codes))) ]
>>> cvs = [cv for cv in cvs if "bnf" not in (cv.codelist.methodology or "").lower()]
>>> len(cvs)
137This also still leaves some "old style" older uploaded codelists which need to be excluded. The dm+d coding system currently supports new style hierarchies but previously did not and all older dm+d codelists were not ported over to new style (there exists a management command for this) when #2704 was merged. >>> cvs = [cv for cv in cvs if not cv.csv_data]
>>> len(cvs)
0We have effectively shown that the 847 apparent counter-examples are all just historical data artifacts. As for the 5 in "'dmd_mismatched_old_new_hierarchy_equal'", these are plausibly caused by search deletion. Indeed they are all created after the merge of #2704 I am satisfied by this point that problem 2 is adequately proven to cause mismatched hierarchies for uploaded codelists. TODO: create a test fixture which displays the old/new hierarchy difference to support a failing test for this problem as current fixtures all produce the same hierarchy for both old and new style calculation (I think, need to double-check this). |
Proving problem 3This has already been substantially proven within the proof for problem 2 - where all searches for a built codelist are deleted. Let's try to find some examples where only some of the searches have been deleted and see if this proof still holds. >>> bnf_has_searches_codeobjs_not_in_results = [cv for cv in CodelistVersion.objects.filter(codelist__coding_system_id="bnf") if cv.searches.count() >0 and any([co for co in cv.code_objs.all() if co.results.count()==0])]
>>> len(bnf_has_searches_codeobjs_not_in_results)
78
>>> Counter([cv.hierarchy == cv.calculate_hierarchy() for cv in bnf_has_searches_codeobjs_not_in_results])
Counter({True: 75, False: 3})Those 75 which are in the candidate problem set but don't have mismatched/non-round-tripping hierarichies suggest that either this problem assessment isn't correct or the methodology to find partial search deletes is incorrect. >>> bnf_has_searches = [cv for cv in CodelistVersion.objects.filter(codelist__coding_system_id="bnf") if cv.searches.count() >0]
>>> bnf_has_searches_mismatched = []
>>> bnf_has_searches_matched = []
>>> for cv in bnf_has_searches:
... if cv.hierarchy == cv.calculate_hierarchy():
... bnf_has_searches_matched.append(cv)
... else:
... bnf_has_searches_mismatched.append(cv)
...
...
...
>>> len(bnf_has_searches_mismatched)
206
>>> len(bnf_has_searches_matched)
1094
>>> len([cv for cv in bnf_has_searches_mismatched if cv not in bnf_has_searches_codeobjs_not_in_results])
203Clearly there are many more builder-created BNF codelists with mismatched hierarchies than the number of "partial search deleted" codelists. This suggests the existence of another problem re: hierarchies for built codelists, or maybe some of these are actually "partial search deletes" but we missed them. Picking >>> cv.hierarchy.nodes
{'0212000B0AAAMAM', '0803043N0AAACAC', '0212000Y0AAAAAA', '0212000Y0BBACAD', '1310020U0BEAAAG', '0212000B0AAAPAP', '0212000B0AAAIAI', '0212000AABBAAAA', '0702020H0BBADAF', '0212000Y0BF', '0502030B0AAACAC', '1201', '1310020U0AAADAD', '0212000B0AAANAN', '0502', '0502030B0AA', '0702020H0', '0803043N0AAANAN', '0212000C0BBAAA[...]}
>>> cv.codes
()In the browser this shows "None of your searches match any concepts". Another such example is >>> cv = bnf_mismatched_full_search_results[-2]
>>> cv.hierarchy.nodes
{None, '04', '0407010AD', '0407010', '0407010ADBC', '0407010ADBBAAAA', '0407010ADBB', '0407010ADAAABAB', '0407010ADBCAAAB', '0407010ADAA', '0407010ADAAAAAA', '040701', '0407'}
>>> cv.codes
()
>>> cv.calculate_hierarchy().nodes - cv.hierarchy.nodes
{'0401020K0AACLCL'}
>>> cv.hierarchy.nodes - cv.calculate_hierarchy().nodes
set()
>>> cv.code_objs.filter(code="0401020K0AACLCL")
<QuerySet []>
>>> [s.term or s.code for s in cv.searches.all()]
['diazepam', '040102', '0401']
>>> "0401020K0AACLCL" in [dr[1] for dr in cv.coding_system.descendant_relationships(["0401"])]
TrueOne would certainly expect this "missing" code (i.e. present in hierarchy but not in code_objs or codes) to have been found by either of the code searches. This probably warrants further investigation in its own issue. Nonetheless I am satisfied that problem 3 is indeed causative in some cases and thus warrants fixing. >>> Counter([any(cv.hierarchy.nodes - set(cv.code_objs.all().values_list("code",flat=True))) for cv in bnf_has_searches_mismatched])
Counter({True: 167, False: 39}) |
|
Stray thoughts regarding search deletion and a possible argument to not re-build the hierarchy after deletion (but arguably if someone deleted one search then added another it'd get rebuilt anyway).
|
This PR covers three problems found as likely causative of #3038
_check_version_by_hierarchy()is only ever done old-style_create_version_with_codes()is only ever done old-styledelete_search()action) does not trigger a hierarchy cache refreshI will address the role of each of these in its own comment for clarity.
This PR contains fixes for these three problem, and therefore fixes #3038
TODO:
TestUpdateCodelistVersionCompatibilityWithSearchButMismatchedHierarchyto work with these fixes as it previously relied on bugs