Skip to content
This repository has been archived by the owner on Jan 25, 2023. It is now read-only.

Handling wildcards #223

Open
teemukataja opened this issue Oct 29, 2018 · 8 comments
Open

Handling wildcards #223

teemukataja opened this issue Oct 29, 2018 · 8 comments

Comments

@teemukataja
Copy link
Contributor

teemukataja commented Oct 29, 2018

This is more a feature request / question.

Now that #221 has been preliminarily accepted, we have implemented it into the Beacon API. We have thus arrived at a new complication described in CSCfi/beacon-python#24.

In some cases the reference bases may have multiple alternate alleles. The specification doesn't offer a direct solution to this issue, and the quickest solution we came up with, is to add these fields into the info key of the datasetAlleleResponses object.

Should we carry on with this solution, or should these fields be added into the response object, as they are quite important in all wildcard queries?

Choices

  1. Feature request: Update specification to support differentiation of wildcard responses, namely the datasetAlleleResponses should contain these two values that reflect the wildcard results (RECOMMENDED);
  2. Keep specification as it is and handle wildcard differentiation in the info key.
@mbaudis
Copy link
Member

mbaudis commented Oct 29, 2018

+1 for a specific response (not overloading info).

@teemukataja
Copy link
Contributor Author

teemukataja commented Oct 29, 2018

Proposed update to the specification: add two new fields (referenceBases and alternateBases) to the datasetAlleleResponses key in the BeaconDatasetAlleleResponse response object.

Response for query is currently:

[
  {
    "beaconId": "string",
    "apiVersion": "string",
    "exists": true,
    "alleleRequest": {
      "referenceName": "1",
      "start": 0,
      "end": 0,
      "startMin": 0,
      "startMax": 0,
      "endMin": 0,
      "endMax": 0,
      "referenceBases": "string",
      "alternateBases": "string",
      "variantType": "string",
      "assemblyId": "GRCh38",
      "datasetIds": [
        "string"
      ],
      "includeDatasetResponses": "ALL"
    },
    "datasetAlleleResponses": [
      {
        "datasetId": "string",
        "exists": true,
        "error": {
          "errorCode": 0,
          "errorMessage": "string"
        },
        "frequency": 0,
        "variantCount": 0,
        "callCount": 0,
        "sampleCount": 0,
        "note": "string",
        "externalUrl": "string",
        "info": [
          {
            "key": "string",
            "value": "string"
          }
        ]
      }
    ],
    "error": {
      "errorCode": 0,
      "errorMessage": "string"
    }
  }
]

Proposed format:

[
  {
    "beaconId": "string",
    "apiVersion": "string",
    "exists": true,
    "alleleRequest": {
      "referenceName": "1",
      "start": 0,
      "end": 0,
      "startMin": 0,
      "startMax": 0,
      "endMin": 0,
      "endMax": 0,
      "referenceBases": "string",
      "alternateBases": "string",
      "variantType": "string",
      "assemblyId": "GRCh38",
      "datasetIds": [
        "string"
      ],
      "includeDatasetResponses": "ALL"
    },
    "datasetAlleleResponses": [
      {
        "datasetId": "string",
        "referenceBases": "string",
        "alternateBases": "string",
        "exists": true,
        "error": {
          "errorCode": 0,
          "errorMessage": "string"
        },
        "frequency": 0,
        "variantCount": 0,
        "callCount": 0,
        "sampleCount": 0,
        "note": "string",
        "externalUrl": "string",
        "info": {}
      }
    ],
    "error": {
      "errorCode": 0,
      "errorMessage": "string"
    }
  }
]

@mbaudis
Copy link
Member

mbaudis commented Oct 29, 2018

Parsing this I see 2 differences

  • the added "referenceBases": "string", "alternateBases": "string",
  • the removal of the placeholder key/value texts from info

So: The ref/alt values in the response then would correspond to a match each, i.e. different matches to a wildcard ... would lead to multiple datasetAlleleResponses, each w/ their own values, right? Seems sensible. But: This then has to be extended for other attributes.
Examples:

  • a (proposed) "BRK" variantType could match specified "BRK" values, but also the edges of other structural events (e.g. start and end of "DUP" or "DEL")
  • positional fuzziness would lead to different start, end values of the matched variants

There is an argument to be made to use a handoff scenario for this (we have this e.g. in Beacon+ - handoff, where one just loads all the data of the matched variants.). But this then requires a specified handover response format, too.


I don't get the change in the info field - these are just placeholders telling users to stick to some kind of "key" : "value" format for additional data.

@blankdots
Copy link

Building on what @teemukataja mentioned, we made this suggestion based on our experience loading and analysing the data from 1000 genome project.
We are tackling one issue at a time, and it happens to be that the wildcard was one of the first.

We would like for the user to be able to differentiate between wildcard results in the UI (we have an example here: CSCfi/beacon-python#24), however the API specification did not provide any fields we can utilise for this purpose.

Thus we made a suggestion how this could be implemented, and we rely on the people defining the specification for the solution.

Regarding "BRK", we had not fully tackled variantTypes yet, but understanding that we might encounter such a use case is beneficial for us.


Regarding the info key the in the current Beacon 1.0.0 specification it is recommended that we implement something like this:

"info": [
          {"key": "accessType",
           "value": "PUBLIC"},
          {"key": "filterAlgorithm",
           "value": "CUSTOM"},
          {"key": "other",
           "value": null}
        ]

However that is equivalent to:

"info": {"accessType": "PUBLIC",
           "filterAlgorithm": "CUSTOM",
           "other": null}

This second option is easier to parse and work with, and we have not encountered any use cases for the first option.
We are aware of that this is tackled in isssue #168 and awaiting resolution.

@mbaudis
Copy link
Member

mbaudis commented Oct 30, 2018

@blankdots @teemukataja As said, I think the concept of returning the different matched alleles looks good to me, but needs some work/discussion about the implementation:

  • There is a difference between the returned variant and the query representation - so this is better solved with a proper variant model instead of using the query attributes w/ the variant values. We have started to have a variant schema for this kind of representation, which could be a good starting place (we use this behind the Beacon+ test implementation).
  • Some of the nesting of response objects has to be figured out (e.g. getting responses from multiple datasets, and then multiple alleles in datasetAlleleResponses - does one just have each different variant/allele represented there w/ the originating dataset as a value, or do we represent each dataset's responses w/ all embedded alleles?).

I'm highly in favour of doing some rapid development here - so suggestions, discussions, PRs welcome (IMO)!

This was referenced Nov 1, 2018
teemukataja added a commit that referenced this issue Nov 1, 2018
@mbaudis
Copy link
Member

mbaudis commented Nov 6, 2018

@blankdots Btw., those (list vs. object) are not equivalent since only a list allows repeated keys (making it harder to parse, but better as a wrapper).

@blankdots
Copy link

blankdots commented Nov 12, 2018

@mbaudis probably should have mentioned i was focusing on content (and how to it could be used), not structure

@mbaudis
Copy link
Member

mbaudis commented Nov 16, 2018

I have written up a page about proposed range matches and wildcards, which also demonstrates a handover [H->O] variant object, which in turn can be analysed for its variant flavours.

This should be considered a (working) prototype, which may look different in the implementation brought forward by the dev team (@sdelatorrep?).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants