Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aurora Pipeline Fails when NCEDC is down #159

Open
kkappler opened this issue Mar 19, 2022 · 8 comments
Open

Aurora Pipeline Fails when NCEDC is down #159

kkappler opened this issue Mar 19, 2022 · 8 comments
Assignees

Comments

@kkappler
Copy link
Collaborator

kkappler commented Mar 19, 2022

Parkfield tests fail on gitactions runner because the data and metadata cannot be called when NCEDC is suffering an outage.

First observed on 17 Mar, 2022.

Since NCEDC is not a stakeholder at this point, we cannot expect them to be concerned about this issue.

We could:

  1. make a copy of the data on an IRIS hosted server so that these tests can query IRIS instead of NCEDC.
  2. Alternatively, we could revisit mth5_test_data repo, tidy it up and place test data there. Note that make_mth5 is already tested by mth5, i.e. the access / transmission of data and metadata is being tested outside aurora.
  • If revisiting mth5_test_data, the Pooch package maybe of interest:
@kkappler
Copy link
Collaborator Author

This happened again on April 23 at 0900 Pacific time.

The error is:

Traceback (most recent call last):
    streams = dataset_config.get_data_via_fdsn_client(data_source="NCEDC")
  File "/home/kkappler/software/irismt/aurora/aurora/sandbox/io_helpers/fdsn_dataset_config.py", line 78, in get_data_via_fdsn_client
    self.endtime,
  File "/home/kkappler/anaconda2/envs/py37/lib/python3.7/site-packages/obspy/clients/fdsn/client.py", line 830, in get_waveforms
    raise ValueError(msg)
ValueError: The current client does not have a dataselect service.

I have attached the hz data from PKD for the time interval that we use for the tests ...
ex, ey, hx, hy are already archived at IRIS.
hz_pkd.csv

@timronan Can you or Laura look at adding this hz data to the IRIS archive? Then we can set up the tests to use IRIS (or try NCEDC and catch exception use IRIS).

@kkappler
Copy link
Collaborator Author

kkappler commented May 6, 2022

in tests/parkfield/ calling python make_parkfield_mth5.py creates the mth5 file locallly, with both the data and the metadata (from NCEDC).

This file could actually be used as a source of data and metadata that we could push to IRIS, see issue 99 in mth5:
https://github.com/kujaku11/mth5/issues/99

@kkappler
Copy link
Collaborator Author

kkappler commented Sep 3, 2022

Here's a new one, Sept 2, 2022:
`
from obspy.clients.fdsn import Client

Client(base_url="NCEDC")
`

Client(base_url="NCEDC")
Traceback (most recent call last):
File "/home/kkappler/software/pycharm-community-2019.1.1/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "", line 1, in
File "/home/kkappler/anaconda2/envs/py38/lib/python3.8/site-packages/obspy/clients/fdsn/client.py", line 276, in init
self._discover_services()
File "/home/kkappler/anaconda2/envs/py38/lib/python3.8/site-packages/obspy/clients/fdsn/client.py", line 1531, in _discover_services
wadl_parser = WADLParser(wadl)
File "/home/kkappler/anaconda2/envs/py38/lib/python3.8/site-packages/obspy/clients/fdsn/wadl_parser.py", line 28, in init
doc = etree.parse(io.BytesIO(wadl_string)).getroot()
File "src/lxml/etree.pyx", line 3536, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1893, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1913, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1800, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1141, in lxml.etree._BaseParser._parseDoc
File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
File "", line 1
lxml.etree.XMLSyntaxError: Space required after the Public Identifier, line 1, column 50

@kkappler
Copy link
Collaborator Author

kkappler commented Sep 8, 2022

The lxml error is due to NCEDC changing their urls. See Issue 3134 https://github.com/obspy/obspy/issues/3134

kkappler added a commit that referenced this issue Dec 23, 2022
This does not solve the larger issue, but it is intended to allow all
tests to pass on github when PKD and SAO are unavailable due to NCEDC
communications issues.

[Issue(s): #159]
@kkappler
Copy link
Collaborator Author

Here is a new one again Dec 2022
Symptoms:

  • Python 3.6, 3.7 fail due to no inventory returned by NCEDC
  • Python 3.8, 3.9 Fails in run_ts_obj.from_obspy_stream(streams_dict[station_id], run_metadata)
    at the end of the method when calling self.validate_metadata()
    message is:
    mt_metadata.base.metadata.run.add_channel - ERROR: component cannot be empty

Note the mth5.timeseries.run_ts.RunTS calls self.validate_metadata() twice. The first time through it passes, but not the second.

The first time through is in the set_dataset method of RunTS. There is a check of the condition:
self.run_metadata.id not in self.station_metadata.runs.keys()
which is False, because self.run_metadata.id = '0' and self.station_metadata.runs.keys() = ['0',], so the
self.station_metadata.runs[0].update(self.run_metadata) is ignored.

After set_data() a check is made:

if run_metadata is not None:
            self.run_metadata.update(run_metadata)

This metadata update is what triggers the failure, because after the metadata update:
self.run_metadata.id = '001'
and
self.station_metadata.runs.keys() = ['0',]
i.e. the run_metadata.id changed, but the station_metadata.runs.keys did not. Because of this inconsistency, the next time self.validate_metadata() executes, the condition
self.run_metadata.id not in self.station_metadata.runs.keys() returns True, which triggers
self.station_metadata.runs[0].update(self.run_metadata)

I followed the trail for awhile, and the error occurs when an auxiliary channel is encountered...
@kujaku11 do we want to force component on auxiliary channels? Also, we might need to track down why there is an aux channel at all here.

if channel_obj.component is None:
        if not isinstance(channel_obj, Auxiliary):  # Adding this condition seems to fix the 3.8/3.9 issue
                msg = "component cannot be empty"
                self.logger.error(msg)
                raise ValueError(msg)

@kkappler
Copy link
Collaborator Author

kkappler commented Dec 30, 2022

Regarding the second flavor of failure, ... this might be related to the obspy version.
Note that obspy v1.2.2 has python2 code in it.
A long-awaited python3-only version of obspy (v1.3) was released in 2022, and updated to 1.3.1 in October 2022. This requires python >= 3.7.
So we should probably require the same.

Only a month after v1.3.1 was released, out popped v1.4, November 2022. This version requires python>=3.8. It is not clear the value of maintaining v3.7 compatibility.

In any case, to fix the v3.7 issue, one need only replace the kwarg:
data_source="NCEDC"
with
data_source='https://service.ncedc.org/'
in make_parkfield_mth5

This argument is passed as base_url to obspy Client

To reproduce the error:

from obspy.clients.fdsn import Client
client = Client(base_url="NCEDC", force_redirect=True)

but replacing with

from obspy.clients.fdsn import Client
client = Client(base_url="https://service.ncedc.org/", force_redirect=True)

works.

This is discussed in comment by alexhutko.
It has to do with hardcoded url lookup tables, and the fact that NCEDC is only available via https not http. This may get fixed in obspy, but if we want to support py37 we can just use the explcit url (for now).

To fix the py38 issue, one only needs to be using obspy v1.4

kkappler added a commit that referenced this issue Jan 2, 2023
- [x] simplify FDSNDatasetConfig so it doesn't need additional methods from xml_sandbox.
- [x] Drop methods from xml_sandbox now in FDSNDatasetConfig
- [x] rename FDSNDatasetConfig to FDSNDataset
- [x] Place client method in FDSNDataset
- [x] move describe_inventory_stages from xml_sandbox to inventory_review
- [x] tidy comments in make_mth5_helpers

Issues: [#159]
@kkappler
Copy link
Collaborator Author

kkappler commented Jan 15, 2023

Now that these tests are working again, there are a couple of things that can be done to simplify the parkfield tests:

  1. We don't really need to support the creation of separate PKD and SAO, and PKDSAO h5 files
  • Replace all references to h5 files with pkd_sao_test_00.h5
  • A single method called ensure_data_exists() can be placed in /test_utils/parkfield/make_parkfield_mth5.py, and all the try/except stuff that is replicated in several methods can be placed in that one spot

@kkappler kkappler mentioned this issue Jan 15, 2023
5 tasks
kkappler added a commit that referenced this issue Jan 23, 2023
- replaced all pkd_test_00 with pkd_sao_test_00
- result is that there is only one h5 file built for parkfield tests

[Issue(s): #159]
kkappler added a commit that referenced this issue Jan 23, 2023
- Added ensure_h5_exists() method to make_parkfield_mth5
- Deprecated config_path from ConfigCreator()
    - removed any references to config_path
[Issue(s): #159]
@kkappler
Copy link
Collaborator Author

kkappler commented Jan 23, 2023

I pushed an h5 of the combined PKD and SAO data to mth5_test_data.
in mth5_test_data/mth5/parkfield/pkd_sao_test_00.h5
It should be possible from this file to extract the metadata and the data-streams and archive these somewhere at IRIS.

When this is done, I suggest that the making of the PKD data, when using IRIS be done using make_mth5, instead of the NCEDC kluge we have implemented to work around their non-FDSN complient nomenclature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants