Skip to content
This repository was archived by the owner on Oct 20, 2020. It is now read-only.
/ taxonomy-merging Public archive

For running and preparing input to the ALA Taxonomy builder and nameindexer

Notifications You must be signed in to change notification settings

bioatlas/taxonomy-merging

Repository files navigation

taxonomy-merging

Scripts for running and preparing input to the ALA Taxonomy builder and nameindexer. These tools may be used to merge the GBIF taxonomy backbone with the Genome Taxonomy DataBase (GTDB) taxonomy, and with a list of Swedish Amplicon Sequence Variants (ASVs), to improve taxonomic coverage of prokaryotes in the BioAtlas.

Overview of files

get-bb-subset.sh, prep-gtdb-for-r.sh
make subsets of GBIF backbone / GTDB taxonomy -> manageable size & structure for further editing i R

prep-xxx-for-merge.R
edits data, and adds parent taxa -> file suitable for ALA Taxonomy builder

merge-taxonomy.sh
applies ALA Taxonomy builder, and outputs zip file to be transferred to the BioAtlas

Taxonomy implementation in the BioAtlas

The steps below describe how to implement the merged taxonomy in a test version of the BioAtlas, and assumes that scp & ssh connection options (host, user, RSA key file) are specified in a .ssh/config file, allowing alias usage (cloud). I use the dyntaxa label and dyntaxa-index directory for convenience only.

  1. Copy taxonomy-dwca (and if needed, the java tool nameindexer.zip) to server
scp ~/data/lucene/runs/xxx/dyntaxa.dwca.zip cloud:repos/ala-docker/dyntaxa-index
  1. Login to cloud server, and navigate to ala-docker dir
ssh cloud
cd /repos/ala-docker
  1. Build nameindex image [name:tag - edit as needed], using the namindexer tool in the dyntaxa-index directory.
docker build --no-cache -t bioatlas/ala-dyntaxaindex:xxx dyntaxa-index
  1. Test search index (by searching for taxon zzz)
docker run --rm -it bioatlas/ala-dyntaxaindex:xxx nameindexer -testSearch zzz

     Should output something similar to:

...
Classification: "null",Bacteria,Acidobacteriota,Acidobacteriae,Acidobacteriales,Acidobacteriaceae,Edaphobacter
Scientific name: zzz
...
Match type: exactMatch
  1. Setup nameindex service to start from newly created index image
nano docker-compose.yml

     Use Ctrl+w to search for e.g. 'dynt'. Comment out current image and add new image, like so:

nameindex:
#image: bioatlas/ala-nameindex:v0.4
#image: bioatlas/ala-dyntaxaindex:v0.4
image: bioatlas/ala-dyntaxaindex:xxx
command: /bin/ash
container_name: nameindex
...

     Use Ctrl+x to save

  1. Clean-up data volumes (will remove indices, and all data from ingested datasets)
docker-compose stop nameindex biocachebackend biocacheservice specieslists
docker rm -vf solr cassandradb nameindex biocachebackend biocacheservice specieslists
docker volume rm ala-docker_data_solr ala-docker_db_data_cassandra ala-docker_data_nameindex
  1. Restart services (will create new nameindex service, as configurated in docker-compose.yml)
docker-compose up -d
docker-compose restart webserver
  1. Add a new data resource (at least add a name), and upload your occurrence dwca.zip in Collectory

  2. Map records against nameindex (will update the Solr index for Occurrence search [core: biocache])

docker-compose run --rm biocachebackend biocache

# List available data resources
list

# Fetch DwCA from collectory and write to cassandra database
biocache> load drX

# Match records against nameindex, and update in cassandra
biocache> process -dr drX

# Write occurrence records from cassandra to SOLR index -> generates the ALA-hub Occurrence search index
biocache> index -dr drX

# Quit
exit
  1. Restart services
docker-compose restart biocacheservice biocachehub
  1. Prepare files for creating Solr index for Taxonomic search [cores: bie-offline / bie]
# Move old taxonomy dwca to backup folder
docker exec -it bieindex bash
mv /data/bie/import/dwc-a /data/bie/import/dwca-maria-XXXXXX 
exit

# Unzip new taxonomy dwca
cd dyntaxa-index/
mkdir dwc-a
unzip dyntaxa.dwca.zip -d dwc-a

# Copy into running bieindex container
docker cp dwc-a bieindex:/data/bie/import/

Some background: Solr core = running instance of a Lucene index, needed to perform indexing. The taxonomic index in BAS has two alternative cores, with the same schema (structure): bie and bie-offline. Swapping cores means to swap file pointers (inc. filename) between the cores. The point of this is to make it possible to perform the resource intensive and long process of taxonomy index generation offline (to produce bie-offline), so that it does not block the search functionality, before swapping it with the bie.

  1. Import taxonomy to bie-offline index
    Go to Admin | BIE Web services
    Click DwCA Import - Import taxon data in Darwin Core Archive form
    Check Clear existing taxonomic data, to clear up old stuff if any
    Click /data/bie/import/dwc-a Import DwCA

  2. Swap cores
    In SOLR admin, click Core Admin | Swap
    Make sure it reads this: bie andand: bie-offline
    Click Swap Cores

  3. If needed, restart services

make up
docker-compose restart webserver

References

Parks, D. H., et al. (2018). "A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life." Nature Biotechnology, 36: 996-1004.

About

For running and preparing input to the ALA Taxonomy builder and nameindexer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published