Skip to content

Commit

Permalink
documentation updates
Browse files Browse the repository at this point in the history
  • Loading branch information
Maxim Moinat committed Aug 13, 2020
1 parent abee076 commit d950ab7
Show file tree
Hide file tree
Showing 3 changed files with 26 additions and 18 deletions.
23 changes: 15 additions & 8 deletions docs/RabbitInAHat.html
Original file line number Diff line number Diff line change
Expand Up @@ -413,7 +413,14 @@ <h2>Scope and purpose</h2>
</div>
<div id="process-overview" class="section level2">
<h2>Process Overview</h2>
<p>The typical sequence for using this software to generate documentation of an ETL: 1. Scanned results from WhiteRabbit completed 2. Open scanned results; interface displays source tables and CDM tables 3. Connect source tables to CDM tables where the source table provides information for that corresponding CDM table 4. For each source table to CDM table connection, further define the connection with source column to CDM column detail 5. Save Rabbit-In-a-Hat work and export to a MS Word document.</p>
<p>The typical sequence for using this software to generate documentation of an ETL:</p>
<ol style="list-style-type: decimal">
<li>Scanned results from WhiteRabbit completed</li>
<li>Open scanned results; interface displays source tables and CDM tables</li>
<li>Connect source tables to CDM tables where the source table provides information for that corresponding CDM table</li>
<li>For each source table to CDM table connection, further define the connection with source column to CDM column detail</li>
<li>Save Rabbit-In-a-Hat work and export to a MS Word document.</li>
</ol>
</div>
<div id="installation-and-support" class="section level2">
<h2>Installation and support</h2>
Expand All @@ -436,8 +443,8 @@ <h2>Open an Existing Document</h2>
<h2>Selecting Desired CDM Version</h2>
<p>Rabbit-In-a-Hat allows you to select which CDM version (v4, v5 or v6) you’d like to built your ETL specification against.</p>
<p>See the graphic below for how to select your desired CDM: <img src="http://i.imgur.com/LOqhp7H.gif" alt="Switching between CDMv4 and CDMv5" /></p>
<p>The CDM version can be changed at any time, but beware that some of your existing mappings may be lost in the process. By default, Rabbit-In-a-Hat will attempt to pereserve as many mappings between the source data and the newly selected CDM as possible. When a new CDM is selected, Rabbit-In-a-Hat will drop any mappings if the mapping’s CDM table or CDM column name no longer exist</p>
<p>For instance, switching from CDMv4 to CDMv5, a mapping from source to CDM person.person_source_value will be kept because the person table has person_source_value in both CDMv4 and CDMv5. However, person.associated_provider_id exists only in CDMv4 and has been renamed to <a href="https://github.com/OHDSI/CommonDataModel/wiki/PERSON">person.provider_id in CDMv5</a> and so that mapping will not be kept when switching between these two CDMs.</p>
<p>The CDM version can be changed at any time, but beware that you may lose some of your existing mappings in the process. By default, Rabbit-In-a-Hat will attempt to preserve as many mappings between the source data and the newly selected CDM as possible. When a new CDM is selected, Rabbit-In-a-Hat will drop any mappings <strong>without warning</strong> if the mapping’s CDM table or CDM column name no longer exists.</p>
<p>For instance, switching from CDMv4 to CDMv5, a mapping to <code>person.person_source_value</code> will be kept because the person table has <code>person_source_value</code> in both CDMv4 and CDMv5. However, <code>person.associated_provider_id</code> exists only in CDMv4 (it was renamed to <em>provider_id</em> in CDMv5) and will <strong>not</strong> be kept when switching between these two CDMs.</p>
</div>
<div id="loading-in-a-custom-cdm" class="section level2">
<h2>Loading in a Custom CDM</h2>
Expand All @@ -463,7 +470,7 @@ <h2>Stem table</h2>
</div>
<div id="concept-id-hints-v0.9.0" class="section level2">
<h2>Concept id hints (<em>v0.9.0</em>)</h2>
<p>A number of CDM fields have a limited number of standard concept_id(s) that can be used. Examples are: <code>gender_concept_id</code>, <code>_type_concept_id</code>‘s, <code>route_concept_id</code> and <code>visit_concept_id</code>. To help choosing the right concept_id during ETL design, Rabbit-In-a-Hat shows the list of possible concept ids of a CDM field when clicking on a target field. Note that all standard and non-standard target concepts with the right domain are shown, but the OMOP conventions only allow for standard concepts (flagged with an ’S’ in the panel).</p>
<p>A number of CDM fields have a limited number of standard concept_id(s) that can be used. Examples are: <code>gender_concept_id</code>, <code>_type_concept_id</code>‘s, <code>route_concept_id</code> and <code>visit_concept_id</code>. To help choose the right concept_id during ETL design, Rabbit-In-a-Hat shows the list of possible concept ids of a CDM field when clicking on a target field. Note that all standard and non-standard target concepts with the right domain are shown, but the OMOP conventions only allow for standard concepts (flagged with an ’S’ in the panel).</p>
<p><img src="images/riah_concept_id_hints.png" /></p>
<p>The concept id hints are stored statically in <a href="https://github.com/OHDSI/WhiteRabbit/blob/master/rabbitinahat/src/main/resources/org/ohdsi/rabbitInAHat/dataModel/CDMConceptIDHints_v5.0_MAR-18.csv">a csv file</a> and are not automatically updated. The <a href="https://github.com/OHDSI/WhiteRabbit/blob/master/rabbitinahat/src/main/resources/org/ohdsi/rabbitInAHat/dataModel/concept_id_hint_select.sql">code used to create the aforementioned csv file</a> is also included in the repo.</p>
</div>
Expand All @@ -483,23 +490,23 @@ <h1>Table to Table Mappings</h1>
<h1>Field to Field Mappings</h1>
<p>By double clicking on an arrow connecting a source and CDM table, it will open a <em>Fields</em> pane below the arrow selected. The <em>Fields</em> pane will have all the source table and CDM fields and is meant to make the specific column mappings between tables. Hovering over a source table will generate an arrow head that can then be selected and dragged to its corresponding CDM field. For example, in the <em>drug_claims</em> to <em>drug_exposure</em> table mapping example, the source data owners know that <em>patient_id</em> is the patient identifier and corresponds to the <em>CDM.person_id</em>. Also, just as before, the arrow can be selected and <em>Logic</em> and <em>Comments</em> can be added.</p>
<p><img src="images/rabbitinahat-fields.png" /></p>
<p>If you select the source table orange box, Rabbit-In-a-Hat will expose values the source data has for that table. This is meant to help in the process in understanding the source data and what logic may be required to handle the data in the ETL. In the example below <em>ndcnum</em> is selected and raw NDC codes are displayed starting with most frequent (note that in the WhiteRabbit scan a “Min cell count” could have been selected and values below that frequency will not show).</p>
<p>If you select the source table orange box, Rabbit-In-a-Hat will expose values the source data has for that table. This is meant to help in the process in understanding the source data and what logic may be required to handle the data in the ETL. In the example below <em>ndcnum</em> is selected and raw NDC codes are displayed starting with most frequent (note that in the WhiteRabbit scan a “Min cell count” could have been selected and values smaller than that count will not show).</p>
<p><img src="images/rabbitinahat-fieldex.png" /></p>
<p>Continue this process until all source columns necessary in all mapped tables have been mapped to the corresponding CDM column. Not all columns must be mapped into a CDM column and not all CDM columns require a mapping. One source column may supply information to multiple CDM columns and one CDM column can receive information from multiple columns.</p>
</div>
<div id="generating-an-etl-document" class="section level1">
<h1>Generating an ETL Document</h1>
<p>To generate an ETL MS Word document use <em>File –&gt; Generate ETL document –&gt; Generate ETL Word document</em> and select a location to save. The ETL document can also be exported to markdown or html. In this case, a file per target table is created and you will be prompted to select a folder. Regardless of the format, the generated document will contain all mappings and notes from Rabbit-In-a-Hat.</p>
<p>Once the information is in the document, if an update is needed you must either update the information in Rabbit-In-a-Hat and regenerate the document or update the document. If you make changes in the document, Rabbit-In-a-Hat will not read those changes and update the information in the tool. However it is common to generate the document with the core mapping information and fill in more detail within the document.</p>
<p>Once the information is in the document, if an update is needed you must either update the information in Rabbit-In-a-Hat and regenerate the document or update the document. If you make changes in the document, Rabbit-In-a-Hat will not read those changes and update the information in the tool. However, it is common to generate the document with the core mapping information and fill in more detail within the document.</p>
<p>Once the document is completed, this should be shared with the individuals who plan to implement the code to execute the ETL. The markdown and html format enable easy publishing as a web page on e.g. Github. A good example is the <a href="https://ohdsi.github.io/ETL-Synthea/">Synthea ETL documentation</a>.</p>
</div>
<div id="generating-a-testing-framework" class="section level1">
<h1>Generating a Testing Framework</h1>
<p>To make sure the ETL process is working as specified, it is highly recommended to create <a href="https://en.wikipedia.org/wiki/Unit_testing">unit tests</a> that evaluate the behavior of the ETL process. To efficiently create a set of unit tests Rabbit-in-a-Hat can <a href="riah_test_framework.html">generate a testing framework</a>.</p>
<p>To make sure the ETL process is working as specified, it is highly recommended creating <a href="https://en.wikipedia.org/wiki/Unit_testing">unit tests</a> that evaluate the behavior of the ETL process. To efficiently create a set of unit tests Rabbit-in-a-Hat can <a href="riah_test_framework.html">generate a testing framework</a>.</p>
</div>
<div id="generating-a-sql-skeleton-v0.9.0" class="section level1">
<h1>Generating a SQL Skeleton (<em>v0.9.0</em>)</h1>
<p>The step after documenting your ETL process is to implement it in an ETL framework of your choice. As many implementations involve SQL, Rabbit-In-a-Hat provides a convenience function to export your design to a SQL skeleton. This contains all field to field mappings, with logic/descriptions as comments, as non-functional pseudo-code. This saves you copying names into your SQL code, but still requires you to implement the actual logic. The general format of the skeleton is:</p>
<p>The step after documenting your ETL process is to implement it in an ETL framework of your choice. As many implementations involve SQL, Rabbit-In-a-Hat provides a convenience function to export your design to an SQL skeleton. This contains all field to field mappings, with logic/descriptions as comments, as non-functional pseudo-code. This saves you copying names into your SQL code, but still requires you to implement the actual logic. The general format of the skeleton is:</p>
<pre class="sql"><code>INSERT INTO &lt;target_table&gt; (
&lt;target_fields&gt;
)
Expand Down
1 change: 1 addition & 0 deletions docs/RabbitInAHat.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Rabbit-In-a-Hat generates documentation for the ETL process it does not generate

## Process Overview
The typical sequence for using this software to generate documentation of an ETL:

1. Scanned results from WhiteRabbit completed
2. Open scanned results; interface displays source tables and CDM tables
3. Connect source tables to CDM tables where the source table provides information for that corresponding CDM table
Expand Down
20 changes: 10 additions & 10 deletions docs/WhiteRabbit.html
Original file line number Diff line number Diff line change
Expand Up @@ -418,8 +418,8 @@ <h2>Process Overview</h2>
<ol style="list-style-type: decimal">
<li>Set working folder, the location on the local desktop computer where results will be exported.</li>
<li>Connect to the source database or delimited text file and test connection.</li>
<li>Select the tables of interest for the scan and scan the tables.</li>
<li>WhiteRabbit creates an export of information about the source data.</li>
<li>Select the tables to be scanned and execute the WhiteRabbit scan.</li>
<li>WhiteRabbit creates a ‘ScanReport’ with information about the source data.</li>
</ol>
<p>Once the scan report is created, this report can then be used in the Rabbit-In-A-Hat tool or as a stand-alone data profiling document.</p>
</div>
Expand Down Expand Up @@ -460,7 +460,7 @@ <h3>Source Data</h3>
<div id="delimited-text-files" class="section level4">
<h4>Delimited text files</h4>
<ul>
<li><strong><em>Delimiter:</em></strong> specifies the delimiter that separates columns, default is ‘,’ and you can write ‘tab for tab delimited.</li>
<li><strong><em>Delimiter:</em></strong> specifies the delimiter that separates columns. Enter <code>tab</code> for a tab delimited file.</li>
</ul>
<p>WhiteRabbit will look for the files to scan in the same folder you set up as a working directory.</p>
</div>
Expand Down Expand Up @@ -529,7 +529,7 @@ <h4>Google BigQuery</h4>
<ul>
<li><em><strong>Server location:</strong></em> name of GBQ ProjectID</li>
<li><em><strong>User name:</strong></em> OAuth service account email address</li>
<li><em><strong>Password:</strong></em> OAuth private key path (file location of private key JSON file). Must be a valid full file pathname</li>
<li><em><strong>Password:</strong></em> OAuth private key path (full path to the private key JSON file)</li>
<li><em><strong>Database name:</strong></em> data set name within ProjectID named in Server location field</li>
</ul>
</div>
Expand All @@ -553,12 +553,12 @@ <h3>Performing the Scan</h3>
<ul>
<li>Checking the “Scan field values” box tells WhiteRabbit that you would like to investigate raw data items within tables selected for a scan (i.e. if you select Table A, WhiteRabbit will review the contents in each column in Table A).
<ul>
<li>“Min cell count” is an option when scanning field values. By default this is set to 5, meaning values in the source data that appear less than 5 times will not appear in the report.</li>
<li>“Min cell count” is an option when scanning field values. By default, this is set to 5, meaning values in the source data that appear less than 5 times will not appear in the report.</li>
<li>“Rows per table” is an option when scanning field values. By default, WhiteRabbit will random 100,000 rows in the table. There are other options to review 500,000, 1 million or all rows within the table.</li>
<li>“Max distinct values” is an option when scanning field values. By default this is set to 1,000, meaning that a maximum of 1,000 distinct values per field will appear in the scan report. This option can be set to 100, 1,000 or 10,000 distinct values.</li>
<li>“Max distinct values” is an option when scanning field values. By default, this is set to 1,000, meaning a maximum of 1,000 distinct values per field will appear in the scan report. This option can be set to 100, 1,000 or 10,000 distinct values.</li>
</ul></li>
<li>Unchecking the “Scan field values” tells WhiteRabbit to not review or report on any of the raw data items.</li>
<li>Checking the “Numeric stats” box will include numeric statistics. See the section on <a href="#numeric-statistics">Numerical Statistics</a>.</li>
<li>Checking the “Numeric stats” box will include numeric statistics. See the section on <a href="#numerical-statistics">Numerical Statistics</a>.</li>
</ul>
<p>Once all settings are completed, press the ‘Scan tables’ button. After the scan is completed the report will be written to the working folder.</p>
</div>
Expand Down Expand Up @@ -600,14 +600,14 @@ <h3>Table Overview</h3>
<li>Column E: the number of fields in the table</li>
<li>Column F: the number of empty fields</li>
</ul>
<p>The “Description” column for both the field and table overview was added in v0.10.0. These cells are not populated by WhiteRabbit (with the exception when scanning sas7bdat files that contain labels). Rather, this field provides a way for the data holder to add descriptions to the fields and tables. These descriptions are displayed in Rabbit-In-A-Hat when loading the scanreport. This is especially usefull when the fieldnames are not descriptives or in a foreign language.</p>
<p>The “Description” column for both the field and table overview was added in v0.10.0. These cells are not populated by WhiteRabbit (with the exception when scanning sas7bdat files that contain labels). Rather, this field provides a way for the data holder to add descriptions to the fields and tables. These descriptions are displayed in Rabbit-In-A-Hat when loading the scan report. This is especially useful when the fieldnames are abbreviations or in a foreign language.</p>
</div>
<div id="value-scans" class="section level3">
<h3>Value scans</h3>
<p>If the values of the table have been scanned (described in <a href="#performing-the-scan">Performing the Scan</a>), the scan report will contain a tab for each scanned table. An example for one field is shown below.</p>
<p><img src="images/wr_scan_report_value_freq_v0.10.1.PNG" /></p>
<p>The field names from the source table will be across the columns of the Excel tab. Each source field will generate two columns in the Excel. One column will list all distinct values that have a “Min cell count” greater than what was set at time of the scan. Next to each distinct value will be a second column that contains the frequency, or the number of times that value occurs in the data. These two columns(distinct values and frequency) will repeat for all the source columns in the profiled table.</p>
<p>If a list of unique values was truncated, the last value in the list will be <code>&quot;List truncated...&quot;</code>; this indicates that there are one or more additional unique source values that appear less than the number entered in the “Min cell count”.</p>
<p>If a list of unique values was truncated, the last value in the list will be <code>&quot;List truncated...&quot;</code>; this indicates that there are one or more additional unique source values that have a frequency lower than the “Min cell count”.</p>
<p>The scan report is powerful in understanding your source data by highlighting what exists. For example, the above example was retrieved for the “GENDER” column within one of the tables scanned, we can see that there were two common values (1 &amp; 2) that appeared 104 and 96 times respectively. WhiteRabbit will not define “1” as male and “2” as female; the data holder will typically need to define source codes unique to the source system. However, these two values (1 &amp; 2) are not the only values present in the data because we see this list was truncated. These other values appear with very low frequency (defined by “Min cell count”) and often represent incorrect or highly suspicious values. When generating an ETL we should not only plan to handle the high-frequency gender concepts “1” and “2” but also the other low-frequency values that exist within this column.</p>
<div id="numerical-statistics" class="section level4">
<h4>Numerical Statistics</h4>
Expand Down Expand Up @@ -638,7 +638,7 @@ <h2>Generating Fake Data</h2>
<p>The following options are available for generating fake data:</p>
<ul>
<li>“Max rows per table” sets the number of rows of each output table. By default, it is set to 10,000.</li>
<li>By checking the “Uniform Sampling” box will generate the fake data in a uniform way. The frequency of each of the values will be treated as being 1, but the value sampling will still be random. This increases the chance that each of the values in the scan report is at least once represented in the output data.</li>
<li>By checking the “Uniform Sampling” box will generate the fake data uniformly. The frequency of each of the values will be treated as being 1, but the value sampling will still be random. This increases the chance that each of the values in the scan report is at least once represented in the output data.</li>
</ul>
</div>
</div>
Expand Down

0 comments on commit d950ab7

Please sign in to comment.