This is an example of a simple data structure and data dictionary that worked for a WCWH lab. It is a table in a spreadsheet that becomes a database table when uploaded. The lab used a single table with participant ID as the first column, and multiple additional columns to accommodate measured variables.
participant_id | sample_date | string_metric | integer_metric | column_name_E |
---|---|---|---|---|
P0001T | 5/1/23 | test-string | 42 | null |
P0002T | 5/2/23 | test-string | 42 | null |
P0004T | 5/2/23 | test-string | 42 | null |
P0006T | 5/2/23 | test-string | 42 | null |
P0007T | 5/1/23 | test-string | 42 | null |
P0008T | 5/2/23 | test-string | 41 | null |
P0009T | 5/3/23 | test-string | 42 | null |
P0011T | 5/3/23 | test-string | 42 | null |
P0014T | 5/1/23 | test-string | 42 | null |
P0015T | 5/3/23 | test-string | 42 | null |
The table above requires a data dictionary to define what the table itself is for, the meaning of each column heading:
Table Name | Table Definition | Field | Field Definition |
---|---|---|---|
sample table name | explanation of your table's data | participant_id | participant identifier |
-- | -- | sample_date | date of sample collection |
-- | -- | string_metric | explanation of the text in this field |
-- | -- | integer_metric | answer to the ultimate question of life |
-- | -- | column_name_E | some other description for column E |
-
All uploaded data should be deidentified
-
Your first column should be participant ID. This is a unique identifier for each participant.
-
Use the csv file format
-
Use an underscore instead of a space in column headings. For example, “column_name” instead of “column name”
-
Column names should be descriptive and include units wherever possible
-
If gender is included as a variable and is numerically coded, then use 0 for female and 1 for male to maintain consistency with other database values.
-
Create a data dictionary explaining the meaning of each column so that other researchers understand the data.
-
Include only minimal demographic info. Adult and child demographic information is available in a separate database table.
-
Formatting (highlights, bold or italicized fonts, conditional formatting) won't be visible to others in the database.
-
Do not use merged cells. The top row (row 1) should have only the column headings of each column and no additional information. Participant ID should be the first column, and any number of additional columns to accommodate other variables.
-
Use “null” or a similar placeholder for missing values in spreadsheet rather than leaving a value blank. When downloaded from the database later, blank cells in a spreadsheet can be interpreted to be zero by some software.
-
Standardize values of none/NA/missing to a single value. Use NULL for all missing values if you don’t already have a preference. It may make sense for your group to use a combination of null and a separate identifier to make a distinction between data that is missing vs data not currently available.
-
Rename duplicate column headings in your spreadsheet (e.g. visit_one and visit_two). It may be appropriate for your group to have multiple tables (e.g. visit_one_table and vistit_two_table) rather than many near duplicate column headings in a single table.