Identification of multivariate phenotypes most influenced by mutation: Drosophila serrata wings as a case study 
Cara Conradsen and Katrina McGuigan
Published in Evolution: 01 August 2025
This repository contains all the code and data used in the manuscript.
The history of the experiment can be found in Causes of variability in estimates of mutational variance from mutation accumulation experiments in Genetics. 
This SAS script analyses the Drosophila wing trait data from our mutation reduction experiment where the 12 populations were analysed separately. The outcome is the multivariate (MVN) unstructured (UN) mutation variance - covariance matrix (M) for six wing traits. Here, populations are labelled first by treatment and then sampled generation (e.g., treatment 1_1 small treatment, generation 1). The steps include:
- Sorting the Data
The dataset
mrwings
is sorted byTrait
andGen
. - Standardising Scores
The
Score
variable is standardised (mean = 0, std = 1) within eachTrait
group and saved asstdmrwings
. - Linear Mixed Model using the mixed procedure (
PROC MIXED
) A mixed-effects model is fit using REML, applied to treatments seperately, where multivaraite outliers are excluded (e.g.,Treat=1
,Gen=1
, excludingmultiout=1
):
- _Fixed effect:
Trait
- _Random effects:
Trait
withinLine
,Trait
withinVial
, nested inTreat
andLine
- _Repeated measures:
Animal
nested inTreat
,Line
, andVial
- Output Tables
The following outputs are saved for downstream analysis (within the folder
un_sas_output
):
- _
Covparms1_1
: Covariance parameters - _
FitStat1_1
: Fit statistics - _
Converge1_1
: Convergence diagnostics - _
Asycov1_1
: Asymptotic covariance matrix
There are four parts to randomise_wing_data.R
. In the first part, Part 1. Prepare the data set to be shuffled, we create a unique ID for a line. This section also separates the experimental component of Generation + Treatment + Line (population_info
) and the individual fly trait information (with a newly assigned unique line ID, line_trait_info
) into two data frames.
In the second section (Part 2. Generate a data frame of sets of randomised unique line IDs), here are the steps where we randomly sample the list of unique line IDs, and then generate a data frame of the sampled line sets (all_sets
, N = 1000), each column is a single set of sampled lines and each row, within a column, is a randomly sampled line. This is where you can set the number of data sets needed and whether you want to sample with or without replacement.
In the third part (Part 3. Write a function to merge randomised line data to trait info and save to .csv), we create the function create_randomise_wing_data
that uses the data frames generated in parts 1 and 2. It takes a single column from all_sets
(one randomised sample of lines), and does several things:
- adds the randomised unique line IDs by pasting the
all_sets
column onto the data frame of the experimental component,population_info
- The unique lines are now randomised across the 12 experimental populations of treatment X generation and 42 experimental lines.
- then populates with the flies and their wing information by appending the assigned unique line ID in
line_trait_info
to the randomised unique line ID
- This keeps replicate vials and individual flies intact within line, and at this step, this is our complete randomised dataset
- we retain the column with the experimental design lines, numbered from 1 to 42, for SAS and remove the column with the unique line IDs, numbered from 1 to 493 (see Note below)
- calculates the Mahalanobis Distance and determines multivariate outliers for this new data set to use in SAS
- adds a column of the treatments as numbers for SAS, using the original notation: small is 1 and big is 3 (SB crosses were 2)
- converts the new data set into long format for SAS
- re-orders the data columns
- sorts the data by Gen, Treatment, Line, Vial, Individual
- saves the dataset as a .csv in the 'generated_datasets' subfolder
Finally, in the last section (Part 4. Implement function on sets of randomised lines) we implement the function created above on the data frame all_sets
. In this section, to speed up the time to generate all 1000 data sets, the function is run in parallel using the foreach
function (%dopar%).
Note
Our dataset had fewer 'unique' lines than the full experimental design (i.e., 6 gens X 2 treat X 42 lines = 504, Observed data = 493). Our code is written to keep the same number of 'unique' lines (N = 493), where shuffled genetic data is extended onto this. If sample with replacement equals:
- False; same number of individuals (N = 5135)
- True; the number of total individuals will vary