Data processing
In the manuscript, we have used data from the SMArtCARE registry on patients with SMA. As the data cannot be made publicly available, we provide functions for simulating data with a similar structure.
SMArtCARE data
LatentDynamics.SMATestData — Typemutable struct SMATestDataStruct to serve as a container for the SMArtCARE data, consisting of the following fields:
test: name of the motor function test for which the data is collectedxs: vector of matrices (nitems x ntimepoints) of the item scores across time of the chosen test for each patientxs_baseline: vector of vectors of baseline variable measurements for each patienttvals: vector of vectors of follow-up time points for each patientids: vector of patient IDs
LatentDynamics.get_SMArtCARE_data — Functionget_SMArtCARE_data(test::String, baseline_df, timedepend_df; extended_output::Bool=false)Function to preprocess the SMArtCARE data for a specific test. The function returns an SMATestData struct with the extracted information on time-dependent and baseline variables, follow-up time points and IDs of all patients for whom the chosen test was conducted.
From the provided input dataframes, the function first filters the time-dependent dataframe for patients that have the selected test conducted. The dataframe is then subset to the variables of the items of the specific test. The baseline dataframe is subset to the same patients. For each patient, outlier time points are filtered out. An outlier is classified as a time point where the difference to the previous time point is larger than 2 times the interquartile range of all difference between subsequent time points for that patient. Additionally, the variance of the sum score of the test is calculated, to allow for potential further subsequent filtering.
Arguments
test: name of the motor function test for which the data is collectedbaseline_df: DataFrame containing the baseline variables for all patientstimedepend_df: DataFrame containing the time-dependent variables for all patientsextended_output: iftrue, the function also returns the calculated variances of the sumscore for each patient and the time point masks that show which time points where filtered out for each patient.
LatentDynamics.recode_SMArtCARE_data — Functionrecode_SMArtCARE_data(testdata::SMATestData)Recodes the time-dependent item values in an SMATestData struct to be between 0 and 1. Original item levels are integers between 0 and 2 for all items except item a, which has values between 0 and 6. Each item is separately mapped to numbers between 0 and 1 and the values are subsequently logit-transformed. A new SMATestData struct is returned, where the recoded values are stored in the xs field.
Arguments
testdata::SMATestData: the test data to be recodedrecoding_dict: Dictionary specifying the numbers item levels should be recoded to for all items except a; default is Dict(0 => 0.1, 1 => 0.5, 2 => 0.9) and this is what has been used for all experiments.recoding_dict_itema: Dictionary specifying the numbers item levels should be recoded to for item a; default is Dict(0 => 0.1, 1 => 0.2, 2 => 0.3, 3 => 0.5, 4 => 0.7, 5 => 0.8, 6 => 0.9) and this is what has been used for all experiments.
Simulated data
LatentDynamics.simdata — Typemutable struct simdataStruct to serve as a container for simulated data, consisting of the following fields:
xs: vector of matrices (nvariables x ntimepoints) of simulated values of the different variables across time for each patientxs_baseline: vector of vectors of baseline variable measurements for each patienttvals: vector of vectors of follow-up time points for each patientgroup1: vector of indices of patients belonging to group 1group2: vector of indices of patients belonging to group 2
LatentDynamics.generate_xs — Functiongenerate_xs(n, p, true_u0, sol_group1, sol_group2;
t_start=1.5, t_end=10, maxntps = 10, dt=0.1, σ_var=0.1, σ_ind=0.5)Generates simulated data by sampling n observations of p variables at between 1 and maxntps timepoints for each individual by randomly selecting one of the true underlying ODE solutions given by sol_group1 and sol_group2, taking its values at a randomly sampled number between 1 and maxntps of randomly sampled time points and adding variable-specific and individual-specific errors to the values of the true trajectories, where the variance of the error terms is controlled by σ_var and σ_ind.
Arguments:
n: number of individuals to simulatep: number of time-dependent variables to simulate - should be divisible by the number of the true underlying trajectory dimensions, so the first (p/ntruedimensions) variables can be noisy versions of the first dimension of the true dynamics, and so on.true_u0: vector stating the initial condition of the ground-truth underlying ODE systems from which to simulate the datasol_group1: true ODE solution of the first groupsol_group2: true ODE solution of the second group
Optional keyword arguments:
t_start: Earliest time point possible for follow-up measurements, start of the interval from which to sample the subsequent measurement time point(s). Default = 1.5t_end: Latest time point possible for follow-up measurements, end of the interval from which to sample the subsequent measurement time point(s). Default = 10maxntps: maximum number of time points per individual after the baseline timepoint. Default = 1dt: time steps at which to solve the ODE. Needed to ensure correct array sizes. Default = 0.1σ_var: variance with which to sample the variable-specific error terms. Default = 0.1σ_ind: variance with which to sample the individual-specific error terms. Default = 0.5
Returns:
xs: vector of lengthn= nindividuals, where theith element is a (nvars=p x n_timepoints) matrix containing the simulated values of the time-dependent variables of theith individual in the datasettvals: vector of lengthn= nindividuals, where theith element is a vector of length 1 (or more generally ntimepoints_i) containing the simulated time point of theith individual's second measurement (or all the timepoints after the baseline visit)group1: indices of all individuals in group1group2: indices of all individuals in group1
LatentDynamics.generate_baseline — Methodgenerate_baseline(n, q, q_info, group1; σ_info=1, σ_noise=1)Generates simulated baseline data by sampling n observations of q baseline variables, of which only the first q_info are informative, and the other ones are just pure noise variables, based on the group membership information. This information is given by group1, the indices of all individuals in group1, based on which the other indices in group 2 can be inferred, since union(group1, group2) = {1,...,n}. Baseline measurements are simulated by encoding group membership as 1 or -1 and drawing from N(0,σinfo) or N(1, σinfo), repectively. For the noise variables, data are simulated by drawing from N(0, σ_noise).
Arguments:
n: number of individuals to simulateq: number of baseline variables to simulateq_info: number of informative baseline variables.group1: indices of all individuals in group1 - since [group1, group2] = {1,...,n}, thegroup2indices can be inferred from that
Optional keyword arguments:
σ_info: variance with which to sample from the group membership information in the informative baseline variables terms. Default = 1σ_noise: variance with which to sample the noise baseline variables terms. Default = 1
Returns:
x_params: vector of lengthn= nindividuals, where theith element is a vector of length (nbaselinevars=q) containing the baseline information for theith individual in the dataset
LatentDynamics.generate_baseline — Methodgenerate_baseline(n, q, q_info, group1, true_odeparams_group1, true_odeparams_group2; σ_info=0.1, σ_noise=0.1)Generates simulated baseline data by sampling n observations of q baseline variables, of which only the first q_info are informative, and the other ones are just pure noise variables, based on the true ODE parameters passed as true_odeparams_group1 and true_odeparams_group2. Baseline measurements are simulated by sampling from the true parameters with a standard deviation of σinfo. For the noise variables, data are simulated by drawing from N(0, σnoise).
Arguments:
n: number of individuals to simulateq: number of baseline variables to simulateq_info: number of informative baseline variables.group1: indices of all individuals in group1 - since [group1, group2] = {1,...,n}, thegroup2indices can be inferred from that
Optional keyword arguments:
σ_info: variance with which to sample from the group membership information in the informative baseline variables terms. Default = 0.1.σ_noise: variance with which to sample the noise baseline variables terms. Default = 0.1.
Returns:
x_params: vector of lengthn= nindividuals, where theith element is a vector of length (nbaselinevars=q) containing the baseline information for theith individual in the dataset