Data processing

In the manuscript, we have used data from the SMArtCARE registry on patients with SMA. As the data cannot be made publicly available, we provide functions for simulating data with a similar structure.

SMArtCARE data

LatentDynamics.SMATestData — Type

mutable struct SMATestData

Struct to serve as a container for the SMArtCARE data, consisting of the following fields:

test: name of the motor function test for which the data is collected
xs: vector of matrices (nitems x ntimepoints) of the item scores across time of the chosen test for each patient
xs_baseline: vector of vectors of baseline variable measurements for each patient
tvals: vector of vectors of follow-up time points for each patient
ids: vector of patient IDs

source

LatentDynamics.get_SMArtCARE_data — Function

get_SMArtCARE_data(test::String, baseline_df, timedepend_df; extended_output::Bool=false)

Function to preprocess the SMArtCARE data for a specific test. The function returns an SMATestData struct with the extracted information on time-dependent and baseline variables, follow-up time points and IDs of all patients for whom the chosen test was conducted.

From the provided input dataframes, the function first filters the time-dependent dataframe for patients that have the selected test conducted. The dataframe is then subset to the variables of the items of the specific test. The baseline dataframe is subset to the same patients. For each patient, outlier time points are filtered out. An outlier is classified as a time point where the difference to the previous time point is larger than 2 times the interquartile range of all difference between subsequent time points for that patient. Additionally, the variance of the sum score of the test is calculated, to allow for potential further subsequent filtering.

Arguments

test: name of the motor function test for which the data is collected
baseline_df: DataFrame containing the baseline variables for all patients
timedepend_df: DataFrame containing the time-dependent variables for all patients
extended_output: if true, the function also returns the calculated variances of the sumscore for each patient and the time point masks that show which time points where filtered out for each patient.

source

LatentDynamics.recode_SMArtCARE_data — Function

recode_SMArtCARE_data(testdata::SMATestData)

Recodes the time-dependent item values in an SMATestData struct to be between 0 and 1. Original item levels are integers between 0 and 2 for all items except item a, which has values between 0 and 6. Each item is separately mapped to numbers between 0 and 1 and the values are subsequently logit-transformed. A new SMATestData struct is returned, where the recoded values are stored in the xs field.

Arguments

testdata::SMATestData: the test data to be recoded
recoding_dict: Dictionary specifying the numbers item levels should be recoded to for all items except a; default is Dict(0 => 0.1, 1 => 0.5, 2 => 0.9) and this is what has been used for all experiments.
recoding_dict_itema: Dictionary specifying the numbers item levels should be recoded to for item a; default is Dict(0 => 0.1, 1 => 0.2, 2 => 0.3, 3 => 0.5, 4 => 0.7, 5 => 0.8, 6 => 0.9) and this is what has been used for all experiments.

source

Simulated data

LatentDynamics.simdata — Type

mutable struct simdata

Struct to serve as a container for simulated data, consisting of the following fields:

xs: vector of matrices (nvariables x ntimepoints) of simulated values of the different variables across time for each patient
xs_baseline: vector of vectors of baseline variable measurements for each patient
tvals: vector of vectors of follow-up time points for each patient
group1: vector of indices of patients belonging to group 1
group2: vector of indices of patients belonging to group 2

source

LatentDynamics.generate_xs — Function

generate_xs(n, p, true_u0, sol_group1, sol_group2; 
    t_start=1.5, t_end=10, maxntps = 10, dt=0.1, σ_var=0.1, σ_ind=0.5)

Generates simulated data by sampling n observations of p variables at between 1 and maxntps timepoints for each individual by randomly selecting one of the true underlying ODE solutions given by sol_group1 and sol_group2, taking its values at a randomly sampled number between 1 and maxntps of randomly sampled time points and adding variable-specific and individual-specific errors to the values of the true trajectories, where the variance of the error terms is controlled by σ_var and σ_ind.

Arguments:

n: number of individuals to simulate
p: number of time-dependent variables to simulate - should be divisible by the number of the true underlying trajectory dimensions, so the first (p/ntruedimensions) variables can be noisy versions of the first dimension of the true dynamics, and so on.
true_u0: vector stating the initial condition of the ground-truth underlying ODE systems from which to simulate the data
sol_group1: true ODE solution of the first group
sol_group2: true ODE solution of the second group

Optional keyword arguments:

t_start: Earliest time point possible for follow-up measurements, start of the interval from which to sample the subsequent measurement time point(s). Default = 1.5
t_end: Latest time point possible for follow-up measurements, end of the interval from which to sample the subsequent measurement time point(s). Default = 10
maxntps: maximum number of time points per individual after the baseline timepoint. Default = 1
dt: time steps at which to solve the ODE. Needed to ensure correct array sizes. Default = 0.1
σ_var: variance with which to sample the variable-specific error terms. Default = 0.1
σ_ind: variance with which to sample the individual-specific error terms. Default = 0.5

Returns:

xs: vector of length n = nindividuals, where the ith element is a (nvars=p x n_timepoints) matrix containing the simulated values of the time-dependent variables of the ith individual in the dataset
tvals: vector of length n = nindividuals, where the ith element is a vector of length 1 (or more generally ntimepoints_i) containing the simulated time point of the ith individual's second measurement (or all the timepoints after the baseline visit)
group1: indices of all individuals in group1
group2: indices of all individuals in group1

source

LatentDynamics.generate_baseline — Method

generate_baseline(n, q, q_info, group1; σ_info=1, σ_noise=1)

Generates simulated baseline data by sampling n observations of q baseline variables, of which only the first q_info are informative, and the other ones are just pure noise variables, based on the group membership information. This information is given by group1, the indices of all individuals in group1, based on which the other indices in group 2 can be inferred, since union(group1, group2) = {1,...,n}. Baseline measurements are simulated by encoding group membership as 1 or -1 and drawing from N(0,σinfo) or N(1, σinfo), repectively. For the noise variables, data are simulated by drawing from N(0, σ_noise).

Arguments:

n: number of individuals to simulate
q: number of baseline variables to simulate
q_info: number of informative baseline variables.
group1: indices of all individuals in group1 - since [group1, group2] = {1,...,n}, the group2 indices can be inferred from that

Optional keyword arguments:

σ_info: variance with which to sample from the group membership information in the informative baseline variables terms. Default = 1
σ_noise: variance with which to sample the noise baseline variables terms. Default = 1

Returns:

x_params: vector of length n = nindividuals, where the ith element is a vector of length (nbaselinevars=q) containing the baseline information for the ith individual in the dataset

source

LatentDynamics.generate_baseline — Method

generate_baseline(n, q, q_info, group1, true_odeparams_group1, true_odeparams_group2; σ_info=0.1, σ_noise=0.1)

Generates simulated baseline data by sampling n observations of q baseline variables, of which only the first q_info are informative, and the other ones are just pure noise variables, based on the true ODE parameters passed as true_odeparams_group1 and true_odeparams_group2. Baseline measurements are simulated by sampling from the true parameters with a standard deviation of σinfo. For the noise variables, data are simulated by drawing from N(0, σnoise).

Arguments:

n: number of individuals to simulate
q: number of baseline variables to simulate
q_info: number of informative baseline variables.
group1: indices of all individuals in group1 - since [group1, group2] = {1,...,n}, the group2 indices can be inferred from that

Optional keyword arguments:

σ_info: variance with which to sample from the group membership information in the informative baseline variables terms. Default = 0.1.
σ_noise: variance with which to sample the noise baseline variables terms. Default = 0.1.

Returns:

x_params: vector of length n = nindividuals, where the ith element is a vector of length (nbaselinevars=q) containing the baseline information for the ith individual in the dataset

source