BrainChart calibration tutorial

Introduction

The code in this repository extends the original BrainChart codebase, and therefore can’t be assembled into a standard R package without restructuring that code.

First, load the libraries we’ll be using later:

library(tidyverse)
library(pracma)

The functionality in this repository is accessed via the following code snippet.

BrainChartFolder <- dirname(normalizePath("calibrate_braincharts_centiles.R"))
wd <- setwd(BrainChartFolder)
source("calibrate_braincharts_centiles.R")
setwd(wd)

IMPORTANT NOTE A full path to calibrate_braincharts_centiles.R must be provided if your code is outside this repository. e.g.

BrainChartFolder <- dirname(normalizePath("~/Projects/brainchartscentiles/calibrate_braincharts_centiles.R"))
source(file.path(BrainChartFolder, 'calibrate_braincharts_centiles.R'))

Using the calibration functions requires the following steps:

Loading brain IDPs
Attaching demographic and site data
Renaming columns
Calibrating the large population sample
Calibrate the small population sample with repeated measures in the large population sample

This tutorial illustrates calibration using a combination of data from the healthy brain network and synthetic data. The code can be used as a template to calibrate your own data.

Note that some output from the code snippets has been supressed to make cleaner documentation. The code used to create this tutorial is available in the file calibration_example_hbn.R.

Loading FreeSurfer and demographic data

Helper function and loading FreeSurfer data

First load the study FreeSurfer data. These tables are collated from the individual study statistics using the FreeSurfer asegstats2table and aparcstats2table utilities, as illustrated in the bash script named freesurfer_make_stats.sh.

readFS <- function(fname) {
  df <- read_csv(fname, name_repair = "universal", show_col_types = FALSE)
  df <- rename(df, ID = 1)
  return(df)
}

asegDF <- readFS(file.path('HBN', 'stats_aseg.csv'))
lhCortThicknessDF <- readFS(file.path('HBN', 'stats_aparc_lh_thickness.csv'))
lhCortSurfaceAreaDF <- readFS(file.path('HBN', 'stats_aparc_lh_area.csv'))
rhCortThicknessDF <- readFS(file.path('HBN', 'stats_aparc_rh_thickness.csv'))
rhCortSurfaceAreaDF <- readFS(file.path('HBN', 'stats_aparc_rh_area.csv'))

Set up IDP columns

Next, we select columns of interest and combine and rename them as required by the BrainChart calibration process. We then join the per-hemisphere surface-based metrics to the volume metrics and optionally discard columns not required by calibration.

# make the IDP columns
brainChartsDF <- select(asegDF, ID, 
                        CerebralWhiteMatterVol, 
                        SubCortGrayVol, CortexVol, 
                        EstimatedTotalIntraCranialVol, 
                        Left.Lateral.Ventricle, Right.Lateral.Ventricle, 
                        SupraTentorialVolNotVent, BrainSegVol.to.eTIV)
brainChartsDF <- mutate(brainChartsDF, 
                        Ventricles = Left.Lateral.Ventricle + 
                          Right.Lateral.Ventricle, BrainSegVol.to.eTIV)

brainChartsDF <- rename(brainChartsDF, 
                        WMV = CerebralWhiteMatterVol,
                        sGMV = SubCortGrayVol,
                        GMV = CortexVol,
                        etiv = EstimatedTotalIntraCranialVol,
                        TCV = SupraTentorialVolNotVent,
                        )
# join surface area measures
brainChartsDF <- left_join(brainChartsDF, 
                           select(lhCortSurfaceAreaDF, ID, 
                                  lh_WhiteSurfArea_area), by="ID")
brainChartsDF <- left_join(brainChartsDF, 
                           select(rhCortSurfaceAreaDF, ID, 
                                  rh_WhiteSurfArea_area), by="ID")
# join cortical thickness measures
brainChartsDF <- left_join(brainChartsDF, 
                           select(lhCortThicknessDF, ID, 
                                  lh_MeanThickness_thickness), by="ID")
brainChartsDF <- left_join(brainChartsDF, 
                           select(rhCortThicknessDF, ID, 
                                  rh_MeanThickness_thickness), by="ID")

brainChartsDF <- mutate(brainChartsDF, 
                        SA = lh_WhiteSurfArea_area + rh_WhiteSurfArea_area)
brainChartsDF <- mutate(brainChartsDF, 
                        CT = ((lh_MeanThickness_thickness * 
                                 lh_WhiteSurfArea_area) + 
                                (rh_MeanThickness_thickness * 
                                   rh_WhiteSurfArea_area))/SA,
                        fs_version = 'Custom_T1T2',
                        run = 1,
                        participant = ID,
                        country = 'Australia',
                        dx = 'CN'
                        )

brainChartsDF <- select(brainChartsDF, -Right.Lateral.Ventricle, 
                        -Left.Lateral.Ventricle, 
                        -ends_with("area"), -ends_with("thickness"))

Set up other required columns

Here, we make the columns fs_version (the version of FreeSurfer used for processing), run (an identifier for the acquisition session), participant (the subject IDs), country (acquisition country, unused).

brainChartsDF <- mutate(brainChartsDF, 
                        fs_version = 'Custom_T1T2',
                        run = 1,
                        participant = ID,
                        country = 'Australia',
)

brainChartsDF <- filter(brainChartsDF, CT > 2)

Set up the study column

The study column contains the identifier for the sample each set of measurements belongs to. It can be an arbitrary string, and all measurements belonging to the same sample must have the same value. It cannot be one of the studies in the Braincharts’ training set, which are as follows: 3R-BRAIN, ABCD, abide1, abide2, ADHD200, ADNI, AIBL, AOBA, AOMIC_ID1000, AOMIC_PIOP1, AOMIC_PIOP2, ARWIBO, BabyConnectome, BGSP, BHRCS, BSNIP, Calgary, CALM, CamCAN, CAMFT, Cornell_C1, Cornell_C2, cVEDA, devCCNP, dHCP, DLBS, EDSD, EMBARC, FemaleASD, FinnBrain, GUSTO, HABS, Harvard_Fetal1, HBN, HCP, HCP_lifespanA, HCP_lifespanD, IBIS, ICBM, IMAGEN, IMAP, IXI, LA5c, MCIC, Narratives, NHGRI, NIH, NIHPD, NSPN, NTB_Yale, OASIS3, Oulu, PCDC, PING, PNC, POND, PREVENTAD, RDB, SALD, SLIM, UCSD, UKB, VETSA, WAYNE.

In this example, we set the study column later via the demographic data.

Set up the diagnosis/group column

The dx column assigns participants to control CN (for calibration) or other notCN (not for calibration). Here, all subjects are controls, so we set it as follows:

brainChartsDF <- mutate(brainChartsDF, 
                        dx = "CN"
)

# make the last 5 subjects non-controls

brainChartsDF$dx[(nrow(brainChartsDF)-4):nrow(brainChartsDF)] <- "notCN"

Derived values, such as centiles, are estimated for non-controls, but these measurements are not used for calibration.

Prepare demographic information

Preparing the demographic information is similar. In the HBN dataset the demographic tables are provided on a per-study basis, so we load them individually, create a study column, combine into a single table, create an age_days column and rename the Sex column and set up the factor levels. We set the study column according to the acquisition site (CBIC).

demoDF <- read_csv(file.path('HBN', 'participants-CBIC.csv'), 
                     show_col_types = FALSE)

demoDF <- mutate(demoDF, study = 'HBNCBIC')

demoDF <- mutate(demoDF, age_days = Age * 365.25)
demoDF <- rename(demoDF, sex = Sex)

demoDF <- mutate(demoDF, 
                 sex = factor(sex, labels = c("Male", "Female"), 
                                      levels = c(0, 1)))

Combining IDP and demographic tables

Finally, we merge the demographic and IDP tables using the study ID as a key, and exclude any incomplete records.

More complex studies may require merging by combinations of multiple columns, such as ID and time point or site.

brainChartsDF <- left_join(brainChartsDF, 
                           select(demoDF, participant_id, sex, age_days, study), 
                           by=join_by(ID==participant_id))

# remove mising values
brainChartsDF <- na.omit(brainChartsDF)
brainChartsDF <- mutate(brainChartsDF, study = factor(study, levels=c("HBNCBIC")))

Calibration

Study overview

Perform a quick check of study numbers and breakdown by sex to ensure that the data is correctly loaded

count(brainChartsDF, study)

# A tibble: 1 × 2
  study       n
  <fct>   <int>
1 HBNCBIC  1460

count(brainChartsDF, study, sex, sort = FALSE)

# A tibble: 2 × 3
  study   sex        n
  <fct>   <fct>  <int>
1 HBNCBIC Male     937
2 HBNCBIC Female   523

Calibration, Phase 1

Phase 1 of calibration is the standard, cross sectional, version proposed in the original BrainCharts paper. We show cortical thickness as the IDP of interest. We perform this on a per-study basis, since each study represents a single sample. We will illustrate usage with a single site (CBIC).

# convert to dataframe for compatability with non-tidyverse code
brainChartsDF <- data.frame(brainChartsDF)
row.names(brainChartsDF) <- brainChartsDF$ID

phenotype <- "CT"

# do the full sample calibrations per site
fullSampleCalibration <- calibrateBrainCharts(
  brainChartsDF, phenotype = phenotype)

The fields of the results structure will be described later.

Calibration, Phase 2

Phase 2 of calibration is the “small” sample calibration, which would typically form part of the study dataset. We generate a simulated small sample in this example.

subFakeDF <- simulateSmallSite(fullSampleCalibration, brainChartsDF)

Here, subFakeDF would represent the “small” sample an investigator is wishing to calibrate. We calibrate this sample using the centile-informed repeated measures method, using the large site calibration output previously computed as follows:

subFakeQuantileCalibration <- 
  calibrateBrainChartsIDQuantilePenalty(subFakeDF, phenotype = "CT", 
                                        largeSiteOutput = fullSampleCalibration)

Results

fullSampleCalibration$DATA.PRED2$sample <- "large"
subFakeQuantileCalibration$DATA.PRED2$sample <- "small"

I <- which(fullSampleCalibration$DATA.PRED2$dx == "CN")
J <- which(fullSampleCalibration$DATA.PRED2$dx == "notCN")
S <- bind_rows(fullSampleCalibration$DATA.PRED2[I[1],], fullSampleCalibration$DATA.PRED2[J[1],])

S[, c('dx', 'age_days', 
      'PRED.l025.pop', 'PRED.l250.pop', 'PRED.m500.pop', 'PRED.u750.pop', 'PRED.u975.pop',
      'PRED.l025.wre', 'PRED.l250.wre', 'PRED.m500.wre', 'PRED.u750.wre', 'PRED.u975.wre',
      'meanCT2Transformed.q.wre', 'meanCT2Transformed.normalised')]

                    dx age_days PRED.l025.pop PRED.l250.pop PRED.m500.pop
sub-NDARAA396TWZ    CN 2578.625  0.0002544697  0.0002657019  0.0002712844
sub-NDARZZ598MH8 notCN 3133.375  0.0002520320  0.0002629325  0.0002683553
                 PRED.u750.pop PRED.u975.pop PRED.l025.wre PRED.l250.wre
sub-NDARAA396TWZ  0.0002766653  0.0002863981  0.0002621626  0.0002737731
sub-NDARZZ598MH8  0.0002735857  0.0002830546  0.0002596524  0.0002709200
                 PRED.m500.wre PRED.u750.wre PRED.u975.wre
sub-NDARAA396TWZ  0.0002795427  0.0002851035  0.0002951599
sub-NDARZZ598MH8  0.0002765246  0.0002819298  0.0002917136
                 meanCT2Transformed.q.wre meanCT2Transformed.normalised
sub-NDARAA396TWZ               0.03612907                   0.000255934
sub-NDARZZ598MH8               0.37301763                   0.000265776

The results data structure will be described, this is purely for reference. In this example, we will use the `The dataframe fullSampleCBICCalibration is the output

fullSampleCalibration$expanded: The model parameters
- fullSampleCalibration$expanded$mu$ranef: The $\mu$ site-effect estimates. The user-provided study name will be the estimated effect for the sample, in this example it is fullSampleCalibration$expanded$mu$ranef[['HBN']]
- fullSampleCalibration$expanded$sigma$ranef: The $\mu$ site-effect estimates. The user-provided study name will be the estimated effect for the sample, in this example it is fullSampleCalibration$expanded$mu$ranef[['HBN']]
fullSampleCalibration$DATA.PRED: The data with predicted values from the model before calibration
fullSampleCalibration$DATA.PRED2: The data with predicted values for parameters after calibration
- fullSampleCalibration$DATA.PRED2$meanCT2transformed: the transformed cortical thickness values
- fullSampleCalibration$DATA.PRED2$mu.pop: the training data $\mu$
- fullSampleCalibration$DATA.PRED2$mu.wre: the estimated $\mu$ for each measurement
- fullSampleCalibration$DATA.PRED2$sigma.pop: the training data $\sigma$ for each measurement
- fullSampleCalibration$DATA.PRED2$sigma.wre: the estimated $\sigma$ for each measurement
- fullSampleCalibration$DATA.PRED2$PRED.l250.pop: the training data 0.25 centile for each measurement
- fullSampleCalibration$DATA.PRED2$PRED.m500.pop: the training data 0.5 (median) centile for each measurement
- fullSampleCalibration$DATA.PRED2$PRED.u750.pop: the training data 0.75 centile for each measurement
- fullSampleCalibration$DATA.PRED2$PRED.l250.wre: the estimated 0.25 centile for each measurement
- fullSampleCalibration$DATA.PRED2$PRED.m500.wre: the estimated 0.5 (median) centile for each measurement
- fullSampleCalibration$DATA.PRED2$PRED.u750.wre: the estimated 0.75 centile for each measurement
- fullSampleCalibration$DATA.PRED2$meanCT2Transformed.q.wre: the estimated centile of the cortical thickness measurement in the altered site-corrected distribution. These values are in a consistent frame between samples.

Plotting results

This plot depicts a typical representation of the measured data and the estimated centile lines. The original data are plotted as points with the centile lines overlaid. The implication of the calibration is that data points are to be characaterised by their centiles. Thus, the centiles provide a common frame of reference to compare timepoints between samples that is free of sample biases.

T <- bind_rows(fullSampleCalibration$DATA.PRED2, subFakeQuantileCalibration$DATA.PRED2)
M <- c(CN = 1, notCN = 5)

T$DXSize <- factor(M[T$dx], labels=c("CN", "notCN"))

ggplot(T, aes(x = age_days / 365.25)) +
  geom_point(aes(y = meanCT2Transformed * 10000, colour=sample, size = DXSize), alpha = 0.5) +
  geom_line(aes(y = PRED.l250.wre * 10000, colour=sample)) + 
  geom_line(aes(y = PRED.m500.wre * 10000, colour=sample), linewidth = 1) +
  geom_line(aes(y = PRED.u750.wre * 10000, colour=sample)) +
  labs(x = "Age (years)", y = "CT", colour = "Sample", size = "Diagnosis")