How much motion is too much motion ? Determining motion thresholds by sample size for reproducibility in developmental resting-state MRI

A constant problem developmental neuroimagers face is in-scanner head motion. Children move more than adults and this has led to concerns that developmental changes in restingstate connectivity measures may be artefactual. Furthermore, children are challenging to recruit into studies and therefore researchers have tended to take a permissive stance when setting exclusion criteria on head motion. The literature is not clear regarding our central question: How much motion is too much? Here, we systematically examine the effects of multiple motion exclusion criteria at different sample sizes and age ranges in a large openly available developmental cohort (ABIDE; http://preprocessed-connectomesproject.org/abide). We checked 1) the reliability of resting-state functional magnetic resonance imaging (rs-fMRI) pairwise connectivity measures across the brain and 2) the accuracy with which we can separate participants with autism spectrum disorder from typically developing controls based on their rs-fMRI scans using machine learning. We find that reliability on average is primarily sensitive to the number of participants considered, but that increasingly permissive motion thresholds lower case-control prediction accuracy for all sample sizes. ‡ § | ¶,# © Leonard J et al. This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are


Background
A constant problem developmental imagers face is in-scanner head motion (Poldrack et al. 2002, Raschle et al. 2012).Children move more than adults and this has led to concerns that developmental changes in resting-state connectivity measures may be artefactual (Van Dijk et al. 2011, Satterthwaite et al. 2012).Furthermore, typically-developing children and children with developmental disorders are challenging to recruit into studies and researchers may engage in extensive mock scanner motion training with participants and/ or may take permissive stance when setting exclusion criteria on head motion (de Bie et al. 2010, Yerys et al. 2009).Yet, no one has systematically examined what motion cutoffs should be used to make reliable inferences in developmental data and how this might vary by both sample size and age range.
Here, we systematically examine the effects of multiple motion exclusion criteria at different sample sizes and age ranges in a large openly available developmental cohort (ABIDE; Di Martino et al. 2013, Cameron et al. 2013; http://preprocessed-connectomes-project.org/abide) on both reliability of resting state functional magnetic resonance imaging (rs-fMRI) pairwise connectivity and Autism/healthy control prediction accuracy.

Methods
In a cohort of 743 children (aged 6 to 18 years, 620 male), we varied motion cutoffs and sample size to explore how these variables impacted both split-half reliability and prediction accuracy of autism diagnosis using machine-learning.Specifically, we adjusted the sample size (from 10 to 100 participants) and the permitted number of volumes that exceeded a displacement from the previous volume by 0.2 mm (from 0 to 100%; details at http:// preprocessed-connectomes-project.org/abide/quality_assessment.html).The input data for all analyses were individual pairwise correlation matrices using the 116 regions of interest (ROIs) defined in the Automated Anatomical Labeling (AAL) atlas (Tzourio-Mazoyer et al. 2002).For both analyses described below we selected two matched groups according to our sample size and motion criteria, and ensured they were balanced for age, sex, diagnosis, and scanning site.Data and all code to reproduce the analyses can be found at GitHub (Flournoy and Leonard 2017).
For the split-half reliability analyses, we averaged the individual correlation matrices to give the average connection between each ROI-ROI pair in each group.We computed Rsquared values for the fit between all the average pairwise correlations assuming the two groups were equal (Fig. 1) r each sample size and motion cutoff, we ran 100 permutations to identify a median R-squared value and therefore were able to create a value of "reliability" between two samples by motion threshold and sample size.
Another measure of how motion thresholds change the replicability of an analysis is out-ofsample predictive accuracy.We used the participants' resting state functional connectivity matrices as features to predict diagnostic category (Autism spectrum disorder vs typically developing controls).We designated one half of the data to be a training set and reserved the other for testing our model.The training generated a support vector machine (SVM) classifier with an L1 penalty tuned using 10-fold cross-validation (Pedregosa et al. 2011) classifier was then used to predict diagnosis labels in the test set, with classification accuracy as our outcome of interest.Both the test-training split, as well as the 10-fold splits within the training data, were stratified so that the proportion of cases and controls were roughly equivalent in each split.For each sample size and motion cut off we ran 500 permutations.We compared the estimated prediction accuracy to a baseline rate that would be achieved by predicting that all diagnosis labels are the same for whichever diagnostic category is the most prevalent --that is, if in a sample of 90 controls and 10 cases, one could achieve 90% accuracy by predicting that every participant is labeled a control.In order to investigate the effects of age range, motion exclusion threshold and sample size on functional connectiivity reliability we split the data into two matched samples.For the reliability analysis we averaged all participants in each sample and then calculated how well aligned the two groups were in terms of each pairwise regional connectivity measure.For the out-ofsample prediction analysis we used one half of the data to train a model and then tested it on the other half.

Results
The split-half reliability analysis showed that reliability is primarily sensitive to the number of participants considered, with more participants leading to higher reliability (Fig. 2).Motion cutoffs didn't seem to have a strong effect on reliability.Although this is comforting, it is important to note that while some studies still average across subjects to look at group differences, many are moving towards predicting individual differences.Our results do not speak to the sensitivity of individual difference analyses to motion.
The results of the out-of-sample predictive accuracy analyses show that prediction accuracy is not only dependent on sample size but also on motion cutoffs.The best prediction was found in larger sample sizes with lower motion thresholds (Fig. 3).In sample sizes of 60 or more, median prediction accuracy is steadily above the baseline of a naive classifier that assumes that all participants share the modal diagnosis (in this case, non-ASD).However, out-of-sample prediction accuracy varies across the different permutations of the data within each sample-size and motion threshold iteration, and a large proportion of classifiers perform worse than baseline.We only tested one machine learning strategy and it is likely that the exact model will also affect the prescribed "best" motion cutoff and sample size.As expected, larger sample sizes improve both of our reliability measures (R and prediction accuracy).We found that prediction accuracy decreased when the exclusion criteria for motion was made more lenient.

Conclusions and future directions
While this project is far from complete, we have shown that motion cutoffs, and sample sizes, and age ranges do affect reliability in developmental data.In future work, we would also like to explore how both motion thresholds and sample sizes might affect reliability differently by age range.Our end goal is to provide tool for authors to check their own datasets against our findings to ensure they make informed decisions when designing future developmental neuroimaging studies.
In a larger sense though, we have shown that bringing people together who work in a similar field (cognitive neuroscience) but from diverse backgrounds (developmental psychology, psychiatry, computational modeling, developmental cognitive neuroscience) for a one week hackathon can foster novel solutions to old problems.This cross-pollination of ideas brought a much needed fresh, rigorous methodological approach to developmental imaging and the week of fast learning inspired and prepared the next generation of cognitive neuroscientists to create thoughtful and reproducible work in the future.Out of sample prediction accuracy of autism diagnosis using resting state data as a function of sample size and motion-based exclusion criteria (percentage of fMRI, whole-brain volumes exceeding threshold).Red line is a naive classifier that assumes that all participants share the modal diagnosis (in this case, non-ASD).The black line spans the 5th to 95th percentile accuracy across iterations using a linear SVM, with the black points at the median value.Code and output can be found on GitHub (Flournoy and Leonard 2017).

Figure 2 .
Figure 2. Split-half reliability results showing how sample size (N) has a large effect on R squared (median R squared from 100 permutations) while motion threshold does not.Error bars represent average 95% confidence intervals across 100 permutations.Code and output can be found on GitHub (Flournoy and Leonard 2017). 2 Figure 3.