Item#18

Item #18

“Proper data partitioning process” [1] (licensed under CC BY)

Explanation

“Whether the training-validation-test data split is done at the very beginning of the analysis pipeline, prior to any processing step. Data split should be random but reproducible (e.g., fixed random seed), preferably without altering outcome variable distribution in the test set (e.g., using a stratified data split). Moreover, the data split should be on the patient level, not the scan level (i.e., different scans of the same patient should be in the same set). Proper data partitioning should guarantee that all data processing (e.g., scaling, missing value imputation, oversampling or undersampling) is done blinded to the test set data. These techniques should be exclusively fitted on training (or development) data sets and then used to transform test data at the time of inference. If a single training-validation data split is not done and a resampling technique (e.g., cross-validation) is used instead, test data should always be handled separately from this.” [1] (licensed under CC BY)

Positive examples from the literature

Example #1: “Patients with PI-RADS 1, 2, 4, and 5 of centers 1–3 were used as a pretraining cohort (total N = 1144), and PI-RADS 3 patients from the same centers were used as the training cohort (total N = 238). For centers 4–6, PI-RADS 3 patients were used as external testing cohorts (total N = 185) (Fig. 1), and PI-RADS 1, 2, 4, and 5 patients (N = 439) were not applied in the present study. In each pretraining and training cohort, the patients were randomly divided into two datasets, including nine-tenth (i.e., the training dataset) and one-tenth patients (i.e., the tuning dataset), which were used to train the network weights and optimize the hyperparameters of the DL model, respectively. […]

The preprocessing of prostate MRI images included data de-identification, registration, data harmonization, and data augmentation (Supplementary Section 4) referring to our previous study. […]

The data harmonization included three steps: (1) all MRI images were resampled to a common voxel size of 0.46×0.46×3, which was the median value of voxel spacing in the training cohort; (2) for each index lesion, in the slice where the section of the lesion was the largest extent size, a 2D square ROI was produced to comprise the lesion with an additional 5-voxel margin. Then, this 2D ROI was reproduced in the slices containing the lesion with an additional 5-slice margin extending to the top and bottom. Finally, these 2D ROI consisted of a 3D ROI. (3) For all patients, the 3D ROIs were resampled into a common resolution of 112×112×16. (3) For each 3D ROI, the intensity of each voxel was converted to z-score, namely 𝑧‑𝑠𝑐𝑜𝑟𝑒 = 𝑥𝑖−𝑥 σ , where 𝑥𝑖 is the original intensity value of the ith voxel, and 𝑥 and σ are the mean and standard deviation across all voxels of corresponding 3D ROI, respectively. In order to prevent models from over-fitting and further improve models’ generalization, the data of the pretraining and training cohorts (except for their respective tuning datasets) was augmented by the translations in random directions and rotations at random angles. The augmented pretraining and training cohorts, and not-augmented tuning datasets were employed to develop deep learning models. The remaining not-augmented testing cohorts were used to test models’ performance.” [2] (licensed under CC BY)

Example #2: “Patients from institution 1 were allocated to the training cohort, and those from institutions 2 and 3 to the testing cohort. […]

The feature selection process in this study was conducted within the training cohort. Only features with an intraclass correlation coefficient (ICC) greater than 0.8 were retained. The training cohort was divided into an internal training set and an internal validation set in a 4:1 ratio, a procedure replicated across 100 iterations. In each iteration, the internal training set underwent analysis using the Mann–Whitney U-test and least absolute shrinkage and selection operator (LASSO) with 5-fold cross-validation. These methods were employed to generate a feature set for model construction. Ten algorithms were used to build classifiers: logistic regression (LR), support vector machine (SVM), K-nearest neighbors (KNN), decision tree, random forest, extra trees, XGBoost, multi-layer perceptron (MLP), Naive Bayes, and light gradient boosting machine (LightGBM). The performance of these classifiers was tested on the internal validation set. The best-performing classifier and its feature set from each iteration were recorded. Features were ranked based on their frequency of selection. The top two features for each imaging modality (CT and MRI) and feature extraction method (traditional radiomics and deep learning) were selected to build single modality models using the ten algorithms. Then for each imaging modality, a combined model based on both traditional and deep-learning radiomics features was built. To enhance the model’s generalizability and reduce overfitting, another round of feature selection and model construction was performed, again over 100 iterations, based on the twelve features selected in the former procedure. This process aimed to select the top four features for building integrated models with the ten algorithms.” [3] (licensed under CC BY)

Example #3: “We used the stratified random sampling method to divide development set into a training and a validation set at a 7:3 ratio. The synthetic minority oversampling technique (SMOTE) was used in the training set to overcome data imbalance in the training set, and oversampled the number of patients with HER2-enriched and TNBC to twice their own. The same step-by-step process was used for feature standardization, feature selection, and model construction in the training set for each model.” [4] (licensed under CC BY)

Example #4: “We randomly split the data of center 1 and center 2 at the patient level into a training cohort (n = 762) and an internal testing (n = 327) cohort in a 7:3 ratio. The data from center 3 (n = 279) and center 4 (n = 248) were used as two separate external testing cohorts.

In order to remove the imbalance from the training data set, we performed up-sampling by repeating random cases to equal the number of positive/negative samples. The z-score was used to normalize each feature by subtracting the mean value and dividing it by the standard deviation. The dimension reduction was applied to the normalized feature. Pearson correlation coefficient (PCC) was calculated for each pair of two features, one of which was dropped if the PCC value was > 0.99. Analysis of variance (ANOVA) was used for feature selection, and the F-value of each feature was calculated based on the labels in the training cohort. The selected features for predicting csPCa are summarized in Table S2. Finally, the random forest (RF), support vector machine (SVM), logistic regression (LR), and linear discriminant analysis (LDA) models were trained on the selected features to build the radiomics model separately. We used 5-fold cross-validation on the training cohort to determine the hyper-parameters of the pipeline, including the number of selected features, the kernel, or the regularization parameter of the four classifications, after which the hyper-parameters that achieved the highest cross-validation performance were used to train the final model on the whole training cohort. The details of the pipeline of the machine models are shown in Figure S1. The prediction of the final model was used as the radiomics score (Rad-score) in the subsequent analysis.” [5] (licensed under CC BY)

Hypothetical negative examples

Example #5: Our study included 400 patients (340 negative class; 60 positive class) with MRI data from two institutions. To address class imbalance, we applied synthetic minority oversampling (SMOTE), ensuring balanced classes. […] Feature standardization and selection were performed, and a support vector machine model was trained and tested.

Example #6: We used CT scans from 300 patients to develop a radiomics model. The dataset was split at the scan level into training (n=600 scans), validation (n=150 scans), and test (n=150 scans) sets in a 4:1:1 ratio. […] Preprocessing steps, including voxel resampling and intensity scaling, were applied. […] Cross-validation was used on the training set to select features and train a logistic regression model, and the test set was evaluated separately.

Example #7: We extracted 200 radiomic features from the images. […] To reduce dimensionality, we performed feature selection using a correlation-based method. The top 50 features were selected. […] The data was split into training, validation, and testing sets (70/15/15), and a model was trained and evaluated.

Example #8: In our study, we collected MRI scans from 500 patients across three centers. We performed data preprocessing, including z-score normalization and data augmentation to balance the classes. […] The data was split into a training set (70%, n=350), a validation set (15%, n=75), and a test set (15%, n=75) using random sampling. […]

Example #9: Prior to radiomic feature extraction, images underwent pixel resampling and discretization using the bin-width method. The optimal bin width for this dataset was determined to be 12. Subsequently, 1120 radiomic features were extracted from the processed images.

Example #10: Gathering radiomic feature set from different publications, we observed that some features had missing values for a subset of patients. To address this, we imputed the missing values using the mean of each feature. […] The model was developed through 10-fold cross validations using a support vector machine model and evaluated on the hold-out test set.

Importance of the item

Proper data partitioning is a cornerstone of radiomics studies, ensuring methodological rigor and reliable results. Incorrect data partitioning leads to a massive overestimation of model performance [6, 7]. This process involves splitting the dataset into training, validation, and test sets at the very start of the analysis pipeline. Random but reproducible splits (using a fixed random seed) maintain fairness, while stratification ensures balanced outcome variable distribution across sets. Patient-level partitioning prevents data leakage by grouping all observations from the same patient into a single set, preserving the independence of test data. Importantly, preprocessing steps like scaling or imputation must be derived solely from training data and applied to test data only during inference. If cross-validation is used for testing purposes, rather than tuning, the test folds must remain completely independent to ensure unbiased evaluation [7]. By adhering to these principles, data leakage is avoided, model performance reflects true generalizability and results remain robust and reproducible.

Specifics about the positive examples

In all examples, it was clearly stated that data splitting was performed before any processing step that could introduce information leakage. In Example #1, the authors clarified that data harmonization and augmentation were confined to the pretraining and training sets. Example #2 provided detailed reporting, explicitly confirming that feature selection was conducted appropriately after data splitting. In Example #3, all potential leakage-prone steps—including oversampling, feature standardization, feature selection, and model construction—were restricted to the training set, with transparent documentation. Similarly, Example #4 followed these steps, clearly reporting the process and explicitly noting that data splitting was performed at the patient level.

Specifics about the negative examples

Example #5 lacks clarity regarding whether preprocessing steps, such as SMOTE oversampling and standardization, were restricted exclusively to the training set. Example #6 splits data at the scan level rather than at the patient level, risking patient-level data leakage; scans from the same patient should be confined to a single dataset partition. Example #7 performs feature selection prior to dataset splitting, improperly utilizing test set information. Example #8 does not clearly indicate whether normalization and augmentation were applied exclusively to the training set or inadvertently to the entire dataset. Example #9 conducts bin-width optimization for image preprocessing on the entire dataset before splitting, thus introducing data leakage, as such optimization should strictly be performed on the training set. Example #10 is a good negative example for performing the missing value imputation potentially before data split, according to the flow of the text.

Recommendations for appropriate scoring

To appropriately score this item, evaluators should assess not only the data partitioning process but also the handling of preprocessing steps that could potentially introduce information leakage. It is essential to ensure that such preprocessing steps are either performed before the data split or restricted to the training set only.

Evaluators should be aware that not all preprocessing steps lead to information leakage, but some do. Special attention should be given to preprocessing techniques that involve all data, such as oversampling, as these are more likely to cause leakage if applied incorrectly. In contrast, preprocessing confined to individual patient data, such as voxel resampling, is generally less problematic.

If preprocessing steps involve the entire dataset, clear documentation should be provided to demonstrate that these steps were applied only to the training set, thus preventing data leakage into validation or test sets.

If the evaluator cannot find sufficient or satisfactory reporting regarding proper data splitting and preprocessing handling, the item should be scored negatively.

References

Kocak B, Akinci D’Antonoli T, Mercaldo N, et al (2024) METhodological RadiomICs Score (METRICS): a quality scoring tool for radiomics research endorsed by EuSoMII. Insights Imaging 15:8. https://doi.org/10.1186/s13244-023-01572-w
Bao J, Zhao L, Qiao X, et al (2025) 3D-AttenNet model can predict clinically significant prostate cancer in PI-RADS category 3 patients: a retrospective multicenter study. Insights into Imaging 16:25. https://doi.org/10.1186/s13244-024-01896-1
Liu Y, Wang Y, Hu X, et al (2024) Multimodality deep learning radiomics predicts pathological response after neoadjuvant chemoradiotherapy for esophageal squamous cell carcinoma. Insights into Imaging 15:277. https://doi.org/10.1186/s13244-024-01851-0
Huang G, Du S, Gao S, et al (2024) Molecular subtypes of breast cancer identified by dynamically enhanced MRI radiomics: the delayed phase cannot be ignored. Insights into Imaging 15:127. https://doi.org/10.1186/s13244-024-01713-9
Bao J, Qiao X, Song Y, et al (2024) Prediction of clinically significant prostate cancer using radiomics models in real-world clinical practice: a retrospective multicenter study. Insights into Imaging 15:68. https://doi.org/10.1186/s13244-024-01631-w
Strotzer QD, Wagner T, Angstwurm P, et al (2024) Limited capability of MRI radiomics to predict primary tumor histology of brain metastases in external validation. Neuro-Oncology Advances 6:vdae060. https://doi.org/10.1093/noajnl/vdae060
Gidwani M, Chang K, Patel JB, et al (2023) Inconsistent Partitioning and Unproductive Feature Associations Yield Idealized Radiomic Models. Radiology 307:e220715. https://doi.org/10.1148/radiol.220715

Previous: Item #17 Next: Item #19