Fig. 2

Study design. In order to develop classifiers to predict ALN status, our study was divided into three stage, including discovery, training and validation stage. In the discovery stage, the genes with differential coverages were identified. In the training stage, different machine learning models were used to develop classifiers by using the differential features. The importance of the features was assessed with the sigFeature package of R. Then we selected top 100 features for further classifier construction. In order to identify the optimal gene combination with the largest area under the curve (AUC), backward method was adopted. Finally, the classifiers with the largest AUC were selected. In the validation stage, the predictive efficacy of the selected classifiers was assessed using an internal validation cohort. The detailed characteristics of breast cancer patients were shown in Table 1. WGS whole genome sequencing, ALN axillary lymph node, TSS transcriptional start site, SVM support vector machine, LR logistic regression, LDA linear discriminant analysis, LOOCV leave one out cross validation