This was a multicenter, retrospective cohort study, and the KIDS was developed and validated utilizing retinal images and clinical data from participants across 8 hospitals in China and 1 hospital in Somalia. The dataset collected from FAH and ZOC was used for model development and internal, prospective testing. External testing was performed using datasets from ZPH, FPH, and AHY. Additionally, small-scale real-world validation was conducted using multicenter and multi-ethnic datasets from FPHK, SPTCMI, SAH, and BH. The study was registered at ClinicalTrials.gov (identifier: NCT05223712) and approved by the Institutional Review Board/Ethics Committee of ZOC (2021KYPJ167), the ICE for Clinical Research and Animal Trials of FAH (2022097), the Ethics Committee for Clinical Research and Experimental Animals of ZPH (K2022-162-2), the Medical Ethics Committee of FPH (2022-90), the Medical Ethics Committee of AHY (YYFY-LL-2022-106), the Medical Ethics Committee of SPTCMI (2025Y-08001), the Medical Ethics Committee of FPHK (2024-69), the Medical Ethics Committee of SAH (2024-256), and the Research Committee of BH (BH/IRVD2024007/001). The study was conducted in accordance with the Declaration of Helsinki. Informed consent was obtained from participants prospectively enrolled at the FAH for model prospective testing. For all other cohorts, informed consent was waived by the respective institutional review boards due to the retrospective design and use of de-identified data. All ethics committees listed above reviewed and approved the study protocol and consent procedures. (Supplementary Fig. 1).
Diagnostic criteria
According to the international guidelines of the Kidney Disease: Improving Global Outcomes (KDIGO), CKD was defined as either an eGFR <60 mL/min/1.73 m2 or kidney damage (albuminuria) for 3 months or longer5. Non-CKD controls were defined as individuals with an eGFR ≥ 60 mL/min/1.73 m2, absence of albuminuria, and no documented history of CKD. CKD was categorized into early CKD (eGFR ≥ 60 mL/min/1.73 m2 with albuminuria), moderate CKD (eGFR 30–59 mL/min/1.73 m2), and advanced CKD (eGFR < 30 mL/min/1.73 m2)55.
The pathological diagnosis was established through collaborative assessments by nephrology and pathology specialists, based on renal biopsy results (Supplementary Methods). We developed prediction models for identifying the five most common types of CKD, whose distribution is consistent with epidemiological surveys of the biopsy-proven spectrum of kidney diseases in China56,57: (1) IgAN: IgA nephropathy; (2) MN: idiopathic membranous nephropathy; (3) ANS: arterionephrosclerosis; (4) DN: diabetic nephropathy; and (5) MCD/FSGS: idiopathic minimal change disease (MCD) and focal segmental glomerulosclerosis (FSGS). MCD and FSGS are grouped as podocytopathies owing to their shared pathological characteristics and mechanisms involving podocytes58. The definitions and management principles of these 5 pathological classifications are described in the Supplementary Material. The PSG is a crucial prognostic marker for long-term implications that is used for assessing the severity of kidney diseases. In the pathological staging model, glomerular lesions with ≥75% sclerotic glomeruli are classified as having high PSG, indicating substantial glomerular damage and fibrosis59.
The composite renal endpoint event was defined as a renal outcome if it met any of the following criteria: I. deterioration in kidney function, defined as a ≥ 50% decline in eGFR from baseline (as measured by serum creatinine at admission and follow-up); II. development of end-stage renal disease (eGFR < 15 mL/min/1.73 m2); and III. initiation of renal replacement therapy (dialysis or kidney transplantation)60. Kidney survival was calculated after diagnostic renal biopsy to renal endpoints.
Clinical and image datasets
Dataset for the CKD screening AI model
For CKD detection, we used retinal images of CKD patients and non-CKD participants from retrospective datasets. The development and internal test dataset consisted of non-CKD participants who had annual health examinations, including routine systemic (albuminuria and creatinine tests) and ophthalmic tests at the Health Examination Center, and CKD patients from the Department of Nephrology of FAH between January 3, 2009, and February 28, 2023. The external test set included participants who underwent an annual health check at ZOC and outpatients at ZPH, FPH, and AHY from December 23, 2016, to March 13, 2024.
Dataset for the pathological diagnosis and staging AI model
Patients who met the diagnostic criteria for CKD and underwent both ophthalmological examinations with retinal imaging and kidney specialist examinations with renal biopsy during the same hospitalization were included in the study. The development and internal test datasets were collected from patients of the Department of Nephrology of FAH between April 28, 2009, and August 8, 2022. Validation datasets were provided by a prospective cohort collected in the Department of Nephrology of FAH from August 8, 2022, to April 27, 2023.
For external testing, we obtained 3 independent retrospectively collected datasets: (1) Zhongshan City People’s Hospital [ZPH, Zhongshan City, China]: patients from December 23, 2016, to November 7, 2022; (2) First People’s Hospital of Foshan [FPH, Foshan City, China]: patients from January 13, 2021, to February 28, 2023; and (3) Affiliated Hospital of Youjiang Medical University for Nationalities [AHY, Youjiang City, China]: patients from May 7, 2022, to February 3, 2023. These hospitals are located in three cities across two provinces in southern China, each representing different economic and medical profiles. All these datasets consisted of retinal images and clinical data, including blood and urine test results, kidney ultrasound scans (as shown in Supplementary Table 2), pathological diagnosis, and PSGs derived from renal biopsies.
For real-world validation across multinational and multi-ethnic settings, we collected four datasets: (1) First People’s Hospital of Kashi [FPHK, Kashi, China], spanning December 14, 2020, to November 10, 2024; (2) Shanxi Provincial Traditional Chinese Medicine Institute [SPTCMI, Taiyuan, China], from December 26, 2018, to September 29, 2024; (3) The Second Affiliated Hospital of Xi’an Jiaotong University [SAH, Xi’an, China], covering November 11, 2020, to December 21, 2023; and (4) Banadir Hospital [BH, Mogadishu, Somalia], from July 20, 2023, to November 18, 2024. Due to constraints in local medical resources, the datasets from SPTCMI and BH comprised only clinical data.
Dataset for progression prediction models
For the CKD progression prediction AI model, CKD patients (eGFR ≥ 15 mL/min/1.73 m2 and without dialysis) before the kidney failure stage from the retrospective cohort were included. Administrative and follow-up data from the retrospective cohort were collected for development and validation. Participants were monitored for disease outcomes, including renal dialysis, kidney transplantation, hospitalization, mortality, and cause of death, from May 12, 2009, to June 22, 2022. Follow-up information regarding the participants’ examinations after the initial assessment was obtained through electronic health records, phone interviews, or clinical visits. The drop-out time and reason for the missed visits were recorded. Participants with preexisting renal dialysis or kidney transplantation at the time of administration were excluded from the cohort analysis (Supplementary Methods).
Image acquisition and data quality control
The retinal images of all the datasets consisted of one macula-centered fundus image per eye with a 45° primary field of view. For the datasets collected from FAH, the retinal images were captured with CR-2 AF (Canon) and RetiCam 3100 (SYSEYE). For the external testing dataset, retinal images from ZPH, FPH, and AYH were obtained via KOWA Nonmyd 7 (Kowa), AFC-330 (NIDEK), and CR-2 AF (Canon), respectively. The retinal images included in the analysis met the following criteria: integrity of the macula and optic disc, absence of artifacts, and sufficient resolution. For each subject, one retinal image per eye that met the specified criteria was selected. For subjects with only one eligible eye, the best image from that eye was retained. Out of a total of 15,127 images collected, we excluded 1983 (13.1%) ineligible fundus images.
Model development
CKD screening AI model
The model was developed based on retinal images. The images were assigned randomly to training, validation, and test sets at a ratio of 8:1:1 at the participant level, and there were no samples overlapping in the three datasets. In this study, the model’s input comprised retinal images from both the right and left eyes, and the images were processed separately. Predictions were made with the AI model at the image level, and then the image-level outputs were averaged at the participant level as a final prediction for each participant. For participants whose only one eye image was eligible, the retinal image was processed, and its model output was taken as participant-level prediction. To further evaluate the performance of the CKD screening DL model when applied to populations with different prevalences of CKD, we conducted tests based on the internal test and external test datasets, with simulated prevalences of 5%, 10%, and 20%. We created 1000 simulated datasets by randomly sampling 1000 × (1−prevalence) non-CKD and 1000 × prevalent CKD samples from each test dataset with replacement for each prevalence.
A noninvasive model for pathological diagnosis
To identify the presence of the five categories of pathological diagnosis, we trained a noninvasive model consisting of three submodels to perform five separate binary classification tasks: (1) a retinal image model; (2) a clinical data model; and (3) a hybrid model. In the retinal image-only model, the input comprised retinal images, specifically one macular-centered image per eye, and the output was the probabilities of the presence of the five pathological categories. The clinical data model was first built via a multilayer perceptron (MLP; 56, 256, 256, 5) with full features. Then, the importance of the features was measured via the permutation feature importance method on the validation set, and features with 95% lower confidence limits >0 were selected. Finally, the MLP (35, 256, 256, 5) model with the selected features and the highest accuracy on the validation set was constructed as the final meta-model. In addition, we developed a simplified version of the meta-model for application based on tests and ultrasounds that may be lacking in the local medical conditions. For the hybrid model, we developed an EfficientNet-B5-based multimodal model by integrating retinal images and clinical data as inputs. The average of the probabilities of each category for both eyes from a patient was calculated as the probability of the presence of the category at the participant level. (Supplementary Fig. 2).
Pathological staging model
The pathological staging model was built via XGBoost with features selected via least absolute shrinkage and selection operator with 10-fold cross-validation. The datasets used were the same as those used in the aforementioned clinical data model, which requires 6 clinical features as inputs: hemoglobin, creatinine, uric acid, and renal ultrasound measurements (left renal length; whether the demarcation of the cortex is clear; and whether the medulla and hyperechoic medulla are greater). The output was the probability of a PSG score ≥ 75%.
Progression prediction models
The pathological types confirmed by renal biopsy, combined with clinical features, were used to develop models for predicting the renal endpoints. We also used the predicted probabilities of pathological classification from retinal images and the predicted probability of high glomerulosclerosis (≥75%) results instead of renal biopsy results to develop models with the same methods. CPH regression and 10-fold cross-validation were adopted to train and validate the models. For risk stratification, the individual risk scores were calculated with the models. Patients were subsequently categorized into three groups according to upper and lower quartiles: low-risk (risk score < Q1), medium-risk (Q1–Q3), and high-risk (risk score > Q3) according to a previous study28.
Model visualization and explainability
As a graphical representation of the prognostic prediction model based on Cox regression, the nomogram consists of a set of axes representing the distribution of selected variables, with corresponding points assigned to different levels or values of each variable. The points for each variable are added to obtain a total score, which is then translated into a predicted probability of CKD progression at 1, 3, and 5 years on separate axes. Furthermore, to improve the interpretability of the models, a forest plot was adopted for the Cox regression models, and the importance of the features in the final meta-model was measured via the permutation feature importance method on the validation set. A SmoothGrad saliency map61 was used to draw heatmaps by highlighting pixels that strongly influenced the prediction of the models for the images in the test datasets.
The details of model development are provided in the Supplementary Methods.
Comparison between the KIDS and clinical nephrologists in a prospective multicenter validation
We designed an AI and clinical nephrologists’ comparison study to evaluate the performance of the KIDS compared with that of nephrologists in diagnosing pathological types of CKD and to assess the real-world applicability of the KIDS in a clinical setting. The test set prospectively enrolled 256 CKD patients, none of whom had been used in training and validation before, from multiple centers (FAH, FPH, ZPH, and AHY) with single or multiple pathological diagnoses confirmed by renal biopsy between May 1, 2023, and November 22, 2023. All the cases were subsequently assessed via the KIDS and individually by nine nephrologists at different levels of expertise in China: three at the resident level (X.L. [FPH], Q.H. [ZPH], Y.X. [AHY]), three at the senior level (L.J. [FAH], P.Y. [FPH], S.Y. [Dongguan TCM hospital]), and three at the expert level (N.H. [FAH], H.Y. [FAH], J.Y. [FAH]). Three nephrologists, I.M.A. (resident level), N.I.A. (senior level), and M.H.R.H. (expert level) from Banadir Hospital in Somalia, also conducted pathological diagnoses of CKD patients. Nephrologists were asked to determine the pathological type on the basis of the clinical data and retinal images provided for each case. We used ROC curves and accuracy (ACC) values to compare the pathological diagnostic performance of the KIDS and nephrologists.
$$Accuracy,ACC=\frac1n\mathop\sum \limits_i=1^n\fracY_i\cap Z_i\right\left$$
The ACC for each patient is defined as the proportion of correctly predicted pathological diagnosis labels to the total number of labels (both predicted and actual) for that patient. The ACC for the test dataset is calculated as the average ACC across all 256 patients (n: total number of patients; Yi: actual label set, Yi = \((y_1,y_2\cdots \cdots y_k),\,y_j\,\in \,\left\\mathrm0,1\right\,\,1\le j\le k,k\; is\; the\; number\; of\; labels\); Zi: predicted label set, Zi = \((z_1,z_2\cdots \cdots z_k),z_j\,\in \left\\mathrm0,1\right\,\,1\le j\le k\))62,63.
Statistical analysis
The performance of the models in CKD screening and pathological diagnosis and staging for binary classification prediction was assessed via ROC curves, in which sensitivity was plotted against 1-specificity at different probability thresholds. The AUCs are reported along with 95% DeLong confidence intervals (CIs). The paired DeLong test, continuous NRI, and IDI were employed to assess the improvement in pathological classification performance by adding retinal imaging information. Bonferroni correction was adopted for multiple AUC comparisons. Calibration plot and Hosmer-Lemeshow goodness-of-fit test were used to assess the consistency between the observed probabilities and the probabilities predicted by the metadata model and hybrid model. Using the optimal operating thresholds determined with the Youden index, the sensitivities and specificities for each type of pathological diagnosis made by the KIDS were measured and visualized via ROC curves to compare the KIDS with those of clinical nephrologists. Kaplan‒Meier curves and log-rank tests were used to compare the overall survival probability of CKD patients in different pathological types and groups. The concordance index (C-index) was used to assess the overall predictive accuracy of the CPH model. Time-dependent ROC curves with AUCs and 95% CIs at 1 year, 3 years, and 5 years were also used to measure model performance. All the statistical analyses were performed via Python (version 3.8.0) and R (version 4.1.1).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
link
