This section details the datasets and machine learning algorithms employed to develop and evaluate our framework. The study was granted an IRB exemption due to its exclusive use of publicly accessible data.
Datasets for Skin Assessment Module Development
Our methodology development utilized several datasets, each playing a distinct role as outlined in Supplementary Fig. 3.
DermEducation Dataset: This dataset, compiled for educational purposes, served as a convenience image resource. It encompasses 2708 images, categorized into 461 non-skin and 2247 skin images. Within the skin images, 1932 represent Fitzpatrick Skin Types (FST) I-IV, and 315 represent FST V-VI. DermEducation was instrumental in training our skin vs. non-skin image classifier. Furthermore, it facilitated the validation of our skin tone estimation method by comparing it against ITA-based tone assessments. A medical student, with dermatologist review for accuracy, meticulously labeled the images for skin/non-skin classification and skin tone.
SegmentedSkin Dataset: Comprising 22 open-source dermatology images from Wikimedia, this dataset was curated by a dermatologist. Segmentation masks of healthy skin, created by a dermatologist, are available for each image. SegmentedSkin was specifically used to validate the accuracy of our skin pixel segmentation technique.
Fitzpatrick17K Dataset: A publicly available dataset, Fitzpatrick17K 11, contains 16,577 clinical images sourced from online dermatology atlases. Dermatologists previously generated FST labels for this dataset. After preprocessing, we utilized 13,844 images of FST I-IV and 2168 images of FST V-VI. Fitzpatrick17K was crucial for training and validating our skin tone estimator model.
Medical Textbooks Dataset: To ensure real-world applicability and for external testing, we incorporated a dataset derived from four medical textbooks owned by the authors. This Medical Textbooks dataset includes: Rook’s Textbook of Dermatology 26, Bolognia Dermatology 4e 27, Fitzpatrick Color Atlas 8e 28, and Fitzpatrick Dermatology in General Medicine 9e 29. Using a corpus conversion service, we extracted images and filtered out those smaller than 100 pixels in any dimension. Table 2 provides a detailed summary of these datasets. It’s important to note that the ratio of skin to non-skin images varies across these textbooks. For the Medical Textbooks dataset, authors manually labeled images as skin or non-skin. Non-dermatologist labelers, trained on examples, categorized skin images into FST I-IV and FST V-VI. Label distributions were comparable to domain expert reports on a subset of images (Fig. 5). Agreement levels between domain experts and trained labelers were 0.887 for Fitzpatrick, 0.860 for Atlas, and 0.855 for Bolognia (Fig. 5). DermEducation labels were provided by a medical student, while Fitzpatrick17k dataset labels were included with the dataset 11.
Table 2: Summary of Datasets for Skin Tone Assessment Learning Module Development
Dataset | Description | Skin Images | Non-Skin Images | FST I-IV | FST V-VI | Purpose |
---|---|---|---|---|---|---|
DermEducation | Educational dermatology image set | 2247 | 461 | 1932 | 315 | Skin vs. Non-skin classifier training, ITA validation |
SegmentedSkin | Open-source dermatology images with masks | 22 | N/A | N/A | N/A | Skin pixel segmentation validation |
Fitzpatrick17K | Public clinical dermatology image dataset | 16012 | 565 | 13844 | 2168 | Skin tone estimator training & validation |
Medical Textbooks | Images extracted from dermatology textbooks | Varies | Varies | Varies | Varies | External testing, real-world application |
Fig. 5: Comparison of Skin Tone Labels by Domain Experts and Non-Dermatologists.
Label distribution comparison between non-dermatologist assigned labels and domain expert reported values for Bolognia and Atlas textbooks.
Machine Learning Pipeline for Skin Tone Assessment
The architecture of our proposed machine learning pipeline is depicted in Fig. 1. The key components of this pipeline are detailed below.
Document Ingestion for Image Extraction
We utilized the Corpus Conversion Service (CCS) 20, a cloud-based platform designed for ingesting and processing large volumes of academic material in scanned and programmatic PDF formats. CCS employs advanced AI models 30 to convert PDF documents into structured text-files in JSON format 31. Beyond text extraction, CCS effectively identifies and extracts tables and images, including captions and document positions. This image extraction feature was crucial for obtaining raw image data for our research.
Skin Image Detection Methodology
For efficient skin image detection, we implemented the Histogram of Oriented Gradients (HOG) descriptor 32. HOG is widely used in object detection due to its robustness to geometric and photometric variations. The HOG feature vector (h*i) for an image I is calculated from magnitude-weighted histograms of direction bins derived from pixel intensity gradients in horizontal (Gx) and vertical (Gy) directions. Gradients are computed as: Gx(r, c) = I(r, c + 1) − I(r, c − 1) and Gy(r, c) = I(r + 1, c) − I(r − 1, c). The gradient angle θi(r, c) = arctan(Gy/Gx) and magnitude Mi(r, c) = pG2y + Gx2 are then determined. Angle values are binned into C = 32 clusters based on sensitivity analysis, with each θi value mapped to the nearest cluster, weighted by magnitude M*i. Additionally, we incorporated direct pixel intensity values from the CIE LAB color space (L, a, and b channels), known for its device-agnostic properties. The intensity-based feature vector pi = [*µL,µa,µb,σL,σa,σb], includes mean (µ) and standard deviation (σ*) values for each channel. The final 38-dimensional feature vector for skin image detection is the concatenation of HOG features (hi) and intensity features (p**i).
For classification, we validated both Support Vector Machines (SVM) 15 and XGBoost 16 algorithms, using five-fold stratified cross-validation on the DermEducation dataset. SVM utilized an RBF kernel, effective for non-linear feature relationships, with parameters nu=0.01 (training error control) and gamma=0.05 (RBF kernel radius influence, to prevent overfitting). XGBoost employed cross-validation based calibration (cv folds=3), with hyperparameters optimized from the best-performing fold. Area Under the Receiver Operating Characteristic curve (AUROC) and F1 score served as performance metrics.
Skin Pixel Segmentation Techniques
Skin segmentation methods range from threshold-based to model-based and region-based approaches. Region-based methods have been shown to be highly effective for color segmentation 33. Given our focus on binary skin tone classification, skin segmentation was prioritized over lesion segmentation. For finer granularity, lesion pixel analysis would be necessary. Our initial experiments used a combination of region-growing and color-based segmentation in HSV and YCbCr color spaces, with ranges based on prior research 33. Image clipping was followed by watershed and morphological operations.
Skin Tone Estimation and Classification
For skin tone classification, we leveraged the Fitzpatrick17k dataset for training and cross-validation, with external testing on the Medical Textbooks and DermEducation datasets (Table 2). Input data consisted solely of skin pixels extracted in the segmentation step (Fig. 3B, C). We aimed to classify skin images into FST I-IV or FST V-VI categories. Both feature-engineered and deep learning methodologies were explored. Feature engineering involved concatenating HOG feature vectors with mean and standard deviation values of Luminance (L) and Yellow (b) channels from the CIE LAB color space, and ITA values. ITA is strongly correlated with melanin indices 9, 12, 23. These feature vectors were used with Ensemble methods including Random Forest 34, Extremely Randomized Trees 35, AdaBoost 36, and Gradient Boost 16 (Table 1). Random Forest and Extremely Randomized Trees showed comparable performance to other methods with reduced computational demands compared to AdaBoost and Gradient Boosting. Models were implemented using scikit-learn v0.24.2 37 and imbalanced-learn 38. We also evaluated deep learning models, specifically a pretrained ResNet-18 CNN with 11,689,512 parameters, initially trained on over a million ImageNet images 18. We modified the last layer for binary classification (FST I-IV vs. FST V-VI) and performed weighted retraining for 20 epochs using Stochastic Gradient Descent, weighted cross-entropy loss, a learning rate of 1*e*−3 with linear decay, and a batch size of 32. Implementation was done with Scientific Python Stack v3.6.9 39 and Pytorch v1.8.1 40. Results are detailed in Table 1 and Fig. 4. We also assessed an ITA-based approach, mapping ITA values to Fitzpatrick skin tones according to Supplementary Table 1.
The six Fitzpatrick skin tone indices were consolidated into two categories (FST I-IV and FST V-VI) for comparison with STAR-ED. Skin tone estimation across all methods was evaluated using a 70% training, 10% validation, and 20% testing split of the Fitzpatrick17K dataset. Other datasets were used solely for testing.
Reporting Summary
Further details on research design are available in the Nature Research Reporting Summary linked to this article: Nature Research Reporting Summary.