Classification of Visually Similar Scalp Diseases using Deep Learning: A Hybrid CNN-VIT Approach with Cross-Attention Fusion

Authors: Research Scholar Ayushi Dixit, Dr. Brij Mohan Singh

Abstract: Accurate automated diagnosis of visually similar scalp diseases represents one of the most challenging problems in clinical dermatology. Conditions such as Psoriasis, Seborrheic Dermatitis, Tinea Capitis, Alopecia Areata, Folliculitis, and Eczema share overlapping visual characteristics: including redness, scaling, and patchy hair loss, making misclassification clinically dangerous and common even among trained dermatologists. The global shortage of specialist dermatologists, particularly in rural and resource-limited settings in India, further amplifies the need for reliable automated diagnostic tools. This comprehensive research proposes ScalpViT, a novel hybrid deep learning architecture that combines a 16×16 Patch Vision Transformer (ViT) with a Convolutional Neural Network (CNN) backbone connected via a bidirectional cross-attention fusion module. The ViT branch processes the scalp image by dividing it into 256 non-overlapping 16×16-pixel patches, embedding each as a 768-dimensional token, and applying multi-head self-attention across the full token sequence to capture global spatial distribution and morphological patterns. Concurrently, the CNN branch extracts local texture details. The bidirectional cross-attention enables texture features to query spatial features and vice-versa, avoiding the pitfalls of simple feature concatenation. Trained on a meticulously curated multi-source dataset of approximately 7,000 dermoscopic and clinical scalp images drawn from DermNet NZ, ISIC 2018, HAM10000, and SD-198, ScalpViT achieves 94.3% accuracy, a macro F1-score of 0.93, and an AUC of 0.97. It significantly outperforms conventional baselines like ResNet-50 (83.1%), EfficientNet-B3 (87.4%), standard ViT-B/16 (90.8%), Swin-Tiny (91.2%), and DINOv2-B (93.5%). Furthermore, to bridge the interpretability gap for clinical deployment, ScalpViT utilizes GradCAM for CNN texture heatmapping and Attention Rollout for ViT patch mapping, delivering dual visual explainability to clinicians. The paper extensively details the methodology, dataset construction, architectural innovations, and clinical relevance for point-of-care mobile deployments.

DOI: https://doi.org/10.5281/zenodo.20631579

Related posts

Follow Us on