Where connoisseurship meets computation
A framework formalising what connoisseurs have done for centuries — examining the physical evidence a painter leaves on canvas — using self-supervised vision transformers, Bayesian inference, and cross-attention fusion.
Upload an artwork for style classification, artist attribution, forgery screening, workshop decomposition, and temporal dating.
Structure-tensor analysis and coherence clustering reveal characteristic facture through patch-level DINOv2 embeddings.
Multi-axis classification along period, school, and genre using CLIP embeddings projected through independently trained linear heads.
Cosine-similarity ranking against a reference gallery with Bayesian confidence intervals and temporal cross-validation.
One-class anomaly detection via Mahalanobis distance with per-dimension z-scores, stress-tested against adversarial forgery.
DPGMM infers how many distinct hands contributed to a painting without requiring the number as input.
Gaussian process regression over dated embeddings estimates when an undated work was most likely produced.
In the 1870s, Giovanni Morelli proposed that attribution should focus on incidental details — the shape of fingernails, the curl of earlobes, the rendering of drapery folds. These peripheral passages reveal the artist’s hand more reliably than any consciously composed focal passage.
ArtSleuth formalises the Morellian intuition as a feature-extraction problem: the “incidental details” correspond to the low-level textural features that self-supervised vision transformers encode.
DINOv2 learns visual representations without labelled data via self-distillation. The resulting feature space encodes texture, directionality, and granularity — the physical surface qualities that connoisseurs evaluate.
CLIP encodes images and text in a shared embedding space. Critical for style classification, where categories like “Baroque” are culturally constructed labels.
Standard ViT preprocessing: resize, centre-crop, ImageNet normalisation. Optional art-specific corrections: varnish correction, craquelure suppression, canvas texture normalisation.
Paintings are divided into patches via Grid, Salient, or Adaptive strategies.
Structure tensor eigenvalues yield orientation, coherence, and energy. DINOv2 patch embeddings are clustered to reveal homogeneous brushwork.
CLIP embeddings are projected through three linear heads (period, school, genre) producing calibrated distributions.
Fused embedding compared via temperature-scaled cosine similarity. Mahalanobis distance for anomaly scoring.
Cross-attention lets CLIP queries attend over DINOv2 patch tokens. GP with RBF kernel models temporal drift. Bayesian GMM with Dirichlet process prior infers workshop hands.
| Backbone | Style | F1 | Artist | Top-5 | Genre |
|---|---|---|---|---|---|
| DINOv2 ViT-B/14 | 57.5% | 0.553 | 64.7% | 90.9% | 71.0% |
| CLIP ViT-L/14 | 67.1% | 0.656 | 74.6% | 95.9% | 75.0% |
| Fusion (frozen) | 65.0% | 0.633 | 71.0% | 94.2% | 74.2% |
| Fusion (fine-tuned) † | 71.6% | 0.703 | 77.8% | 96.2% | 75.1% |
| Fusion (end-to-end) † | 72.7% | — | 79.0% | 96.9% | 76.6% |
WikiArt 81,444 images. Macro-averaged. Top 3 rows reproducible via benchmarks/wikiart.py. † Separate training run; training code not included. 125-artist forgery validation: mean AUC 0.958 (CLIP), 0.873 (DINOv2), 0.897 (Fused).