Plant Taxonomy Meets Plant Counting

A Fine-Grained, Taxonomic Dataset for
Counting Hundreds of Plant Species

Jinyu Xu, Tianqi Hu, Xiaonan Hu, Letian Zhou,
Songliang Cao, Meng Zhang, Hao Lu*
Huazhong University of Science and Technology, China *Corresponding Author
CVPR 2026 Oral

Abstract

Visually cataloging and quantifying the natural world requires pushing the boundaries of both detailed visual classification and counting at scale. Despite significant progress, particularly in crowd and traffic analysis, the fine-grained, taxonomy-aware plant counting remains underexplored in vision. In contrast to crowds, plants exhibit nonrigid morphologies and physical appearance variations across growth stages and environments.

To fill this gap, we present TPC-268, the first plant counting benchmark incorporating plant taxonomy. Our dataset couples instance-level point annotations with Linnaean labels (kingdom to species) and organ categories, enabling hierarchical reasoning and species-aware evaluation. The dataset features 10,000 images with 678,050 point annotations, includes 268 countable plant categories over 242 plant species in Plantae and Fungi, and spans observation scales from canopy-level remote sensing imagery to tissue-level microscopy.

We follow the problem setting of class-agnostic counting (CAC), provide taxonomy-consistent, scale-aware data splits, and benchmark state-of-the-art regression- and detection-based CAC approaches. By capturing the biodiversity, hierarchical structure, and multi-scale nature of botanical and mycological taxa, TPC-268 provides a biologically grounded testbed to advance fine-grained class-agnostic counting.

Experimental Results

1. Performance on TPC-268

Regression-based paradigms consistently outperform detection-based methods, as explicit object localization is severely hindered by the compact spatial arrangement and structural entanglement of plant organs. LOCA achieves the best test performance by effectively integrating local structural cues. In contrast, models relying primarily on global self-attention (e.g., CACViT, TasselNetV4) show strong validation results but exhibit significant generalization gaps on unseen test scenes, indicating a tendency to overfit validation distributions.

Table 2a: 3-Shot Setting

Method Venue Backbone Validation Test
MAE ↓ RMSE ↓ R² ↑ MAE ↓ RMSE ↓ R² ↑
FamNet CVPR'21 R50 28.87 52.51 0.58 30.43 65.62 0.62
BMNet+ CVPR'22 R50 29.33 77.78 0.47 27.78 57.25 0.50
C-DETR ECCV'22 R50 22.66 77.51 0.75 22.68 57.97 0.74
SPDCNet BMVC'22 R18 25.66 72.49 0.52 23.70 47.53 0.64
CountTR BMVC'22 Hybrid 20.21 55.82 0.73 25.19 49.94 0.62
SAFECount WACV'23 R18 22.57 63.65 0.64 25.70 52.30 0.58
LOCA ICCV'23 R50 17.26 53.19 0.75 17.51 38.37 0.78
DAVE CVPR'24 R50 16.47 52.87 0.76 17.61 40.06 0.75
CACVIT AAAI'24 ViT-B 16.63 42.49 0.82 22.04 41.79 0.73
CountGD NeurIPS'24 Swin-B 18.32 54.55 0.74 19.52 50.51 0.61
TasselNetV4 ISPRS'26 ViT-B 13.20 43.93 0.83 22.95 51.36 0.60

Table 2b: 1-Shot Setting

Method Venue Backbone Validation Test
MAE ↓ RMSE ↓ R² ↑ MAE ↓ RMSE ↓ R² ↑
FamNet CVPR'21 R50 33.11±0.68 68.95±4.15 0.58±0.05 33.63±1.13 62.07±2.94 0.41±0.05
BMNet+ CVPR'22 R50 29.33±0.13 77.50±0.31 0.48±0.05 27.84±0.10 56.98±0.12 0.50±0.01
CountTR BMVC'22 Hybrid 20.16±0.05 55.15±0.82 0.73±0.01 25.19±0.14 50.23±0.24 0.62±0.00
LOCA ICCV'23 R50 17.19±0.31 48.14±2.19 0.80±0.02 21.47±0.29 42.36±0.72 0.73±0.01
DAVE CVPR'24 R50 16.06±0.60 48.35±1.19 0.80±0.01 19.47±0.44 42.54±0.35 0.72±0.00
CACVIT AAAI'24 ViT-B 17.96±0.16 43.38±0.47 0.83±0.00 22.06±0.11 42.97±0.81 0.71±0.01
TasselNetV4 ISPRS'26 ViT-B 13.49±0.02 41.30±0.46 0.85±0.00 22.20±0.11 48.70±0.26 0.67±0.00

2. Cross-Dataset Transfer Analysis

Generic models trained on FSC-147 suffer severe performance degradation on TPC-268 (MAE increases up to 225%). Conversely, models trained on plant data transfer more robustly to generic scenes, indicating that plant counting presents a more challenging representation problem due to morphological complexity.

Method FSC-147 → TPC-268 TPC-268 → FSC-147
MAE Δ vs Same-Domain MAE Δ vs Same-Domain
CountTR 38.62 +225% 26.53 +5%
CACVIT 26.73 +147% 17.88 -19%
LOCA 24.70 +130% 15.16 -13%

3. Zero-Shot and Foundation Models

Current zero-shot methods (GroundingREC) and vision-language backbones (BioCLIP2) underperform relative to visual-exemplar methods. The low-resolution feature maps of ViT architectures, without specific adapter designs, are suboptimal for dense prediction, suggesting that explicit modeling of visual similarity remains more effective than text-only or off-the-shelf foundation features.

Method / Paradigm Test MAE Test R²
LOCA (3-Shot Visual) 17.51 0.78
GroundingREC (Zero-Shot Text) 24.14 0.53
LOCA + BioCLIP2 Backbone 34.75 0.29

4. Taxonomic Knowledge as Inductive Bias

Incorporating Linnaean taxonomy as textual prompts yields consistent error reduction (e.g., MAE drops from 19.52 to 16.90). This confirms that structured biological knowledge provides a practical and effective inductive bias for fine-grained counting tasks.

Target Specification MAE ↓ RMSE ↓ R² ↑
3 visual exemplars 19.52 50.51 0.61
+ species name 17.53 44.80 0.69
+ full taxonomy 16.90 43.32 0.71
t-SNE Feature Space

t-SNE visualization. It highlights that visual features alone struggle to cluster deep biological taxa.

TPC-268 Showcase

Broad Coverage

TPC-268 diversity across scales. Multi-scale morphologies from microscopic tissues to canopy-level remote sensing.

Qualitative Predictions

Qualitative results on TPC-268. Predicted counting results from representative methods across diverse scenarios.

BibTeX

@article{xu2026plant,
  title={Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species},
  author={Xu, Jinyu and Hu, Tianqi and Hu, Xiaonan and Zhou, Letian and Cao, Songliang and Zhang, Meng and Lu, Hao},
  journal={arXiv preprint arXiv:2603.21229},
  year={2026}
}