Abstract
This study addresses critical industrial challenges in e-commerce product categorization — namely platform heterogeneity and the structural limitations of existing taxonomies — by developing and deploying a multimodal hierarchical classification framework. Using a dataset of 271,700 products from 40 international fashion e-commerce platforms, we integrate textual features (RoBERTa), visual features (ViT), and joint vision-language representations (CLIP).
We investigate fusion strategies, including early, late, and attention-based fusion within a hierarchical architecture enhanced by dynamic masking to ensure taxonomic consistency. Results show that CLIP embeddings combined via an MLP-based late-fusion strategy achieve the highest hierarchical F1 (98.59%), outperforming unimodal baselines.
To address shallow or inconsistent categories, we further introduce a self-supervised "product recategorization" pipeline using SimCLR, UMAP, and cascade clustering, which discovered new, fine-grained categories (e.g., subtypes of "Shoes") with cluster purities above 86%. Cross-platform experiments reveal a deployment-relevant trade-off: complex late-fusion methods maximize accuracy with diverse training data, while simpler early-fusion methods generalize more effectively to unseen platforms. Finally, we demonstrate industrial scalability through deployment in EURWEB's commercial transaction intelligence platform via a two-stage inference pipeline.
Key Contributions
- Multimodal Hierarchical Framework: Integrates RoBERTa (text), ViT (vision), and CLIP (vision-language) with early, late, and attention-based fusion strategies within a hierarchical architecture using dynamic masking for taxonomic consistency.
- State-of-the-Art Categorization: CLIP + MLP late-fusion achieves 98.59% hierarchical F1 on 271,700 products across 40 international fashion platforms, outperforming all unimodal baselines.
- Self-Supervised Recategorization Pipeline: SimCLR + UMAP + cascade clustering discovers new fine-grained product categories with cluster purities above 86%, enabling taxonomy enrichment without manual annotation.
- Cross-Platform Generalization Analysis: Reveals a practical trade-off — late-fusion maximizes accuracy with diverse training data; early-fusion generalizes better to unseen platforms.
- Industrial Deployment: Two-stage inference pipeline (lightweight RoBERTa + GPU-accelerated multimodal stage) deployed in EURWEB's commercial platform, balancing cost and accuracy at scale.
BibTeX
@inproceedings{gross2025crossplatform,
title={Cross-Platform E-Commerce Product Categorization and Recategorization: A Multimodal Hierarchical Classification Approach},
author={Gross, Lotte and Walter, Rebecca and Zoppi, Nicole and Justus, Adrien and Gambetti, Alessandro and Han, Qiwei and Kaiser, Maximilian},
booktitle={IEEE International Conference on Big Data (BigData)},
year={2025},
address={Macau, China},
url={https://arxiv.org/abs/2508.20013}
}