A Survey on Text-to-Image Person Re-identification: From CLIP to Fine-Grained Cross-Modal Alignment

Dongbin Chen; Xiaohang Su; Junjie Li; Yancheng Wu; Shirun Liang

doi:10.63313/AERpc.9097

Authors

Dongbin Chen School of Electronics and Information Engineering, Tiangong University, Tianjin 300387, China Author
Xiaohang Su School of Mechanical Engineering, Tiangong University, Tianjin 300387, China Author
Junjie Li School of Electronics and Information Engineering, Tiangong University, Tianjin 300387, China Author
Yancheng Wu School of Electronics and Information Engineering, Tiangong University, Tianjin 300387, China Author
Shirun Liang School of Electronics and Information Engineering, Tiangong University, Tianjin 300387, China Author

DOI:

https://doi.org/10.63313/AERpc.9097

Keywords:

Text-to-Image Person Re-identification, CLIP, Cross-Modal Alignment, Fine-Grained Retrieval, Vision-Language Pre-training

Abstract

Text-to-image person re-identification (TI-ReID) has become a critical task in intelligent surveillance, aiming to retrieve a target pedestrian from a large image gallery using a natural language description. Unlike image-based ReID, TI-ReID must bridge the semantic gap between modalities while capturing fine-grained, identity-discriminative details. The advent of Vision-Language Pre-training (VLP) models, particularly CLIP, has significantly advanced the field by providing robust pre-trained cross-modal representations. However, directly applying CLIP to TI-ReID is suboptimal due to a granularity gap: CLIP is effective at global scene-level matching, whereas TI-ReID demands fine-grained alignment of attributes like clothing, accessories, and actions. This survey systematically reviews the evolution of TI-ReID, from early unimodal backbone approaches to modern CLIP-based frameworks. We categorize existing methods into three main paradigms: 1) global feature learning with contrastive objectives, 2) part-level alignment using spatial attention or human parsers, and 3) token-level interaction and selection mechanisms. We analyze the core challenges of visual token redundancy, weak part-level correspondence, and noisy cross-modal data. Key representative methods such as IRRA, CFine, RASA, and RDE are discussed. Finally, we identify key open challenges and outline promising future directions, including multimodal large language models, query-aware alignment, and video-based ReID.

References

[1] S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X. Wang, "Person search with natural language description," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, Jul. 2017, pp. 1970-1979.

[2] Y. Zhang and H. Lu, "Deep cross-modal projection learning for image-text matching," in Proc. Eur. Conf. Comput. Vis. (ECCV), Munich, Germany, Sep. 2018, pp. 686-701.

[3] N. Sarafianos, X. Xu, and I. A. Kakadiaris, "Adversarial representation learning for text-to-image matching," in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Seoul, Korea, Oct. 2019, pp. 5814-5824.

[4] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, and I. Sutskever, "Learning transferable visual models from natural language supervision," in Proc. Int. Conf. Mach. Learn. (ICML), Jul. 2021, pp. 8748-8763.

[5] A. Zhu, Z. Wang, Y. Li, X. Wan, J. Jin, T. Wang, F. Hu, and G. Hua, "DSSL: Deep surroundings-person separation learning for text-based person retrieval," in Proc. 29th ACM Int. Conf. Multimedia, Chengdu, China, Oct. 2021, pp. 209-217.

[6] Y. Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh, "DynamicViT: Efficient vision transformers with dynamic token sparsification," in Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 13937-13949.

[7] Y. Qin, Y. Chen, D. Peng, X. Peng, J. T. Zhou, and P. Hu, "Noisy-correspondence learning for text-to-image person re-identification," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Seattle, WA, USA, Jun. 2024, pp. 27187-27196.

[8] D. Jiang and M. Ye, "Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Vancouver, BC, Canada, Jun. 2023, pp. 2787-2797.

[9] S. Yan, N. Dong, L. Zhang, and J. Tang, "CLIP-driven fine-grained text-image person re-identification," IEEE Trans. Image Process., vol. 32, pp. 6032-6046, 2023.

[10] Z. Wang, Z. Fang, J. Wang, and Y. Yang, "ViTAA: Visual-textual attributes alignment in person search by natural language," in Proc. Eur. Conf. Comput. Vis. (ECCV), Glasgow, UK, Aug. 2020, pp. 392-408.

[11] Y. Chen, G. Zhang, Y. Lu, Z. Wang, and Y. Zheng, "TIPCB: A simple but effective part-based convolutional baseline for text-based person search," Neurocomputing, vol. 494, pp. 171-181, 2021.

[12] Y. Bai, M. Cao, D. Gao, Z. Cao, C. Chen, Z. Fan, L. Nie, and M. Zhang, "RASA: Relation and sensitivity aware representation learning for text-based person search," in Proc. 32nd Int. Joint Conf. Artif. Intell. (IJCAI), Macau, China, Aug. 2023, pp. 555-563.

[13] S. He, H. Luo, W. Jiang, X. Jiang, and H. Ding, "VGSG: Vision-guided semantic-group network for text-based person search," IEEE Trans. Image Process., vol. 33, pp. 163-176, 2024.

[14] S. Yang, Y. Zhou, Z. Zheng, Y. Wang, L. Zhu, and Y. Wu, "Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark," in Proc. 31st ACM Int. Conf. Multimedia (MM), Ottawa, ON, Canada, Oct. 2023, pp. 4492-4501.

[15] S. Chen et al., "CMLFA: Cross-modal learning with feature aggregation for text-based person retrieval," J. Electron. Imag., vol. 34, no. 2, pp. 78.05-90.43, 2025.

[16] Y. Qin, Y. Bai et al., "Human-centered interactive learning via MLLMs for text-to-image person re-identification," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Nashville, TN, USA, Jun. 2025.