Decoding multimodal text analytics: tasks, datasets, fusion models, and future frontiers

Dimensions

Nath, Tanusree, Gupta, Vedika ORCID: https://orcid.org/0000-0002-8109-498X, Gupta, Manjari and Sharma, Rajesh (2026) Decoding multimodal text analytics: tasks, datasets, fusion models, and future frontiers. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 16 (2): e70083. John Wiley & Sons Inc. . ISSN 1942-4787 Available at: https://doi.org/10.1002/widm.70083

[thumbnail of Decoding multimodel text analysis.pdf]

Text
Decoding multimodel text analysis.pdf - Published Version
Restricted to Repository staff only
Download (3MB) | Request a copy

Abstract

It is estimated that the volume of data on the digital fronts will grow exponentially to reach a volume of 180 zettabytes by 2025, and more than 90% of this data will be of unstructured forms. The unimodal to multimodal text analytics (MTA) has been triggered by this phenomenon. The early introduction of the multimodal text were observed in scholarly literature and industrial use‐cases during the early 2010s. Since then, it has greatly expanded its horizons in other sectors such as healthcare, e‐commerce, education and public safety. This survey presents a task‐oriented, modality‐inclusive, and dataset‐aware synthesis of recent advancements in MTA, which offers an in‐depth review of 10 core text analytics tasks through a multimodal lens. We systematically analyze over 160 research studies and categorize more than 120 state‐of‐the‐art models, spanning fusion strategies, representation learning, transformer architectures, and pretrained vision‐language frameworks (e.g., CLIP, ViLBERT). In a variety of datasets including CMU‐MOSI, CMU‐MOSEI, IEMOCAP, and MAViT‐Bangla, multimodal models achieve up to 18%–25% F 1‐score improvements over text‐only baselines, captured in the standardized task‐wise comparison tables that are part of this survey. Moreover, this survey discusses seven under‐explored tasks, including personality detection, satire detection, and author profiling, and elaborates gaps in research in modality fusion, diversity of data sets, and social inclusivity in these tasks. It does not only fill gaps in the current literature by unifying knowledge in different fields, but also offers researchers working on MTA a future path. It is the first survey that puts all the key tasks within multimodal text analytics into a contiguous and consistent overview compared to other surveys that either refer to multimodal computing at an administrative level or concentrate on a specific task.

Item Type:	Article
Uncontrolled Keywords:	Algorithmics \| Emotion recognition \| Fake news \| Hate speech \| Language model \| Multi-modal \| Sentiment analysis \| Text analytics \| Transformer architecture \| Vision-language model
Subjects:	Physical, Life and Health Sciences > Computer Science
Vol/Issue no. published date:	7 June 2026
Depositing User:	Mr. Syed Anas Ali
Date Deposited:	16 Apr 2026 10:09
Last Modified:	21 May 2026 11:57
Official URL:	https://doi.org/10.1002/widm.70083
URI:	https://pure.jgu.edu.in/id/eprint/11209

Downloads

Downloads per month over past year

Actions (login required)

: View Item