This study proposes a Multimodal Sentiment Analysis (MSA) framework for detecting mental health-related sentiment on social media platform X. A total of 40,000 tweet-image pairs were collected from platform X and annotated through a majority voting system. To construct the FastText similarity corpus, 103,512 text data from digital Cable News Network (CNN) 63,512 and X 40,000 were merged to enhance semantic learning. The framework integrates multiple textual feature extraction techniques—RoBERTa, TF-IDF, and FastText—with visual features extracted using VGG-19. Classification is conducted using Long Short-Term Memory (LSTM) for text, Fully Connected Neural Network (FCNN) for images, and their fusion within a multimodal architecture. The best-performing configuration, a multimodal LSTM + FCNN model enhanced with an attention mechanism, achieved an accuracy of 78.56%, marking a 28.01% increase over the image-only baseline. These findings underscore the importance of combining contextual language modeling with complementary visual features through adaptive fusion. The proposed MSA framework demonstrates potential in recognizing complex emotional signals and contributes to the advancement of AI-driven early detection tools for psychological distress on social media.
Keywords—Sentiment Analysis, RoBERTa, TF-IDF, Fasttext, VGG-19, LSTM, FCNN.