Image analysis and visual recognition play a vital role in various modern applications, including face recognition, autonomous vehicles, and object detection. Fine-grained visual classification (FGVC) is a significant challenge in visual classification, where the ability to distinguish fine details among visually very similar objects is a major constraint. Recently, Vision Transformer has emerged as one of the methods used for visual tasks. Although Vision Transformers (ViTs) have shown great potential in various visual tasks, there are still limitations in understanding and extracting local features from images, which are the main components in FGVC. In this thesis, we modified the Internal Ensemble Learning Transformer method by modifying the Multi-Head Voting (MHV) module by integrating more appropriate kernels and applying masking techniques through the implementation of the Batch-based Dynamic Masking (BDMM) algorithm to improve the model’s ability to understand and extract local features from input imag