Research on the Application of NLP Sentiment Analysis in Student Course Evaluations

Caichun Cen

College of Big Data and Software Engineering, Wuzhou University, Wuzhou, 543000, China.

Renaissance 2025, 1(05); https://doi.org/10.70548/ra142145
Submission received: 13 March April / Revised: 24 May 2025 / Accepted: 11 July 2025 / Published: 21 July 2025

Abstract：With the rapid development of educational informatization, student course evaluations have gradually gained widespread attention as an important tool for assessing teaching quality and optimizing course design. Traditional course evaluations often rely on manual analysis, which is time-consuming and prone to subjective biases. This study aims to utilize Natural Language Processing (NLP) technology by employing BERT and RoBERTa models, based on the Transformer architecture, for sentiment analysis, automating the processing of student course evaluation data to achieve fast and accurate sentiment classification of student feedback. The experimental results show that both BERT and RoBERTa perform excellently in sentiment classification tasks, effectively predicting sentiment categories, including positive, negative, neutral, and mixed sentiments.

Keywords：Student Course Evaluation, Sentiment Analysis,NLP, BERT, RoBERTa

1.INTRODUCTION

In modern education, student course evaluations have become an important means of assessing teaching quality and optimizing course design. These evaluations not only reflect students’ satisfaction with course content, teaching methods, and instructors, but also provide educational administrators with valuable suggestions for improvement and evidence for decision-making [1][2]. However, traditional course evaluations mainly rely on qualitative questionnaires and rating scales, making the analysis process both tedious and inefficient [3][4].

Sentiment Analysis (SA), a key task in Natural Language Processing (NLP), aims to automatically identify emotional tendencies in text [5]. The goal of sentiment analysis is to recognize and classify the polarity expressed in text, typically categorized as negative, neutral, positive, or mixed [6][7]. Existing sentiment analysis approaches can be broadly divided into the following categories:Rule-based Approach: Performs sentiment analysis using predefined rules and sentiment lexicons [8].Machine Learning-based Approach: Trains models on large amounts of labeled data. Commonly used algorithms include Naive Bayes, Support Vector Machines (SVM), decision trees, etc. [9][10].Deep Learning-based Approach: Automatically extracts high-level features from data via neural networks. Widely adopted models include Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), BERT, RoBERTa, and others [11][12][13].

With the rapid development of NLP technologies, sentiment analysis has gradually become an effective tool for processing student course evaluation data. Through automated sentiment analysis, educational administrators can quickly extract emotional information (e.g., positive, negative, or neutral) from student feedback, thereby significantly improving processing efficiency and providing data-driven support for educational decision-making [14].This study employs two state-of-the-art deep learning models — BERT (Bidirectional Encoder Representations from Transformers) and RoBERTa (Robustly optimized BERT approach) — to train on student course evaluation data [15][16]. Both models are built upon the Transformer architecture and have been widely applied to text classification and sentiment analysis tasks [17][18]. BERT leverages bidirectional attention mechanisms to fully capture contextual information, thereby enhancing semantic understanding [19]. RoBERTa, built on BERT, achieves further performance improvements through training on larger datasets for longer periods and adopting optimized training strategies [20]. Although RoBERTa generally outperforms BERT on large-scale datasets, BERT tends to be more stable and performs better when training data is limited, making it particularly suitable for small-scale sentiment analysis tasks [21].

2.DATASET DESCRIPTION

2.1 Data Files

The dataset for this study was obtained from a Kaggle competition: https://www.kaggle.com/code/aslemimolu/nlp-sentiment-analysis-xm-99/input. It includes training data for course evaluations and a test dataset for prediction, both comprising corresponding IDs for each evaluation, student text feedback on courses, and four sentiment classifications for each evaluation: positive, negative, neutral, and mixed. As shown in Table 1. The data distribution of the training set and test set is illustrated in Figure 1 and Figure 2, respectively.

TABLE I.INFORMATION ABOUT THE DATASET
RELEVANT INFORMATION	Training set	Test set
NUMBER OF SAMPLES	1600	400
ID	Unique identifiers	Unique identifiers
TEXT	Texts for sentiment analysis	Texts to be used for prediction
CATEGORY	Labeled emotional categories (0, 1, 2, 3)

Fig1 .Training set Fig2 .Test set

2.2 Task Objectives

The objective of this task is to predict the sentiment category of each student course evaluation based on the textual feedback using sentiment analysis. We train and evaluate the performance of BERT and RoBERTa models separately on this sentiment analysis task. The sentiment categories include negative, neutral, positive, and mixed.By analyzing these sentiment categories, we aim to gain a comprehensive understanding of students’ overall perceptions of the course. This enables instructors to adjust and optimize course design in real time and promptly capture students’ suggestions and opinions regarding the course.The evaluation metric adopted in this study is Accuracy: the percentage of instances where the model’s predicted labels match the ground-truth labels.

3. DATA PROCESSING

3.1 Data Preprocessing

Data preprocessing was conducted in a Python environment. Basic information extraction, class distribution analysis, and visualization were performed, with results shown in Figure 3. The dataset exhibits a highly balanced class distribution, with the following sample counts for each category:Negative (0): 394 samples；Neutral (1): 404 samples；Positive (2): 393 samples；Mixed (3): 409 samples.These results indicate that the model can adequately represent each sentiment category, and no additional measures (such as oversampling, undersampling, or class weighting) are required to address class imbalance.

Fig3 .Data clustering visualization results

3.2 Data Preprocessing

The raw evaluation texts were cleaned as follows: First, the data were loaded using pandas, and text cleaning was performed via regular expressions (removing URLs, special characters, emojis, and redundant whitespace). All text was converted to lowercase.Subsequently, label encoding and dataset splitting were carried out: the evaluation texts were labeled according to their sentiment categories (positive, negative, neutral, and mixed), and the dataset was divided into training and validation sets.Finally, tokenization was performed using the respective tokenizers of the models. The maximum sequence length was determined, and the texts were converted into PyTorch Dataset format suitable for input to BERT and RoBERTa using their corresponding Tokenizers.

4. MODEL TRAINING AND EVALUATION

4.1 Model Training

The core idea of BERT (Bidirectional Encoder Representations from Transformers) is to understand contextual information in text through a bidirectional self-attention mechanism. To elucidate its principles more clearly, the explanation starts from the self-attention mechanism and extends to BERT’s pre-training objectives.The goal of the self-attention mechanism is to compute the relationships between each token in the input sequence and all other tokens, thereby assigning a corresponding weight to each token.Given an input sequence, it can be represented as:

Q = XW ^Q, K = XW ^K , V = XW ^V

Where Q is the Query matrix, K is the Key matrix, V is the Value matrix, X is the input word embedding matrix, and W^Q, W ^K , W ^V are the learned weight matrices.Next, the attention scores are computed using the following formula:

Where is the dimension of the keys, and is used to scale the results, preventing excessively large values.In the BERT model, the query vector (), key vector (), and value vector () are calculated bidirectionally for each word, which allows the model to consider both the left and right context of the word. This bidirectional context understanding is one of the key differences between BERT and traditional unidirectional language models (such as LSTM or GPT).

BERT’s pretraining tasks include Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). The goal of MLM is to predict the masked words based on the surrounding context:

Where is the target word, is the hidden representation obtained through the self-attention mechanism, and is the output layer weight.

The final output of the BERT model is a representation computed using the bidirectional self-attention mechanism, which captures the meaning of each word in its context. This representation is then used for downstream tasks such as sentiment analysis, question answering, and text classification.The pseudocode for the implementation process in this study is shown in Algorithm 1.

Algorithm 1：BERT Model Training for Sentiment Analysis

Input: Training data

, Test data

, Maximum sequence length

Output: Trained BERT model Load and clean data Apply text cleaning function to remove mentions, URLs, special characters, and convert text to lowercase Tokenize training and test data using BERT tokenizer Prepare train and validation datasets Define BERT model architecture for sequence classification Set training parameters (batch size = 16, optimizer = AdamW, learning rate = 5e-5) Train the model for

epochs (set

) for epoch = 1 to N doTrain model on training data (use CrossEntropyLoss for loss calculation) Compute and display training accuracy and loss Evaluate the model on validation data Compute and display validation accuracy and loss Save the model if validation accuracy is improvedend for Predict on test data using the trained modelSave predictions to a CSV file

　　RoBERTa (A Robustly Optimized BERT Pretraining Approach) is an improved version of the BERT model, proposed by Facebook AI. RoBERTa optimizes the pretraining process of BERT by utilizing larger training datasets and more computational resources, further enhancing the model’s performance.

　　RoBERTa is based on the same Transformer architecture as BERT, particularly in the Encoder part. Like BERT, RoBERTa uses the self-attention mechanism to capture relationships between different words in the text. It also learns high-level textual features automatically, without the need for manual feature engineering. The key difference of RoBERTa is that it incorporates several important optimizations during training. For instance, it removes the Next Sentence Prediction (NSP) task, which was considered to have little impact on BERT’s model performance. Instead, RoBERTa relies solely on the Masked Language Model (MLM) for pretraining. Additionally, RoBERTa uses larger batch sizes and learning rate schedulers to improve training efficiency on large-scale datasets. The optimization formula used in RoBERTa during the training process is as follows:

Where represents the current model parameters, is the learning rate, and is the gradient of the current loss function with respect to the model parameters . The pseudocode for the implementation process in this study is shown as Algorithm 2.

Algorithm 2：RoBERTa Model Training for Sentiment Analysis

Input: Training data

, Test data

, Maximum sequence length

Output: Trained RoBERTa model Load and clean data Apply text cleaning function to remove mentions, URLs, special characters, and convert text to lowercase Tokenize training and test data using RoBERTa tokenizer Prepare train and validation datasets Define RoBERTa model architecture for sequence classificationSet training parameters (batch size = 16, optimizer = AdamW, learning rate = 5e-5) Train the model for

epochs (set

) for epoch = 1 to N doTrain model on training data (use CrossEntropyLoss for loss calculation) Compute and display training accuracy and lossEvaluate the model on validation data Compute and display validation accuracy and loss Save the model if validation accuracy is improved end forPredict on test data using the trained model Save predictions to a CSV file

4.2Model Evaluation

By employing two advanced NLP models, BERT and RoBERTa, this study demonstrates how modern deep learning techniques can be applied to the educational field, particularly in the analysis of student course evaluations. The competition results show that with optimization, the BERT model’s score improved from 93.5% to 98.00%,As shown in Figure 4. while the RoBERTa model’s performance further improved to 99.00%.As shown in Figures 5 and 6.

Fig4 .Training results using BERT in the competition

Fig５.Training results using RoBERTa in the competition

Fig６.Training results using RoBERTa in the competition

5.CONCLUSION AND FUTURE WORK

This study utilizes the BERT and RoBERTa models for sentiment analysis of student course evaluations, automating the processing of evaluation data, and successfully classifying student feedback sentiments. The experimental results show that RoBERTa slightly outperforms BERT in terms of accuracy, but BERT performs more stably with smaller datasets. This research provides educational administrators with a precise course feedback analysis tool, demonstrating the potential of deep learning technologies in the field of education.

Future work could be optimized in the following areas:

Data Processing and Cleaning: Further improving data quality by removing noise.
Class Imbalance: Enhancing model performance through data augmentation or weighted loss functions.
Model Selection and Training: Exploring more lightweight models and enhanced regularization techniques to avoid overfitting.
Hyperparameter Tuning: Using automated tuning and multi-metric evaluation methods to improve model performance.
Inference Speed: Improving real-time prediction capabilities through model optimization techniques.
Multi-domain Applications: Expanding sentiment analysis to include areas such as professional evaluations and teacher assessments, providing more comprehensive data support.

ACKNOWLEDGMENT

This work was supported by the Project of Guangxi University Young and Middle-aged Teachers’ Research Basic Ability Enhancement Project (2024KY0695), Wuzhou Science and Technology Plan Project (2023B02028)

REFERENCES

Wright, J., Smith, R. (2018). Enhancing Educational Quality through Student Feedback: A Review of Methods. International Journal of Educational Research, 98, 12-20.
Sanchez, V., & Martínez, A. (2019). Using Student Course Evaluations for Quality Improvement in Higher Education. Journal of Educational Administration, 57(3), 316-331.
Li, Y., & Zhou, Q. (2020). A Comparative Study of Traditional and Digital Feedback Systems in Higher Education. Computers & Education, 146, 103764.
Chen, L., Liu, Z., & Wu, L. (2017). Exploring the Use of Questionnaires for Course Evaluation: A Case Study. Educational Technology Research and Development, 65(6), 1305-1321.
Pang, B., & Lee, L. (2008). Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval, 2(1-2), 1-135.
Liu, B. (2012). Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies, 5(1), 1-167.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022.
Hu, M., & Liu, B. (2004). Mining and Summarizing Customer Reviews. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2004), 168-177.
Yang, Y., & Liu, X. (2006). A Re-examination of Text Categorization Methods. Proceedings of the 22nd International Conference on Machine Learning (ICML 2005), 392-399.
Joachims, T. (1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proceedings of the 10th European Conference on Machine Learning, 137-142.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep Learning. Nature, 521(7553), 436-444.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. P., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. Proceedings of NeurIPS 2017, 30.
Xue, X., & Zhang, C. (2020). Text Mining for Educational Feedback: A Sentiment Analysis Approach. Journal of Educational Data Mining, 12(2), 35-54.
Liu, Y., Ott, M., Goyal, N., Du, J., Mou, L., Yang, H., Manning, C. D. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. Proceedings of the 36th International Conference on Machine Learning (ICML 2019), 97, 700-709.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2019). Xlnet: Generalized Autoregressive Pretraining for Language Understanding. Proceedings of NeurIPS 2019.
Radford, A., Wu, J., Amodei, D., Clark, J., Luan, D., Sutskever, I., & Mikolov, T. (2018). Improving Language Understanding by Generative Pre-training. OpenAI Blog.
Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. Proceedings of ACL 2018.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019.
Radford, A., Wu, J., Amodei, D., Clark, J., Luan, D., Sutskever, I., & Mikolov, T. (2018). Improving Language Understanding by Generative Pre-training. OpenAI Blog.
Zhou, Q., Li, Y., & Liu, W. (2017). Sentiment Analysis in Course Evaluation Systems: A Case Study. International Journal of Artificial Intelligence Education, 28(4), 484-505.