TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/BanglaBook: A Large-scale Bangla Dataset for Sentiment Ana...

BanglaBook: A Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews

Mohsinul Kabir, Obayed Bin Mahfuz, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan

2023-05-11Sentiment Analysis
PaperPDFCode(official)

Abstract

The analysis of consumer sentiment, as expressed through reviews, can provide a wealth of insight regarding the quality of a product. While the study of sentiment analysis has been widely explored in many popular languages, relatively less attention has been given to the Bangla language, mostly due to a lack of relevant data and cross-domain adaptability. To address this limitation, we present BanglaBook, a large-scale dataset of Bangla book reviews consisting of 158,065 samples classified into three broad categories: positive, negative, and neutral. We provide a detailed statistical analysis of the dataset and employ a range of machine learning models to establish baselines including SVM, LSTM, and Bangla-BERT. Our findings demonstrate a substantial performance advantage of pre-trained models over models that rely on manually crafted features, emphasizing the necessity for additional training resources in this domain. Additionally, we conduct an in-depth error analysis by examining sentiment unigrams, which may provide insight into common classification errors in under-resourced languages like Bangla. Our codes and data are publicly available at https://github.com/mohsinulkabir14/BanglaBook.

Results

TaskDatasetMetricValueModel
Sentiment AnalysisBanglaBookWeighted Average F1-score0.9331Bangla-BERT (large)
Sentiment AnalysisBanglaBookWeighted Average F1-score0.9106Random Forest (word 2-gram + word 3-gram)
Sentiment AnalysisBanglaBookWeighted Average F1-score0.9064Bangla-BERT (base-uncased)
Sentiment AnalysisBanglaBookWeighted Average F1-score0.9053SVM (word 2-gram + word 3-gram)
Sentiment AnalysisBanglaBookWeighted Average F1-score0.9043Random Forest (word 1-gram)
Sentiment AnalysisBanglaBookWeighted Average F1-score0.8978Logistic Regression (char 2-gram + char 3-gram)
Sentiment AnalysisBanglaBookWeighted Average F1-score0.8964Logistic Regression (word 2-gram + word 3-gram)
Sentiment AnalysisBanglaBookWeighted Average F1-score0.8723XGBoost (char 2-gram + char 3-gram)
Sentiment AnalysisBanglaBookWeighted Average F1-score0.8663Multinomial NB (word 2-gram + word 3-gram)
Sentiment AnalysisBanglaBookWeighted Average F1-score0.8651XGBoost (word 2-gram + word 3-gram)
Sentiment AnalysisBanglaBookWeighted Average F1-score0.8564Multinomial NB (BoW)
Sentiment AnalysisBanglaBookWeighted Average F1-score0.8519SVM (word 1-gram)
Sentiment AnalysisBanglaBookWeighted Average F1-score0.0991LSTM (GloVe)

Related Papers

AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis2025-07-17AI Wizards at CheckThat! 2025: Enhancing Transformer-Based Embeddings with Sentiment for Subjectivity Detection in News Articles2025-07-15DCR: Quantifying Data Contamination in LLMs Evaluation2025-07-15SentiDrop: A Multi Modal Machine Learning model for Predicting Dropout in Distance Learning2025-07-14GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation2025-07-10FINN-GL: Generalized Mixed-Precision Extensions for FPGA-Accelerated LSTMs2025-06-25Unpacking Generative AI in Education: Computational Modeling of Teacher and Student Perspectives in Social Media Discourse2025-06-19Characterizing Linguistic Shifts in Croatian News via Diachronic Word Embeddings2025-06-16