TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets/BanglaBook

BanglaBook

Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews

TextsCreative Commons Attribution-NonCommercial-ShareAlike 4.0 InternationalIntroduced 2023-05-11

This repository contains the code, data, and models of the paper titled "BᴀɴɢʟᴀBᴏᴏᴋ: A Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews" published in the Findings of the Association for Computational Linguistics: ACL 2023.

arXiv anthology GoogleScholar ResearchGate GitHub HuggingFace

PDF Slides Video

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

license

Data Format

Each row consists of a book review sample. The table below describes what each column signifies.

Column Title | Description ------------ | ------------- id | The unique identification number of the sample Book_Name | The title of the book that has been evaluated by the review Writer_Name | The name of the book's author Category | The genre to which the book belongs Rating | A numerical value rrr such that 1≤r≤51\leq r \leq 51≤r≤5<br>A score reflecting the reviewer's subjective assessment of the book's quality Review | The review text written by the reviewer Site | The name of the online bookshop sentiment | The conveyed sentiment and class label of the review<br>For a review sample iii with rating rir_iri​, the sentiment label SiS_iSi​ is,<br>S_i =\begin{cases}Negative, & \text{if r_i \leq 2}\\\Neutral, & \text{if r_i = 3}\\\Positive, & \text{if r_i \geq 4}\end{cases} label | The numerical representation of the sentiment label<br>For a review sample iii with sentiment label SiS_iSi​, the numerical label is,<br>label_i =\begin{cases}0, & \text{if S_i = Negative}\\\1, & \text{if S_i = Neutral}\\\2, & \text{if S_i = Positive}\end{cases}

Data Construction

Data Collection Process

For the data collection and preparation process of the BᴀɴɢʟᴀBᴏᴏᴋ dataset, we first compile a list of URLs for authors from online bookstores. From there, we procure URLs for the books. We meticulously scrape information such as book titles, author names, book categories, review texts, reviewer names, review dates, and ratings by utilizing these book URLs. <img src="https://github.com/mohsinulkabir14/BanglaBook/raw/main/images/banglabookgithub1.png" alt="drawing" style="width:1000px;"/>

Labeling, Translation, and Validation of the Curated Samples

If a review does not have a rating, we deem it unannotated. Reviews with a rating of 1 or 2 are classified as negative, a rating of 3 is considered neutral, and a rating of 4 or 5 is classified as positive. After discarding the unannotated reviews, we curate a final dataset of 158,065 annotated reviews. Of these, 89,371 are written entirely in Bangla. The remaining 68,694 reviews were written in Romanized Bangla, English, or a mix of languages. They are translated into Bangla with Google Translator and a custom Python program using the googletrans library. The translations are subsequently subjected to manual review and scrutiny to confirm their accuracy. <img src="https://github.com/mohsinulkabir14/BanglaBook/raw/main/images/banglabookgithub2.png" alt="drawing" style="width:1000px;"/>

Results

<img src="https://github.com/mohsinulkabir14/BanglaBook/raw/main/images/banglabookgithub3.png" alt="drawing" style="width:1000px;"/> ## Citation If you find this work useful, please cite our paper: ```bib @inproceedings{kabir-etal-2023-banglabook, title = "{B}angla{B}ook: A Large-scale {B}angla Dataset for Sentiment Analysis from Book Reviews", author = "Kabir, Mohsinul and Bin Mahfuz, Obayed and Raiyan, Syed Rifat and Mahmud, Hasan and Hasan, Md Kamrul", booktitle = "Findings of the Association for Computational Linguistics: ACL 2023", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-acl.80", pages = "1237--1247", abstract = "The analysis of consumer sentiment, as expressed through reviews, can provide a wealth of insight regarding the quality of a product. While the study of sentiment analysis has been widely explored in many popular languages, relatively less attention has been given to the Bangla language, mostly due to a lack of relevant data and cross-domain adaptability. To address this limitation, we present BanglaBook, a large-scale dataset of Bangla book reviews consisting of 158,065 samples classified into three broad categories: positive, negative, and neutral. We provide a detailed statistical analysis of the dataset and employ a range of machine learning models to establish baselines including SVM, LSTM, and Bangla-BERT. Our findings demonstrate a substantial performance advantage of pre-trained models over models that rely on manually crafted features, emphasizing the necessity for additional training resources in this domain. Additionally, we conduct an in-depth error analysis by examining sentiment unigrams, which may provide insight into common classification errors in under-resourced languages like Bangla. Our codes and data are publicly available at https://github.com/mohsinulkabir14/BanglaBook.", } ```

Benchmarks

Sentiment Analysis/Weighted Average F1-score

Statistics

Papers
1
Benchmarks
1

Links

Homepage

Tasks

Sentiment AnalysisSentiment ClassificationText Classification