Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews
This repository contains the code, data, and models of the paper titled "BᴀɴɢʟᴀBᴏᴏᴋ: A Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews" published in the Findings of the Association for Computational Linguistics: ACL 2023.
License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
Each row consists of a book review sample. The table below describes what each column signifies.
Column Title | Description
------------ | -------------
id | The unique identification number of the sample
Book_Name | The title of the book that has been evaluated by the review
Writer_Name | The name of the book's author
Category | The genre to which the book belongs
Rating | A numerical value such that <br>A score reflecting the reviewer's subjective assessment of the book's quality
Review | The review text written by the reviewer
Site | The name of the online bookshop
sentiment | The conveyed sentiment and class label of the review<br>For a review sample with rating , the sentiment label is,<br>S_i =\begin{cases}Negative, & \text{if r_i \leq 2}\\\Neutral, & \text{if r_i = 3}\\\Positive, & \text{if r_i \geq 4}\end{cases}
label | The numerical representation of the sentiment label<br>For a review sample with sentiment label , the numerical label is,<br>label_i =\begin{cases}0, & \text{if S_i = Negative}\\\1, & \text{if S_i = Neutral}\\\2, & \text{if S_i = Positive}\end{cases}
For the data collection and preparation process of the BᴀɴɢʟᴀBᴏᴏᴋ dataset, we first compile a list of URLs for authors from online bookstores. From there, we procure URLs for the books. We meticulously scrape information such as book titles, author names, book categories, review texts, reviewer names, review dates, and ratings by utilizing these book URLs. <img src="https://github.com/mohsinulkabir14/BanglaBook/raw/main/images/banglabookgithub1.png" alt="drawing" style="width:1000px;"/>
If a review does not have a rating, we deem it unannotated. Reviews with a rating of 1 or 2 are classified as negative, a rating of 3 is considered neutral, and a rating of 4 or 5 is classified as positive. After discarding the unannotated reviews, we curate a final dataset of 158,065 annotated reviews. Of these, 89,371 are written entirely in Bangla. The remaining 68,694 reviews were written in Romanized Bangla, English, or a mix of languages. They are translated into Bangla with Google Translator and a custom Python program using the googletrans library. The translations are subsequently subjected to manual review and scrutiny to confirm their accuracy.
<img src="https://github.com/mohsinulkabir14/BanglaBook/raw/main/images/banglabookgithub2.png" alt="drawing" style="width:1000px;"/>