ArSen-20

TextsApache-2.0 licenseIntroduced 2024-04-11

Sentiment detection remains a pivotal task in natural language processing, yet its development in Arabic lags due to a scarcity of training materials compared to English. Addressing this gap, we present ArSen-20, a benchmark dataset tailored to propel Arabic sentiment detection forward. ArSen-20 comprises 20,000 professionally labeled tweets sourced from Twitter, focusing on the theme of COVID-19 and spanning the period from 2020 to 2023. Beyond tweet content, the dataset incorporates metadata associated with the user, enriching the contextual understanding. ArSen-20 offers a comprehensive resource to foster advancements in Arabic sentiment analysis and facilitate research in this critical domain.

The ArSen-20 dataset statistics:

| Statistics | Num | |:-------------:|:-----:| | Training set size | 16000 | | Validation set size| 2000 | | Testing set size | 2000 | | Neutral | 17262 | | Positive | 878 | | Negative | 1860 |

Features

The dataset has the following features:

| Field | Type | Description | |:-----------:| :--------: |:----------------: | | tweet id | string | The unique identifier of the requested Tweet. | | label | string | Sentiment Classification of this tweet. | | author id | string |The unique identifier of this user. | | created_at | data | Creation time of the Tweet. | | lang | string | Language of the Tweet, if detected by Twitter. | | like_count | int |The number of likes on this tweet.| |quote_count | int | The number of times this tweet has been quoted. | | reply_count | int | The number of replies to this tweet. | | retweet_count| int | The number of retweets to this tweet. | | tweet | string | The actual UTF-8 text of the Tweet. | |user_verified | boolean | Indicates if this user is a verified Twitter User. | |followers_count | int |The number of followers of the author. | | following_count | int | The number of following of the author. | | tweet_count | int | Total number of tweets by the author. | | listed_count | int |The number of public lists that this user is a member of. | |name | string | The name of the user. | | username | string | The Twitter screen name, handle, or alias. | | user_created_at| data | The UTC datetime that the user account was created. | | description | string | The text of this user’s profile description (bio). |

DownLoad

You can download the dataset from here.

  • ArSen-20_publish.csv - Contains all features.

  • ArSen-20_id_only.csv - Contains only tweets and their author's id.

Citation

If you use this dataset in your research, please cite the following papers:

@inproceedings{fang2024arsen,
title={ArSen-20: A New Benchmark for Arabic Sentiment Detection},
author={Yang Fang and Cheng Xu},
booktitle={5th Workshop on African Natural Language Processing},
year={2024},
url={https://openreview.net/forum?id=GgsRUF5kJt}
}
@inproceedings{fang2024advancing,
    title = "Advancing {A}rabic Sentiment Analysis: {A}r{S}en Benchmark and the Improved Fuzzy Deep Hybrid Network",
    author = "Fang, Yang  and
      Xu, Cheng  and
      Guan, Shuhao  and
      Yan, Nan  and
      Mei, Yuke",
    editor = "Barak, Libby  and
      Alikhani, Malihe",
    booktitle = "Proceedings of the 28th Conference on Computational Natural Language Learning",
    month = nov,
    year = "2024",
    address = "Miami, FL, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.conll-1.39",
    pages = "507--516",
}

contact

If you have any questions or comments about the dataset, please contact Yang Fang (20211209024@chnu.edu.cn).

Potential cooperation in related fields is also welcome. :)