Vashantor: A Large-scale Multilingual Benchmark Dataset for Automated Translation of Bangla Regional Dialects to Bangla Language

TextsCC BY 4.0Introduced 2024-01-15

The Vashantor dataset consists of 32,500 sentences from different regions, including Chittagong, Noakhali, Sylhet, Barishal, and Mymensingh. It is categorized into two language formats: "Bangla" and "Banglish." Each region and language combination has specified quantities for training, testing, and validation samples. The dataset details are as follows:

Specifics of the Core Data:

| Type | Bangla | Banglish | English | |:----------: |:------: |:--------: |:-------: | | Train | 1875 | 1875 | 1875 |
| Test | 375 | 375 | 375 |
| Validation | 250 | 250 | 250 |

Specifics of the Regional Data:

<table class="tg"> <thead> <tr> <th class="tg-c3ow">Region</th> <th class="tg-c3ow">Type</th> <th class="tg-c3ow">Train</th> <th class="tg-c3ow">Test</th> <th class="tg-c3ow">Validation</th> </tr> </thead> <tbody> <tr> <td class="tg-c3ow" rowspan="2">Chittagong</td> <td class="tg-0pky">Bangla</td> <td class="tg-dvpl">1875</td> <td class="tg-dvpl">375</td> <td class="tg-dvpl">250</td> </tr> <tr> <td> </td> <td class="tg-0pky">Banglish</td> <td class="tg-dvpl">1875</td> <td class="tg-dvpl">375</td> <td class="tg-dvpl">250</td> </tr> <tr> <td class="tg-c3ow" rowspan="2">Noakhali</td> <td class="tg-0pky">Bangla</td> <td class="tg-dvpl">1875</td> <td class="tg-dvpl">375</td> <td class="tg-dvpl">250</td> </tr> <tr> <td> </td> <td class="tg-0pky">Banglish</td> <td class="tg-dvpl">1875</td> <td class="tg-dvpl">375</td> <td class="tg-dvpl">250</td> </tr> <tr> <td class="tg-c3ow" rowspan="2">Sylhet</td> <td class="tg-0pky">Bangla</td> <td class="tg-dvpl">1875</td> <td class="tg-dvpl">375</td> <td class="tg-dvpl">250</td> </tr> <tr> <td> </td> <td class="tg-0pky">Banglish</td> <td class="tg-dvpl">1875</td> <td class="tg-dvpl">375</td> <td class="tg-dvpl">250</td> </tr> <tr> <td class="tg-c3ow" rowspan="2">Barishal</td> <td class="tg-0pky">Bangla</td> <td class="tg-dvpl">1875</td> <td class="tg-dvpl">375</td> <td class="tg-dvpl">250</td> </tr> <tr> <td> </td> <td class="tg-0pky">Banglish</td> <td class="tg-dvpl">1875</td> <td class="tg-dvpl">375</td> <td class="tg-dvpl">250</td> </tr> <tr> <td class="tg-c3ow" rowspan="2">Mymensingh</td> <td class="tg-0pky">Bangla</td> <td class="tg-dvpl">1875</td> <td class="tg-dvpl">375</td> <td class="tg-dvpl">250</td> </tr> <tr> <td> </td> <td class="tg-0pky">Banglish</td> <td class="tg-dvpl">1875</td> <td class="tg-dvpl">375</td> <td class="tg-dvpl">250</td> </tr> </tbody> </table>