TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Unigram Segmentation

Unigram Segmentation

Natural Language ProcessingIntroduced 20001 papers
Source Paper

Description

Unigram Segmentation is a subword segmentation algorithm based on a unigram language model. It provides multiple segmentations with probabilities. The language model allows for emulating the noise generated during the segmentation of actual data.

The unigram language model makes an assumption that each subword occurs independently, and consequently, the probability of a subword sequence x=(x1,…,xM)\mathbf{x} = (x_1,\ldots,x_M)x=(x1​,…,xM​) is formulated as the product of the subword occurrence probabilities p(xi)p(x_i)p(xi​):

P(x)=∏i=1Mp(xi),∀i  xi∈V,   ∑x∈Vp(x)=1, P(\mathbf{x}) = \prod_{i=1}^{M} p(x_i), \\\\ \forall i\,\, x_i \in \mathcal{V},\,\,\, \sum_{x \in \mathcal{V}} p(x) = 1, \nonumberP(x)=i=1∏M​p(xi​),∀ixi​∈V,x∈V∑​p(x)=1,

where V\mathcal{V}V is a pre-determined vocabulary. The most probable segmentation x∗\mathbf{x}^*x∗ for the input sentence XXX is then given by:

x∗=argmaxx∈S(X)P(x), \mathbf{x}^{*} = \text{argmax}_{\mathbf{x} \in \mathcal{S}(X)} P(\mathbf{x}),x∗=argmaxx∈S(X)​P(x),

where S(X)\mathcal{S}(X)S(X) is a set of segmentation candidates built from the input sentence XXX. x∗\mathbf{x}^*x∗ is obtained with the Viterbi algorithm.

Papers Using This Method

Benchmarking Azerbaijani Neural Machine Translation2022-07-29