Papers With Code 2 | ML Benchmarks, SotA Results & Code

Description

Unigram Segmentation is a subword segmentation algorithm based on a unigram language model. It provides multiple segmentations with probabilities. The language model allows for emulating the noise generated during the segmentation of actual data.

The unigram language model makes an assumption that each subword occurs independently, and consequently, the probability of a subword sequence $\mathbf{x} = (x_1,\ldots,x_M)$ is formulated as the product of the subword occurrence probabilities $p(x_i)$ :

P(\mathbf{x}) = \prod_{i=1}^{M} p(x_i), \\\\ \forall i\,\, x_i \in \mathcal{V},\,\,\, \sum_{x \in \mathcal{V}} p(x) = 1, \nonumber

where $\mathcal{V}$ is a pre-determined vocabulary. The most probable segmentation $\mathbf{x}^*$ for the input sentence $X$ is then given by:

\mathbf{x}^{*} = \text{argmax}_{\mathbf{x} \in \mathcal{S}(X)} P(\mathbf{x}),

where $\mathcal{S}(X)$ is a set of segmentation candidates built from the input sentence $X$ . $\mathbf{x}^*$ is obtained with the Viterbi algorithm.

Description

P(\mathbf{x}) = \prod_{i=1}^{M} p(x_i), \\\\ \forall i\,\, x_i \in \mathcal{V},\,\,\, \sum_{x \in \mathcal{V}} p(x) = 1, \nonumber

where $\mathcal{V}$ is a pre-determined vocabulary. The most probable segmentation $\mathbf{x}^*$ for the input sentence $X$ is then given by:

\mathbf{x}^{*} = \text{argmax}_{\mathbf{x} \in \mathcal{S}(X)} P(\mathbf{x}),

where $\mathcal{S}(X)$ is a set of segmentation candidates built from the input sentence $X$ . $\mathbf{x}^*$ is obtained with the Viterbi algorithm.

Unigram Segmentation

Description

Papers Using This Method

Unigram Segmentation

Description

Papers Using This Method