TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets/PCFG SET

PCFG SET

Probabilistic Context Free Grammar String Edit Task

TextsMIT licenceIntroduced 2019-08-22

The Probabilistic Context Free Grammar String Edit Task (PCFG SET) dataset is a dataset with sequence to sequence problems specifically designed to test different aspects of compositional generalisation. In particular, the dataset contains splits to test for systematicity, productivity, substitutivity, localism and overgeneralisation.

The input alphabet of PCFG SET contains three types of words: words for unary and binary functions that represent \emph{string edit operations} (e.g. append,copy,reverse)\texttt{append}, \texttt{copy}, \texttt{reverse})append,copy,reverse), elements to form the string sequences that these functions can be applied to (e.g. A,B,A1,B1\texttt{A}, \texttt{B}, \texttt{A1}, \texttt{B1}A,B,A1,B1), and a separator to separate the arguments of a binary function (,\texttt{,},). The input sequences that are formed with this alphabet are sequences describing how a series of such operations are to be applied to a string argument. For instance:

  • repeat A B C \texttt{repeat A B C }repeat A B C 
  • echo remove_first D K , E F\texttt{echo remove\_first D K , E F}echo remove_first D K , E F
  • append swap F G H , repeat I J\texttt{append swap F G H , repeat I J}append swap F G H , repeat I J

The input sequences are generated with a PCFG, whose production probabilities are learned with EM to match the depth and length distributions in a corpus with English sentences.

The output of a PCFG SET sequence, representing its meaning, is constructed by recursively applying the string edit operations specified in the sequence. For instance:

  • repeat A B C \texttt{repeat A B C }repeat A B C  & →\rightarrow→ & A B C A B C\texttt{A B C A B C}A B C A B C
  • echo remove_first D K , E F\texttt{echo remove\_first D K , E F}echo remove_first D K , E F & →\rightarrow→ & E F F\texttt{E F F}E F F
  • append swap F G H , repeat I J\texttt{append swap F G H , repeat I J}append swap F G H , repeat I J & →\rightarrow→ & H G F I J I J \texttt{H G F I J I J }H G F I J I J 

The string alphabet used for the construction of the dataset has 520 distinct elements, the length of the string arguments to a functions is limited to 5.The dataset contains around 100 thousand examples in total. A full description of the dataset can be found in Hupkes et al (2020).

Statistics

Papers
4
Benchmarks
0

Links

Homepage

Tasks

Semantic ParsingSystematic Generalization