Postediting dataset

This page provides information about the dataset with postediting information we developed and how to obtain it.

Characteristics of the dataset

The dataset contains 260 pairs of sentences extracted from English-Spanish part of the Autodesk Post-Editing Data corpus. The sentences were extracted in such a way that they contain approximately 3,000 words. This number represents a day’s work for the average professional translator and allowed us to emulate a real-world setting by asking the translators to complete the task in one day. In addition, they were selected in such a way that we can investigate four different scenarios:

  • translators are not provided with any automatic translation and have to translate from scratch
  • an automatic translation is provided, but no information about the quality of the translation is given to translators. In this scenario, the translators have to decide whether to post-edit the given sentence or translate from scratch
  • an automatic translation is provided and the translator is informed that the quality of the translation is poor. In this scenario, the translators is advised to translate from scratch, but they can decide post-edit the given translation
  • an automatic translation is provided and the translator is informed that the quality of the translation is good. In this scenario, the translators is advised to post-edit the given translation

The selected sentences were equally distributed between these categories.

The experiment

For our experiments we enlisted the help of four professional Spanish translators with several years’ translating experience. All 4 translators had some experience with Post-Editing tools. They were asked to use the post-editing tool PET which records all the operations performed by our translators.

The dataset is available as four XML which record the operations performed by the translators. The example below presents the format used to record all these operations.

<unit Translate="From Scratch!" id="0BadQE" status="FINISHED" type="pe">
  <S producer="QETranslate">Exports AutoCAD Civil 3D data to an IMX file that can be imported into Autodesk Infrastructure Modeler</S>
  <R producer="QETranslate">Exporta datos de AutoCAD Civil 3D a un archivo IMX que puede importarse a Autodesk Infrastructure Modeler</R>
  <MT producer="QETranslate">Exporta datos de AutoCAD Civil 3D a un archivo IMX que se puede importar en Autodesk Infrastructure Modeler</MT>
  <annotations revisions="1">
    <annotation r="1">
      <PE producer="A1.QETranslate">Exporta datos de AutoCAD Civil 3D a un archivo IMX que se puede importar a  Autodesk Infrastructure Modeler</PE>
      <indicator id="unchanged" type="flag">false</indicator>
      <indicator elapsed=",0" id="assignment" length="108" offset="0" t0=",0" type="change">
Exporta datos de AutoCAD Civil 3D a un archivo IMX que se puede importar en Autodesk Infrastructure Modeler</indicator>
      <indicator id="substitution" type="wrap">
        <action elapsed=",0" id="deletion" length="2" offset="74" t0=",0" type="change">en</action>
        <action elapsed=",0" id="insertion" length="1" offset="74" t0=",0" type="change">a</action>
      </indicator>
      <indicator id="editing" type="time">40s,76</indicator>
      <indicator id="letter-keys" type="count">1</indicator>
      <indicator id="digit-keys" type="count">0</indicator>
      <indicator id="white-keys" type="count">0</indicator>
      <indicator id="symbol-keys" type="count">0</indicator>
      <indicator id="navigation-keys" type="count">1</indicator>
      <indicator id="erase-keys" type="count">2</indicator>
      <indicator id="copy-keys" type="count">0</indicator>
      <indicator id="cut-keys" type="count">0</indicator>
      <indicator id="paste-keys" type="count">0</indicator>
      <indicator id="do-keys" type="count">0</indicator>
      <events>
        <flow t="0">EDITING_START</flow>
        <change offset="0" t="30">
          <in>
Exporta datos de AutoCAD Civil 3D a un archivo IMX que se puede importar en Autodesk Infrastructure Modeler</in>
        </change>
        <cursor dot="0" mark="0" t="140"/>
        <cursor dot="76" mark="76" t="24569"/>
        <change offset="75" t="25817">
          <out>n</out>
        </change>
        <cursor dot="75" mark="75" t="25817"/>
        <command offset="75" t="25817">BACKSPACE</command>
        <change offset="74" t="26051">
          <out>e</out>
        </change>
        <cursor dot="74" mark="74" t="26051"/>
        <command offset="74" t="26051">BACKSPACE</command>
        <keystroke offset="74" t="26504">a</keystroke>
        <change offset="74" t="26504">
          <in>a</in>
        </change>
        <cursor dot="75" mark="75" t="26504"/>
        <navigation offset="75" t="27596">RIGHT</navigation>
        <cursor dot="76" mark="76" t="27596"/>
        <flow t="40076">EDITING_END</flow>
        <flow t="40076">EDITING_END</flow>
      </events>
    </annotation>
  </annotations>
</unit>

Obtaining the dataset

In order to obtain the dataset please fill in the following form.

See a translator in action

We are currently developing an interface for observing the operations carried out by our translators. You can see an early demo at http://dinel.org.uk/demos/postedit/