The limited availability of in-domain training data is a major issue in the training of application-specific neural machine translationmodels. Professional outsourcing of bilingual data collections is costly and often not feasible. In this paper we analyze the influence ofusing crowdsourcing as a scalable way to obtain translations of target in-domain data having in mind that the translations can be of alower quality. We apply crowdsourcing with carefully designed quality controls to create parallel corpora for the educational domainby collecting translations of texts from MOOCs from English to eleven languages, which we then use to fine-tune neural machinetranslation models previously trained on general-domain data. The results from our research indicate that crowdsourced data collectedwith proper quality controls consistently yields performance gains over general-domain baseline systems, and systems fine-tuned withpre-existing in-domain corpora.

Improving Machine Translation of Educational Content via Crowdsourcing

Federico Gaspari;
2018-01-01

Abstract

The limited availability of in-domain training data is a major issue in the training of application-specific neural machine translationmodels. Professional outsourcing of bilingual data collections is costly and often not feasible. In this paper we analyze the influence ofusing crowdsourcing as a scalable way to obtain translations of target in-domain data having in mind that the translations can be of alower quality. We apply crowdsourcing with carefully designed quality controls to create parallel corpora for the educational domainby collecting translations of texts from MOOCs from English to eleven languages, which we then use to fine-tune neural machinetranslation models previously trained on general-domain data. The results from our research indicate that crowdsourced data collectedwith proper quality controls consistently yields performance gains over general-domain baseline systems, and systems fine-tuned withpre-existing in-domain corpora.
2018
9791095546009
MOOCs
neural machine translation
crowdsourcing
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12078/27307
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 3
social impact