The limited availability of in-domain training data is a major issue in the training of application-specific neural machine translationmodels. Professional outsourcing of bilingual data collections is costly and often not feasible. In this paper we analyze the influence ofusing crowdsourcing as a scalable way to obtain translations of target in-domain data having in mind that the translations can be of alower quality. We apply crowdsourcing with carefully designed quality controls to create parallel corpora for the educational domainby collecting translations of texts from MOOCs from English to eleven languages, which we then use to fine-tune neural machinetranslation models previously trained on general-domain data. The results from our research indicate that crowdsourced data collectedwith proper quality controls consistently yields performance gains over general-domain baseline systems, and systems fine-tuned withpre-existing in-domain corpora.
Improving Machine Translation of Educational Content via Crowdsourcing
Federico Gaspari;
2018-01-01
Abstract
The limited availability of in-domain training data is a major issue in the training of application-specific neural machine translationmodels. Professional outsourcing of bilingual data collections is costly and often not feasible. In this paper we analyze the influence ofusing crowdsourcing as a scalable way to obtain translations of target in-domain data having in mind that the translations can be of alower quality. We apply crowdsourcing with carefully designed quality controls to create parallel corpora for the educational domainby collecting translations of texts from MOOCs from English to eleven languages, which we then use to fine-tune neural machinetranslation models previously trained on general-domain data. The results from our research indicate that crowdsourced data collectedwith proper quality controls consistently yields performance gains over general-domain baseline systems, and systems fine-tuned withpre-existing in-domain corpora.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.