BOOK
Parida_ELRA_2020/IDIAP
OdiEnCorp 2.0: Odia-English Parallel Corpus for Machine Translation
Parida, Shantipriya
Dash, Satya Ranjan
Bojar, Ondrej
Motlicek, Petr
Pattnaik, Priyanka
Mallick, Debasish Kumar
EXTERNAL
https://publications.idiap.ch/attachments/papers/2020/Parida_ELRA_2020.pdf
PUBLIC
https://publications.idiap.ch/index.php/publications/showcite/Parida_Idiap-RR-08-2020
Related documents
Proceedings of the WILDRE5 - 5th Workshop on Indian Language Data: Resources and Evaluation
2020
European Language Resources Association (ELRA)
9 rue des Cordelières75013, Paris, France
979-10-95546-67-2
https://lrec2020.lrec-conf.org/media/proceedings/Workshops/Books/WILDRE-5book.pdf
URL
The preparation of parallel corpora is a challenging task, particularly for languages that suffer from under-representation in the digital world. In a multi-lingual country like India, the need for such parallel corpora is stringent for several low-resource languages. In this work, we provide an extended English-Odia parallel corpus, OdiEnCorp 2.0, aiming particularly at Neural Machine Translation (NMT) system which will help translate English↔Odia. OdiEnCorp 2.0 includes existing English-Odia corpora and we extended the collection by several other methods of data acquisition: parallel data scraping from many websites, including Odia Wikipedia, but also optical character recognition (OCR) to extract parallel data from scanned images. Our OCR-based data extraction approach for building a parallel corpus is suitable for other low resource languages that lack in online content. The resulting OdiEnCorp 2.0 contains 98,302 sentences and 1.69million English and 1.47 million Odia tokens. To the best of our knowledge, OdiEnCorp 2.0 is the largest Odia-English parallel corpus covering different domains and available freely for non-commercial and research purposes.
REPORT
Parida_Idiap-RR-08-2020/IDIAP
OdiEnCorp 2.0: Odia-English Parallel Corpus for Machine Translation
Parida, Shantipriya
Dash, Satya Ranjan
Bojar, Ondrej
Motlicek, Petr
Pattnaik, Priyanka
Mallick, Debasish Kumar
EXTERNAL
https://publications.idiap.ch/attachments/reports/2020/Parida_Idiap-RR-08-2020.pdf
PUBLIC
Idiap-RR-08-2020
2020
Idiap
May 2020
In Proceedings of the LREC 2020 WILDRE5– 5thWorkshop on Indian Language Data:Resources and Evaluation