NDLI: ParaMed: a parallel corpus for English–Chinese translation in the biomedical domain

Content Provider	Springer Nature : BioMed Central
Author	Liu, Boxiang Huang, Liang
Abstract	Biomedical language translation requires multi-lingual fluency as well as relevant domain knowledge. Such requirements make it challenging to train qualified translators and costly to generate high-quality translations. Machine translation represents an effective alternative, but accurate machine translation requires large amounts of in-domain data. While such datasets are abundant in general domains, they are less accessible in the biomedical domain. Chinese and English are two of the most widely spoken languages, yet to our knowledge, a parallel corpus does not exist for this language pair in the biomedical domain. We developed an effective pipeline to acquire and process an English-Chinese parallel corpus from the New England Journal of Medicine (NEJM). This corpus consists of about 100,000 sentence pairs and 3,000,000 tokens on each side. We showed that training on out-of-domain data and fine-tuning with as few as 4000 NEJM sentence pairs improve translation quality by 25.3 (13.4) BLEU for en $\rightarrow$ zh (zh $\rightarrow$ en) directions. Translation quality continues to improve at a slower pace on larger in-domain data subsets, with a total increase of 33.0 (24.3) BLEU for en $\rightarrow$ zh (zh $\rightarrow$ en) directions on the full dataset. The code and data are available at https://github.com/boxiangliu/ParaMed .
Related Links	https://bmcmedinformdecismak.biomedcentral.com/counter/pdf/10.1186/s12911-021-01621-8.pdf
Ending Page	11
Page Count	11
Starting Page	1
File Format	HTM / HTML
ISSN	14726947
DOI	10.1186/s12911-021-01621-8
Journal	BMC Medical Informatics and Decision Making
Issue Number	1
Volume Number	21
Language	English
Publisher	BioMed Central
Publisher Date	2021-09-06
Access Restriction	Open
Subject Keyword	Health Informatics Information Systems and Communication Service Management of Computing and Information Systems Machine translation Natural language processing Text mining
Content Type	Text
Resource Type	Article
Subject	Health Informatics Computer Science Applications Health Policy
Journal Impact Factor	3.3/2023
5-Year Journal Impact Factor	3.9/2023

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Semantic biomedical resource discovery: a Natural Language Processing framework

Semantic text mining support for lignocellulose research

Natural language processing data services for healthcare providers

Detecting causality from online psychiatric texts using inter-sentential language patterns

SNOMED CT in a language isolate: an algorithm for a semiautomatic translation

Facilitating accurate health provider directories using natural language processing

A systematic review of natural language processing applied to radiology reports

A pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish

ClotCatcher: a novel natural language model to accurately adjudicate venous thromboembolism from radiology reports

ParaMed: a parallel corpus for English–Chinese translation in the biomedical domain

Similar Documents

Semantic biomedical resource discovery: a Natural Language Processing framework

Semantic text mining support for lignocellulose research

Natural language processing data services for healthcare providers

Detecting causality from online psychiatric texts using inter-sentential language patterns

SNOMED CT in a language isolate: an algorithm for a semiautomatic translation

Facilitating accurate health provider directories using natural language processing

A systematic review of natural language processing applied to radiology reports

A pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish

ClotCatcher: a novel natural language model to accurately adjudicate venous thromboembolism from radiology reports

ParaMed: a parallel corpus for English–Chinese translation in the biomedical domain