phrase -based machine translation system for breast … · 2018. 9. 1. · an easy medium of c...
TRANSCRIPT
.
"PHRASE-BASED MACHINE TRANSLATION SYSTEM"
"for Breast Cancer Pathology Domain"
"Johanna Johnsi Rani Ga, Gladis Db, Joy John Mammenc, Manikandan Kabalid" *
"aDepartment of Computer Science, Madras Christian College, Chennai 600 059, India"
"b Gladis D Department of Computer Science, Presidency College,Chennai 600 005, India"
"c Joy John Mammen, Department of Immunohematology & Transfusion Medicine, Christian Medical College, Vellore 632 004, India"
"dManikandan Kabali, Department of Computer Science, Madras Christian College, Chennai 600 059, India"
Abstract
The proposed machine translation system from English to Tamil is for the Medical domain, female Breast cancer
Pathology in particular. The system translates textual content in a breast cancer pathology report from English to Tamil.
Medical documentation and reporting is mostly done in English language by the Medical professionals. While English is
an easy medium of communication between Medical practitioners, Clinicians and Pharmacists, the proposed system is
developed to cater to the need of the patient in understanding her medical condition in their regional language. The
available Medical resources in Tamil cover a broad spectrum in Medicine and are scarce and inadequate for the domain
considered. Hence pre-processing Natural Language Processing steps and online selection methods are applied to build a
Lexicon of breast cancer pathology terms from multiple resources such as breast cancer pathology reports and protocols.
Through an intensive manual process, a bilingual dictionary is subsequently built by translating each term in the Lexicon
from English to Tamil. The Bilingual dictionary is used for machine translation using Phrase-based translation approach.
The translated breast cancer pathology reports are evaluated by Linguistic experts through human judgment and by
comparing them with the Google Translate output. The translation process by the proposed system performs better than
Google Translate using domain-specific resources and applying phrase-based translation approach by %.
DisplayText cannot span more than one line!
* Corresponding author. Tel.: +91-9940259515; fax: -
E-mail address: [email protected].
International Journal of Pure and Applied MathematicsVolume 119 No. 18 2018, 2071-2082ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/
2071
1. Introduction
India has multilingual population and Tamil is an ancient language and dialect of the natives of Tamil
Nadu. Most of the communication among Medical professionals is through the medium of English, which is
highly suited for exchange of information between Medical professionals namely Doctors, Clinical personnel
and Pharmacists. But only a small percentage of the local population can read and understand the details of a
Medical report written in English. Breast cancer is the predominant life-threatening disease among women in
India. The proposed work aims to build resources in Breast cancer pathology that can be used in the
translation of medical reports from English to Tamil. Tamil language is rich enough to adequately serve as a
medium of communication across all domains. Research efforts have been minimal until now with respect to
translation of Medical terms from English to Tamil and the proposed system is an effort to initiate work in this
area. Machine translation of reports handed over to patients would in their regional language would enable
them to be aware of their Medical condition. The Medical domain also would also have a wealth of Medical
terminological resources for various diseases, with more such works in the future.
A Breast cancer pathology report consists of textual content and hence Natural Language Processing (NLP)
techniques are applied before the machine translation is performed. NLP is defined as the automatic
manipulation of natural language, like speech and text, by software. Machine Translation refers to application
of computer system to translate text or speech from one language to another. The source language for the
translation system is English and the target Language is Tamil. Existing Medical Dictionary / Thesauri in
Tamil include a wide spectrum of medical terms and are not adequate to translate reports in a particular
disease domain. Hence as preprocessing steps to translation, we collate resources for the Breast Pathology
domain through manual compilation, using existing resources. The resources generated are a Breast Pathology
Lexicon and a Bilingual Dictionary from English to Tamil for the terms in the Lexicon. The manual
compilation has been done in consultation with Language experts in the target language. The resources, are
subsequently used in the machine translation process. The translation output is analyzed and evaluated through
manual scrutiny and comparison with the existing translation of the same content using Google Translate.
The source language English is widely spoken across the globe and serves as a common communication
medium across borders. In the Medical domain, reports are usually written in English, as the language has
adequate resources in the field and is better understood by Medical practitioners. Some of the Medical
resources in English are the “Dictionary of Medical terms” by A& C Black, London, “Merriam-Webster
Medical Dictionary”, “Oxford Medical Dictionary” and “Taber’s Cyclopaedic Medical Dictionary”. Besides
these, there are many online Medical resources such as Glossaries. These resources have terms associated
with all diseases and all medical conditions, while we focus only on Breast Cancer Pathology. To mention a
few characteristic features of the two language, English and Tamil, English is syntactically a Subject-Verb-
Object (SVO) language in which the word order is rigid and fixed. In complex sentences, the Subordinate
clause follows the Main clause. Modern English is analytic and has Pre-positions. Tense and Time in English
International Journal of Pure and Applied Mathematics Special Issue
2072
are indicated by Auxiliaries and are always placed before the Main verbs. In interrogative sentences,
Auxiliaries are shifted to the front position. Adjectives and Noun qualifiers always precede the Nouns they
qualify. The characteristic features of Tamil language are listed below. Tamil is a Subject Object Verb (SOV)
language and since it allows flexibility in word order it can be called “word-order free” language. Also it is
not mandatory that a sentence must have Subject, Object and Verb. A single word in Tamil can show the
tense, action performed and the gender.
Translation of text from English to Tamil is highly challenging because of flexibility of Tamil language.
Furthermore, handling translation of gender-specific statements and ambiguous words are complex than in
English. The proposed work escapes the challenge of gender-specific translations because the dataset used by
the proposed system consists of deidentified medical reports and the content of the reports are statements
written about the disease and its associated conditions and not gender-specific mentions about the patient. The
complexity of translation due to ambiguous words also need not be addressed here, as the textual content has
standard medical terms and phrases which are locally and globally used by Pathologists. In this context, an
essential resource for translation from English to Tamil is the Parallel corpora for the Breast Cancer Pathology
domain, which is generated through manual and automated efforts, as a pre-processing step to the actual
translation task. In generating a parallel corpus for the domain, we can across a few Medical resources in
Tamil. They are “English-Tamil Glossary of Medicine” by Tamil Virtual University, a “Medical dictionary”
translated from English to Tamil and Telugu compiled by D. Rambabu, and translated along with V.V.
Rathnasree. and “Thatcha Naayanaar Tamil Medical Dictionary” (
) which is a compilation of Ayurvedha and Siddha medicine. [7,8,9,10]. Since the above-listed
Tamil Medical resources were not adequate to use for a specific domain namely Breast cancer pathology
domain, we create a Lexicon and Bilingual Dictionary for the domain. In the process, a few online resources
were referred to either directly derive Tamil word-equivalents to the English terms or coin Tamil phrases for
their English equivalents. We then apply phrase-based approach to translation of breast cancer pathology
reports from English to Tamil with the resources generated. The automated machine translation enables us to
translated the source language to the target language without human intervention, because the translation
process uses relevant domain resources. [14]
The remaining content of this paper is organized as follows. Section 2 explains related work in translating
Textual content from English to Tamil. Section 3 gives the details of implementation of Phrase-based machine
translation and Section 4 explains the evaluation of the translated content and the results obtained. Section 5
presents the conclusions with scope for future work.
2. Related Work
Translations from English to Tamil was earlier performed manually by Linguistic experts in the languages
but in recent times, Machine translation to Tamil language is widely carried out by many researchers.
International Journal of Pure and Applied Mathematics Special Issue
2073
Machine Translation systems have been developed for Indian Languages since 1995, a few of which are
AnglaBharti (1991), Anusaaraka (1995), Mantra (1999), Matra (2004), AnuBharti (2004) Shiva and Shakti
(2004), Anubaad (2004) and Sampark (2009). Anusaaraka translates children’s stories. Mantra was developed
for use in the Rajya Sabha Secretariat. Matra converts complex English Sentences into Simpler sentences and
translates the content into Hindi. Angla Bharti is a general-purpose machine translation system with provision
for domain customization. Shiva and Shakti machine translation system caters to translation requirements in
Hindi, Marathi and Telugu [14].
Thenmozhi D and Aravindan C proposed a Statistical machine translation system that translates textual
content in the Agriculture domain from Tamil to English. Poornima C, Dhanalakshmi V, et al proposed a
preprocessing tool that converts complex sentences to simple sentences in English, using rule-based technique
before translation to the target language Tamil [13]. R. Harshwardhan et.al proposed a framework for Phrase-
based translation from English to Tamil using Translation memory and concept labeling [6]. Dhanesh N
proposed a Conceptual Framework for Automated English to Tamil rule-based machine translation system
[5]. S. Saraswathi et.al proposed a Bilingual Translation System for English and Tamil using Hybrid approach
that uses Rule based Machine Translation and Knowledge based Machine Translation [15]. The works listed
here are in diverse fields but not specific to Medical domain. The proposed work hence aims to apply machine
translation to a disease domain, namely Breast cancer, by generated the required linguistic resources and
translation using phrase-based approach. The Breast cancer Pathology reports do not have gender-specific
statements. Hence as porposed by S. Suganthi et.al., our system also handles prepositions such as ‘of ’, ‘in’,
‘to’, ‘on’, ‘by’, ‘from’ that occur in the reports [16]. The semantic rules derived by manual process are based
on parts of speech tags at the sentence-level.
3. Machine Translation Implementation
Machine translation (MT) is automatic translation from one language to another using computer. MT is
classified into major categories namely Rule-based machine translation (RBMT), Data-driven machine
translation (DDMT) and Hybrid models. The Rule-based models heavily dependent on language theory and
requires numerous man-hours to build rules for translation from one language to another. Due to the initial
efforts, rule-based MT systems are easy to maintain and extend to other languages. Rule-based translation can
be direct or indirect and is suitable for languages for which there is no or less parallel corpora. The Data-
driven model, also known as corpus based translation, makes use of bilingual parallel corpora. The model has
two major approaches namely statistical machine translation (SMT) and Example based machine translation
(EBMT). Statistical Machine based translation approach chooses the string with highest probability to
translate text from a source language to the target language. This approach requires a parallel corpus between
the source and the target languages at the sentence level. Example-based approach is a data driven approaches
that makes use of analogy (similarity in meaning and form) translation from examples database [4]. Hybrid
model has the merits of both RBMT and DDMT and it is a combination of two or more translation approaches
International Journal of Pure and Applied Mathematics Special Issue
2074
and can be DDMT-guided or RBMT-guided.
Medical translation refers to translation of Pathology reports, Laboratory reports, Clinical reports, Scan
reports etc. [12] The machine translation of Breast cancer pathology report is performed by the system is a
data-driven, Phrase-based translation approach. The proposed work used 150 Breast Cancer Pathology reports
obtained from a renowned hospital in Tamil Nadu, South India. The report has five sections namely
Specimen, Clinical, Gross, Micro and Impression. The Impression section is the summary portion of the
report, used for cancer staging. Hence, the initial Phrase-based translation is limited to the contents of the
Impression section alone. Typical of data-driven approach, the medical resources for translation namely, the
Breast Cancer Pathology Lexicon (BCP-LEX) and Breast Cancer Pathology Bilingual Dictionary (BCP-
BLD) are generated first, through an intensive manual process.
3.1. Lexicon Building
A Lexicon of Breast Cancer Pathology terms (BCP-LEX) are generated by applying three approaches. The
first approach applies NLP steps to the dataset namely the Breast Cancer Pathology reports and the American
Joint Committee on Cancer (AJCC) protocol to collate the unigram Medical terms [1]. The second approach
is a collection of Breast cancer pathology Glossary terms extracted from an online resource. In addition to the
two, the system allows the user to select Medical Phrases from the dataset through online selection. The
Lexicon generated thus consists of a total of 2083 terms belonging to the domain and serves as the source to
build the bilingual dictionary. Table 1 gives the distribution of Lexicon terms generated using the three
approaches. After elimination of duplicates, the Lexicon BCP-LEX has 1124 terms.
Table 1. Breast Cancer Pathology Lexicon (BCP-LEX)
Source No. of Terms
Dataset – Pathology reports (Using NLP) 1541
AJCC-TNM C Classification (Using NLP) 125
Glossary of Pathological terms
Total
417
2083
3.2. Bilingual Dictionary
Tamil Dictionary and Thesauri of Medical terms are rare and online resources for Breast Cancer Pathology
are unavailable. The terms in the Breast Pathology Lexicon can be categorized into i. Translated words
(unigram), ii. Translated phrases (bigrams or more), iii. Transliterated items and iv. Non-translatable items.
Through an intensive manual process, the Medical terms in the Lexicon are translated into Tamil using
International Journal of Pure and Applied Mathematics Special Issue
2075
Google Translate and a few other online Tamil dictionaries [17, 18] and iteratively checked for correctness
with Tamil language experts. Transliteration is the conversion of a text from one script to another and
representing words from one language using phonetic or spelling equivalents of another language. [11].
Medical terms that are not translatable without loss of meaning are transliterated. A small subset of terms
cannot be translated into Tamil, as they are standard Medical indicators. For example, Pathological
classification pTNM is manually derived from the contents of the pathology report, in which pT represents the
pathological classification of Tumour, pN represents the pathological classification of Lymph Node and pM
represents the pathological classification of Distant Metastasis. pT, pN and pM are grouped to determine the
stage of cancer of the patient. The pTNM classification and cancer stage representations cannot be translated,
in order to retain their Medical significance in the text.
Table 2 lists a sample of translated words, translated phrases, transliterated terms and non-translatable
terms in the domain, in both the source language English and the target language Tamil. The system provides
an editor for the Language experts in Tamil and Medical domain experts to check the preciseness of the
medical terms in the target language. The current terms validated by the Language experts as a total of 1124
terms relating to the domain out of which 891 are translated terms, 46 are Not translatable terms, 36 are
Abbreviations and 151 are Transliterated terms. The existing bilingual dictionary will be evaluated further by
the Medical experts would make the Breast Cancer Pathology domain-based Lexicon and Bilingual
Dictionary a precise and valuable resource for translation of documents in the domain. The
Table 2. Sample Terms in the Bilingual Dictionary for Breast Cancer Pathology Domain
Category English Tamil
Translated Terms Tumour / Tumor (unigram)
Cancer / Carcinoma (unigram)
Adjacent breast tissue (phrase)
Microcalcification (phrase)
Transliterated Terms Progesterone
Amphophilic cytoplasm
Non-translatable Terms Tis (DCIS)
T1b
Tis (DCIS)
T1b
3.3. Translation approach
The manual preprocessing steps namely building a Lexicon of 1124 breast cancer pathology terms from
multiple sources relating to the domain and building a Bilingual dictionary provide the baseline resources
required for the translation process. The machine translation process requires a few more preprocessing to be
International Journal of Pure and Applied Mathematics Special Issue
2076
applied on the source text in English namely, Section Segregation, Sentence splitting, and Phrase splitting.
The Section Segregation divides the report into its constituent sections and separates the Impression section
for translation. The translation process is applied on individual sentences; hence the sentence splitting is
performed next. The sentences are POS-tagged and stored, after which the phrases in a sentence are split for
the translation process. The PENN treebank tag contains 36 tags is used for tagging. word in one language
may have equivalent translation in another language with multiple words. Since Medical terms have phrases
or group of words which cannot be split, in order to retain their medical meaning, chunking is performed
before translation. For example, a term “Ductal carcinoma in situ” is meaningful as a phrase than as
individual word components. Case-sensitivity is handled at the preprocessing stage to avoid errors in
Dictionary lookup. Abbreviations in the text are directly translated to their Tamil equivalents using the
Bilingual dictionary while the non-translatable terms appear without changes in the translated output. The
algorithm given below shows the steps in the translation process.
Algorithm
1. Read Input text in SL
2. Preprocess the text and store in database D
3. For report 1 to n
4. For each sentence S
Apply POS Tagging and store in D
5. Split into phrases
6. For each phrase P
Apply Chunking
7. For each term n to 1, in the Chunklist
If chunk-term Є Bilingual Dictionary
Replace Chunk-term with Target language term;
Go to 6
Endif
End
8. Read the precorrected translated output in TL
9. Apply Morphological Rules
10. Print correct translated text
International Journal of Pure and Applied Mathematics Special Issue
2077
Fig. 1. Phrase-based Translation from English to Tamil
The initial translated output is not a perfect syntactically correct one. Morphological analysis takes care of
the inflated forms of a verb or noun [5]. Anand Kumar M et.al. mention the difference between English and
Tamil that while English has fixed word order, Tamil has rich morphological structure and is a free word-
order language [2]. Hence, Morphological analysis is performed on the output to obtain perfect translation.
Rule based translation applies grammar rules to the first level translation and reorders the sentences using
linguistic information in the source and target languages [11]. The proposed system applies rules that are
derived using a data-driven approach on the corpus. The POS tagged report content was manually scrutinized
several times to derive the rules. The Morphological rules provide corrected translation output that adheres to
grammatical rules in Tamil language. An example of such a corrected output is given below.
English Sentence: Tumour is 1 cm from the deep resection margin.
Translation output before
Morphological Analysis: is 1 . from the .
Final translation output after
Morphological Analysis: 1 . .
International Journal of Pure and Applied Mathematics Special Issue
2078
Fig. 2 shows the translation output of the Impression section of a breast cancer pathology report.
Fig. 2. Impression Section Translation from English and Tamil
4. Evaluation of the translated output
The evaluation is performed using human evaluation approach for the Bilingual Dictionary and the
translation output. In the first level, using an editor, the resources and the translation output are evaluated by a
Linguistic expert in both English and Tamil languages. The Linguistic expert checked the correctness of
translation of each term in the Bilingual Dictionary and also each sentence in the Impression section and
graded them as Good (G), Partially correct (PC) or Poor (P).
In evaluating the Bilingual Dictionary terms, the criterion for grading was twofold. Firstly, the Tamil
equivalent for the breast cancer pathology term must represent the domain without loss of deviation in
meaning and secondly the coinage of phrases must be syntactically and semantically correct from the target
language perspective. With the above criteria, the experts graded every term and its Tamil equivalent in the
Bilingual Dictionary. Out of the 1124 terms identified so far, 1102 terms were graded ‘Good’, 7 were graded
to be ‘Partially correct’ and 15 terms were graded to be ‘Poor’. In the same way, the translation output of the
proposed system also was performed by the Language expert. Out of the 150 breast cancer pathology report
translations, 142 translations were graded to be ‘Good’, 5 were graded to be ‘Partially correct’ and 3 were
International Journal of Pure and Applied Mathematics Special Issue
2079
graded to be ‘Poor’. The evaluation results are graphically summarized and shown in Fig. 3 (a) and Fig. 3 (b).
To fully authenticate the translation, the translation will be graded by the Medical experts.
Fig. 3 (a) Evaluation of Bilingual Dictionary terms; Fig. 3 (b). Evaluation of Translation output
5. Conclusion
The aim of the proposed system was to translate breast cancer pathology reports from English to Tamil,
using Phrase-based translation. The advantage of phrase-based translation that Word-sense disambiguation
need not be considered in the translation process [6]. We found it appropriate to use the phrase-based
translation approach for the Medical domain at the initial stage of translation. Hence, in the proposed system,
valuable resources namely breast cancer pathology Lexicon and breast cancer pathology bilingual dictionary
were generated. The resources may not be immediately used by Medical experts in their reporting but
building such a resource for many other diseases would bring out the richness of the Tamil language and its
applicability to all domains. The work can be applied to all the sections of the breast cancer pathology report
and expanded further to Medical reports of all diseases in the future.
Acknowledgment
The authors thank the Department of Pathology, Christian Medical College and Hospital, Vellore for
providing them with the sample data for their study. Our special thanks to Dr. Marie Therese Manipadam and
Dr. Gunadala Ishitha, Department of Pathology, CMC, Vellore for sharing their domain expertise in the field
of Breast Cancer Pathology. We also thank the Tamil language expert from Madras Christian College, Dr.
David Prabhakar, for evaluating the Lexicon terms and the Bilingual Dictionary and providing valuable input.
International Journal of Pure and Applied Mathematics Special Issue
2080
References
[1] AJCC Cancer Staging Manual, Eighth Edition © The American College of Surgeons (ACS), Chicago, Illinois, 2017. [2] Anandkumar M et.al., 2013, “Improving the performance of English-Tamil Statistical Machine Translation System using Source-
side preprocessing”, Proceedings of International Conference on Advances in Computer Science, AETACS, Elsevier, 2013.
[3] Antony P J, 2013, “Machine Translation Approaches and Survey for Indian Languages”, International Journal of Computational
Linguistics and Chinese Language processing, Volume 18, No. 1, pp 47-78.
[4] Benson Kituku, Lawrence Muchemi, Wanjiku Nganga, “A Review on Machine Translation Approaches”, Indonesian Journal of
Electrical Engineering and Computer Science, Vol. 1, No. 1, January 2016, pp. 182 ~ 190. [5] Dhanesh N, 2016, “A Conceptual Framework for Automated English to Tamil Machine Translation System”, International Journal
for Trends in Engineering & Technology, Volume 16 Issue 1.
[6] Harshwardhan R, Mridula Sara Augustine, K.P. Soman, 2011, “Phrase based English to Tamil Translation System by Concept Labeling using Translation memory”, International Journal of Computer Applications (0975-8887), Volume 20-No.3.
[7] http://dictionary.tamilcube.com/
[8] https://glosbe.com [9] http://siddhadreams.blogspot.com/2008/10/tamil-medical-dictionary.html
[10] http://www.tamilvu.org/library/technical_glossary/html/techindex.htm
[11] Kamaljeet Kaur, Parminder Singh, 2014, “Review of Machine Transliteration Techniques”, International Journal of Computer Applications (0975-8887), Volume 107 –No. 20, December 2014.
[12] Marta R, Mireia Farrús, Jordi Serrano Pons, 2012, “Machine Translation in Medicine A quality analysis of statistical machine
translation in the medical domain”, International Virtual Conference, Section 14, Information Technology, Advanced Research in Scientific Areas, pp. 1995 – 1998.
[13] Poornima C, Dhanalakhmi V, Anand Kumar M, Soman K P, 2011, “Rule-based Sentence Simplification for English to Tamil
Machine Translation System”, International Journal of Computer Applications, (0975-8887), Volume 20-No.3. [14] Sanjay Kumar Dwivedi, Pramod Premdas Sukhadeve, 2010, “Machine Translation System in Indian Perspectives”, Journal of
Computer Science 6 (10): 1082-1087. ISSN 1549-3636, Science Publications
[15] Saraswathi S, P. Kanivadhana, M. Anusiya, S. Sathiya, 2011, “Bilingual Translation System”, International Journal on Computer Science and Engineering”, Vol.3 No.3 pp. 1168-1174.
[16] Suganthi S, K.G.Srinivasagan, P.Bama Ruckmani, M.Saravanan, 2013, “Rule based Approach for Prepositional Phrase Attachment in English-Tamil Translation”, International Journal of Computer Applications (0975 – 8887) Volume 63– No.22.
[17] www.translate.google.com
[18] www.shabdkosh.com
International Journal of Pure and Applied Mathematics Special Issue
2081
2082