phrase -based machine translation system for breast … · 2018. 9. 1. · an easy medium of c...

.

"PHRASE-BASED MACHINE TRANSLATION SYSTEM"

"for Breast Cancer Pathology Domain"

"Johanna Johnsi Rani Ga, Gladis Db, Joy John Mammenc, Manikandan Kabalid" *

"aDepartment of Computer Science, Madras Christian College, Chennai 600 059, India"

"b Gladis D Department of Computer Science, Presidency College,Chennai 600 005, India"

"c Joy John Mammen, Department of Immunohematology & Transfusion Medicine, Christian Medical College, Vellore 632 004, India"

"dManikandan Kabali, Department of Computer Science, Madras Christian College, Chennai 600 059, India"

Abstract

The proposed machine translation system from English to Tamil is for the Medical domain, female Breast cancer

Pathology in particular. The system translates textual content in a breast cancer pathology report from English to Tamil.

Medical documentation and reporting is mostly done in English language by the Medical professionals. While English is

an easy medium of communication between Medical practitioners, Clinicians and Pharmacists, the proposed system is

developed to cater to the need of the patient in understanding her medical condition in their regional language. The

available Medical resources in Tamil cover a broad spectrum in Medicine and are scarce and inadequate for the domain

considered. Hence pre-processing Natural Language Processing steps and online selection methods are applied to build a

Lexicon of breast cancer pathology terms from multiple resources such as breast cancer pathology reports and protocols.

Through an intensive manual process, a bilingual dictionary is subsequently built by translating each term in the Lexicon

from English to Tamil. The Bilingual dictionary is used for machine translation using Phrase-based translation approach.

The translated breast cancer pathology reports are evaluated by Linguistic experts through human judgment and by

comparing them with the Google Translate output. The translation process by the proposed system performs better than

Google Translate using domain-specific resources and applying phrase-based translation approach by %.

DisplayText cannot span more than one line!

* Corresponding author. Tel.: +91-9940259515; fax: -

E-mail address: [email protected].

International Journal of Pure and Applied MathematicsVolume 119 No. 18 2018, 2071-2082ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/

2071

1. Introduction

India has multilingual population and Tamil is an ancient language and dialect of the natives of Tamil

Nadu. Most of the communication among Medical professionals is through the medium of English, which is

highly suited for exchange of information between Medical professionals namely Doctors, Clinical personnel

and Pharmacists. But only a small percentage of the local population can read and understand the details of a

Medical report written in English. Breast cancer is the predominant life-threatening disease among women in

India. The proposed work aims to build resources in Breast cancer pathology that can be used in the

translation of medical reports from English to Tamil. Tamil language is rich enough to adequately serve as a

medium of communication across all domains. Research efforts have been minimal until now with respect to

translation of Medical terms from English to Tamil and the proposed system is an effort to initiate work in this

area. Machine translation of reports handed over to patients would in their regional language would enable

them to be aware of their Medical condition. The Medical domain also would also have a wealth of Medical

terminological resources for various diseases, with more such works in the future.

A Breast cancer pathology report consists of textual content and hence Natural Language Processing (NLP)

techniques are applied before the machine translation is performed. NLP is defined as the automatic

manipulation of natural language, like speech and text, by software. Machine Translation refers to application

of computer system to translate text or speech from one language to another. The source language for the

translation system is English and the target Language is Tamil. Existing Medical Dictionary / Thesauri in

Tamil include a wide spectrum of medical terms and are not adequate to translate reports in a particular

disease domain. Hence as preprocessing steps to translation, we collate resources for the Breast Pathology

domain through manual compilation, using existing resources. The resources generated are a Breast Pathology

Lexicon and a Bilingual Dictionary from English to Tamil for the terms in the Lexicon. The manual

compilation has been done in consultation with Language experts in the target language. The resources, are

subsequently used in the machine translation process. The translation output is analyzed and evaluated through

manual scrutiny and comparison with the existing translation of the same content using Google Translate.

The source language English is widely spoken across the globe and serves as a common communication

medium across borders. In the Medical domain, reports are usually written in English, as the language has

adequate resources in the field and is better understood by Medical practitioners. Some of the Medical

resources in English are the “Dictionary of Medical terms” by A& C Black, London, “Merriam-Webster

Medical Dictionary”, “Oxford Medical Dictionary” and “Taber’s Cyclopaedic Medical Dictionary”. Besides

these, there are many online Medical resources such as Glossaries. These resources have terms associated

with all diseases and all medical conditions, while we focus only on Breast Cancer Pathology. To mention a

few characteristic features of the two language, English and Tamil, English is syntactically a Subject-Verb-

Object (SVO) language in which the word order is rigid and fixed. In complex sentences, the Subordinate

clause follows the Main clause. Modern English is analytic and has Pre-positions. Tense and Time in English

International Journal of Pure and Applied Mathematics Special Issue

2072

are indicated by Auxiliaries and are always placed before the Main verbs. In interrogative sentences,

Auxiliaries are shifted to the front position. Adjectives and Noun qualifiers always precede the Nouns they

qualify. The characteristic features of Tamil language are listed below. Tamil is a Subject Object Verb (SOV)

language and since it allows flexibility in word order it can be called “word-order free” language. Also it is

not mandatory that a sentence must have Subject, Object and Verb. A single word in Tamil can show the

tense, action performed and the gender.

Translation of text from English to Tamil is highly challenging because of flexibility of Tamil language.

Furthermore, handling translation of gender-specific statements and ambiguous words are complex than in

English. The proposed work escapes the challenge of gender-specific translations because the dataset used by

the proposed system consists of deidentified medical reports and the content of the reports are statements

written about the disease and its associated conditions and not gender-specific mentions about the patient. The

complexity of translation due to ambiguous words also need not be addressed here, as the textual content has

standard medical terms and phrases which are locally and globally used by Pathologists. In this context, an

essential resource for translation from English to Tamil is the Parallel corpora for the Breast Cancer Pathology

domain, which is generated through manual and automated efforts, as a pre-processing step to the actual

translation task. In generating a parallel corpus for the domain, we can across a few Medical resources in

Tamil. They are “English-Tamil Glossary of Medicine” by Tamil Virtual University, a “Medical dictionary”

translated from English to Tamil and Telugu compiled by D. Rambabu, and translated along with V.V.

Rathnasree. and “Thatcha Naayanaar Tamil Medical Dictionary” (

) which is a compilation of Ayurvedha and Siddha medicine. [7,8,9,10]. Since the above-listed

Tamil Medical resources were not adequate to use for a specific domain namely Breast cancer pathology

domain, we create a Lexicon and Bilingual Dictionary for the domain. In the process, a few online resources

were referred to either directly derive Tamil word-equivalents to the English terms or coin Tamil phrases for

their English equivalents. We then apply phrase-based approach to translation of breast cancer pathology

reports from English to Tamil with the resources generated. The automated machine translation enables us to

translated the source language to the target language without human intervention, because the translation

process uses relevant domain resources. [14]

The remaining content of this paper is organized as follows. Section 2 explains related work in translating

Textual content from English to Tamil. Section 3 gives the details of implementation of Phrase-based machine

translation and Section 4 explains the evaluation of the translated content and the results obtained. Section 5

presents the conclusions with scope for future work.

2. Related Work

Translations from English to Tamil was earlier performed manually by Linguistic experts in the languages

but in recent times, Machine translation to Tamil language is widely carried out by many researchers.


2073

Machine Translation systems have been developed for Indian Languages since 1995, a few of which are

AnglaBharti (1991), Anusaaraka (1995), Mantra (1999), Matra (2004), AnuBharti (2004) Shiva and Shakti

(2004), Anubaad (2004) and Sampark (2009). Anusaaraka translates children’s stories. Mantra was developed

for use in the Rajya Sabha Secretariat. Matra converts complex English Sentences into Simpler sentences and

translates the content into Hindi. Angla Bharti is a general-purpose machine translation system with provision

for domain customization. Shiva and Shakti machine translation system caters to translation requirements in

Hindi, Marathi and Telugu [14].

Thenmozhi D and Aravindan C proposed a Statistical machine translation system that translates textual

content in the Agriculture domain from Tamil to English. Poornima C, Dhanalakshmi V, et al proposed a

preprocessing tool that converts complex sentences to simple sentences in English, using rule-based technique

before translation to the target language Tamil [13]. R. Harshwardhan et.al proposed a framework for Phrase-

based translation from English to Tamil using Translation memory and concept labeling [6]. Dhanesh N

proposed a Conceptual Framework for Automated English to Tamil rule-based machine translation system

[5]. S. Saraswathi et.al proposed a Bilingual Translation System for English and Tamil using Hybrid approach

that uses Rule based Machine Translation and Knowledge based Machine Translation [15]. The works listed

here are in diverse fields but not specific to Medical domain. The proposed work hence aims to apply machine

translation to a disease domain, namely Breast cancer, by generated the required linguistic resources and

translation using phrase-based approach. The Breast cancer Pathology reports do not have gender-specific

statements. Hence as porposed by S. Suganthi et.al., our system also handles prepositions such as ‘of ’, ‘in’,

‘to’, ‘on’, ‘by’, ‘from’ that occur in the reports [16]. The semantic rules derived by manual process are based

on parts of speech tags at the sentence-level.

3. Machine Translation Implementation

Machine translation (MT) is automatic translation from one language to another using computer. MT is

classified into major categories namely Rule-based machine translation (RBMT), Data-driven machine

translation (DDMT) and Hybrid models. The Rule-based models heavily dependent on language theory and

requires numerous man-hours to build rules for translation from one language to another. Due to the initial

efforts, rule-based MT systems are easy to maintain and extend to other languages. Rule-based translation can

be direct or indirect and is suitable for languages for which there is no or less parallel corpora. The Data-

driven model, also known as corpus based translation, makes use of bilingual parallel corpora. The model has

two major approaches namely statistical machine translation (SMT) and Example based machine translation

(EBMT). Statistical Machine based translation approach chooses the string with highest probability to

translate text from a source language to the target language. This approach requires a parallel corpus between

the source and the target languages at the sentence level. Example-based approach is a data driven approaches

that makes use of analogy (similarity in meaning and form) translation from examples database [4]. Hybrid

model has the merits of both RBMT and DDMT and it is a combination of two or more translation approaches


2074

and can be DDMT-guided or RBMT-guided.

Medical translation refers to translation of Pathology reports, Laboratory reports, Clinical reports, Scan

reports etc. [12] The machine translation of Breast cancer pathology report is performed by the system is a

data-driven, Phrase-based translation approach. The proposed work used 150 Breast Cancer Pathology reports

obtained from a renowned hospital in Tamil Nadu, South India. The report has five sections namely

Specimen, Clinical, Gross, Micro and Impression. The Impression section is the summary portion of the

report, used for cancer staging. Hence, the initial Phrase-based translation is limited to the contents of the

Impression section alone. Typical of data-driven approach, the medical resources for translation namely, the

Breast Cancer Pathology Lexicon (BCP-LEX) and Breast Cancer Pathology Bilingual Dictionary (BCP-

BLD) are generated first, through an intensive manual process.

3.1. Lexicon Building

A Lexicon of Breast Cancer Pathology terms (BCP-LEX) are generated by applying three approaches. The

first approach applies NLP steps to the dataset namely the Breast Cancer Pathology reports and the American

Joint Committee on Cancer (AJCC) protocol to collate the unigram Medical terms [1]. The second approach

is a collection of Breast cancer pathology Glossary terms extracted from an online resource. In addition to the

two, the system allows the user to select Medical Phrases from the dataset through online selection. The

Lexicon generated thus consists of a total of 2083 terms belonging to the domain and serves as the source to

build the bilingual dictionary. Table 1 gives the distribution of Lexicon terms generated using the three

approaches. After elimination of duplicates, the Lexicon BCP-LEX has 1124 terms.

Table 1. Breast Cancer Pathology Lexicon (BCP-LEX)

Source No. of Terms

Dataset – Pathology reports (Using NLP) 1541

AJCC-TNM C Classification (Using NLP) 125

Glossary of Pathological terms

Total

417

2083

3.2. Bilingual Dictionary

Tamil Dictionary and Thesauri of Medical terms are rare and online resources for Breast Cancer Pathology

are unavailable. The terms in the Breast Pathology Lexicon can be categorized into i. Translated words

(unigram), ii. Translated phrases (bigrams or more), iii. Transliterated items and iv. Non-translatable items.

Through an intensive manual process, the Medical terms in the Lexicon are translated into Tamil using


2075

Google Translate and a few other online Tamil dictionaries [17, 18] and iteratively checked for correctness

with Tamil language experts. Transliteration is the conversion of a text from one script to another and

representing words from one language using phonetic or spelling equivalents of another language. [11].

Medical terms that are not translatable without loss of meaning are transliterated. A small subset of terms

cannot be translated into Tamil, as they are standard Medical indicators. For example, Pathological

classification pTNM is manually derived from the contents of the pathology report, in which pT represents the

pathological classification of Tumour, pN represents the pathological classification of Lymph Node and pM

represents the pathological classification of Distant Metastasis. pT, pN and pM are grouped to determine the

stage of cancer of the patient. The pTNM classification and cancer stage representations cannot be translated,

in order to retain their Medical significance in the text.

Table 2 lists a sample of translated words, translated phrases, transliterated terms and non-translatable

terms in the domain, in both the source language English and the target language Tamil. The system provides

an editor for the Language experts in Tamil and Medical domain experts to check the preciseness of the

medical terms in the target language. The current terms validated by the Language experts as a total of 1124

terms relating to the domain out of which 891 are translated terms, 46 are Not translatable terms, 36 are

Abbreviations and 151 are Transliterated terms. The existing bilingual dictionary will be evaluated further by

the Medical experts would make the Breast Cancer Pathology domain-based Lexicon and Bilingual

Dictionary a precise and valuable resource for translation of documents in the domain. The

Table 2. Sample Terms in the Bilingual Dictionary for Breast Cancer Pathology Domain

Category English Tamil

Translated Terms Tumour / Tumor (unigram)

Cancer / Carcinoma (unigram)

Adjacent breast tissue (phrase)

Microcalcification (phrase)

Transliterated Terms Progesterone

Amphophilic cytoplasm

Non-translatable Terms Tis (DCIS)

T1b

Tis (DCIS)

T1b

3.3. Translation approach

The manual preprocessing steps namely building a Lexicon of 1124 breast cancer pathology terms from

multiple sources relating to the domain and building a Bilingual dictionary provide the baseline resources

required for the translation process. The machine translation process requires a few more preprocessing to be


2076

applied on the source text in English namely, Section Segregation, Sentence splitting, and Phrase splitting.

The Section Segregation divides the report into its constituent sections and separates the Impression section

for translation. The translation process is applied on individual sentences; hence the sentence splitting is

performed next. The sentences are POS-tagged and stored, after which the phrases in a sentence are split for

the translation process. The PENN treebank tag contains 36 tags is used for tagging. word in one language

may have equivalent translation in another language with multiple words. Since Medical terms have phrases

or group of words which cannot be split, in order to retain their medical meaning, chunking is performed

before translation. For example, a term “Ductal carcinoma in situ” is meaningful as a phrase than as

individual word components. Case-sensitivity is handled at the preprocessing stage to avoid errors in

Dictionary lookup. Abbreviations in the text are directly translated to their Tamil equivalents using the

Bilingual dictionary while the non-translatable terms appear without changes in the translated output. The

algorithm given below shows the steps in the translation process.

Algorithm

1. Read Input text in SL

2. Preprocess the text and store in database D

3. For report 1 to n

4. For each sentence S

Apply POS Tagging and store in D

5. Split into phrases

6. For each phrase P

Apply Chunking

7. For each term n to 1, in the Chunklist

If chunk-term Є Bilingual Dictionary

Replace Chunk-term with Target language term;

Go to 6

Endif

End

8. Read the precorrected translated output in TL

9. Apply Morphological Rules

10. Print correct translated text


2077

Fig. 1. Phrase-based Translation from English to Tamil

The initial translated output is not a perfect syntactically correct one. Morphological analysis takes care of

the inflated forms of a verb or noun [5]. Anand Kumar M et.al. mention the difference between English and

Tamil that while English has fixed word order, Tamil has rich morphological structure and is a free word-

order language [2]. Hence, Morphological analysis is performed on the output to obtain perfect translation.

Rule based translation applies grammar rules to the first level translation and reorders the sentences using

linguistic information in the source and target languages [11]. The proposed system applies rules that are

derived using a data-driven approach on the corpus. The POS tagged report content was manually scrutinized

several times to derive the rules. The Morphological rules provide corrected translation output that adheres to

grammatical rules in Tamil language. An example of such a corrected output is given below.

English Sentence: Tumour is 1 cm from the deep resection margin.

Translation output before

Morphological Analysis: is 1 . from the .

Final translation output after

Morphological Analysis: 1 . .


2078

Fig. 2 shows the translation output of the Impression section of a breast cancer pathology report.

Fig. 2. Impression Section Translation from English and Tamil

4. Evaluation of the translated output

The evaluation is performed using human evaluation approach for the Bilingual Dictionary and the

translation output. In the first level, using an editor, the resources and the translation output are evaluated by a

Linguistic expert in both English and Tamil languages. The Linguistic expert checked the correctness of

translation of each term in the Bilingual Dictionary and also each sentence in the Impression section and

graded them as Good (G), Partially correct (PC) or Poor (P).

In evaluating the Bilingual Dictionary terms, the criterion for grading was twofold. Firstly, the Tamil

equivalent for the breast cancer pathology term must represent the domain without loss of deviation in

meaning and secondly the coinage of phrases must be syntactically and semantically correct from the target

language perspective. With the above criteria, the experts graded every term and its Tamil equivalent in the

Bilingual Dictionary. Out of the 1124 terms identified so far, 1102 terms were graded ‘Good’, 7 were graded

to be ‘Partially correct’ and 15 terms were graded to be ‘Poor’. In the same way, the translation output of the

proposed system also was performed by the Language expert. Out of the 150 breast cancer pathology report

translations, 142 translations were graded to be ‘Good’, 5 were graded to be ‘Partially correct’ and 3 were


2079

graded to be ‘Poor’. The evaluation results are graphically summarized and shown in Fig. 3 (a) and Fig. 3 (b).

To fully authenticate the translation, the translation will be graded by the Medical experts.

Fig. 3 (a) Evaluation of Bilingual Dictionary terms; Fig. 3 (b). Evaluation of Translation output

5. Conclusion

The aim of the proposed system was to translate breast cancer pathology reports from English to Tamil,

using Phrase-based translation. The advantage of phrase-based translation that Word-sense disambiguation

need not be considered in the translation process [6]. We found it appropriate to use the phrase-based

translation approach for the Medical domain at the initial stage of translation. Hence, in the proposed system,

valuable resources namely breast cancer pathology Lexicon and breast cancer pathology bilingual dictionary

were generated. The resources may not be immediately used by Medical experts in their reporting but

building such a resource for many other diseases would bring out the richness of the Tamil language and its

applicability to all domains. The work can be applied to all the sections of the breast cancer pathology report

and expanded further to Medical reports of all diseases in the future.

Acknowledgment

The authors thank the Department of Pathology, Christian Medical College and Hospital, Vellore for

providing them with the sample data for their study. Our special thanks to Dr. Marie Therese Manipadam and

Dr. Gunadala Ishitha, Department of Pathology, CMC, Vellore for sharing their domain expertise in the field

of Breast Cancer Pathology. We also thank the Tamil language expert from Madras Christian College, Dr.

David Prabhakar, for evaluating the Lexicon terms and the Bilingual Dictionary and providing valuable input.


2080

References

[1] AJCC Cancer Staging Manual, Eighth Edition © The American College of Surgeons (ACS), Chicago, Illinois, 2017. [2] Anandkumar M et.al., 2013, “Improving the performance of English-Tamil Statistical Machine Translation System using Source-

side preprocessing”, Proceedings of International Conference on Advances in Computer Science, AETACS, Elsevier, 2013.

[3] Antony P J, 2013, “Machine Translation Approaches and Survey for Indian Languages”, International Journal of Computational

Linguistics and Chinese Language processing, Volume 18, No. 1, pp 47-78.

[4] Benson Kituku, Lawrence Muchemi, Wanjiku Nganga, “A Review on Machine Translation Approaches”, Indonesian Journal of

Electrical Engineering and Computer Science, Vol. 1, No. 1, January 2016, pp. 182 ~ 190. [5] Dhanesh N, 2016, “A Conceptual Framework for Automated English to Tamil Machine Translation System”, International Journal

for Trends in Engineering & Technology, Volume 16 Issue 1.

[6] Harshwardhan R, Mridula Sara Augustine, K.P. Soman, 2011, “Phrase based English to Tamil Translation System by Concept Labeling using Translation memory”, International Journal of Computer Applications (0975-8887), Volume 20-No.3.

[7] http://dictionary.tamilcube.com/

[8] https://glosbe.com [9] http://siddhadreams.blogspot.com/2008/10/tamil-medical-dictionary.html

[10] http://www.tamilvu.org/library/technical_glossary/html/techindex.htm

[11] Kamaljeet Kaur, Parminder Singh, 2014, “Review of Machine Transliteration Techniques”, International Journal of Computer Applications (0975-8887), Volume 107 –No. 20, December 2014.

[12] Marta R, Mireia Farrús, Jordi Serrano Pons, 2012, “Machine Translation in Medicine A quality analysis of statistical machine

translation in the medical domain”, International Virtual Conference, Section 14, Information Technology, Advanced Research in Scientific Areas, pp. 1995 – 1998.

[13] Poornima C, Dhanalakhmi V, Anand Kumar M, Soman K P, 2011, “Rule-based Sentence Simplification for English to Tamil

Machine Translation System”, International Journal of Computer Applications, (0975-8887), Volume 20-No.3. [14] Sanjay Kumar Dwivedi, Pramod Premdas Sukhadeve, 2010, “Machine Translation System in Indian Perspectives”, Journal of

Computer Science 6 (10): 1082-1087. ISSN 1549-3636, Science Publications

[15] Saraswathi S, P. Kanivadhana, M. Anusiya, S. Sathiya, 2011, “Bilingual Translation System”, International Journal on Computer Science and Engineering”, Vol.3 No.3 pp. 1168-1174.

[16] Suganthi S, K.G.Srinivasagan, P.Bama Ruckmani, M.Saravanan, 2013, “Rule based Approach for Prepositional Phrase Attachment in English-Tamil Translation”, International Journal of Computer Applications (0975 – 8887) Volume 63– No.22.

[17] www.translate.google.com

[18] www.shabdkosh.com


2081

phrase -based machine translation system for breast … · 2018. 9. 1. · an easy medium of c...

Documents