finding semantic relation among medical terms

NATIONAL INSTITUTE OF TECHNOLOGYK URUKSHETRA

MAJOR PROJECT

REPORT

ON

FINDING SEMANTIC RELATIONSHIP AMONGASSOCIATED MEDICAL TERMS

SUBMITTED TO: SUBMITTED BY:Dr R.M SharmaManisha Singh(111497)Sneha Bairagi(111717)Abhinav Rai(511004)

CONTENTS

1. Introduction 52. Motivation 5 3. Problem Statement 64. Description 6 4.1 Steps involved 7 4.1.1 Tokenization 7 4.1.2 Stemming 7 4.1.3 POS Tagging 84.1.4 Annotating corpora and searching patterns 105. Java 206. JDBC 227. Conclusion 248. Future work 249. References 25

Acknowledgments

We express our profound gratitude and indebtedness to Prof. R.M. Sharma, Department of Computer Science and Engineering, NIT Kurukshetra for supporting the present topic and for their inspiring intellectual guidance, constructive criticism and valuable suggestion throughout the project work.

Date - 4/05/2015 Manisha Singh Kurukshetra Sneha Bairagi Abhinav Rai

ABSTRACT

The machine learning field has gained its thrust in almost any domain of research and just recently has become a reliable tool in the medieval domain. The experimental domain of automatic learning is used in tasks such as medical decision support, medieval imaging, protein-protein interaction, extraction of medical knowledge, and for overall patient management care. Machine Learning is envisaged as a tool by which compute-based systems can be integratedin the healthcare field in order to get a better, well-organised medical care. It describes a ML-based methodology for building an application that is capable of identifying and disseminating healthcare information. It extracts sentences from published medical papers that mention diseases and treatments, and identifies semantic relations that exist between diseases and treatments. Our evaluation results for these tasks show that the proposed methodology obtains reliable outcomes that could be integrated in an application to be used in the medical care domain. The potential value of this paperstands in the ML settings that we propose and in the fact that we outperform previous results on the same data set.

1. Introduction Because of the enormous increase in the research in the medical domain, information extraction tools become more and more important for practitioners of the medical domain. Finding the relevant information in medical domain is still very problematic because most of the data on the internet is poorly structured, amorphous, and unable to deal with problems algorithmically. Most of the data is contained by the journal of medicines and biology which makes this type of textual mining a central and core problem. In this project, we have focused on Disease-Medicine co-occurrence relationship extraction from the text of the literature.. It will be a very valuable contribution in the field of public health to auto-identification of relationship from medicinal records between the disease and treatment to support the process of diagnosis. In this project we are presenting a methodology for extracting useful information from large medical data. In this project we are applying some techniques of data mining to extract treatment corresponding to a disease from huge corpus of data. The system tries to identify the relationship of an active disease and extract relevant medicine for the patient. With the growing number of medical thesis, research papers, research articles, researchers are faced with the difficulty of reading a lot of research papers to gain knowledge in their field of interest. So this system helps the user to extract disease-treatment relationship without reading the whole document. From the extracted file treatment of the particular disease is filtered and displayed to the user. Thus the user gets the required information alone which saves his time and improves the quality of the result. This text mined document can be used in medical health care domain where a doctor can analyse various kinds of treatment that can be given to patient with particular medical disorder. The doctor can update the knowledge related to particular disease or its treatment methodology. A large-scale and accurate list of drug-disease treatment pairs derived from published biomedical literature can be used for drug repurposing[1]. The extracted pairs themselves contain many interesting drug-disease repurposing pairs with evidence from case studies or small-scale clinical studies. Second, these pairs can be used in network-based systems approaches for drug repurposing. For example, if drug 1 is similar to drug 2 and disease 1 can be treated by drug 1 then we can hypothesize that disease 1 can also be treated by drug 2. Here drug-disease relationships will be important to connect drugs to diseases. 2. MotivationThere is a huge volume of data growing on the internet in the form of research papers and web documents. The amount of medical literature continues to grow and specialize. The traditional healthcare system is also becoming one that hug the internet and electronic world. Electronic Health Records (EHR) is becoming the standard in the healthcare domain. Researches and studies show that the potential benefits of having an EHR system are:Health information recording and clinical data repositories immediate access to patient diagnoses, allergies, and lab test results that enable better and time-efficient medical decisions;Medication management rapid access to information regarding potential adverse drug reactions, immunizations, supplies, etc. Decision support the ability to capture and use quality medical data for decisions in the workflow of healthcare; and Obtain treatments that are tailored to specific health needsrapid access to information that is focused on certain topics. In order to embrace the views that the EHR system has, we need better, faster, and more reliable access to information. All research discoveries come and enter the repository at high rate, making the process of identifying and disseminating reliable information a very difficult task. A system can be effective if and only if it takes need of user into account. There are two types of users who have 1)Interest & knowledge in medical field and 2) No interest in medical field. Both groups face problems when it comes to retrieve information about any disease. For people belonging to second group, this task is very tedious and time consuming. As they dont have knowledge of medical terms , its hard to understand medical documents and how to extract relevant data from irrelevant. People of group one find it time consuming to extract relevant documents from irrelevant. They want a system to provide them all the relevant information quickly and efficiently. This system solves problems of people of both groups . It will also allow them to get access of recent data.

3. Problem StatementProblem: Finding Semantic Relationship among associated medical terms using pattern.Sematic relationship among trems basically refers to hidden meaning between the terms like between drug and disease the hidden meaning is treatment. In this we are trying to find out the treatments for the diseases by processing the relevant documents using NLP(natural language processing) techniques which can be used by doctors to improve their knowledge by knowing about latest treatments discovered and can also be used in drug-repurposing.

4. DescriptionIn this project we are coming out with a system that will be used to identify various medicines available for a particular disease. In this project input will be the disease name and will extract the medicines available for the disease from the text documents available in unstructured format. So basically we are processing the text documents to get the disease-treatment pairs available in documents.

Proposed Algorithm:Following is the used algorithm:Input : Disease, Rules. Output: Medicine, Semantic Relationship. 1. For any disease do Extract paper form Medline. 2. Tokenize the document. 3. Remove all stopwords. 4. Perform stemming. 5. POS tagging is preformed to separate required part of speech.6. convert this corpora to annotated corpora.7. From annotated sentences Extract sentence having atleast one medicine and one disease. 8. Pattern is searched between disease and medicine.9. Medicines are associated and ranked based on frequency and superiority. 10. Semantic relationships are then presented to user.

4.1 Seps involved:

4.1.1 TokenizationTokenization is the process of breaking up the given text into units called token. The tokens may be words or number or punctuation mark. Tokenization does this task by locating word boundaries. Ending point of a word and beginning of the next word is called word boundaries. Tokenization is also known as word segmentation.

Challenges in tokenizationChallenges in tokenization depends on the type of language. Languages such as English and French are referred to as space-delimited as most of the words are separated from each other by white spaces. Language such as Chinese and Thai are referred to as unsegmented as words do not have clear boundaries. Tokenising unsegmented language sentences requires additional lexical and morphological information(root words, affixes, parts of speech). Tokenization is also affected by writing system and the typographical structure of the words. Structures of languages can be grouped into three categories:Isolating: words do not divide into smaller units. Example: Mandarin ChineseAgglutinative: words divide into smaller units. Example: Japanese,TamilInflectional: Boundaries between morphemes are not clear and ambiguous in terms of grammatical meaning. Example:Latin.Tokenizationis the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such asparsingortext mining. In NLP we tokenize a large piece of text to generate tokens which are smaller pieces of text (words, sentences, etc.) that are easier to work with.

4.1.2 Stop Word RemovalIncomputing,stop wordsare words which are filtered out before or afterprocessing of natural languagedata (text). There is no single universal list of stop words used by allprocessing of natural languagetools, and indeed not all tools even use such a list. Before text analysis a stop word list is developed for the removal of semantically insignificant words, this lists vary in size. For our technique we have list of stop word including common words, phrases and characters. Stopword contains the high frequency terms that are to be ignored from the text as they are not giving any useful information for our scenario. The most common stopwords in our case are 'a', 'the', 'of'' etc. Stop words are basically a set of commonly used words in any language, not just English. The reason why stop words are critical to many applications is that, if we remove the words that are very commonly used in a given language, we can focus on the important words instead. For example in the context of search engine, if your search query is how to develop information retrieval applications, If the search engine tries to find web pages that contained the terms how, to, develop, information, retrieval, applications the search engine is going to find a lot more pages that contain the terms how , to than pages that contain information about developing information retrieval applications because the terms how and to are so commonly used in the English language. So, if we disregard these two terms, the search engine can actually focus on retrieving pages that contain the keywords: develop information retrieval applications which would more closely bring up pages that are really of interest. This is just the basic intuition for using stop words. Stop words can by used in a whole range of tasks and these are just a few: Supervised machine learning removing stop words from the feature space Clustering removing stop words prior to generating clusters Information retrieval preventing stop words from being indexed Text Summarization excluding stop words from contributing to summarization scores and removing stop words when computing ROUGE scores.Types of stop words: Stop words are generally thought to be a single set of words. It really can mean different things to different application. For example, in some applications removing all stop words right from determiners (e.g. the, a, an) to prepositions (e.g. above, across, before) to some adjectives (e.g. good, nice) can be an appropriate stop word list. To some applications however, this can be detrimental. For instance, in sentiment analysis removing adjective terms such as good and nice as well as negations such as not can throw algorithms off their tracks. In such cases, one can choose to use a minimal stop list consisting of just determiners or determiners with prepositions or just coordinating conjunctions depending on the needs of the application.Examples of minimal stop word lists:Determiners Determiners tend to mark nouns where a determiner usually will be followed by a noun examples: the, a, an, another.Coordinating Conjunctions Coordinating conjunctions connect words, phrases, and clauses. Examples : form, an, nor, but, or, yet, soPrepositions Prepositions express temporal or spatial relations. Examples : in, under, towards, before.

4.1.3 Stemming:Stemmingis the term used inlinguistic morphologyandinformation retrievalto describe the process for reducing inflected words to theirword stem, base orrootform-generally a written word form. The stem needs not to be identical to themorphological rootof the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Stemming is a pre-processing step in Text Mining applications as well as a very common requirement of Natural Language processing functions. In fact it is very important in most of the Information Retrieval systems. The main purpose of stemming is to reduce different grammatical forms/word forms of a word like its noun, adjective, verb, adverb etc. to its root form.[2] We can say that the goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.For example 'reader' and 'reading' are reduced to 'read' so that terms can lead to similarity detection. Stemming does not seem to depend on the domain but depends on the language of text. But our findings show that stemming effects to the semantics of term. It has been seen that most of the times the morphological variants of words have similar semantic interpretations and can be considered as equivalent for the purpose of IR applications. Since the meaning is same but the word form is different it is necessary to identify each word form with its base form. In stemming, conversion of morphological forms of a word to its stem is done assuming each one is semantically related. There are mainly two errors in stemming over stemming and under stemming. Over-stemming is when two words with different stems are stemmed to the same root. This is also known as a false positive. Under-stemming is when two words that should be stemmed to the same root are not. This is also known as a false negative. Paice has proved that light-stemming reduces the over-stemming errors but increases the under-stemming errors. On the other hand, heavy stemmers reduce the under-stemming errors while increasing the over-stemming errors.

Various Stemming algorithms available are:Truncate(n): The most basic stemmer was the Truncate (n) stemmer which truncated a word at the nth symbol i.e. keep n letters and remove the rest. In this method words shorter than n are kept as it is. The chances of over stemming increases when the word length is small.

S-Stammer: An algorithm conflating singular and plural forms of English nouns. This algorithm was proposed by Donna Harman. The algorithm has rules to remove suffixes in plurals so as to convert them to the singular forms.

Lovins Stemmer: This was the first popular and effective stemmer proposed by Lovins in 1968. It performs a lookup on a table of 294 endings, 29 conditions and 35 transformation rules, which have been arranged on a longest match principle [6]. The Lovins stemmer removes the longest suffix from a word. Once the ending is removed, the word is recoded using a different table that makes various adjustments to convert these stems into valid words. It always removes a maximum of one suffix from a word, due to its nature as single pass algorithm. The advantages of this algorithm is it is very fast and can handle removal of double letters in words like getting being transformed to get and also handles many irregular plurals like mouse and mice, index and indices etc.Drawbacks of the Lovins approach are that it is time and data consuming. Furthermore, many suffixes are not available in the table of endings. It is sometimes highly unreliable and frequently fails to form words from the stems or to match the stems of like-meaning words. The reason being the technical vocabulary being used by the author.

Porters Stemmer: Porters stemming algorithm is as of now one of the most popular stemming methods proposed in 1980. Many modifications and enhancements have been done and suggested on the basic algorithm. It is based on the idea that the suffixes in the English language (approximately 1200) are mostly made up of a combination of smaller and simpler suffixes. It has five steps, and within each step, rules are applied until one of them passes the conditions. If a rule is accepted, the suffix is removed accordingly, and the next step is performed. The resultant stem at the end of the fifth step is returned.The rule looks like the following: ->For example, a rule (m>0) EED EE means if the word has at least one vowel and consonant plus EED ending, change the ending to EE. So agreed becomes agree while feed remains unchanged. This algorithm has about 60 rules and is very easy to comprehend.

Paice/Husk Stemmer:The Paice/Husk stemmer is an iterative algorithm with one table containing about 120 rules indexed by the last letter of a suffix [14]. On each iteration, it tries to find an applicable rule by the last character of the word. Each rule specifies either a deletion or replacement of an ending. If there is no such rule, it terminates. It also terminates if a word starts with a vowel and there are only two letters left or if a word starts with a consonant and there are only three characters left. Otherwise, the rule is applied and the process repeats. The advantage is its simple form and every iteration taking care of both deletion and replacement as per the rule applied. The disadvantage is it is a very heavy algorithm and over stemming may occur.

Dawson Stemmer:This stemmer is an extension of the Lovins approach except that it covers a much more comprehensive list of about 1200 suffixes. Like Lovins it too is a single pass stemmer and hence is pretty fast. The suffixes are stored in the reversed order indexed by their length and last letter. In fact they are organized as a set of branched character trees for rapid access. The advantage is that it covers more suffixes than Lovins and is fast in execution. The disadvantage is it is very complex and lacks a standard reusable implementation.

4.1.4 POS Tagging:part-of-speech tagging(POS taggingorPOST), also calledgrammaticaltaggingorword-categorydisambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particularpart of speech, based on both its definition, as well as its contexti.e.relationship with adjacent and related wordsin aphrase,sentence, orparagraph.The process of assigning a part-of-speech to each word in a sentence.

Eg: Come September, and the UJF campus is abuzz with new and returning students.

After POS tagging sentence will be:Come_VB September_NNP ,_, and_CC the_DT UJF_NNP campus_NN abuzz_JJ with_IN new_JJ and_CC returning_VGB students_NNS ._.These labels actually came from PENN TAGSET and this PENN TAGSET came from university of Pennsylvania which is a famous place for natural language processing work.Foundation is based on noisy channel model.

Noisy Channel

(wn,wn-1,..............,w1) (tm,tm-1,...............t1)

Sequence w is transformed into sequence t. w*|t*=argmaxP(w/t) w

Guess at the correct correct Noisy transformation Sequence sequence

Here noisy channel is a metaphor for a computation where input is coming and suspected to noise at every stage of processing and an output is generated. On the input side we have word sequence and on the output side we have the tag sequence.

Argmax Computation:

Let y=f(x) be a function.Theny*=max(y) for all x.Compare max with argmax: x*=argmax(f(x)) for all x.

Bayesian Decision TheoryGiven the random variables A and B then P(A/B)=P(A)*P(B/A)/P(B).P(A/B)=posterior probabilityP(A)=prior probabilityP(B/A)=likelihoodAssumption: Choose that value as the decision whose probability is highest

A*=argmax(P(A|B)) A =argmax(P(A).P(B|A)) AComputing and using P(A) and P(B|A), both need Looking at the internal structures of A and B Making independence assumptions Putting together a computation from smaller parts

Best tag sequence t*, t*=argmax(P(t|w)) tAfter applying Bayes theorem =argmax(P(t)*P(w|t)) tHere P(w) can be ignored because it is going to remain for all t.P(t) is prior probability of tag sequence t. It also acts as a filter for bad tags.Some of the POS Tages are: NN Noun e.g Dog_NN VM- Main Verb e.g.RUN_VM VAUX Auxiliary verb e.g. IS_VAUX JJ Adjective e.g. Red_JJ PRP Pronoun e.g. You_PRP NNP Proger Noun e.g. John_NNP CC Coordinating conjuction e.g. jack and_CC jill CD Cardinal number e.g. four_CD children MD Modal e.g. you may_MD go etc.POS tag ambiguityI bank1 on the bank2 on the river bank3 for my transactions. Here bank1 is verb, the other two banks are noun

Process of POS Tagging List all possible tag for each word in a sentence. Choose best possible tag sequence.Example:Sentence: people jump highPeople : noun/verbJump : noun/verbHigh : noun/adjectivePeople : noun/verb People are the assests of a country. Here people is noun. The place was peopled with the members of the tribes. Here people is used as verb.Jump : noun/verb I jumped over the fence. Here jump is used as a verb. This was a good jump. Here jump is a noun.High : noun/adjective High hills. Here high is used as a adjective. After the win , he was on a high. Here high is used as a noun.

^ people jump high .

Each tag here is considered as a state and ^(hat) is considered as the starting state and .(dot) is considered as the ending state. If there are N words in a sentence then we get a tag sequence having N+2 states because here hat and dot states are also considered. All the possible tag sequences are tried and then the tag sequence with the maximum probability is assigned to word sequence. So finding the tag sequence has been reduced to graph traversal.

Best tag sequence= T*=argmaxP(T|W)=argmaxP(T)P(W|T)P(T)=P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0)....................P(tn|tn-1tn-2.........t0)P(tn+1|tntn-1..........t0) = P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1)....................P(tn|tn-1tn-2)P(tn+1|tntn-1) (Trigram Assumption)P(tn|tn-1tn-2)=no of times (tntn-1tn-2) sequence occurs divided by number of times (tn-1tn-2) sequence occurs.P(W|T)=P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0 t0-tn+1)...P(wn|w0-wn-1 t0-tn+1)P(wn+1|w0-wn t0-tn+1)Assumption: a word is completely determined by its tag. This is inspired by speech recognition.P(W|T)=P(w0|t0)P(w1|t1)P(w2| t2)...P(wn|tn)P(wn+1|tn+1) (Lexical probability)

Example of calculation from actual data.CorpusLet the data be ^ Ram got many NLP books. He found them all very interesting.POS Tagged^N V A N N . N V N A R A.Recording numbers(Bigram assumption)

^NVAR.

^020000

N012101

V010100

A010011

R000100

.100000

Probabilities^NVAR.

^010000

N01/52/51/501/5

V0000

A01/3001/31/3

R000100

.100000

P(ram|N)=P(wi=ram|ti=noun)=no of times ram occurs as noun/total number of nouns.

Lexical probabilitiesRamGotManyNLPBooksHe FoundThemAllVeryInteresting

^

N

V

A

R

.

HIDDEN MARKOV MODELA very powerful tool for problem solving in statistical AI is markov assumption and bayes theorm.Let us consider a problem having three urn and all of them having red, blue and green balls.Balls are drawn randomly from urns and this sequence is termed as the observation sequence.We have to find out the state sequence i.e the urn sequence.

Red balls:30 Red balls:10 Red balls:60 Green balls:50 Green balls:40 Green balls:10 Blue balls:20 Blue balls:50 Blue balls:30

Probability of transition to another Urn after picking a ball:U1U2U3

U10.10.40.5

U20.60.20.2

U30.30.40.3

Probability of drawing ball from urnRGB

U10.30.50.2

U20.10.40.5

U30.60.10.3

Let the observation sequence is RRGGBRGR. We have to find out the state sequence. Many problems in AI fall into this class predict hidden from observed.Diagrammatic representation

Observations and statesHere states S0 and S9 as initial and final states respectively. After S8 the next state is S9 with probability 1, i.e, , P(S9|S8)=1. O0 is a E-transition. O0 O1 O2 O3 O4 O5 O6 O7 O8OBS: E R R G G B R G RState: S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

Si = U1/U2/U3.A particular stateS: state sequenceO:observation sequence S*= best possible state(urn) sequenceGoal : maximize P(S*|O) by choosing best S. Maximize P(S|O) where S is the state sequence and O is the observation sequence.S*= argmax(P(S|O)) SS*=argmax(P(S)*P(O|S)) SP(S)=P(S0-9)P(S)=P(S0)P(S1|S0)P(S2|S0-1)P(S3|S0-2)P(S4|S0-3)............P(S9|S0-9)By Markov AssumptionP(S)=P(S0)P(S1|S0)P(S2|S1)P(S3|S2)P(S4|S3)............P(S9|S8)P(O|S)=P(O0|S0-9)P(O1|O0,S1-8)P(O2|O0-1,S0-9)P(O3|O0-2,S0-9)..............P(O8|O0-7,S0-9)Assumption that ball drawn depends only on the urn choosen.P(O|S)=P(O0|S0)P(O1|S1)P(O2|S2)P(O3|S3)............P(O8|S8)P(S|O)=P(S)P(O|S)P(S|O)=P(S0)P(S1|S0)P(S2|S1)P(S3|S2).......P(S8|S7)P(S9|S8)P(O0|S0)P(O1|S1)P(O2|S2)P(O3|S3).... ..........P(O8|S8)P(S).P(O|S)=[P(O0|S0) P(S1|S0)].[ P(O1|S1) P(S2|S1)].[ P(O2|S3) P(S3|S2)].................[ P(O8|S8) P(S9|S8)]After S8 the next state is S9 with probability 1, i.e.,P(S9|S8)=1 Here we are having 3 urns and 8 observations . so number of times these above computations to be done is 38 i.e |number of states|length of observation sequence.

So to improve these computations , Viterbi algorithm is used.

The medical terms from above stage is provided to POS tagger to correctly elaborate all syntactic categories such as noun, verb, adjective, pronoun that can be used to identify part of speech. For purpose of our task we are using POS Tagger which is based on Markov Assumption[3] and uses Viterbi algorithm[4]. From all the text we consider only four part of speech i.e. Noun, Pro-Noun, Verb and Adjective. Noun are used because each entity of our domain is treated as Noun by POS e.g. dengue, malaria etc and Pro-Noun are used because in most of the paragraph a term initially starts with the Name entity called Noun by our POS and in the remainder portion of paragraph terms occur as pro-noun. Each pronoun is tested by moving in backward direction to access the pointed Noun. Verb shows the link of relationship among the nouns and adjectives are used to show the strength of relation e.g. severe, low, high.

In POS tagging : Best t*=argmax(P(t/s)) =argmax(P(t)*P(s/t)) where t is the tag sequence and s is the sequence which is to be tagged.If Hidden Markov Model is used to compute the tags , then complexity is going to be more but his trigram assumption is taken into consideration i.e the current tag depends upon the previous two tags and Viterbi algorithm is applied to perform tagging.

Viterbi AlgorithmGiven:The HMM which means:a) Start state: s1b) Alphabet A={a1 a2 ..... an}c) Set of states S={s1s2....sn}d) Transition probability P(si to sj via ak) for all i,j,k which is equal to P(sjak/si).Find the output string a1a2.....at.To find: the most likely sequence of states c1c2....ct which produces the given output sequence i.e c1c2c3....ct=argmax(P(c/a1a2....at));Data structures:a) A N*T array called SEQSCORE to maintain the winner sequence always(N=#states , T=length of output sequence).b) Another N*T array whose BACKPTR to recover the path.c) Three distinct steps in Viterbi implementationa) Initilizationb) Iterationc) Sequence identification

Initilization: SEQSCORE(1,1)=1.0 BACKPTR(1,1)=0.0 For(i=2 to N) do SEQSCORE(i,1)=0.0

Iteration: For(t=2 to T) do For(i=2 to N) do SEQSCORE(i,t)=max(j=1,N) BACKPTR(i,t)=index j that gives the max above.

Sequence Identification C(T)=i that maximizes SEQSCORE(i,T) For i from (T-1) to 1 do C(i)=BACKPTR[C(i+1),i+1].

Understand Viterbi algorithm with a example

Developing the tree:

Tag sequence is a1a2a1a2. We have to find the state sequence.Probability tableE A1A2A1A2

S11.00.10.090.0120.0081

S20.00.30.060.0270.0054

BACKPTR tableEA1A2A1A2

S101222

S21212

By using the BACKPTR table state sequence is obtained. Best state sequence obtained is S1S2S1S2S1.This has reduced the complexity to a greater extent.

4.1.4 Annotating Corpora and Searching patterns:

In this corpora is annoted as disease or medical terms. Sentences are tagged with disease entities from the clean disease lexicon and drug entities from the drug list. The tagging was based on case-insensitive exact string matching for high precision and efficiency. Then pattern is searched between disease and drug. Pattern could be drug pattern disease if the drug entity precedes the disease entity or disease pattern drug if disease precedes the drug.The patterns that we are using for drug pattern disease are: in, in the treatment of, for, in patients with, for the treatment of, treatment of, therapy for, therapy in, for treatment of, against, in the management of, therapy of, treatment for, treatment in, in a patient with, in treatment of, in children with, to cure, is used to cure , is used for curing , in the management of , is used to manage, reduces, in the treatment of patients, prevents, is used to prevent, to prevent, for the management of, to treat, can be used to control symptoms of , can be used as medication for, can be used to improve symptoms, can be used as a antibiotic for, can be used to relieve sign symptoms, can be used to relieve symptoms, can be used to reduce symptoms, can be effective for, may be effective in the treatment of , can be used to prevent and the patterns that are used for disease pattern drug are can be treated with , interventions to control disease are, symptoms can be improved with, symptoms can be controlled with, symptoms can be improved with, antibiotics for the disease are, antibiotics that can be used are, your doctor may recommend, symptoms can be reduced with, can be prevented with.

How to check the quality of taggingThree parameters: Precision P=|A ^ O|/|O| It measures out of those obtained which proportion is correct. Recall R=|A ^ O|/|A| It measures out of those correct how many are actually got. F-score=2PR/(P+R) Harmonic meanIf every word is given a tag and no word is left out. So sizeof(A)=sizeof(O)Therefore, presision=recall=F-score.5. JAVA

Java is a programming language and a platform. Java is a high level, robust, secured and object-oriented programming language. Any hardware or software environment in which a program runs, is known as a platform. Since Java has its own runtime environment (JRE) andAPI, it is called platform.A simple java example:classSimple{publicstaticvoidmain(Stringargs[]){System.out.println("HelloJava");}}According to Sun, 3 billion devices run java. There are many devices where java is currently used. Some of them are as follows: Desktop Applications such as acrobat reader, media player, antivirus etc. Web Applications such as irctc.co.in, javatpoint.com etc. Enterprise Applications such as banking applications. Mobile Embedded System Smart Card Robotics Games etc.Features of JAVA Simple : Java is a simple language because syntax is based on C++(so easier for programmers to learn after C++). It has removed many confusing and/or rarely-used features e.g., explicit pointers ,operator overloading etc. There is no need to remove unreferenced objects because there is automatic garbage collection in java. Object-Oriented : Object-oriented means we organize our software as a combination of different types of objects that incorporates both data and behaviour. Object-oriented programming(OOPs) is a methodology that simplify software development and maintenance by providing some rules. Basic concepts of OOPs are: Object, Class, Inheritance, Polymorphism, Abstraction, Encapsulation. Platform independent : A platform is the hardware or software environment in which a program runs. There are two types of platforms software-based and hardware-based. Java provides software-based platform. The Java platform differs from most other platforms in the sense that it's a software-based platform that runs on top of other hardware-based platforms.It has two components: runtime environment and API(application programming interface). Java code can be run on multiple platforms e.g.Windows,Linux,Sun Solaris,Mac/OS etc. Java code is compiled by the compiler and converted into bytecode.This bytecode is a platform independent code because it can be run on multiple platforms i.e. Write Once and Run Anywhere(WORA). Secured : Java is secured because it has no explicit pointer and programs run inside virtual machine sandbox.

Robust : Robust simply means strong. Java uses strong memory management. There are lack of pointers that avoids security problem. There is automatic garbage collection in java. There is exception handling and type checking mechanism in java. All these points makes java robust. Architecture neutral : There is no implementation dependent features e.g. size of primitive types is set. Portable : java bytecode can be carried anywhere. High Performance : Java is faster than traditional interpretation since byte code is "close" to native code still somewhat slower than a compiled language (e.g., C++) Multithreaded : A thread is like a separate program, executing concurrently. We can write Java programs that deal with many tasks at once by defining multiple threads. The main advantage of multi-threading is that it shares the same memory. Threads are important for multi-media, Web applications etc. Distributed : applications can also be distributed in java. RMI and EJB are used for creating distributed applications. We may access files by calling the methods from any machine on the internet.

6. JDBC

Java JDBC is a java API to connect and execute query with the database. JDBC API uses jdbc drivers to connect with the database. Before JDBC, ODBC API was the database API to connect and execute query with the database. But, ODBC API uses ODBC driver which is written in C language (i.e. platform dependent and unsecured). That is why Java has defined its own API (JDBC API) that uses JDBC drivers (written in Java language). API (Application programming interface) is a document that contains description of all the features of a product or software. It represents classes and interfaces that software programs can follow to communicate with each other. An API can be created for applications, libraries, operating systems, etc.

7 Steps to connect to databaseThere are 5 steps to connect any java application with the database in java using JDBC. They are as follows: Register the driver class Creating connection Creating statement Executing queries Closing connectionRegister the driver class: The forName() method of Class class is used to register the driver class. This method is used to dynamically load the driver class.Syntax of forName() methodpublicstaticvoidforName(StringclassName)throwsClassNotFoundExceptionCreate the Connection object: The getConnection() method of DriverManager class is used to establish connection with the database.Syntax of getConnection methodpublicstaticConnectiongetConnection(Stringurl)throwsSQLExceptionpublicstaticConnectiongetConnection(Stringurl,Stringname,Stringpassword)throwsSQLException

Create the statement object: The createStatement() method of Connection interface is used to create statement. The object of statement is responsible to execute queries with the database.Syntax of createStatement methodpublicStatementcreateStatement()throwsSQLException

Exeute the query: The executeQuery() method of Statement interface is used to execute queries to the database. This method returns the object of ResultSet that can be used to get all the records of a table.Syntax of executequery() methodpublicResultSetexecuteQuery(Stringsql)throwsSQLException

Close the connection object: By closing connection object statement and ResultSet will be closed automatically. The close() method of Connection interface is used to close the connection.Syntax of close methodpublicvoidclose()throwsSQLException

Connectivity with access with DSN

Connectivity with type1 driver is not considered good. To connect java application with type1 driver, create DSN first, here dsn name is mydsn.

importjava.sql.*;classTest{publicstaticvoidmain(Stringar[]){Try{Stringurl="jdbc:odbc:mydsn";Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");Connectionc=DriverManager.getConnection(url);Statementst=c.createStatement();ResultSetrs=st.executeQuery("select*fromlogin");while(rs.next()){System.out.println(rs.getString(1));}}catch(Exceptionee){System.out.println(ee);}}}

7. ConclusionWe tackle in this project is a task that has applications in information retrieval, information extraction, and text summarization. We identify potential improvements in results when more information is brought in the representation technique for the task of classifying short medical texts. Experimental result shows that the technique used in the proposed work minimizes the time and the work load of the doctors in analyzing information about certain disease and treatment in order to make decision about patient monitoring and treatment. This systemhelps users especially doctors in saving their time and they can know easily about a disease its treatment and symptoms and can analyses more about a various treatments associated with a particular disease. This text mined document can be used in medical health care domain where a doctor can analyse various kinds of treatment that can be given to patient with particular medical disorder. The doctor can update the knowledge related to particular disease or its treatment methodology or the details of medicine that are in research for a particular disease. The doctor can gain idea about particular medicine that are effective for some patient but causes side effect to patient with some additional medical disorder. The patient can also use this extracted document to get clear understanding about a particular disease its symptoms, side effects, its medicines, its treatment methodologies.

8. Future ScopeA wide future scope exists for this project. We can make this project more user-friendly by allowing user to also extract information regarding cure, symptoms and prevention of disease. It involves expanding the project to finding the root cause of the disease and then by taking the patient history or condition and providing him the dose accordingly. The future idea is based on viewing the composition of medicine and after applying it on patient report identifying that is it be suiting him.

9. References

[1] Rong Xu and QuanQiu Wang Large- scale extraction of accurate drug-disease treatment pairs from biomedical literature for drug repurposing, Issue 2013.[2] Fadi Yamout, Further Enhancement to the Porters Stemming Algorithm, Issue 2006.[3] Ray S and Craven M,Representing sentence structure in Hidden Markov Models for information extraction, Proceedings of IJCAI-2001.[4] M. S. Ryan and G. R. Nudd., The Viterbi Algorithm, Department of Computer Science, University of Warwick, Coventry,England,Issue 1993.[5] Jesse Davis jdavis Mark Goadrich, The Relationship Between Precision-Recall and ROC Curves, Department of Computer Sciences and Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison,USA.

6

finding semantic relation among medical terms

Documents

medical domain

medical care domain

large medical data

medical papers

medieval domain

domain of research

medical decision support

wellorganised medical