new quantitative methodology for identification of drug abuse based on feature-based context-free...

17
New Quantitative Methodology for Identification of Drug Abuse Based on Feature-Based Context-Free Grammar Yuqing (Carrie) Wang

Upload: carrie-wang

Post on 26-Jun-2015

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: New Quantitative Methodology for Identification of Drug Abuse Based on Feature-Based Context-Free Grammar

New Quantitative Methodology for Identification of Drug Abuse Based on Feature-Based Context-Free Grammar

Yuqing (Carrie) Wang

Page 2: New Quantitative Methodology for Identification of Drug Abuse Based on Feature-Based Context-Free Grammar

Abstract• Representing the meaning of unstructured text in a computable form is an

open problem in computational linguistics that hinders full use of the information in social media. Here, I develop a feature-based context-free grammar that can parse the semantics of a YouTube discourse about the recreational use of cough syrup and perform anaphora resolution. The resulting anaphora resolution allows the computer to understand the meaning of unstructured texts. This work makes it easier to mine the meaning as well as the statistics of textual data from social media. More broadly, it contributes to the development of a computational representation of spontaneous discourses. Such a representation has applications in

• (i) syndromic surveillance by using novel semantic representations as a proxy for the emergence of novel substances, and • (ii) clinical research by more intelligently mining electronic medical

records. It may also contribute to the understanding of the computational structure of human language.

Page 3: New Quantitative Methodology for Identification of Drug Abuse Based on Feature-Based Context-Free Grammar

IntroductionBackground information: • Traditional means for learning about patterns of substance usage, such as the National

Survey on Drug Use and Health, do not provide timely information on known drugs. They, moreover, provide no means for learning about new drugs.

• An analysis of social media provides an efficient and cheap way to investigate trends, opinions, and the flow of information in the general population. Such an analysis may identify the signs and symptoms associated with the use of novel drugs or novel use of existing drugs.

• Natural language processing refers to the recognition, analysis, and reproduction of human language by machines. In this project, the term “natural language processing” refers only to the analysis of texts from social media, such as tweets from Twitter or comments on YouTube videos.

• Context-free grammar: A context-free grammar (CFG) defines a formal grammar in which every production translates one nonterminal symbol, V , into a string of (possibly empty) terminals and nonterminals, w.

• V -> W • This grammar is context-free because the same rule applies to V no matter what

nonterminals surround it.

Page 4: New Quantitative Methodology for Identification of Drug Abuse Based on Feature-Based Context-Free Grammar

IntroductionLiterature Review: • Social media, such as Twitter and YouTube, provide a cheap and rapid venue for many aspects of public health.

Tweets about influenza predict the CDC’s calculation of influenza-like illness [13]. Social media also provide a way to screen for illnesses. Tweets about patterns of smoking and exercise correlate with the annual CDC Behavioral Risk Factors Study. There is a positive correlation between high smoking

• rates and high Twitter message rates about cancer (r = .648), a negative correlation between exercise and obesity messages (r = −.201), and a negative correlation between good healthcare coverage and messages about ailments in general (r = −.253) [5]. These examples indicate that social media provide a source of data that agree with official sources, but can be acquired more rapidly and on a much larger scale.

• A major barrier to the use of data from social media is the extraction and representation of the meaning (semantics) of textual data, such as tweets from Twitter, comments about YouTube videos, or status updates on Facebook . The creation of a computational representation of semantics is, in fact, an outstanding problem in natural language processing.

• Feature structures are a general data structure for representation information of any kind. They can be simple, atomic values such as +AUX, which indicates the presence of an auxiliary verb in the sentence. However, they can also be complex values that contain features that are themselves feature structures. One representation of such complex feature structures is known as the attribute value matrix. An attribute value matrix groups together agreement features, such as person, gender, and number, as a distinguished part of a category, ARG.

• First-order logic contains two parts: non-logical expressions and logical connectives. Examples of non-logical expressions include constants, variables, function symbols, and n-place predicate symbols. Logical connectives are negation, and, or, and implication.

Page 5: New Quantitative Methodology for Identification of Drug Abuse Based on Feature-Based Context-Free Grammar

IntroductionLiterature Review:• The term lambda calculus refers to a set of rules for formulating and converting lambda expressions. In turn,

lambda expressions represent statements in first-order logic. Lambda calculus expresses all statements in first-order logic as combinations of (i) functions that take one argument and (ii) variables that may be bound or free. A bound variable has a value defined within the scope of the function. A free variable does not. A lambda expression may have one of three forms.

• Discourse semantics refers to the analysis of the relationships among related sentences– a discourse is a sequence of sentences– by including abstract mental representations (discourse representation structures). The interpretation of a sentence in a discourse often depends on its context– the surrounding sentences. This dependence on context is most noticeable with anaphoric pronouns, such as he, she, and it.

• An anaphora is an expression whose meaning depends on another referential element. For instance, in the discourse {I like dogs. They are cute.}, they refers to dogs. An anaphoric expression usually involves a pronoun referring to its antecedent.

• The resolution of an anaphor refers to the binding of the anaphor to its proper antecedent. The accurate resolution of anaphors is important in computing

• the semantics of YouTube comments because those commons often take on the form of highly referential discourses.

• Research Problem and Hypothesis• If discussing certain topics online entails using idiosyncratic syntactic structures, then I may use those

structures to identity and parse those concepts. Because these structures recur, they can be used to resolve anaphors.

Page 6: New Quantitative Methodology for Identification of Drug Abuse Based on Feature-Based Context-Free Grammar

Introduction

Significance: • This project is a development of a new algorithm based on

feature-based context-free grammar and its implementations on the analysis of computational semantics to social media.

• It demonstrates that the results of this development can provide insight into public health issues by using characteristic syntactic structures to identify potential new drugs or early symptoms of substance abuse.

• It contributes to clinical research by more intelligently mining electronic medical records.

• It may also contribute to the understanding of the computational structure of human language.

Page 7: New Quantitative Methodology for Identification of Drug Abuse Based on Feature-Based Context-Free Grammar

Materials and MethodsDevelopment and Validation of Data:• YouTube comments related to illicit drug discussions were collected and

analyzed (Table 2). The sentences had between three (3) and twenty (20) words, with a median of six (6) words.

Page 8: New Quantitative Methodology for Identification of Drug Abuse Based on Feature-Based Context-Free Grammar

Materials and Methods• In this corpus, the most frequently occurring parts of speech were

determiners and verbal and adjectival forms of expletives. • An example of a parsed comment in this corpus is shown below.

Page 9: New Quantitative Methodology for Identification of Drug Abuse Based on Feature-Based Context-Free Grammar

Materials and MethodsAcquisition of Data:• The data were acquired by querying YouTube’s API through a Python

wrapper (Figure 1).

Page 10: New Quantitative Methodology for Identification of Drug Abuse Based on Feature-Based Context-Free Grammar

Materials and MethodsAnalysis of Data:• Some features in that grammar are -expressions, which allows this grammar to

approximate a comment’s semantics. Other features represent syntactic concepts such as number, tense, or aspect of words.

• I denote by NUM the feature that captures the number of a verb or noun. Similarly, I denote by TENSE the feature that captures the tense of the verb. In addition to grammatical functions and semantic constructions, a third class of features captures the ordering of words in the sentence, which approximates the comment’s pragmatics. For example, I denote by INV the boolean feature that indicates if a sentence is inverted. I use + and - to indicate the truth values of such atomic features. For example, -INV indicates a statement or indirect questions and +INV indicates that the verb and noun phrases are inverted, as in a question.

Page 11: New Quantitative Methodology for Identification of Drug Abuse Based on Feature-Based Context-Free Grammar

Results• This project resulted in the identification of unspecified pronouns in a discourse. For example,

in the discourse {I do drugs. They make me high.}, the parser recognized that they refers to drugs. The example demonstrates anaphora resolution, or matching pronouns with their antecedents, based on the context of the discourse.

•First-order Logic:•I disregarded adverbs and made modifications to sentences to approximate the semantics of a discourse. First-order logic cannot express complicated features such as predicate adverbial and predicate adverbial modifier. It can only express the presence or absence of the semantics of sentences, meaning it can only express a given semantic as true or false.

•Lambda Calculus:• To express the semantics of the linking verb is in sentence It is safe., I denote by the lambda

expression <\x.(it(x) & safe(x))>. This expression indicates that variable x is both it and safe. To express the semantics of the transitive verb use in sentence I use it., I denote by the lambda expression <\X x.X(\y.use(x,y))>, which indicates that x uses y. It also indicates that both x and y are bound to the transitive verb use. Since a transitive verb takes two arguments, it makes sense that two variables are bound to use. In this case, I and it are bound to use. I also used lambda calculus to create semantic value of a sentence at its root, which returns a sequence of readings expressed in lambda calculus as well.

Page 12: New Quantitative Methodology for Identification of Drug Abuse Based on Feature-Based Context-Free Grammar

Results

•The first lambda calculus expression contains three parts. The expression to the right, found(y,x), translates to y found x; the expression in the middle, drug(x), translates to x is a drug. The expression to the left, I(y), translates to y is I. Thus, the entire lambda calculus expression can be translated to: I found a drug.•The second lambda calculus expression also contains three parts. The expression to the right, use(y,x), translates to y use x; the expression in the middle, drug(x), at the same time, translates to x is a drug. The expression to the left, I(y), translates to y is I. Thus, the entire lambda calculus expression can be translated to: I use drug.

Page 13: New Quantitative Methodology for Identification of Drug Abuse Based on Feature-Based Context-Free Grammar

ResultsAnaphora resolution :

• Anaphora resolution refers to the identification of the referent of the anaphor and binding of the referent.

• Using first-order logic and lambda calculus, I built a feature-based context-free grammar that were able to parse this discourse and returned readings that suggest the relationships between the anaphoric pronoun it and the noun phrase a drug.

The leftmost matrix denotes the truth value for drug. The middle on suggests that the direct object that follows the verb found, has the same truth value and scope as drug. This suggests that the direct object of found is drug. The rightmost matrix suggests that for transitive verb use, its direct object it has the same truth value as that of the direct object of the transitive verb found. This equality in the truth value of direct objects for verbs found and use indicate that the direct objects drug and it must refer to the same thing.

Page 14: New Quantitative Methodology for Identification of Drug Abuse Based on Feature-Based Context-Free Grammar

Discussion

Project Purpose: • This project aimed to analyze the discourse semantics of YouTube

comments discussing the recreational use of dextromethorphan. I built a feature-based context-free grammar to parse the semantics of those comments. Some features include -expressions, which allow the parser to approximate the semantics of the comments, and NUM, which captures the number of a noun or a verb. This work makes it easier to obtain the meaning as well as the statistics of textual data from social media. It contributes to the development of a computational representation of spontaneous discourse. This provided a model system to tackle an open problem in computational linguistics – the resolution of anaphoric pronouns in unstructured text.

Page 15: New Quantitative Methodology for Identification of Drug Abuse Based on Feature-Based Context-Free Grammar

Discussion

Discovery:• I was able to use the featured-based context-free grammar to

parse the semantics of a discourse, or a sequence of sentences. I developed a new algorithm based on feature-based context-free grammar to allow my parser to recognize the antecedent of a pronoun in a sentence based on the meaning of the previous sentences, and thus it was able to resolve anaphora. This computational representation of spontaneous discourses allows me to identify the emergence of novel substances and potential substance abuse, and contributes to clinical research by more intelligently mining electronic medical records. It may also contribute to the understanding of the computational structure of human language.

Page 16: New Quantitative Methodology for Identification of Drug Abuse Based on Feature-Based Context-Free Grammar

Discussion

Extension:• Future projects can focus on other sites such as Facebook, or

Tumblr to discover many other syntactic structures common to discussions of a certain topic. Future work could increase the size of the data sets and the number of features included in the grammar or extend the framework of the grammar beyond first-order logic. One possible future project can be the detection of the sale of illegal substances based on the characteristic syntactical structures built upon texts found in Facebook statuses. Another possible project can focus on predicting future drug usage based on the computational representation of tweets regarding people’s attitudes towards drugs.

Page 17: New Quantitative Methodology for Identification of Drug Abuse Based on Feature-Based Context-Free Grammar

References[1] Steven Bird, Edward Loper, and Ewan Klein. Natural Language Processing with Python. O’Reilly Media Inc, 2009.[2] P Blackburn and J Bos. Representation and Inference for Natural Language. CSLI Publications, 2005.[3] Michael Chary, Nicholas Genes, Andrew McKenzie, and Alex Manini. Leveraging social networks for toxicovigiliance. Journal of Medical Toxicology, 9(2):184–91, Jun 2013.[4] Alonzo Church. The Calculi of Lambda Conversion.(AM-6). Number 6. Princeton University Press, 1985.[5] M. Dredze. How social media will change public health. Intelligent Systems, IEEE, 27(4):81 –84, july - aug. 2012.[6] Gunther Eysenbach. Infodemiology and infoveillance: Framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the internet. J Med Internet Res, 11(1):e11, Mar 2009.[7] John Hopcroft and Jeffrey Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 1979.[8] James E. Lange, Jason Daniel, Kestrel Homer, Mark B. Reed, and John D. Clapp. Salvia divinorum: Effects and use among youtube users. Drug and Alcohol Dependence, 108(1–2):138–140, 4 2010.[9] W. McCune. Prover9 and mace4. http://www.cs.unm.edu/~mccune/prover9/, 2005–2010.[10] Kathleen Merikangas and Vetisha McClair. Epidemiology of substance use disorders. Human Genetics, 131:779–789, 2012. 10.1007/s00439-012-1168-0.[11] Eduardo Porter. Numbers tell of failure in drug war. In New York Times (Online Version), July 2012.[12] Ivan Sag, Thomas Wasow, and Emily Bender. Syntactic Theory: A Formal Introduction. CSLI Publications, 2003.[13] A Signorini, AM Segre, and PM Polgreen. The use of twitter to track levels of disease activity and public concern in the u.s. during the influenza a h1n1 pandemic. PLoS ONE, 6(5):e19467, 2011.[14] Zili Sloboda. Forging a relationship between drug abuse epidemiology and drug abuse prevention. In Zili Sloboda and William J. Bukoski, editors, Handbook of Drug Abuse Prevention, Handbooks of Sociology and Social Research, pages 245–264. Springer US, 2006.