mars target encyclopedia: rock and soil composition ......mars target encyclopedia: rock and soil...

6
Mars Target Encyclopedia: Rock and Soil Composition Extracted from the Literature Kiri L. Wagstaff 1 , Raymond Francis 1 , Thamme Gowda 1,2* , You Lu 1 , Ellen Riloff 3 , Karanjeet Singh 1* , and Nina L. Lanza 4 1 Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA 91109, {firstname.lastname}@jpl.nasa.gov 2 Information Sciences Institute, University of Southern California, Marina Del Rey, CA 90292, [email protected] 3 School of Computing, University of Utah, Salt Lake City, UT 84112, [email protected] 4 Los Alamos National Laboratory, Los Alamos, NM 87545, [email protected] Abstract We have constructed an information extraction system called the Mars Target Encyclopedia that takes in planetary science publications and extracts scientific knowledge about target compositions. The extracted knowledge is stored in a search- able database that can greatly accelerate the ability of scien- tists to compare new discoveries with what is already known. To date, we have applied this system to 6000 documents and achieved 41–56% precision in the extracted information. Introduction Scientists everywhere are overwhelmed by the stream of new information that is published by their disciplines’ con- ferences, workshops, and journals. It is increasingly difficult to come up to speed in a new area and to stay current with the latest discoveries. In planetary exploration, new discoveries can occur each time new data is transmitted. For example, our rovers on Mars have sent back compositional data for thousands of individual targets (e.g., rocks, soils), and some of those observations have transformed our understanding of past environments on the planet (Grotzinger et al. 2014). To interpret new observations correctly, it is necessary to be able to compare them with what is already known. For ex- ample, if we observe high manganese content at a particular location, we want to know whether it is consistent with pre- vious observations or it indicates an anomalous new discov- ery. However, no central database exists in which planetary scientists can quickly make that determination. We have created a system called the Mars Target Encyclo- pedia (MTE) that uses information extraction methods to an- alyze planetary science publications and identify stated com- positional relationships between Mars surface targets and el- ements or minerals. The extracted information is stored in a searchable database that allows users to ask questions such as “Which targets contain hematite?” or “What is known about target Dillinger?” It also enables entirely new kinds of information visualization, such as a map display of all loca- tions where the Mars rover Curiosity has detected hematite. Ultimately, the MTE may serve as a resource to inform de- cisions about the next steps in Mars exploration. * This work was done when the authors were at the Jet Propul- sion Laboratory. Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. In this paper, we describe the MTE system and its compo- nent technologies, the empirical performance of the system on labeled data, and results from a large-scale evaluation on 6000 documents. The MTE is currently being integrated into a public website called the PDS (Planetary Data Sys- tem) Analyst’s Notebook for Mars scientists and the public to access. The automated pipeline can be used to ingest and analyze new publications as they become available. Related Work A variety of text analysis methods exist for extracting in- formation from text. Some methods focus on extracting meta-data such as the document title, authors, and publica- tion venue or analyzing and linking citations between pa- pers (Ronzano and Saggion 2016). However, understanding the content of a scientific publication requires a deeper anal- ysis. Information extraction (IE) of this nature is generally broken into two steps: (1) named entity recognition or con- cept extraction, to identify references to people, locations, concepts, etc., and (2) relation extraction, to identify rela- tionships between pairs of entities (Mooney and Bunescu 2005). Many of the recent advances in IE have been moti- vated by problems from the biomedical research world, such as the desire to identify protein-protein interactions (Tikk et al. 2010; Bui, Katrenko, and Sloot 2011) or chemical-protein and chemical-disease relations (Krallinger et al. 2017). Tsut- sui, Ding, and Meng (2016) used topic modeling and open IE, which does not require the prior identification of entities, to build a knowledge database about Alzheimer’s disease. To date, little such work has been done in the domain of planetary science. The closest existing work is the geology- based GeoDeepDive project, which performs text data min- ing on scientific publications about (Earth) rock formations and stratigraphy (Zhang et al. 2013). By applying and ex- tending information extraction methods to planetary sci- ence publications, we have the opportunity to benefit an en- tirely new population of scientists, researchers, and inter- ested members of the public. Machine Learning for Information Extraction The Mars Target Encyclopedia (MTE) is an information ex- traction system that takes in scientific publications in PDF format and extracts knowledge that is useful to scientists

Upload: others

Post on 29-Jan-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

  • Mars Target Encyclopedia:Rock and Soil Composition Extracted from the LiteratureKiri L. Wagstaff1, Raymond Francis1, Thamme Gowda1,2∗, You Lu1,

    Ellen Riloff3, Karanjeet Singh1∗, and Nina L. Lanza41Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA 91109, {firstname.lastname}@jpl.nasa.gov

    2Information Sciences Institute, University of Southern California, Marina Del Rey, CA 90292, [email protected] of Computing, University of Utah, Salt Lake City, UT 84112, [email protected]

    4Los Alamos National Laboratory, Los Alamos, NM 87545, [email protected]

    AbstractWe have constructed an information extraction system calledthe Mars Target Encyclopedia that takes in planetary sciencepublications and extracts scientific knowledge about targetcompositions. The extracted knowledge is stored in a search-able database that can greatly accelerate the ability of scien-tists to compare new discoveries with what is already known.To date, we have applied this system to ∼6000 documentsand achieved 41–56% precision in the extracted information.

    IntroductionScientists everywhere are overwhelmed by the stream ofnew information that is published by their disciplines’ con-ferences, workshops, and journals. It is increasingly difficultto come up to speed in a new area and to stay current with thelatest discoveries. In planetary exploration, new discoveriescan occur each time new data is transmitted. For example,our rovers on Mars have sent back compositional data forthousands of individual targets (e.g., rocks, soils), and someof those observations have transformed our understanding ofpast environments on the planet (Grotzinger et al. 2014).

    To interpret new observations correctly, it is necessary tobe able to compare them with what is already known. For ex-ample, if we observe high manganese content at a particularlocation, we want to know whether it is consistent with pre-vious observations or it indicates an anomalous new discov-ery. However, no central database exists in which planetaryscientists can quickly make that determination.

    We have created a system called the Mars Target Encyclo-pedia (MTE) that uses information extraction methods to an-alyze planetary science publications and identify stated com-positional relationships between Mars surface targets and el-ements or minerals. The extracted information is stored in asearchable database that allows users to ask questions suchas “Which targets contain hematite?” or “What is knownabout target Dillinger?” It also enables entirely new kinds ofinformation visualization, such as a map display of all loca-tions where the Mars rover Curiosity has detected hematite.Ultimately, the MTE may serve as a resource to inform de-cisions about the next steps in Mars exploration.

    ∗This work was done when the authors were at the Jet Propul-sion Laboratory.Copyright c© 2018, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

    In this paper, we describe the MTE system and its compo-nent technologies, the empirical performance of the systemon labeled data, and results from a large-scale evaluation on∼6000 documents. The MTE is currently being integratedinto a public website called the PDS (Planetary Data Sys-tem) Analyst’s Notebook for Mars scientists and the publicto access. The automated pipeline can be used to ingest andanalyze new publications as they become available.

    Related WorkA variety of text analysis methods exist for extracting in-formation from text. Some methods focus on extractingmeta-data such as the document title, authors, and publica-tion venue or analyzing and linking citations between pa-pers (Ronzano and Saggion 2016). However, understandingthe content of a scientific publication requires a deeper anal-ysis. Information extraction (IE) of this nature is generallybroken into two steps: (1) named entity recognition or con-cept extraction, to identify references to people, locations,concepts, etc., and (2) relation extraction, to identify rela-tionships between pairs of entities (Mooney and Bunescu2005). Many of the recent advances in IE have been moti-vated by problems from the biomedical research world, suchas the desire to identify protein-protein interactions (Tikk etal. 2010; Bui, Katrenko, and Sloot 2011) or chemical-proteinand chemical-disease relations (Krallinger et al. 2017). Tsut-sui, Ding, and Meng (2016) used topic modeling and openIE, which does not require the prior identification of entities,to build a knowledge database about Alzheimer’s disease.

    To date, little such work has been done in the domain ofplanetary science. The closest existing work is the geology-based GeoDeepDive project, which performs text data min-ing on scientific publications about (Earth) rock formationsand stratigraphy (Zhang et al. 2013). By applying and ex-tending information extraction methods to planetary sci-ence publications, we have the opportunity to benefit an en-tirely new population of scientists, researchers, and inter-ested members of the public.

    Machine Learning for Information ExtractionThe Mars Target Encyclopedia (MTE) is an information ex-traction system that takes in scientific publications in PDFformat and extracts knowledge that is useful to scientists

  • EntitiesFind

    Elements, Minerals, Targets

    Relations Classify pairs of Target +

    (Element or Mineral)

    MTE Database

    Automatedinformationextraction Userqueriesviaweb

    SEDIMENTOLOGY AND STRATIGRAPHY OF THE PAHRUMP HILLS OUTCROP, LOWER MOUNT SHARP, GALE CRATER, MARS. K. M. Stack1, J. P. Grotzinger2, S. Gupta3, L. C. Kah4, K. W. Lewis,5 M. J. McBride6, M. E. Minitti7, D. M. Rubin8, J. Schieber9, D. Y. Sumner10, L. M. Thompson11, J. Van Beek6, A. R. Vasavada1, R. A. Yingst7. 1Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 91109 ([email protected]), 2California Institute of Technology, Pasadena, CA, 3Imperial College, London, UK, 4University of Tennessee, Knoxville, TN, 5Johns Hopkins University, Baltimore, MD, 6Malin Space Science Systems, San Diego, CA, 7Planetary Science Institute, Tucson, AZ, 8UC Santa Cruz, Santa Cruz, CA, 9Indiana University, Bloomington, IN, 10UC Davis, Davis, CA, 11University of New Brunswick, Fredericton, NB, Canada.

    Introduction: In September 2014, the Mars Sci-

    ence Laboratory Curiosity rover arrived at the Pahrump Hills outcrop after an 8 km traverse from Yellowknife Bay. Geologic mapping of high-resolution orbital images from the HiRISE camera suggests that the Pahrump Hills outcrop is Curiosity’s first encounter with the Murray formation, the informal designation for strata recognized as lower Mount Sharp (Figure 1). This study presents an overview of the Cu-riosity rover team’s investigation of Pahrump Hills and provides the stratigraphic context and depositional interpretation for sedimentary facies and diagenetic textures observed at this outcrop.

    Figure 1. Location of the Pahrump Hills outcrop (yellow star) shown in HiRISE and on a HiRISE digital terrain model (inset).

    The Curiosity Rover Team’s Investigation at Pahrump Hills: After completing sample acquisition and analysis at the Confidence Hills drill site at the base of Pahrump Hills [1], Curiosity began the first of two traverses up the ~12 m thick Pahrump Hills sec-tion (Figure 2). During the first traverse, only the re-mote science instruments (ChemCam, Mastcam, and MARDI) were used to quickly and efficiently charac-terize the section [2-4]. Several outcrops were then examined during a second traverse using Curiosity’s dust removal tool (DRT) and contact science instru-ments (MAHLI and APXS) [5,6]. Using observations acquired during the two traverses from the Mastcam, MARDI, and MAHLI cameras localized to HiRISE

    DTM and Navcam stereo mesh data, a stratigraphic column was constructed for Pahrump Hills using ele-vations, lithologic, and sedimentary properties (Figure 3).

    Figure 2. Main outcrops visited by the Curiosity rover at the Pahrump Hills outcrop displayed on a Mastcam mosaic produced by MSSS. White dots = end of drive or mid-drive stops visited during traverse 1 only, red dots = outcrops examined during traverse 2, blue dot = Confidence Hill drill location.

    Sedimentary Facies at Pahrump Hills: Five main sedimentary facies were observed at Pahrump Hills (Figure 3):

    Recessively-weathering Massive Mud-stone/Siltstone. The most prevalent facies observed throughout the Pahrump Hills section is a slope-forming, very fine-grained rock that appears massive in Mastcam and MARDI images. Individual in-situ grains are not resolvable in MAHLI images of brushed exposures, which suggests that the grain size of this facies is less than ~50 µm, or 2.5x the maximum MAHLI resolution achieved at a 3.9 cm working dis-tance. Accordingly, this facies is likely composed of clay (

  • Table 1: Manual annotations for LPSC documents.2015 2016 Total

    Annotation (62 docs) (55 docs) (117 docs)Element 1195 1029 2224Mineral 748 708 1456Target 566 347 913Contains 434 262 696Total 2943 2346 5289

    Figure 2: Excerpt from document lpsc16-1155 showingcompositional annotations created with the brat web annota-tion tool.

    CorpusOur corpus consists of two-page extended LPSC abstractsin PDF format. We selected 62 documents from LPSC 2015and 55 documents from LPSC 2016 that mentioned “Chem-Cam”, and used the brat annotation tool (Stenetorp et al.2012) to manually label entities within these documents(see Table 1). This data set contains thousands of annota-tions, which are available here: https://doi.org/10.5281/zenodo.1048419. We estimate that it took an av-erage of 30 minutes to annotate each document, or a total ofmore than 58 hours of labor for the full corpus.

    The annotated relationships ranged from simple (e.g., apattern such as “X contains Y” within a sentence) to com-plex (e.g., relationships that crossed sentence boundaries orinvolved pronouns like “it” and other anaphora). Figure 2shows an excerpt from one document that contains sev-eral statements about the composition of the target Big Sky.The vocabulary used to indicate a compositional relationshipvaries, and the final relationship crosses a sentence bound-aries.

    Named Entity Recognition ResultsThe Named Entity Recognizer operates on individual wordsor tokens. We used the 2015 documents for training and di-vided the 2016 documents into validation (n = 20) and test-ing (n = 35) sets. As shown in Table 2, the baseline ap-proach of employing the known lists of elements, minerals,and targets achieved an F1 score of 0.76. Training a basicNER classifier using the CoreNLP system yielded an im-

    Table 2: Named entity recognition performance on LPSC2016 test documents. The best result for each metric isshown in bold.

    Prec. Recall F1Baseline: Lists only 0.831 0.699 0.760CoreNLP NER trained on:LPSC 2015 0.948 0.700 0.805LPSC 2015 + gazettes 0.945 0.777 0.853

    proved F1 score of 0.805. Virtually all of the improvementcame from increased precision (from 0.83 to 0.95). Recallwas highest (0.84) for the Element class, as expected; it was0.73 for Minerals and only 0.28 for Targets. The Target classis the most difficult one to recognize due to the lack of anaming convention and ambiguous names. In addition, theTarget class grows much faster than the set of known ele-ments or minerals, so there will always be new targets infuture documents that never appeared in the training set.

    However, we were able to improve NER recall as well byincluding the gazettes as described above. These term listsaugment the manually labeled documents and provide rel-evant domain knowledge. With the gazettes, the F1 scoreincreased to 0.853 by boosting recall to 0.777. Recall for theTarget class, in particular, more than doubled, to 0.67.

    Relation Extraction ResultsThe decision about whether or not a relation exists is madefor a given pair of entities. Processing all possible pairs ofentities in a document would be infeasible (and likely un-necessary). For simplicity, we adopted the strategy used inprevious work (Giuliano, Lavelli, and Romano 2006) of gen-erating all pairs of entities that occur within a single sen-tence. We used CoreNLP’s sentence splitter to divide thecorpus into sentences and the NER model trained above toidentify entities. For each (Target, Element) or (Target, Min-eral) pair, we generated a jSRE example that encoded thesentence content. If the pair of entities was connected by arelation in the manual annotations, we gave the example apositive label; otherwise, we gave it a negative label.

    To simulate how the system would be used in practice, wetrained and validated the relation classifier using text fromLPSC 2015 and tested it on LPSC 2016. We used the first42 LPSC 2015 documents for training and the remaining 20for validation. The number and distribution of the resultingjSRE examples (relationships) are given in Table 3.

    Table 3: Number and distribution of relationships betweenTargets and Elements or Minerals. The number in parenthe-ses is the percentage of positive relationships.

    Element Mineral MergedTrain 279 (38%) 150 (41%) 429 (39%)Validation 93 (27%) 70 (69%) 163 (45%)Test 111 (37%) 62 (50%) 173 (42%)

  • Table 4: Relation extraction performance on LPSC 2016(test) documents. The best result for each metric is shownin bold.

    Precision Recall F1Elements (n = 111)

    Baseline: All-yes 0.369 1.000 0.539jSRE-Elements 0.531 0.415 0.466

    Minerals (n = 62)Baseline: All-yes 0.500 1.000 0.667jSRE-Minerals 0.679 0.613 0.644

    Merged (n = 173)Baseline: All-yes 0.416 1.000 0.588jSRE-Indiv. 0.598 0.447 0.511jSRE-Merged 0.640 0.444 0.525

    We trained three different relation classifiers: one onTarget-Element relations only; one on Target-Mineral rela-tions only; and one on the merged set. We were curious as tohow a specialized model that was trained on less data wouldcompare to a more generic model trained on more data. Foreach model, we performed a grid search over the jSRE pa-rameters by trying each of the SVM kernels (LC, GC, SL)and window sizes within the set { 1, 2, 5, 10, 15, 20 }. Weselected the model parameters that led to the highest perfor-mance on the validation set in terms of precision. We foundthat the max-precision model did not employ the same pa-rameters across the three models. jSRE-Elements and jSRE-Merged used an LC kernel with a window of 5, while jSRE-Minerals used an SL kernel with a window of 5. Notably, theGC kernel that the original authors found to be most power-ful for the biomedical domain did not perform well in thiscorpus.

    We found that the individual models (“jSRE-Elements”and “jSRE-Minerals”) achieved much higher precision thana baseline approach that always predicted that a relationshipwas present (“All-yes”) (see Table 4). While this baselinealways achieves a recall of 1.00 and therefore appears supe-rior in terms of F-measure, this application domain valuesprecision much more than recall. Content included in theMTE must be of the highest reliability, even if this means itis not comprehensive. We also found that the merged model(“jSRE-Merged”) out-performed the baseline and the indi-vidual models when they were applied to the full (Merged)data set (“jSRE-Indiv.”).

    Large-scale EvaluationWe collected all LPSC documents that were published in2014, 2015, and 2016, omitting the training documents fromLPSC 2015, and ingested them into the MTE (n = 5897).

    It would be infeasible to ask humans to manually labelall 5897 documents to evaluate our results, so instead weperformed a manual review of only the extracted relations.This allows us to measure precision, but not recall. However,as noted above, precision is far more important than recallin this domain, as it captures the true utility of the extractedinformation when used in practice.

    Table 5: Manual review of 817 relations extracted from 5897documents.

    LPSC14 LPSC15 LPSC16 TotalCorrect 55% 57% 29% 41%Partial 9% 15% 14% 13%Irrelevant 19% 6% 9% 11%Wrong 11% 21% 2% 8%Unsure 6% 0% 47% 28%

    The manual review results are shown in Table 5. Our man-ual reviewer examined each extracted relation and its sourcesentence to judge the relation as Correct, Partial (e.g., onlyone word of a multi-word Target name was extracted), Ir-relevant (an appropriate extraction from the sentence, butthe content was not about Mars), Wrong, or Unsure (the re-viewer could not determine whether the relation was cor-rect).

    Overall, the fraction of Correct relations was 41%. Perfor-mance on the 2016 documents was significantly lower thanfor the preceding years. Since the system was trained on doc-uments from 2015, it is likely that targets mentioned in 2015would encompass those discovered in 2014 and 2015, whilethe documents from 2016 contain many newly discoveredtargets and therefore present a more difficult generalizationtask. Many of the Partial relations occurred due to limitedsupport for extracting multi-word entities. This is an areafor future improvement.

    The Irrelevant relations are in some ways quite interest-ing; there are several relations that express the compositionof meteorites that happen to have the same names as (real)Mars targets. The system correctly interpreted the sourcesentences, but the information does not (strictly) belong inthe MTE. For example, the system inferred that “Gibeon”contains “chromite” from this sentence: “Gibeon was foundin several studies to have both chromite and daubrelite inclu-sions.” Gibeon is the name of a Mars target and of a mete-orite. Disambiguating the two requires more context than asingle sentence. Most of the Unsure relations came from ta-bles whose formatting was lost in the conversion from PDFto text. A useful future direction might be to omit table con-tent or to capture its structure in some way, e.g., by usingthe Tabula tool4. If we omit these unparseable sections, weobtain 56% Correct relations, 18% Partial, 15% Irrelevant,and 11% Wrong.

    Deployment of the MTEWe created a simple web interface to allow users to query theMTE for information about targets, elements, or minerals.This interface is currently only available inside JPL, but weare in the process of integrating it with a public PDS websiteas discussed below.

    The MTE enables scientists to ask new questions that pre-viously could not be answered. For example, Figure 3 showsthe results of a query on “hematite.” Nine targets that con-

    4https://github.com/tabulapdf/tabula

  • Figure 3: MTE search results for “hematite.” Nine results(individual Mars targets) are returned, and a map displaysthe location of each hit, in red.

    tain hematite were returned. The user can click on any targetto see the extracted information and sentence excerpts thatsupport the conclusion about the presence of hematite. Be-low is a map of the Curiosity rover’s traverse on Mars, withthe locations of the matching targets marked in red. One canimmediately see whether hematite is localized or has beenidentified throughout the mission.

    LimitationsThe MTE is not comprehensive. There may be composi-tional information that was never written up in a scientificpublication and therefore would not be included in the MTE.Instead, the MTE extracts and indexes only the informationthat was judged by scientists to be worthy of publication tothe scientific community. The MTE leverages and mirrorsthis selection bias, and its holdings (like the source publi-cations) contain only the most valuable and salient informa-tion. This incompleteness is important to convey to the userso that they interpret results correctly.

    On the technical side, there are two important limitationsto the MTE content. First, the current MTE cannot generateoverlapping annotations. For example, the phrase “calciumsulfate” was manually labeled as “calcium” (Element), “sul-fate” (Mineral), and “calcium sulfate” (Mineral). However,the MTE’s NER model only classifies individual tokens, soit misses the “calcium sulfate” phrase.

    Second, the relation extraction module only generatescandidate relation pairs within a sentence. In this corpus,32% of the manually annotated relations cross sentenceboundaries. Therefore, the current system cannot yet retrievethose relations. One way to access sentence-crossing rela-tions would be to expand the number of candidate entitypairs to include all pairs within a paragraph. We plan to eval-uate that strategy in future work.

    Conclusions and Next StepsThis work lies at the intersection of information extraction,machine learning, and planetary science. The MTE uses cur-rent IE technology to provide the first database of Mars tar-get compositional knowledge as expressed in the scientificliterature. The pipeline is fully automated, and we can em-ploy web crawlers to seek out new (publicly accessible) pa-pers as they are published. We also plan to enable users tosubmit their own publications for analysis and augmentationof the database.

    We are in the process of integrating the MTE’s contentinto the MSL Analyst’s Notebook, an interactive web re-source for mission scientists and the interested public (Steinand Arvidson 2013). The Analyst’s Notebook allows usersto browse mission plans, targets discovered, data collected,and summaries of each mission day on Mars. The MTE con-tent will enable the Analyst’s Notebook to also connect tar-gets to publications.

    The MTE currently contains information about Chem-Cam targets that was extracted from three years of paperspublished at the Lunar and Planetary Science Conference. Anext logical step is to expand the MTE to encompass targetsidentified by other instruments on the Curiosity (MSL) roverand other missions such as the Mars Exploration Rovers(Spirit and Opportunity). In addition, we plan to extend theMTE to be able to ingest journal papers that have been pub-lished by MSL science team members and the broader com-munity. This information will carry more weight because itcomes from peer-reviewed sources; users will be able to re-strict their searches to journal papers only, or to obtain allpossible results.

    The automatic extraction of knowledge from scientificpublications can benefit many other areas of scientific in-quiry. In addition to biology and medicine, there are op-portunities at the intersection between fields such as plane-tary science and astronomy. For example, there are currently3,550 confirmed exoplanets (planets outside our solar sys-tem) as of November 2, 2017 (NASA 2017). Hundreds ofnew planet candidates are announced each year in new pub-lications. Desirable properties to extract and store for eachplanet include its radius, temperature, period, distance fromits host star, and more. Compositional relationships exist for

  • elements present in the host star and for constituents in exo-planet atmospheres, with implications for the possible pres-ence of life on other planets. In general, extracting informa-tion and relationships into a central, searchable database canhelp inform new hypotheses and direct future science inves-tigations.

    AcknowledgmentsThis research was carried out in part at the Jet Propul-sion Laboratory, California Institute of Technology, un-der a contract with the National Aeronautics and SpaceAdministration. We thank the Multimission Ground Sys-tem and Services (MGSS) program and the Mars ScienceLaboratory project for funding and enthusiastically sup-porting this work. We also thank Chris Mattmann for hissupport and partial funding as provided by the DARPAXDATA/Memex/D3M programs and NSF award numbersICER-1639753, PLR-1348450, and PLR-144562.

    ReferencesBui, Q.-C.; Katrenko, S.; and Sloot, P. M. A. 2011. A hy-brid approach to extract protein-protein interactions. Bioin-formatics 27(2):259–265.Finkel, J. R.; Grenager, T.; and Manning, C. 2005. Incor-porating non-local information into information extractionsystems by Gibbs sampling. In Proceedings of the 43nd An-nual Meeting of the Association for Computational Linguis-tics (ACL 2005), 363–370.Giuliano, C.; Lavelli, A.; and Romano, L. 2006. Exploitingshallow linguistic information for relation extraction frombiomedical literature. In Proceedings of the 11th Confer-ence of the European Chapter of the Association for Com-putational Linguistics (EACL 2006), 401–408.Grotzinger, J. P.; Sumner, D. Y.; Kah, L. C.; Stack, K.;Gupta, S.; Edgar, L.; Rubin, D.; Lewis, K.; Schieber, J.;Mangold, N.; Milliken, R.; Conrad, P. G.; DesMarais, D.;Farmer, J.; Siebach, K.; Calef, F.; Hurowitz, J.; McLennan,S. M.; Ming, D.; Vaniman, D.; Crisp, J.; Vasavada, A.; Ed-gett, K. S.; Malin, M.; Blake, D.; Gellert, R.; Mahaffy, P.;Wiens, R. C.; Maurice, S.; Grant, J. A.; Wilson, S.; An-derson, R. C.; Beegle, L.; Arvidson, R.; Hallet, B.; Sletten,R. S.; Rice, M.; Bell, J.; Griffes, J.; Ehlmann, B.; Ander-son, R. B.; Bristow, T. F.; Dietrich, W. E.; Dromart, G.;Eigenbrode, J.; Fraeman, A.; Hardgrove, C.; Herkenhoff,K.; Jandura, L.; Kocurek, G.; Lee, S.; Leshin, L. A.; Lev-eille, R.; Limonadi, D.; Maki, J.; McCloskey, S.; Meyer, M.;Minitti, M.; Newsom, H.; Oehler, D.; Okon, A.; Palucis, M.;Parker, T.; Rowland, S.; Schmidt, M.; Squyres, S.; Steele,A.; Stolper, E.; Summons, R.; Treiman, A.; Williams, R.;Yingst, A.; and Team, M. S. 2014. A habitable fluvio-lacustrine environment at Yellowknife Bay, Gale Crater,Mars. Science 343(6169).Krallinger, M.; Rabal, O.; Loureno, A.; Oyarzabal, J.;and Valencia, A. 2017. Information retrieval and textmining technologies for chemistry. Chemical Reviews117(12):7673–7761.

    Mattmann, C., and Zitting, J. 2011. Tika in Action. NewYork: Manning Publications.Maurice, S., and 70 others. 2012. The ChemCam instrumentsuite on the Mars Science Laboratory (MSL) rover: Scienceobjectives and mast unit description. Space Science Reviews170(1):95–166. doi:10.1007/s11214-012-9912-2.Mooney, R. J., and Bunescu, R. 2005. Mining knowledgefrom text using information extraction. ACM SIGKDD Ex-plorations Newsletter 7(1):3–10.NASA. 2017. Exoplanet archive. https://exoplanetarchive.ipac.caltech.edu/.Ronzano, F., and Saggion, H. 2016. Knowledge extractionand modeling from scientific publications. In Proceedingsof the Enhancing Scholarly Data Workshop, 11–25. Cham:Springer International Publishing.Stein, T. C., and Arvidson, R. E. 2013. PDS Analyst’s Note-book for MSL. In Proceedings of the 44th Lunar and Plan-etary Science Conference, Abstract 1570.Stenetorp, P.; Pyysalo, S.; Topić, G.; Ohta, T.; Ananiadou,S.; and Tsujii, J. 2012. brat: A web-based tool for NLP-assisted text annotation. In Proceedings of the Demonstra-tions Session at EACL 2012.Tikk, D.; Thomas, P.; Palaga, P.; Hakenberg, J.; and Leser,U. 2010. A comprehensive benchmark of kernel methodsto extract protein-protein interactions from literature. PLoSComputational Biology 6(7).Tsutsui, S.; Ding, Y.; and Meng, G. 2016. Machine read-ing approach to understand Alzheimers disease literature. InProceedings of the Tenth International Workshop on Dataand Text Mining in Biomedical Informatics (DTMBIO).Zhang, C.; Govindaraju, V.; Borchardt, J.; Foltz, T.; and andShanan Peters, C. R. 2013. GeoDeepDive: Statistical infer-ence using familiar data-processing languages. In Proceed-ings of the 2013 ACM SIGMOD International Conferenceon Management of Data, 993–996.