deriving enhanced geographical representations via ... · survival curves can be reasonably...

22
Unnamed Journal, Vol. x, No. x, 1–22 1 Deriving Enhanced Geographical Representations via Similarity-based Spectral Analysis: Predicting Colorectal Cancer Survival Curves in Iowa Michael T. Lash Department of Computer Science, University of Iowa, Iowa City, IA, USA E-mail: [email protected] Min Zhang Interdisciplinary Graduate Program in Informatics, University of Iowa, Iowa City, IA, USA E-mail: [email protected] Xun Zhou Management Sciences Department, University of Iowa, Iowa City, IA, USA E-mail: [email protected] W. Nick Street Management Sciences Department, University of Iowa, Iowa City, IA, USA E-mail: [email protected] Charles F. Lynch Department of Epidemiology, University of Iowa, Iowa City, IA, USA E-mail: [email protected] Abstract: Neural networks are capable of learning rich, nonlinear feature representations shown to be beneficial in many predictive tasks. In this work, we use such models to explore different geographical feature representations in the context of predicting colorectal cancer survival curves for patients in the state of Iowa, spanning the years 1989 to 2013. Specifically, we compare model performance using area between the curves (ABC) to assess (a) whether Copyright © retained by authors, 2018. arXiv:1809.03323v1 [cs.LG] 6 Sep 2018

Upload: others

Post on 26-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

Unnamed Journal, Vol. x, No. x, 1–22 1

Deriving Enhanced Geographical Representationsvia Similarity-based Spectral Analysis: PredictingColorectal Cancer Survival Curves in Iowa

Michael T. LashDepartment of Computer Science,University of Iowa,Iowa City, IA, USA

E-mail: [email protected]

Min Zhang

Interdisciplinary Graduate Program in Informatics,University of Iowa,Iowa City, IA, USA

E-mail: [email protected]

Xun Zhou

Management Sciences Department,University of Iowa,Iowa City, IA, USA

E-mail: [email protected]

W. Nick Street

Management Sciences Department,University of Iowa,Iowa City, IA, USA

E-mail: [email protected]

Charles F. Lynch

Department of Epidemiology,University of Iowa,Iowa City, IA, USA

E-mail: [email protected]

Abstract: Neural networks are capable of learning rich, nonlinear featurerepresentations shown to be beneficial in many predictive tasks. In this work,we use such models to explore different geographical feature representationsin the context of predicting colorectal cancer survival curves for patients inthe state of Iowa, spanning the years 1989 to 2013. Specifically, we comparemodel performance using area between the curves (ABC) to assess (a) whether

Copyright © retained by authors, 2018.

arX

iv:1

809.

0332

3v1

[cs

.LG

] 6

Sep

201

8

Page 2: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

2 MT. Lash et al.

survival curves can be reasonably predicted for colorectal cancer patients in thestate of Iowa, (b) whether geographical features improve predictive performance,(c) whether a simple binary representation, or a richer, spectral analysis-elicited representation perform better, and (d) whether spectral analysis-basedrepresentations can be improved upon by leveraging geographically-descriptivefeatures. In exploring (d), we devise a similarity-based spectral analysisprocedure, which allows for the combination of geographically relational andgeographically descriptive features. Our findings suggest that survival curves canbe reasonably estimated on average, with predictive performance deviating at thefive-year survival mark among all models. We also find that geographical featuresimprove predictive performance, and that better performance is obtained usingricher, spectral analysis-elicited features. Furthermore, we find that similarity-based spectral analysis-elicited representations improve upon the original spectralanalysis results by approximately 40%.

Keywords: Geographical representations; Spectral analysis; Deep learning;Spectral clustering; Neural networks; Colorectal cancer; Survival curve

Reference to this paper should be made as follows: Lash, M.T., Zhang, M. ,Street, W.N., Zhou, X., and Lynch, C.F. (2018) ‘Deriving Enhanced GeographicalRepresentations via Similarity-based Spectral Analysis: Predicting ColorectalCancer Survival Curves in Iowa’, arXiv preprint, 2018.

Biographical notes: Michael Lash is currently a PhD candidate in the Departmentof Computer Science at the University of Iowa, advised by W. Nick Street andAlberto M. Segre. His research interests are broadly in the areas of machinelearning and data mining methodology, with specific interests lying in causallearning, learning from graphs, geographical and spatial data mining, amongothers. He has published in top data mining conferences, such as SDM, topapplication venues, such as ICHI and BIBM, and top business journals, such asJMIS. He has also been the recipient of numerous awards, including the Universityof Iowa Graduate College Post-Comprehensive Research Fellowship, an NSFGRFP Honorable Mention, and a variety of student travel awards.Min Zhang is currently a graduate student in the Interdisciplinary GraduateProgram in Informatics at the University of Iowa. He received the bachelor degreein Business Analytics and Information Systems from the University of Iowa, in2017. His research interests include business data analytics, big data management,and Geographic Information Systems (GIS).Xun Zhou is currently an Assistant Professor in the Department of ManagementSciences at the University of Iowa. He received a PhD degree in ComputerScience from the University of Minnesota, Twin Cities in 2014. His researchinterests include big data management and analytics, spatial and spatio-temporaldata mining, and Geographic Information Systems (GIS). His works have beenpublished in top conferences and journals such as ACM SIGKDD, IEEE ICDM,ACM SIGSPATIAL, and IEEE TKDE. Xun has received three best paper awards.He was also a co-editor-in-chief of the Springer Encyclopedia of GIS, 2nd Edition.Nick Street is the Henry B. Tippie Research Professor and Departmental ExecutiveOfficer in the Management Sciences Department at the University of Iowa, withjoint appointments in Computer Science, Nursing, and Informatics. He is alsothe director of the interdisciplinary graduate program in Health Informatics. Hisresearch interests are in algorithmic approaches to machine learning and datamining, particularly the use of mathematical optimization in inductive learningtechniques. His recent work has focused on counterfactual reasoning, ensembleconstruction methods, and personalized health care decision making. He haspublished over 110 journal, conference and workshop papers, and has receivedan NSF CAREER award and an NIH INRSA postdoctoral fellowship.

Page 3: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

Deriving Enhanced Geographical Representations 3

Charles Lynch is a Professor with a joint appointment in the Department ofEpidemiology in the College of Public Health and in the Department of Pathologyin the College of Medicine at The University of Iowa. He has been a facultymember at The University of Iowa since he completed his pathology training in1986. Since 1990, he has been Principal Investigator of the State Health Registryof Iowa, Iowa’s statewide cancer surveillance program. His primary researchinterests include cancer surveillance, cancer epidemiology, and environmentalepidemiology.

1 Introduction

As machine learning has become more prevalent, powerful new technologies such as deeplearning, which are capable of learning rich, non-linear representation, have also risen tothe forefront of the field. The domains of public health and medicine have particularlybenefited from these innovations; in this work we examine and propose deep learningmethodologies applied to these areas. The focus of this work, therefore, is to explore howdifferent geographical representations, learned through deep learning technologies, canimprove survival curve predictions for colorectal cancer patients in the state of Iowa.

Figure 1 demonstrates the urgency of the problem we are addressing, showing colorectalcancer (CRC) mortality rates for patients in Iowa spanning the years 1989 to 2013; theseare expressed in terms of a zipcode tabulation area (ZCTA) level of geography. In Figure

¯ Legend

Iowa

Mortality

0.0 - 0.145

0.145 - 0.307

0.307 - 0.447

0.447 - 0.714

0.714 - 1.0

0 40 8020 Miles

Colorectal Cancer Mortality Rate by ZCTAin Iowa: 1989 to 2013

created by: Michael T. Lash

Figure 1: Colorectal cancer mortality rate by ZCTA in the state of Iowa for the years 1989to 2013.

1 we first observe that numerous ZCTAs have CRC mortality rates that are at or above the

Page 4: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

4 MT. Lash et al.

30%, indicating the particularly nefarious nature of this disease, and highlighting the needfor accurate survival outlook predictions at the time of diagnosis to better inform treatmentdecisions Zhang et al. (2015). Furthermore, Figure 1 demonstrates the geographical diversityin which CRC mortality rates are manifested: locale seems to be related to survival outlook.

The relationship between geography and survival outlook isn’t unforeseen,unfortunately. Physical locale manifests pertinent health-based factors, such as access tohealth care, environmental factors, such as ground contaminants, among others, all of whichmay affect disease manifestation and survival outlook Wan et al. (2013).

Provided the spatially heterogeneous manifestation of colorectal cancer mortality, amajor challenge is to build spatially responsive models that can aid in accurate predictionof individual-specific colorectal cancer survival curves. For instance, rural areas may havea different variety of factors affecting colorectal cancer disease manifestation and survivalthan sprawling metropolitan cities. Therefore, we define and examine three geographicaldeep learning representation methods in this work: a simple binary representation (SBR),rich representation – spectral analysis (RR-SA), and rich representation –similarity-basedspectral analysis (RR-SSA); we additionally craft two sub-representation methods that areutilized in RR-SSA.

The contributions of this work, which expand upon the results obtained in Lash, Sun,Zhou, Lynch & Street (2017), are enumerated as follows:

1. We investigate a rich representation of spatial features through spectral analysis (RR-SA) of the underlying geographical relationship graph of the ZCTAs to address thespatial heterogeneity challenge.

2. Modifying our RR-SA representation procedure, we explore the use of geographicallydescriptive features, paired with the underlying adjacency graph, to further address thespatially heterogeneous nature of the problem.

3. We determine whether the simple binary representation (SBR) or richer, spectralanalysis representation (RR-SA), or similarity-based spectral analysis representation(RR-SSA) leads to more accurate survival curve predictions.

4. We determine whether RR-SA or RR-SSA representations lead to more accuratesurvival curve predictions and determine which sub-representation procedure – binary(bin) or full – produces more accurate survival curve predictions.

This works continues with a disclosure of the problem setting, followed by relation ofour three methods of geographic representation and two sub-representation methods; wealso present a graphic depicting the deep architecture of each method (Section 2). In Section3, we describe our colorectal cancer patient dataset, containing 46000 individuals residingin Iowa at the time of their diagnosis; the dataset spans the years 1989 to 2013. Furthermore,we relate our geographical feature dataset, along with our experiments. In Section 4 wedisclose works related to ours prior to concluding the paper in Section 5.

2 Learning Geographical Representations for Survival Curve Prediction

Prior to disclosing our methodology, we relate some preliminary notation, subsequentlydiscussing and mathematically formulating the problem setting. Following this disclosurewe reformulate the problem as one of Kaplan-Meier survival curve prediction before

Page 5: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

Deriving Enhanced Geographical Representations 5

introducing and elaborating on our three methods of geographical representation learningand two sub-representations.

2.1 Preliminaries

Define {(x(i), e(i), t(i))}ni=1 to be a dataset of n instances, where feature vector x(i) ∈ Rm,event label e(i) ∈ {0, 1}, and time of event occurrence t(i) ∈ {0, 1, . . . , T}; where t(i)

represents a discrete time at which an event e(i) has occurred (i.e., e(i) = 1) or the lastdiscrete time instance i is observed, while an event has not occurred (i.e., e(i) = 0). Inthis latter case (e(i) = 0), when t(i) = T we know the event never occurs to the instanceduring the study period (spanning T discrete time periods). If, on the other hand, t(i) < Tthen we only know that the instance did not experience the event up to t(i), but don’tknow what happened during the T − t(i) remaining time. Representation of event-time datadescribed as such are called censored data and, even more specifically, right-censored data.A censoring of instance i occurs when e(i) = 0 and t(i) < T . We elaborate on the handlingof these censored data in a proceeding subsection.

To be more concrete, t ∈ {1, . . . , T} may represent (as is the case in our experiments)six-month patient follow-up periods, with t = 0 designating the entrance of a patient to thestudy. Study entrance occurs when a diagnosis of colorectal cancer is rendered. When aninstance (i.e., patient) i dies from colorectal cancer – e(i) = 1 – then t(i) designates a timein which this event occurred. Alternately, a patient may pass away from complications notrelated to their colorectal cancer disease, or may move elsewhere, switch doctors, or forsome other reason become untrackable prior to the conclusion of the study period, thent(i) < T and e(i) = 0, indicating a censoring.

Patient instance vectors x(i) represent quantified measurements of pertinent patient-based features. Later in this work, we will make reference to certain feature groups of whichthese instance vectors are composed. Therefore, we define notation that will convenientlyrelate to these groups. To such an end, let z denote the full set of index values that referencethe geographical features of x(i); further, denote a to be the full set of index values ofx(i) such that a = {1, . . . ,m}. We will use these index sets to make direct reference tothe feature grouping components of x(i); for instance, x(i)

z is the subvector of instance ihousing the geographical feature values. Furthermore, using set difference notation, x(i)

a\zrefers to the subvector of instance i containing feature values that are non-geographical.

For convenience, we provide the notation related in this and subsequent sections inTable 1.

2.2 Kaplan-Meier Re-representation

To begin elaborating on the censored nature of our data, as we mentioned in the previoussection, instance i has an event label e(i) and a discrete time of event occurrence t(i):provided this, the goal is to transform this two-valued representation to that of a Kaplan-Meier survival curve (KMSC) representation Kaplan & Meier (1958). A KMSC, simplyput, each temporal unit 1, . . . , T with a probability of the disease event e(i) not occurringat that particular temporal unit, dependent upon the probability of “not” event occurrenceof the preceding temporal unit, for each instance i.

More formally, the KMSC re-representation is in the form of a vector, denoted y(i) ∈[0, 1]T , where the index values t ∈ {1, . . . , T} express the temporal units and the indexedvalues y(i)

tdenote the respective probabilities.

Page 6: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

6 MT. Lash et al.

Notation Description

x(i) ∈ Rm Feature vector of instance i.e(i) ∈ {0, 1} Event label of instance i.t(i) ∈ {1, . . . , T} Discrete time of e(i).y(i) ∈ [0, 1]T Outcome vector of instance i.y(i) ∈ [0, 1]T Predicted outcome vector of instance i.

z Set of geographical feature index values.a Set of all feature index values.M A map.Γ(·) Function that determines discrete

geographic entity membership.

P (·) Calculation of a probability.g : Rm → [0, 1]T Neural network.L(·) An arbitrary loss function.Smooth Output smoothing function.

ZZZ Adjacency matrix constructed fromM.ZZZ SSA-elicited affinity matrix.AAA Design matrix for geographical

entities (descriptive geo feats).Common Function that determines whether two geographic entities in

M are adjacent.QQQspec Top k eigenvectors fromQQQ, selected based on largest

eigenvalues in λλλ.qlabel The result of applying kMeans clustering toQQQspec.Enrich Function that assigns values inQQQspec to an instance.ΘΘΘ SSA procedure to produce ZZZ.

Table 1 Notation used throughout this work.

Our KMSC re-representation scheme is originally outlined in Chi et al. Chi et al. (2007).To instantiate the vector y(i), the following is conducted:

y(i)

t=

1 if t < t(i)

0 if t ≥ t(i) & e(i) = 1

1− P (e(i)

t= 1|e(i)

t−1 = 0) if t ≥ t(i) & e(i) = 0

(1)

where P (e(i)

t= 1|e(i)

t−1 = 0) denotes the conditional probability of event e(i) occurring at tprovided that e(i) has not occurred at t− 1. As such, for patients whose CRC outcomes areknown, y(i) exhibits values that are strictly 0 and 1. On the other hand, a censored patient’svector exhibits estimation of survival probability beginning at the index position t = t(i);the ensuing values are conditional probability estimates.

2.3 Predicting Individual KMSC

Our goal in this work is to induce an optimal hypothesis g∗ ∈ G of some [presently]arbitrarily defined hypothesis classG, that is most apt at predicting instance-specific KMSCs.

Page 7: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

Deriving Enhanced Geographical Representations 7

We formalize this problem as:

g∗ = arg ming∈G

{L(y(i), g(x(i))

): i = 1, . . . , n

}(2)

whereL(·) expresses some loss function that measures the divergence between the predictedy(i) (henceforth expressed y(i)) and the known y(i).

The hypothesis class G explored in this work is defined as both deep and shallow neuralnetwork architectures, the specifics of which are disclosed later in this section; we discussthe specific parameterizations employed across our experiments in the experiments section(Section 3). Deep neural network architectures are characterized by multiple hidden layers,and shallow architectures by a single hidden layer.

2.3.1 Output Smoothing

Construction of a neural network model is accomplished in layer-wise fashion, where aparticular layer is composed of nodes. The first layer in a neural network is designated as theinput layer, which is proceeded by any number of so-called hidden layers, the last of whichis connected to the output layer. The output layer is somewhat unique to our problem settingof predicting KMSCs. First, the predicted probability elicited from each of the t = 1, . . . , Toutput nodes are ordered. In other words, the output of nodeout

tis ordered before nodeout

t+1

because t is temporally occurs before t+ 1. Second, the ordered output probabilities of thesenodes should be strictly decreasing: i.e., output(i)

t≥ output(i)

t+1. The reasoning behind

this “strictly decreasing” expectation is intuitive: the probability of survival, of a diseaseor otherwise, even after disease recovery, is never expected to go up. The loss functionL(·) employed to induce mulitple-output networks, such as those in our problem setting,elicit a single loss value representing the loss across all nodes, meaning the desired strictlydecreasing output cannot be guaranteed. In light of this, we develop a smoothing operation,denoted Smooth(output(i)), formally expressed by

y(i)

t+1= min{output(i)

t, output

(i)

t+1} for t = 1, . . . , T (3)

guaranteeing that the post-processed (i.e., smoothed) model output is strictly decreasing.

2.4 Geographic Feature Representation

Although our primary concern is to elicit a g∗ that produces the most accurate predictions,the novelty of the work is to:

1. Demonstrate that geographic-based feature representations enhance the quality ofpredictions.

2. Explore whether simple binary representations or a richer representations (definedshortly) produce more accurate predictions.

3. Quantify the extent to which these representations improve predictive quality.

The details of our experiments and data are elaborated on in the next section where weexplore three different geographical representations: a simple binary representation (SBR), arich representation based on spectral analysis (RR-SA), and a rich representation employingsimilarity-based spectral analysis (RR-SSA).

Page 8: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

8 MT. Lash et al.

2.4.1 Simple Binary Representation

Our simple binary representation (SBR) is minimalist in nature, the procedure consistingonly of (a) determination of the discrete geographic entity membership of instance i and(b) such membership being binarily re-represented (referred to as one hot encoding), thuseliciting a sparse vector-based encoding with a 1 in the indexical location referring to thegeographic entity of which i is a member, and 0s in the remaining positions.

To devise a formulation that is aptly generalizable we assume that the geographic featuresof instance i, expressed as x(i)

z , are defined such that encoded values are capable of elicitingthe discrete geographic unit of which i is a member (e.g., coordinates). For instance, weemploy ZCTAs (zipcode tabulation area) as the discrete geographic unit in our experiments.

A formal procedure for eliciting discrete geographic unit membership can be expressedas

x(i)b = Γ(x(i)

z ,M) (4)

where the function Γ(·) performs a transformation on x(i)z , the geographic feature values

of the instance, to some identification (ID) value, which we denote as x(i)b . This x(i)b valuedenotes the unique, discrete geographic entity, belonging to map M (which we definemomentarily), of which instance i is a member. The values represented by x

(i)z , along with

the information expressed in map M, dictate the procedure used by Γ(·) to perform thetransformation.

The specific z geographics features employed in this work are (latitude,longitude)coordinate pairs and, as such, we specify a definition (referred to as Definition 2.1) of mapM using geography defined in terms of these coordinate pairs.

Definition 2.1: DefineM to be a map, given by

M = {(keyl, valuel)}pl=1 (5)

where keyl is the unique postal code of geographic unit l and valuel is an ordered set of(lat,lon) coordinate pairs denoting the bounding geographic region of l.

We characterize mapM as a continuous geographic region by{∀l∃l′ : valueql = valuejl′ for l, l′ ∈ {1, . . . , p} & l 6= l′

}(6)

where valueql = valuejl′ , (latql = latjl′) ∩ (lonql = lonjl′ ).

Provided our definition of mapM, expressed in Definition 2.1, we define Γ(·) to be afunction that takes (lat,lon) coordinate pairs x(i)

z and determines whether the point is on theinterior of each ZCTA. The zipcode ID, corresponding to the ZCTA having the point x(i)

z

on the interior, is subsequently initialized as the value of x(i)b (i.e., x(i)b = ZCTAtextID).Following this outlined Γ(·) procedure, and additional binarization procedure, often referredto as one-hot encoding, which we denote Bin, is employed to produce a spare vectorrepresentation consisting of a single 1 in the position referencing the ZCTA of which instancei belongs, and 0s in other positions.

Figure 2 illustrates the network architecture using the SBR methodology.We expect that non-geographic representations will perform worse than representations

employing the SBR representation

Page 9: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

Deriving Enhanced Geographical Representations 9

Input node

Hidden nodeOutput node

(logistic)

SBR Feats

Figure 2: SBR neural network architecture.

While models elicited from employing the SBR representation may enjoy somepredictive performance improvement over hypotheses induced on strictly non-geogrpahicfeatures, representations that consist of richer geographic encodings, capable of modelingthe continuous nature of the geographic region of study promise to produce even betterresults.

2.4.2 Spectral Analysis Representation

To elicit richer geographical representations, we devise a spectral analysis approach, basedon on a well-known procedure referred to as spectral clustering. The method begins by firstcomputing an adjacency matrix among the discrete geographic entities represented inM.Subsequently, spectral analysis solves for the eigenvectors and eigenvalues of the adjacencyrepresentation, selecting the top k eigenvectors, based on the largest k eigenvalues. Theelicited representation is p× k matrix, where the p rows refer to the p discrete geographicentities (i.e., a single row refers to one of the p entities). The k values that compose eachrow are used as geographic predictive input features.

To more formally relate this spectral analysis procedure, defineZZZ = Adj (M) to be theaffinity (i.e., adjacency, similarity) matrix, in which the l, v-th entry relates the geographicadjacency relationship among the l-th and v-th entities. We express this by

[ZZZ]l,v =

1 if Common(valuesl, valuesv) = True& l 6= v

0 otherwise(7)

where the function Common(·) determines if valuesl and valuesv have a common element.ProvidedM, related by Definition 2.1, Common(·) computes whether or not valuesl andvaluesv share at least one coordinate pair.

Subsequently, spectral clustering is executed by performingkMeans clustering,qlabel =kMeans (QQQspec), where the function kMeans(·) assigns one of the k cluster labels to eachof the p entries ofQQQspec; where

QQQspec = Topk (QQQ,λλλ) . (8)

The function Topk(·) searches and finds the largest values in λλλ, selects the appropriatecolumns in the matrix QQQ, and creates the matrix QQQspec ∈ Rk×p. The matrix QQQ, composed

Page 10: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

10 MT. Lash et al.

of eigenvectors, and corresponding vector λλλ, composed of eigenvalues, are obtained bysolving the system of equations related by

ZZZQQQ = λλλQQQ. (9)

Here, the columns of QQQspec are used as the k geographical features when inducing ahypothesis g – we refer to this use of theQQQspec matrix as spectral analysis. The labels,qlabel,obtained from application of the clustering procedure, referred to as spectral clustering,are used to visualize the elicited representations obtained from our experiments, related inthe next section. Spectral analysis avoids making use of binarized label assignments of thespectral clustering procedure, instead using a subprocedure, termed spectral analysis, whichpreserves cluster composition and is a richer (i.e., non-sparse) representation.

In Algorithm 1, we relate the spectral clustering process, differentiating spectral analysisfrom spectral clustering via red highlighting; omission of this line produces the spectralanalysis procedure. Simply put, spectral analysis is a sub-procedure of the spectral clustering

Algorithm 1 Spectral Clustering

1: Obtain adjacency matrix ZZZ using (7).2: Solve (9) forQQQ and λλλ.3: ObtainQQQspec as outlined in (8).4: Apply kMeans clustering toQQQspec to obtain qlabel.

process, yielded by omitting the clustering step.Finally, for a test instance x, a process Enrich(xz,M,QQQspec) is executed to obtain the

appropriate k-valued column entry ofQQQspec that is associated with the particular geographicentity that the test instance belongs to. Algorithm 2 outlines this procedure. The deep

Algorithm 2 Enrich Geographic Features EnrichEnrichEnrich

Input: xz,M,QQQspec

1: xb = Γ(xz,M) From (4).2: Using xb find the l such that xb = keyl : l ∈ {1, . . . , p}.

Output: Return column vector [QQQspec]l

learning network architecture outlining the spectral analysis procedure in conjunction withlearning a hypothesis, is depicted in Figure 31.

2.4.3 Similarity-based Spectral Analysis Representation

While the above spectral analysis-based approach produces richer geographicrepresentations, the process is capable of leveraging only geographically relationalinformation, as the input affinity matrix must be square (i.e., p× p). It may, however,be beneficial to leverage encodings that are both relational and descriptive in nature, asgeographically-descriptive features, such as population demographics and types of land-use,may further enrich the spectral analysis-elicited representations.

To allow for such input matrices we adjust our spectral analysis process to first calculatethe pairwise similarity among geographic entities, thus producing a square affinity matrixon which the spectral analysis procedure can be performed.

Page 11: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

Deriving Enhanced Geographical Representations 11

Input node

Hidden nodeOutput node

(logistic)

RR-SA Feats

Figure 3: RR-SA neural network architecture.

More formally, recall the previously discussed adjacency matrixZZZ and defineAAA ∈ Rp×h

to be a geographic entity design matrix whose rows represent geographic entities and whosecolumns represent features. Additionally, AAA is constructed such that the lth row of ZZZ andthe lth row of AAA refer to the same entity.

Using AAA and ZZZ, along with a similarity measure, we devise two sub-representationmethods from which a p× p affinity matrix can be derived.

The first sub-representation uses a single binary feature to indicate spatial adjacencybetween two geographic entities along with the geographically descriptive features of eachentity to yield two vectors zl and zv . These can be formally expressed as

zl =[ZZZl,v,AAAl] (10)zv =[ZZZv,l,AAAv]

where [·] represents the concatenation of a scalar or vector with another vector (in this case, itis scalar-vector concatenation). Also, note thatZZZl,v = ZZZv,l. We term this sub-representationmethod SSA (bin).

The second sub-representation uses full geographic entity adjacency vectors instead ofa definitive indicator of immediate spatial adjacency. This sub-representation method canbe expressed by

zl =[ZZZl,AAAl] (11)zv =[ZZZv,AAAv]

where, in this case [·] indicates two vectors being concatenated. We term this sub-representation method SSA (full).

Subsequently, using either SSA (bin) or SSA (full), the cosine similarity, denoted asφ(·), between l and v is computed, thus producing an affinity matrix.

Algorithm 3, which denotes the procedure as ΘΘΘ, fully discloses this process, using eitherSSA (bin) or SSA (full), while Figure 4 expresses the process in the context of the neuralnetwork architecture.

3 Predicting Colorectal Cancer Survival

We begin this section with an in-depth disclosure of the data employed in our experiments,subsequently outlining the technicalities involved in undertaking these experiments. Finally,

Page 12: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

12 MT. Lash et al.

Algorithm 3 Sub-rep to affinity ΘΘΘ

Input: AAA,ZZZ, REP ∈ {SSA (bin),SSA (full)}1: ZZZ← 0p×p

2: for l = 1, . . . , p− 1 do3: for v = l + 1, . . . , p− 1 do4: if REP =SSA (bin) then5: Define zl and zv according to (10).6: else7: Define zl and zv according to (11).8: end if9: ZZZ[i, j]← φ(zl, zv)

10: ZZZ[j, i]← φ(zl, zv)11: end for12: end forOutput: Return ZZZ

Input node

Hidden nodeOutput node

(logistic)

RR-SSA Feats

Figure 4: RR-SSA neural network architecture.

we provide a discussion of the results elicited from performing these experiments bycomparing the average predicted survival curve of each method against the average actualsurvival curve, leveraging a devised measure, referred to as area between the curves (ABC),discussed further on in this section.

3.1 Colorectal Cancer Survival Data for the State of Iowa

Our data were provided by the Iowa Cancer Registry (ICR), State Health Registry of Iowa(SHRI), and the Iowa Department of Public Health (IDPH). Each instance represents apatient who has been diagnosed with colorectal cancer and whose residence at the timeof diagnosis is in the state of Iowa. The dataset consists of n = 46116 patients and,initially, m = 71 features. After removing identifiers and features having a large numberof instances with missing values (% missing > 50%), we were left with m = 26 distinctfeatures (including unprocessed geographic coordinates). After binarizing discrete features,m = 386 (excluding geographic features). When using SBR geographical re-representation,m = 1364 (386 non geographic features and p = 978 binarized geographic features), andm = 386 + k when using the RR-SA geographic representation (where k is parameterized

Page 13: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

Deriving Enhanced Geographical Representations 13

and therefore user-dependent). When the Kaplan-Meier re-representation is applied to thedataset, we obtain y(i) vectors having T = 53 elements, where each element representsthe patient’s current vital status (alive= 1 or dead= 0), or a probability of survival whenan instance becomes censored, as described by (1). Each t ∈ {1, . . . , 53} represents sixmonths.

The 24 distinct non-geographic features pertain to various patient-specificcharacteristics, which can be categorized as disease-based and demographic-based.Disease-based features include tumor grade, tumor histology and tumor marker; we show ahistogram of tumor grade in Figure 5. Demographic-based features include marital status,race, and age at diagnosis; we show a histogram of age at diagnosis in Figure 6. Theseselected features (age and tumor grade) have been shown to be indicative of not receivingtimely cancer treatment Ward et al. (2013), which we believe will help in predicting cancersurvival, although analysis of such factors is beyond the scope of this work.

Figure 5: Tumor grade at diagnosis for patients in the state of Iowa: Years 1989 to 2013.

Figure 6: Age of colorectal cancer diagnosis for patients in the state of Iowa: Years 1989to 2013.

Page 14: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

14 MT. Lash et al.

3.2 Geographically-descriptive Features for the State of Iowa

We obtain geographically-descriptive features for the state of Iowa at the ZCTA-level ofspatial granularity from the US Census Bureau’s American FactFinder 2 website. Threedifferent geographically-descriptive features were obtained for each of the 978 ZCTAS inIowa: population age demographics, land type, and median household income. Populationage demographics and land type are categorical features and were represented in terms ofproportional bins (e.g., “percentage of population aged 0-5 years”).

3.3 Predictive Setting, Paramaterization and Results

As outlined in the introduction, we wish to address the following:

1. On average, can colorectal cancer survival curves be reasonably predicted for patientsin the state of Iowa?

2. Do geographic features improve the quality of predicted colorectal cancer survivalcurves for patients in the state of Iowa?

3. Do richer geographical feature representations improve predictive performance morethan simpler representations?

4. Can predictive performance be further improved by altering the RR-SA procedure toaccommodate adjacency-descriptive geographical feature pairings (i.e., RR-SSA)?

5. Which RR-SSA representation improves predictive performance the most: binary (bin)or full?

To such an end, we propose to use 10-fold validation where, for each fold, we find a g∗

for each of the following types of model:

(i) A model constructed using no geographical features (No Geo).

(ii) A model constructed using SBR-derived geographical features, as outlined by Figure2 (SBR).

(iii) Models constructed using RR-SA-derived geographical features, as outlined by Figure3, where the values k = 10, 20, 30, 40 will be explored (RR-SA).

(iv) Models construct using RR-SSA-derived geographical features, as outlined by Figure4, using the binary-based adjacency representation, where the valuesk = 10, 20, 30, 40will be explored (RR-SSA (bin)).

(v) Models construct using RR-SSA-derived geographical features, as outlined by Figure4, using the full adjacency representation, where the values k = 10, 20, 30, 40 will beexplored (RR-SSA (bin)).

We then assess predictive performance by computing each model’s average survivalcurve prediction on the test set, taken over 10 folds, as compared to the actual averagesurvival curve, taken over all y(i), using a measure termed area between curves (ABC) thatmeasures the area-wise disparity between the actual and predicted curves Lash, Sun, Zhou,Lynch & Street (2017).

Page 15: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

Deriving Enhanced Geographical Representations 15

3.3.1 Model Parameterization

Our models are constructed using Tensorflow, employing fully connected layers, trainedusing sigmoidal cross entropy as the loss function L(·). The logistic activation function isused for all nodes. Each model is trained using a maximum of 2500 epochs with batch sizeranging from 5% to 20%. While the connectedness of the architecture, activation function,epochs, and batch size are all tunable parameters, we elect to focus on finding the optimalnumber of hidden layers and corresponding hidden nodes for each layer (note that epochsof 1000, 1500, 2000, and 2500 were explored). Table A.1, in the Appendix section, showsthe average optimal architecture for each of the models, taken over the 10 folds.

3.3.2 Average Actual vs Average Predicted Survival

The results comparing the average actual survival curve against the average predictedsurvival curve, by model, are presented in Figure 7. Henceforth, these curves will simplybe referred to as actual and predicted. In these figures we also shade the region between theactual and predicted curves and provide a value representing the total area covered by thisregion. We will use this measure, developed in Lash et al., 2017 Lash, Sun, Zhou, Lynch& Street (2017), referred to as area between the curves (ABC for short), as a means ofcomparing the predictive quality of the 14 different models (where lower ABC is better).

Comparing Figure 7a with Figures 7b through 7n we first see that the addition ofgeographical features has uniformly improved the quality of the predictions, on average,as can be observed visually and by comparing ABC values. That is, with the exception ofRR-SSA (bin) k = 20, which suggests that it is important to tune the spectral analysis kvalue when using such representations.

Secondly, comparing Figure 7b with Figures 7c through 7f, we observe that models usingricher geographical representations (RR-SA) perform better (7c - 7f) than a model trainedusing a simple representation (7b). Furthermore, employing SSA-based representationsyield even better improvements over SBR, depending on the parameterized value of k.

However, there are also RR-SA model performance differences depending on theparameterized k value. Interestingly, there seems to exist a non-linear relationship betweenk and performance, with k = 10 outperforming k = 20, and k = 30 outperforming k = 10;k = 40 performs the best out of all models. We believe this nonlinear relationship may beaccounted for by the fact that higher values of k lead to more localized models, yet can alsoproduce sparse, disjointed clusters. This point is supported by our clustering visualizationsreported in Figure 8 and discussed in Section 3.3.3. These nonlinear response observationscan also be extended to RR-SSA (bin) and RR-SSA (full).

Comparing RR-SA with RR-SSA representations, we can see even greater improvementin our predictions, on average. In fact, by employing RR-SSA (full), we achieve a 38.2%relative improvement in ABC value when comparing the best RR-SSA (full) result (k = 20)with the best RR-SA result (k = 40). Interestingly, and perhaps not entirely unexpectedly,RR-SSA (bin) obtained less predictive improvement when compared with RR-SSA (full),but is able to improve upon the RR-SA result.

Curiously, however, depending upon the parameterized k value, RR-SSA (bin) performsworse than models induced without geographical features and those induced using SBR. Weconjecture that this may be attributable to the overly simple representation of geographicadjacency used in the sub-representation method of RR-SSA (bin). This is a reasonableconclusion as we can see that using a “fuller” representation (i.e., RR-SSA (full)) ofadjacency produces uniformly improved results.

Page 16: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

16 MT. Lash et al.

(a) No geo feats(ABC=14.32).

(b) SBR(ABC=12.60).

(c) RR-SA, k = 10(ABC=11.41).

(d) RR-SA, k = 20(ABC=12.31).

(e) RR-SA, k = 30(ABC=11.65).

(f) RR-SA, k = 40(ABC=10.77).

(g) RR-SSA (bin), k = 10(ABC=9.805).

(h) RR-SSA (bin), k = 20(ABC=14.555).

(i) RR-SSA (bin), k = 30(ABC=11.850).

(j) RR-SSA (bin), k = 40(ABC=11.205).

(k) RR-SSA (full), k = 10(ABC=9.820).

(l) RR-SSA (full), k = 20(ABC=6.657).

(m) RR-SSA (full), k = 30(ABC=11.724).

(n) RR-SSA (full), k = 40(ABC=7.818).

Figure 7: Actual vs. Predicted; k (when specified) denotes the parameterized k for spectralanalysis, and ABC represents the area between curves.

Page 17: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

Deriving Enhanced Geographical Representations 17

In examining the different predicted survival curves we have a few observations,summarized as follows. First, we observe that predictive performance increases are mostlyrealized after the five-year mark. This is, on one hand, intuitive because predicting survivalat times closer to the diagnosis is easier than predicting survival at later times. On theother hand, noticeable deviation of the predicted curves uniformly occurs across all modelsat or around this five-year mark. Therefore, model improvement wrought by using richergeographical representations is realized, by-in-large, at times beyond the five-year mark.Explanation as to why such a deviation is present in all models requires further investigationbeyond the scope of this work.

In summary, we find that

1. On average, colorectal cancer survival curves can be reasonably predicted for patientsin the state of Iowa.

2. Geographic features do improve the quality of predicted colorectal cancer survivalcurves for patients in the state of Iowa by 53.5% (on average) (comparing modelsinduced without geographic features with RR-SSA (full) k = 20).

3. On average, RR-SA feature representations improve predictive performance by 15%over simple representations (SBR) and RR-SSA improve predictive performance by47.2% over SBR.

4. On average, RR-SSA feature representations improve predictive performance by 38.2%over RR-SA representations (comparing RR-SSA (full) k = 20 to RR-SA k = 40).

5. On average, RR-SSA (full) feature representations improve predictive performance by32.1% (comparing RR-SSA (full) k = 20 to RR-SSA (bin) k = 10).

3.3.3 Visualizing Geographic Cluster Assignment

Next, we briefly discuss the results of visualizing cluster assignment for k = 10, 20, 30, 40for RR-SA, RR-SSA (bin), and RR-SSA (full). These results can be observed in Figure 8,where each unique color represents a single cluster.

For RR-SA (row 1), we first note that as k increases, the elicited geographic regionsbecome more precise, yet maintain geographic continuity. However, we secondly observethat some ZCTAs are not adjacent to any other ZCTA having the same cluster assignment.This disjointedness stems from the use of an adjacency representation of the affinity matrixon which spectral clustering is performed and is not unexpected. As k increases it appearsthat the number of disjointed ZCTAs also increases. However, we see that the number ofcontinuous regions also increases. In other words, while disjointedness seems to increasewith k, the desired result of more localized continuous geographical regions is still achieved.Interestingly, when k = 40, larger Iowa cities such as Des Moines (central Iowa) and IowaCity (central-eastern Iowa) begin to emerge.

In examining row 2 of Figure 8, representing the RR-SSA (bin) results we see entirelydifferent geographic clusterings than that of RR-SA. First, we find that for smaller valuesof k (k = 10, 20, cluster membership is very skewed, with a single cluster dominatingthe majority of the state, and the remaining cluster assignments being composed of singleZCTAs. These single-ZCTA clusters are found in Des Moines area, the largest urban areaof Iowa. As k is increased (i.e., k = 30, 40), rural areas begin to decompose into clustersubsets – i.e., as the representation is allowed to become more specific (by increasing k),

Page 18: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

18 MT. Lash et al.

Similarity-based (bin) Spectral Clustering: k=10±

0 50 10025 Miles

Similarity-based (bin) Spectral Clustering: k=20±

0 50 10025 Miles

Similarity-based (bin) Spectral Clustering: k=30±

0 50 10025 Miles

Similarity-based (bin) Spectral Clustering: k=40±

0 50 10025 Miles

Similarity-based (full) Spectral Clustering: k=10±

0 50 10025 Miles

Similarity-based (full) Spectral Clustering: k=20±

0 50 10025 Miles

Similarity-based (full) Spectral Clustering: k=30±

0 50 10025 Miles

Similarity-based (full) Spectral Clustering: k=40±

0 50 10025 Miles

Figure 8: Spectral clustering results for k = 10, 20, 30, 40, where color denotes clustermembership. Row one represents RR-SA results, row two represents RR-SSA (bin) results,and row three represents RR-SSA (full) clustering results.

rural areas begin to become distinguished between. Urban areas, such as Des Moines andIowa City, are also ascribed membership to clusters composed of fewer geographic entities.

Looking at row 3 of Figure 8, which constitutes the cluster results obtained from RR-SSA(full) models, we observe different clustering results from that of the previous two models.First, we can see that clusters often form “ring-like” patterns (this is particularly observablefor k = 10), which is a particularly interesting artifact of this representation. Secondly,juxtaposing these results, with that of the previous two rows (i.e., RR-SA and RR-SSA(bin)), we observe that this representation is somewhat of a “compromise” between RR-SAand RR-SSA (full) in the sense that RR-SA produces mostly geographically contiguousclusterings and RR-SSA (bin) produces more geographically disparate clusterings. Thisis not unexpected, as the sub-representation method of RR-SSA (full) employs the fulladjacency representation used in RR-SA, which is not found in RR-SSA (bin). Interestingly,RR-SSA (full) has also discovered urban areas such as Des Moines and Iowa City, butdoes so at smaller values of k than RR-SSA (bin) (e.g., RR-SSA (bin) k = 10 is only ableto discern areas around Des Moines, whereas RR-SSA (full) is able to discern Iowa City,Des Moines, Waterloo/Cedar Falls, Mason City, etc.). Finally, as k is increased we observethat the representation is becomes more specific in terms of both urban and rural areas upto k = 30. When k = 40 we observe that the clusterings are more disparate, where thereappear to be approximately three different rural areas distinguished between (yellow, green,and white), and where urban ZCTAs are assigned to their own unique cluster. This maysuggest that urban areas are much more heterogeneous than are rural areas.

Page 19: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

Deriving Enhanced Geographical Representations 19

4 Related Work

The topics related to and discussed throughout this work can best be categorized as diseaseand survival curve prediction and geographic-based predictions and representation.

There are many past works involving the prediction of diseases. These can be viewedas classification-based Belciug (2010), Belciug & Gorunescu (2013), Gupta et al. (2011),Khosravi et al. (2015), Ojha & Goel (2017), Puddu & Menotti (2012), Sandhu et al. (2015)and survival-based Chi et al. (2007), Cox (1992), Gupta et al. (2011), Katzman et al. (2016),Samundeeswari & Saranya (2016), Sharma et al. (2017). The focus of this work was onsurvival curve predictions. Such works can be examined by method, which include Coxproportional hazards model (CPH) Cox (1992), which has been historically used to makesuch predictions, decision trees Sharma et al. (2017), and neural network-based models Chiet al. (2007), Gupta et al. (2011), Katzman et al. (2016), Samundeeswari & Saranya (2016),which are a more recent development. However, as Laurentiis and Ravdin De Laurentiis& Ravdin (1994) point out, CPH has several caveats as compared to neural network-basedapproaches, including the naivety of the proportional hazards assumption and inability tocapture nonlinear feature interactions. Furthermore, decision trees are constructed usinggreedy methodology and do not have the architectural benefits of neural networks. Hence,this work employed neural networks.

There are also many works focusing on geographic-based prediction and representation.These works focus on incorporating geographical features into the predictive process. Onemethod of representing geography is by fine grain lattice (i.e., grid) Khezerlou et al. (2017),Lash, Slater, Polgreen & Segre (2017), Yuan et al. (2017). Such methods are akin to our SBRrepresentation and suffer from the same shortcomings. Spatially adaptive filters Tiwari &Rushton (2005), which can tie a single feature to geography when creatingM, which maybe beneficial when the selected feature is particularly indicative of survival. This methodwould, however, still produce a binary feature representation, having the accompanyingshortcomings discussed when disclosing SBR. Spectral clustering has been used to clusterboth social networks White & Smyth (2005) and for representing geo-spatial features Frias-Martinez & Frias-Martinez (2014), van Gennip et al. (2013), as in this work, and producesa rich (i.e., non-sparse) vector of features.

5 Conclusions and Future Work

In this work we explored the use of four different geographical feature representations –a simple binary representation (SBR) and a rich representation based on spectral analysis(which we term spectral analysis and methodologically refer to as RR-SA), and tworepresentations based on similarity-based spectral analysis (RR-SSA) – to predict colorectalcancer survival curves for patients in the state of Iowa. We show that (a) survival curvescan be reasonably estimated, although predictive performance deviates near the five-yearsurvival mark, (b) the use of geographical features generally lead to better predictions,(c) RR-SA trained models outperform those trained using SBR, (d) RR-SSA inducedmodels, generally, outperform RR-SA models, and (e) RR-SSA (full) representationsoutperform RR-SSA (bin) representations. Future work will involve exploration ofdifferent geographical representations, particularly those learned in conjunction with g∗.Additionally, continued exploration of domains and scenarios in which SBR, RR-SA, andRR-SSA geographic representations provide benefit should be undertaken.

Page 20: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

20 MT. Lash et al.

6 Acknowledgements

The authors would like to thank the Iowa Cancer Registry, State Health Registry of Iowa,and the Iowa Department of Public Health for the data. The authors would also like to thankGary Hulett and Jason Brubaker for their help in dataset construction and Prakash Nadkarnifor his help with both data acquisition and the IRB process.

References

Belciug, S. (2010), ‘A two stage decision model for breast cancer detection’, Annals of theUniversity of Craiova-Mathematics and Computer Science Series 37(2), 27–37.

Belciug, S. & Gorunescu, F. (2013), ‘A hybrid neural network/genetic algorithm applied tobreast cancer detection and recurrence’, Expert Systems 30(3), 243–254.

Chi, C.-L., Street, W. N. & Wolberg, W. H. (2007), Application of artificial neural network-based survival analysis on two breast cancer datasets, in ‘AMIA Annual SymposiumProceedings’, Vol. 2007, American Medical Informatics Association, p. 130.

Cox, D. R. (1992), Regression models and life-tables, in ‘Breakthroughs in Statistics’,Springer, pp. 527–541.

De Laurentiis, M. & Ravdin, P. M. (1994), ‘A technique for using neural network analysisto perform survival analysis of censored data’, Cancer Letters 77(2-3), 127–138.

Frias-Martinez, V. & Frias-Martinez, E. (2014), ‘Spectral clustering for sensing urban landuse using twitter activity’, Engineering Applications of Artificial Intelligence 35, 237–245.

Gupta, S., Kumar, D. & Sharma, A. (2011), ‘Data mining classification techniques appliedfor breast cancer diagnosis and prognosis’, Indian Journal of Computer Science andEngineering (IJCSE) 2(2), 188–195.

Kaplan, E. L. & Meier, P. (1958), ‘Nonparametric estimation from incomplete observations’,Journal of the American Statistical Association 53(282), 457–481.

Katzman, J., Shaham, U., Bates, J., Cloninger, A., Jiang, T. & Kluger, Y. (2016), ‘Deepsurvival: A deep cox proportional hazards network’, arXiv preprint arXiv:1606.00931 .

Khezerlou, A. V., Zhou, X., Li, L., Shafiq, Z., Liu, A. X. & Zhang, F. (2017), ‘A trafficflow approach to early detection of gathering events: Comprehensive results’, ACMTransactions on Intelligent Systems and Technology (TIST) 8(6), 74:1–74:24.

Khosravi, B., Pourahmad, S., Bahreini, A., Nikeghbalian, S. & Mehrdad, G. (2015), ‘Fiveyears survival of patients after liver transplantation and its effective factors by neuralnetwork and cox proportional hazard regression models’, Hepatitis Monthly 15(9).

Lash, M. T., Slater, J., Polgreen, P. M. & Segre, A. M. (2017), A large-scale exploration offactors affecting hand hygiene compliance using linear predictive models, in ‘HealthcareInformatics (ICHI), 2017 IEEE International Conference on’, pp. 66–73.

Page 21: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

Deriving Enhanced Geographical Representations 21

Lash, M. T., Sun, Y., Zhou, X., Lynch, C. F. & Street, W. N. (2017), Learning richgeographical representations: Predicting colorectal cancer survival in the state of iowa, in‘Bioinformatics and Biomedicine (BIBM’17), 2017 IEEE International Conference on’,IEEE, pp. 778–785.

Ojha, U. & Goel, S. (2017), A study on prediction of breast cancer recurrence using datamining techniques, in ‘Cloud Computing, Data Science & Engineering-Confluence, 20177th International Conference on’, IEEE, pp. 527–530.

Puddu, P. E. & Menotti, A. (2012), ‘Artificial neural networks versus proportional hazardscox models to predict 45-year all-cause mortality in the italian rural areas of the sevencountries study’, BMC Medical Research Methodology 12(1), 100.

Samundeeswari, E. & Saranya, P. (2016), ‘An artificial neural network model for predictionof survival time of breast cancer dataset’, International Journal of Research in Engineeringand Applied Sciences 6(1), 161–168.

Sandhu, I. K., Nair, M., Shukla, H. & Sandhu, S. (2015), ‘Artificial neural network: Asemerging diagnostic tool for breast cancer’, International Journal of Pharmacy andBiological Sciences 5(3), 29–41.

Sharma, A., Karthik, G., Mittal, N., Sindhu, V. & Pradeep, K. (2017), A survey on predictiveanalysis of cancer survivability rate using machine learning algorithm, in ‘7th InternationalConference on Recent Trends in Engineering, Science, and Management’, pp. 271–278.

Tiwari, C. & Rushton, G. (2005), Using spatially adaptive filters to map late stage colorectalcancer incidence in iowa, in ‘Developments in Spatial Data Handling, Proceedings of the11th International Symposium on Spatial Data Handling. Springer, Berlin, Heidelberg’,Springer, pp. 665–676.

van Gennip, Y., Hunter, B., Ahn, R., Elliott, P., Luh, K., Halvorson, M., Reid, S., Valasik,M., Wo, J., Tita, G. E. et al. (2013), ‘Community detection using spectral clustering onsparse geosocial data’, SIAM Journal on Applied Mathematics 73(1), 67–83.

Wan, N., Zhan, F. B., Zou, B. & Wilson, J. G. (2013), ‘Spatial access to health careservices and disparities in colorectal cancer stage at diagnosis in texas’, The ProfessionalGeographer 65(3), 527–541.

Ward, M. M., Ullrich, F., Matthews, K., Rushton, G., Goldstein, M. A., Bajorin, D. F.,Hanley, A. & Lynch, C. F. (2013), ‘Who does not receive treatment for cancer?’, Journalof Oncology Practice 9(1), 20–26.

White, S. & Smyth, P. (2005), A spectral clustering approach to finding communities ingraphs, in ‘Proceedings of the 2005 SIAM international conference on data mining’,SIAM, pp. 274–285.

Yuan, Z., Zhou, X., Yang, T., Tamerius, J. & Mantilla, R. (2017), Predicting traffic accidentsthrough heterogeneous urban data: A case study, in ‘6th International Workshop on UrbanComputing (UrbComp 2017)’.

Zhang, R., Li, N., Yang, X. & Huang, Y. (2015), ‘Data mining technology and its applicationin diagnosis and treatment of clinical malignant tumor’, Journal of Medical Informaticspp. 50–54.

Page 22: Deriving Enhanced Geographical Representations via ... · survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical

22 MT. Lash et al.

Appendix

Model Avg Optimal Architecture

No Geo 1.5:[83,30]SBR 1.9:[260,122]RR-SA, k = 10 1.5:[82,36]RR-SA, k = 20 1.5:[102,44]RR-SA, k = 30 1.6:[87,45]RR-SA, k = 40 1.5:[80,44]RR-SSA (bin), k = 10 1.6:[82,50]RR-SSA (bin), k = 20 1.7:[87,50]RR-SSA (bin), k = 30 1.6:[70,33.33]RR-SSA (bin), k = 40 1.5:[75,42]RR-SSA (full), k = 10 1.6:[66,45]RR-SSA (full), k = 20 1.7:[91,50]RR-SSA (full), k = 30 1.7:[73,41.43]RR-SSA (full), k = 40 1.9:[78,42.22]

Table A.1 Average optimal architecture by model over the 10 folds (e.g., No geo had 1.5 hiddenlayers, on average, where the first layer had 83 nodes , on average, and the second layerhad 30 nodes, on average).

In Table A.1 we can see that, on average, the optimal architecture is relatively comparableamong all models with the exception of SBR (and to a degree RR-SA, k = 20). First,this suggests that the use of RR-SS and RR-SSA features do not affect the architecturalcomplexity of the model. However, SBR seems to significantly increase such complexity.This is somewhat expected, as SBR is represented as a large, sparse vector, which can becontrasted with the comparatively small vector of RR-SA and RR-SSA.