warwick.ac.uk/lib-publications · distributed/parallel execution frameworks, etc. in the current...

warwick.ac.uk/lib-publications

Original citation: Triantafillou, Peter (2018) Towards intelligent distributed data systems for scalable efficient and accurate analytics. In: 38 IEEE International Conference on Distributed Computing Systems, ICDCS, Vienna, Austria, 2-5 Jul 2018. Published in: 38 IEEE International Conference on Distributed Computing Systems, ICDCS (In Press)

Permanent WRAP URL: http://wrap.warwick.ac.uk/101785 Copyright and reuse: The Warwick Research Archive Portal (WRAP) makes this work by researchers of the University of Warwick available open access under the following conditions. Copyright © and all moral rights to the version of the paper presented here belong to the individual author(s) and/or other copyright owners. To the extent reasonable and practicable the material made available in WRAP has been checked for eligibility before being made available. Copies of full items can be used for personal research or study, educational, or not-for profit purposes without prior permission or charge. Provided that the authors, title and full bibliographic details are credited, a hyperlink and/or URL is given for the original metadata page and the content is not changed in any way. Publisher’s statement: © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting /republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. A note on versions: The version presented here may differ from the published version or, version of record, if you wish to cite this item you are advised to consult the publisher’s version. Please see the ‘permanent WRAP url’ above for details on accessing the published version and note that access may require a subscription. For more information, please contact the WRAP Team at: [email protected]

http://go.warwick.ac.uk/lib-publications

http://go.warwick.ac.uk/lib-publications

http://wrap.warwick.ac.uk/101785

mailto:[email protected]

Towards Intelligent Distributed Data Systems forScalable Efficient and Accurate Analytics

Peter TriantafillouDepartment of Computer Science,

University of WarwickUK

Email: [email protected]

Abstract—Large analytics tasks are currently exe-cuted over Big Data Analytics Stacks (BDASs) whichcomprise a number of distributed systems as layers forback-end storage management, resource management,distributed/parallel execution frameworks, etc. In thecurrent state of the art, the processing of analyticalqueries is too expensive, accessing large numbers ofdata server nodes where data is stored, crunching andtransferring large volumes of data and thus consumingtoo many system resources, taking too much time, andfailing scalability desiderata.

With this vision paper we wish to push the researchenvelope, offering a drastically different view of ana-lytics processing in the big data era. The radical newidea is to process analytics tasks employing learnedmodels of data and queries, instead of accessing anybase data – we call this data-less big data analyticsprocessing. We put forward the basic principles fordesigning the next generation intelligent data systeminfrastructures realizing this new analytics-processingparadigm and present a number of specific researchchallenges that will take us closer to realizing the vision,which are based on the harmonic symbiosis of statisticaland machine learning models with traditional systemtechniques. We offer a plausible research program thatcan address said challenges and offers preliminary ideastowards their solution. En route, we describe initialsuccesses we have had recently with achieving scalabil-ity, efficiency, and accuracy for specific analytical tasks,substantiating the potential of the new paradigm.

I. Introduction

We have all heard the heavy declarations surrounding(big) data science, such as ”data is the new oil”, ”dataanalytics is the new combustion engine”, etc. Behind suchhype, lie two fundamental truths: First, humans and or-ganizations are inundated with massive data volumes, ofhigh complexity, which seriously impede their ability tomanage it and make sense of it in a time-critical fashion.Second, it has been proved that intelligent analyses ofdata lead to profound novel insights, which bear greatbenefits to organizations and individuals, for practicallyall facets of our lives, inlcuding health, education, publicgovernance, scientific knowledge generation, finances, busi-ness decision-making, etc.

Current trends indicate that big, complex data sciencewill become even bigger and more complex. As one ex-ample, earth science sensors around the globe record 100s

of thousands unique correlations [1]. As another example,we expect that by 2025 2B human genomes will havebeen sequenced. All trends with respect to analytics ata large variety of applications, such as health informatics,astrophysics, particle physics, social media, finance, etc.reveal the same expectations. While complex data of mas-sive volumes become omnipresent, their analyses presentsinsurmountable challenges to the system infrastructuresthat will be called upon to process them. Referring backto human genome analysis, identifying genomic variantsrequires 4 orders of magnitude speedups (with 100,000CPUs running in parallel) [2]!

The system infrastructures for big data analytics entaila number of distributed systems, organized in so-calledBig Data Analytics Stacks (BDASs). A distributed storageback-end is used to store and manage data sets; this isin the form of a distributed file system, distributed SQLor NoSQL modern databases, or often a combination ofthese systems. Additionally, a distributed resource man-agement system and a Big Data Engine (which implementsdistributed processing paradigns, such as MapReduce) areutilized for the processing of analytics tasks. Finally, anumber of systems/applications are typically supportedas higher layers, offering specialized functionality, such asMachine Learning libraries, graph analytics, etc.

Our recent research has revealed that the state ofthe art methods for fundamental analytics queries oversuch infrastructures leaves a lot to be desired in termsof efficiency, scalability and accuracy. Armed with thisknowledge, with this paper we put forward our vision forresearch in large-scale systems for big data analytics, whichbased on our recent endeavours reveals great promises.

This vision, coined SEA (for Scalable, Efficient, Accu-rate) presents a novel set of principles, goals and objec-tives, and a novel research program, which (individuallyand especially) collectively aim to push the researchenvelope into distributed data systems for scalable, ef-ficient, accurate big data science and analytics, offeringorders of magnitude improvements and much-needed,novel functionality.

II. The Context and Related WorkScalable, efficient, and accurate (SEA) analytics is

becoming increasingly important in the big data era. Allresearch communities involved in big data science have

recognized and embraced the inherent challenges en routeto achieving scalability, efficiency, and accuracy. New dis-tributed systems, facilitating scalability with massive stor-age management and parallel/distributed processing havebeen developed. The seminal Hadoop-based efforts [3]–[5]solved key scalability problems. Realising limitations forkey applications (e.g., iterative and/or interactive) newersystems such as Spark [6] and Stratosphere/Flink [7], aswell as extensions to Hadoop [8], [9] were proposed. Inparallel, modern scalable data systems (e.g., [10]–[12]) werecontributed.

Furthermore, the need for ensuring high efficiency, inaddition to scalability, was becoming evident and newsystems for fast analytics [13], [14] emerged to satisfy suchneeds. Despite the huge success of the above efforts, it isincreasingly becoming apparent that something more isrequired if massive volumes of data are to be analysedefficiently and scalably.

A distinct and promising research thread, within thedata systems community, concerns approximate query pro-cessing (AQP). In AQP the accuracy of analyses results issacrificed for increased efficiency, relying on data sampling(e.g., [15]) and/or data synopses (e.g., [16]). New systems(be it general-purpose such as BlinkDB [17] or for specificapplications, such as for scientific analysis SciBORQ [18])emerged. More recently, systems such as DBL/VertexDB[19], which are built on top of AQP engines, leverage themand contribute ML models so the system can learn frompast behavior and gradually improve performance.

With respect to systems offering exact (and not ap-proximate answers), new methods enriching the state ofthe art data systems with efficient and scalable statisticalanalysis, such as the Data Canopy [20] have recently beenproposed. Data Canopy contributes a novel semantic cachewith high success.

Interestingly, at the same time, leading researchers inML were recognizing the scalability/efficiency constraintsof known methods [21]–[23]. Michael Jordan, for instance,suggested time-data trade-offs using (efficient but) lessrobust algorithms, where larger datasets compensate forpoorer accuracy [21], [23].

Hence, one can observe different data science researchcommunities converging in their realizations regarding theneed for new research and approaches towards scalable,efficient, and accurate analytics – the holy grail of modernbig data science.

Alas, current solutions are very much lacking and, asmentioned, are predicted to fall way short of requirements[1], [2], for many data-intensive applications as, for exam-ple, earth science, particle physics, astronomy, genomics,social media analytics, etc. Even the most recent researchfrom the data systems community, although promising,also leaves much to be desired if scalability, efficiencyand accuracy are to be ensured. For example, the storagerequired by Data Canopy [20] (as well as by traditionaldata system methods based on caches and materializedviews) can grow prohibitively large. Also, such effortstypically only benefit previously seen queries. Similarly,the state-of-the-art AQP Engines, like BlinkDB [17] suffer

from several disadvantages: First, sample sizes can becomeprohibitively large. Second, accuracy can be quite lowfor many tasks. Finally, engines like BlinkDB arguablyplace its key functionality at the wrong place withinthe big data analytics stack: Samples are created andmaintained over a distributed file system (e.g., HDFS)and are accessed through big data infrastructures (e.g.,Hive on top of Hadoop or Shark/Spark). This amounts totime-consuming and resource hungry tasks and in the endit may attain scalability, but typically at the expense ofefficiency (e.g., as large a la MapReduce tasks) executeover all HDFS nodes. Finally, state of the art learningapproaches, like the DBL approach [19], albeit interestingand promising, are based upon an AQP Engine, such as[17]. Thus, they inherit the aforementioned limitations instorage space and an initial (typically large) error (whichthey try to improve). Additionally, DBL requires largestorage space to manage previous queries and answers,(e.g., maintain 1000s of answer items per executed query).

A. Where Have we Gone Wrong?Figure 1 exemplifies the current state of affairs in

analytical query processing over large distributed big datainfrastructures.

A population of analysts submits (a large number of)analytical queries to the big data system. The data sets tobe analyzed are typically massive in size and are stored,managed, and accessed using distributed big data analyticsstacks (BDASs) encompassing distributed storage engines(e.g., HDFS, HBase/BigTable, Cassandra, MongoDB or acombination), distributed resource manager systems (suchas Yarn, Mesos, Hadoop, etc.), and a Big Data Engine(such as Spark, Hadoop, etc.).

Processing analytical queries incurs large delays andoverheads for many reasons, including the following.

• First, each analytical query passes through many layersof the BDAS, with each layer adding extra overheadsat all nodes engaged in task processing.

• Second, processing is typically continued (e.g., usinga MapReduce style of distributed/parallel processing)across a (potentially) large number of data nodes (e.g.,running HDFS and/or HBase) over which the data isdistributed.

• Third, the processing of complex tasks involves severalsuch passes, with lots of data being transferred fromnode to node, etc.

• Fourth, for many tasks, there are several alternativeprocessing methods one could employ and the systemfails to itself choose or guide programmers to choosethe best possible method.

• Finally, as applications for emerging large-scale geo-distributed analytics proliferate, at such global-scalescurrent solutions’ requirements either exceed availableresources or simply cost too much.

The end result of the above misses, is that task process-ing becomes time-consuming (inefficient), resource-hungry(costly), and unscalable (the system cannot scale as queryarrival rates increase). As mentioned, the state of the artoffers some improvements, e.g., using caching, AQP, and

Analysts

Q1

Q2

…

Qm

…

QN

Response

Query

Analytics Queries

Big Data Analytics System

Fig. 1: Traditional Big Data Analytics Processing.

learning; but they do not go far enough – we will exem-plify and detail several such shortcomings incurred whenprocessing important types of analytical tasks throughoutthe remainder of the paper.

A new much bolder approach is required.

III. Vision OverviewWe now first describe in more detail the types of analyt-

ical queries we focus on with this research. Subsequently,we provide the big picture of our vision.

A. The AnalyticsConsider Penny, an analyst visualizing the (typically

multi-dimensional) data space, which she is trying toexplore and analyze. With the help of a GUI, Penny cancapture a subspace of possible interest (e.g., drawing circles(hyper-spheres) or (hyper)rectangles and then issuing ananalytical query over this subspace to determine if it isof interest). For example, she can use aggregation oper-ators such as count, mean, media, quartiles etc., toderive descriptive statistics within this space (these arealso referred to as roll-up operators in data warehousesand Online Analytical Processing). Alternatively, she candirectly issue SQL(-like) queries, (e.g., in Hive or Pigenvironments implemented on top of a BDAS) involv-ing selection-projection-join queries, with aggregations. Ingeneral, the above queries can be of different types andwill consist of (a) selection operators, which identify a datasubspace of interest and (b) an analytical operator over thedata items within this data subspace.

A variety of selection operators important for data ana-lytics should be considered; for example: (i) range queries,which supply a range of values for each dimension ofinterest, defining (hyper)-rectangles in multi-dimensionaldata spaces; (ii) radius queries, which supply a centralpoint in a multi-dimensional data space and a radius, defin-ing (hyper-)spheres, and (iii) Nearest-Neighbour queries,which select a given number of data items that are closetto a given data point, given a distance definition.

Analytics over defined subspaces should cover bothdescriptive statistics (e.g., aggregations) and dependence(multivariate) statistics (e.g., regressions, correlations).

Finally these analytical queries will involve in generaldifferent types of base data (e.g., SQL tables, .csv filesand spreadsheets, graph data, etc.) and may in additiondepend on other expensive data manipulation tasks, suchas joins across tables or files.

In addition, we argue for the need of new functionality,not supported by present-day data systems and SEAmethods to implement it. For example, Penny should beempowered to perform analyses based on multivariate (de-pendence) statistics between attributes – e.g., correlationand regression analyses, informed of model coefficientsfor predictive analytics and visualizations (e.g., regressioncoefficients) etc. These in turn may be used as buildingblocks for higher-level interrogations, such as ”return thedata subspaces where the correlation coefficient betweenattributes is greater than a threshold value”.

Furthermore, we argue for a new class of functionality,based on the notion of explanations to be defined anddelivered [24]. Consider Penny receiving the answer thatthe population (count) within a data subspace is 273.Such single-scalar answers, returned by present-day datasystems leave much to be desired. What is she to make ofit? How would this value change if the selection operatorsdefining the data subspace to be explored were more/lessselective? Penny would have to issue a large number ofqueries, redefining in turn the size of the queried datasubspace to gain deeper understanding. We need systemsthat offer rich, compact, and accurate explanations, whichwill accompany answers and will empower Penny to bet-ter and faster understand data analyzed data subspaces.And, approaches whereby said explanations can be derivedthemselves scalably and efficiently.

A key point to note is that this new functionality alsobenefits the system itself: Higher-level interrogations andexplanations will allow analysts to understand data spacesfaster, and enable their exploration without burdeningthe system with a large number of queries (that wouldotherwise be required). Therefore, as systems will be fac-ing fewer queries, the scalability, efficiency, and accuracydesiderata are achieved indirectly!

B. The Key ParadigmOur vision revolves around a new paradigm we coined

Data-less Big Data Analytics. Imagine a data system,which can process analytical queries without havingto access any base data, while providing accurateanswers! This would achieve the ultimate in scalability(as query processing times become de facto insensitive todata sizes), and the ultimate in efficiency, as time consum-ing, resource-hungry interactions within distributed andcomplex big data analytics stacks are completely avoided!

Figure 2 exemplifies the central paradigm. The keyidea is to develop an intelligent agent and insert it betweenuser queries and the system. This agent develops a numberof statistical machine learning (SML) models, which canlearn from users the characteristics of the queried space,learn from the system the characteristics of the data-answer space per query and using them can predict withhigh accuracy the answer to future previously-unseen newqueries.

Queries are submitted to the system as before. Aninitial subset of these queries are sent to the system asbefore in Figure 1 (with the exception that the agent’intercepts’ them and their answers). These queries arein essence treated as ’training’ queries. Once the modelsare trained, all future queries need not access any basedata and all answers are provided by the agent outsidethe BDAS.

But is this paradigm realistic? Can these ambitiousgoals be achieved? And with methods and models whichthemselves are efficient and scalable? And without sacrific-ing accuracy? And for which types of analytics tasks? Andwhat are we to do for cases where data-less processing isnot possible (e.g., for certain tasks). In the rest of thispaper we start a discussion as to how to address theseissues. Our vision is based on recent results, which areare very encouraging. At their core, they rest upon MLadvances employed within distributed data systems whichalongside distributed systems techniques (e.g., indexes,statistical structures, caches, query routing, load balanc-ing, data placement, etc.) deliver the above desiderata.

IV. SEA: Principles, Goals, and ObjectivesSEA is fuelled and driven by four fundamental prin-

ciples, the validity of which has been verified and proventhrough our recent research, as will be explained. Theseprinciples are associated with goals and specific objectives,which will be delivered by a research program, which weoutline later.

The first principle, P1, realizes that a human-in-the-loop approach can add the inherent intelligence of thehuman analysts to the models, structures, and algorithmsthat strive to introduce (artificial) intelligence towardsscalable, efficient, and accurate analytics.

P1: Empower the human to empower the data ana-lytics system to empower the human.

SEA will formulate a novel approach to realize thisprinciple, viewed from a system’s perspective. This rests

on the following two goals: The first goal (G1) concernsthe understanding and leveraging of analyst actions vis-a-vis the system’s behaviour in response to analysts’ queries.There are two central objectives here. On the one hand,(O1) SEA will derive novel algorithms, structures, andmodels, which focus on and learn from what analysts aredoing. What are their queries? Which data subspaces arethey concentrating on, etc.? In essence, this will quantizethe query space. At the same time, on the other hand, (O2)SEA will also develop novel models, algorithms, and struc-tures that learn from what the system does in response toanalyst queries. What are the answers to analyst queries?How were they computed? Using which data subspaces?Furthermore, SEA’s third objective (O3) is to unify theresults of O1 and O2, associating specific query spacequanta with methods, models, and answers usedto predict results for future queries, depending ontheir position in the query space.

The second equally important goal (G2) pertains to theformulation and development of novel notions of and mod-els for explanations, which capture queried data subspacesand answers to analytical queries. The rationale here is tonot simply throw data back at analysts in responseto their queries! Instead, explain the characteristics of thequeried data spaces vis-a-vis their queries. For instance,explain how query answers depend on key query param-eters. Such explanations are crucial, for instance, duringexploratory analytics: They will help analysts efficientlyand scalably discover interesting data subspaces, exploringthem while understanding them, further empowering themwithout the need to explicitly issue an inordinate numberof specific queries to gain such understanding.

As a result, the human analyst is empowered by thesystem in her data analyses in a way that empowers thesystem to be scalable and efficient!

SEA’s second principle is:

P2: Data-less big data analytics.

This principle encapsulates the key new paradigm out-lined earlier. The efficiency and scalability gains of thenew paradigm would obviously be dramatic, as mentionedearlier. The viability of the principle rests on leverag-ing known and widely accepted workload characteristics,namely that queries define overlapping data subspaces[17]–[20], [25]. Using objectives O1, O2, and O3 above,SEA essentially injects an ”intelligence” in the system thatachieves its SEA goals.

SEA’s third goal (G3) therefore is to develop a suiteof models, methods, and structures, which prove theapplicability and showcase the benefits of the data-lessanalytics principle across various analytics tasks (querytypes), data formats (text, graph, tabular), and systemtypes (from cluster-based systems, to multi-datacentre, togeo-distributed systems).

Please note that P2 puts forward a game-changingapproach. It is fundamentally different than the state ofthe art, such as caching approaches like Data Canopy [20]and AQP approaches (e.g., stratified data-sampling based)

Analysts

Analytics Queries

Q1

…

Qm

Training Queries

Response

Query

Qm+1

…

Qm+n

Query

Predicted Response

Predicted Queries

Fig. 2: The Vision for Scalable Efficient Accurate Big Data Analytics Processing.

like BlinkDB [17] and learning approaches built on topof AQP engines, like DBL [19]. As aforementioned, thesemake a good step forward, but are associated with severaldrawbacks and are not nearly as ambitious as SEA.

In addition to its obvious efficiency and scalabilityadvantages, the salient feature of P2 is that it proposesto learn gradually, focusing only on those data subspaces,which are of interest to analysts, instead of trying tocapture the whole data domain (which is time-consumingand erro-prone).

Also note that data-less analytics is paramount forrespecting security/privacy concerns. Typically organiza-tions holding sensitive data, may be wary of revealing thedata itself, but will nonetheless typically reveal analyticalquery results (over sensitive data), such as those summa-rizing data. Note that the basis of SEA’s ML models restson observing queries and such analytical query answersand not the data itself.

Our group was the first to formulate and advocatethis principle and obtain the first research results whichshowcase its validity and potential (for classes of analyticsbased on count queries [26]–[28], average queries andregression queries [29]).

SEA’s third principle is:

P3: Big-data-less big data analytics.

Depending on the analytics task at hand, it may notalways be possible to apply data-less analytics processing,e.g., in cases where approximate answers are not satisfac-tory. For these tasks, SEA’s third principle applies withSEA’s fourth goal (G4) which aims to develop algorithms,structures, and models, which will process said analyticstasks via surgically accessing the smallest data sub-set that is required to compute the answers.

Alas, the state of the art is pervaded with approacheswhere key analytics tasks are processed ”scalably” albeithaving to access large portions (if not all) of the massiveunderlying base data and/or all or many data server nodes– recipes for performance disaster in the big data era! Tobe concrete, consider the following fundamental operatorsfor various analytics tasks.

• Often, data analysis requires the joining of data setsspread across different tables/files, etc., and then rank-ing the items in the join result. State of the art solutionsfor this (so-called rank-join) operator were based onalgorithms, which performed several MapReduce tasks(with Hadoop or Spark) over the base data sets. Withour work in [30], we showed that by developing andleveraging appropriate statistical index structures, the(typically very small set of) necessary data items canbe identified and only those are surgically accessed.This achieved up to 6 orders of magnitude performanceimprovements (in execution time, network bandwidth,and money costs)!

• k-Nearest-Neighbours (kNN) is another fundamentalanalytics operator. The state of the art for process-ing kNN queries also required processing an inordi-nate amount of the underlying base data either usingHadoop [31] or Spark [32]. Our work [33] introducedperformance improvements of three orders of magni-tude utilising novel indexes and appropriate distribu-tion processing paradigms.

• Reaching further, subgraph matching is a fundamen-tal operator for graph analytics. In [34], [35] novelsubgraph-query semantic caches, minimized back-endstored data accesses, ensuring performance improve-ments up to 40X.

• The same holds for key tasks that are preparatory foranalytical query answering, such as ensuring data qual-ity. For example, our work on scalable missing valueimputation [36] showed big gains in performance andscalability compared to typical BDAS/MapReduce-style processing.

Therefore, SEA’s fourth objective (O4) is to develop asuite of indexes, caches, and statistical structures, whichwill introduce dramatic improvements in efficiency, scala-bility, and money costs during analytics processing, for awide variety of analytics tasks across multiple data formatsand system types.

SEA’s fourth and final principle is:

P4: Understand the alternatives and selectoptimal processing methods.

Our recent research and exposure on the systems, algo-rithms, methods, and ML models employed for various bigdata analytics tasks has revealed that most tasks can beprocessed using a number of alternative approaches. Giventhe complex state-of-the-art big data analytics stacks (e.g.,be it based on Hadoop or Spark) it is important tofully understand the alternatives and related trade-offs.A central question then emerges: For a specific metric(scalability, efficiency, accuracy, availability, money-costs,etc.) how should analytics tasks be processed? Using whichalternative algorithms, or methods, or models?

To concretize the above, with our research on joinqueries over big data tables (based on [30]), we have foundthat sometimes applying a MapReduce based algorithmis beneficial, while other times a coordinator-cohort dis-tributed processing model is more beneficial, depending ondata distribution degrees and join selectivities. Similarly,for graph-pattern queries we have found that differentalgorithms [37] and different index types [38] are preferablefor different graph patterns and graph Databases. Thesame holds for other key analytics operators, such as kNNqueries, depending on the value of k and the probabilitydistribution of the data sets (as [33] showcased).

In all above cases, the performance difference of em-ploying different alternatives can be dramatic! P4 is associ-ated with two goals: SEA’s fifth goal (G5) is to understandthe alternatives for processing fundamental operators andtheir impact on key performance metrics. Objective O5pertains to the identification of key alternatives and theircomprehensive experimental evaluation. SEA’s sixth goal(G6) is to leverage what is learnt from O5. G6 is associatedwith objective O6, which concerns training, learning, andbuilding optimising modules, which on-the-fly adopt thebest execution method.

V. A Research Program Towards SEA

Now we identify the key research challenges and theirgrouping into research themes, bringing to the surfacespecific open research problems that must be addressed,as well as some preliminary ideas to approach them basedon the aforementioned paradigm and its principles.

A. The Challenges

Methodologically, SEA research should encompass thepreviously outlined goals and objectives, which are perme-ated by the above four principles. Namely, to:

1) Derive novel algorithms, structures, and models, whichfocus on and learn from what analysts are doing,tracking their interests in data spaces, as they shiftwith time.

2) Develop novel models and algorithms that learn fromwhat the system does in response to queries.

3) Unify the results of 1 and 2, associating specific queryspace quanta with methods, models, and answers. Thendevelop, train, and leverage associative-learning modelsto predict results for future unseen queries.

4) Formulate and develop novel notions of and models forquery-answer explanations.

5) Develop a suite of models, methods, and structures,which prove the applicability and showcase the benefitsof the data-less analytics principle across various ana-lytics tasks (query types), data formats (text, graph,tabular), and system types.

6) Develop a suite of indexes, caches, and statistical struc-tures, towards big-data-less analytics.

7) Identify and evaluate key alternative algorithms, meth-ods, and models for key analytics tasks.

8) Train models which learn from past task executions andbuild optimising modules, which, on-the-fly, adopt thebest execution method for the task at hand.

The above challenges constitute open problems to besolved en route to meeting the SEA desiderata. They aregrouped into the following research themes (RTs).

B. Research Theme 1: Data-less Big Data Analytics

Understand and leverage analyst queries vis-a-vis thesystem’s behaviour in response to analysts’ queries, inorder to predict answers to future, unseen queries withoutaccess to base data. This theme entails five core challenges:

1) Query-space Quantization: Derive novel algorithmsand models, to efficiently and scalably learn the structureof the query space, identifying analysts’ current interests.

2) Answer-space Modelling: Derive novel algorithmsand models, to learn and understand how to describe theresults provided by the system to analytical queries (i.e.,model the answer data space).

3) Predictive Models: Combine 1) and 2) to predictresults of new queries, using their position within thequantized query space and the association of query quantawith answer-space models. The models will be developedand trained to concurrently optimize query space quanti-zation and system-answer error. Develop error estimationtechniques, in order to accompany predicted answers with(accurate) error estimations so that the system (or analyst)can choose to proceed with the predicted answer or toobtain an exact answer by accessing the base data.

4) Model Maintenance: Develop approaches, which canensure accuracy in the presence of updates in (i) querypatterns, as analysts’ interests drift, and (ii) base data,as data is inserted, deleted, and updated. (i) will rest onappropriate definitions of distance between a query andthe query quanta. (ii) will rest on this and the estimatederror associated with the predicted answer.

5) Multi-system Analytics: Big data analytics is cur-rently performed over big data analytics stacks. Withinthe same level, many alternative systems may be collabo-ratively employed. For instance, different types of NoSQLsystems may coexist for different data types, e.g., for struc-tured access to graphs, tables, and documents, alongsidea distributed file system. Emerging applications involvinganalytics operators across data stored in such polystorestypically wish to access data stored at different systems.Invariably this requires moving data from one systemto the other, which is a time-consuming and resourcewasting process. Despite the promises of recent research

on polystores (e.g., [39], [40]) these problems continue tobe a major impediment in multi-system analytics.

The data-less data processing paradigm offers newinsights and the potential to completely ameliorate thesecosts. The central idea is to develop and deploy agentswithin each constituent system in a polystore. The agentin essence encapsulates the ML models and functionalityfor data-less processing. Therefore, instead of migratinglarge volumes of data between constituent systems, either:(i) only approximate results of performing operators onthe local data are sent, or (ii) the models themselvesare migrated which are incorporated to produce the finalapproximate answers.

C. Research Theme 2: Big-Data-less Big Data Analytics

RT2 focuses on answering queries using surgical basedata accesses to only the (small) subsets of the voluminousdata sets (which are required or simply suffice) to producethe answer. This will be achieved by developing accessstructures like indexes as well as specialized (semantic)caches. RT2 includes three main threads.

1) Data Manipulation Operations (Defining Subspacesof Interest): The key operations to be studied have beenidentified from our prior research, where the state of theart solutions are badly lacking and include fundamentaltasks. In general, they should include fundamental oper-ations such as join operations, (especially in distributedsettings, focusing on minimising the data movements fromserver node to server node for its computation), kNNquery processing (and its variants, such as Reverse kNN,kNN joins, all-pair and approximate kNN, etc.), spatialanalytics operations (such as Spatial Joins, spatial (multi-dimensional) range queries, etc.), radius queries, etc.

2) Ad Hoc ML Tasks: This task focuses on traditionalML operations (such as classification, clustering, regres-sion, etc.). Specifically, analysts are to define (using selec-tion operators, such as those discussed above) subspaces ofinterest and ask for the data items within these subspacesto be clustered, classified, or to perform regressions, etc.Although a large body of research has tackled such tasks inisolation, performing these tasks efficiently and scalably onarbitrarily defined, ad hoc subspaces is an open problem.This thread will develop semantic caches and indexes todramatically expedite such operations. Furthermore, it willtarget expediting other fundamental operations, namelykNN regression and kNN classification, exploiting insightsgained.

3) Raw Data Analytics: Currently data analytics is per-formed on cleaned data, fitted to given data models. Thisrequires a resource-hungry and time-consuming data wran-gling process and ETL (Extract-Transform-Load) proce-dures. As data sizes increase, the data-to-insight times canbecome too high. This thread will centre its attention ondeveloping adaptive indexing and caching techniques thatoperate on raw data and facilitate efficient and scalableraw-data analyses.

RT2 will consider both exact and (appropriately-defined) approximate solutions and will exploit known

properties of both real-world data sets (e.g., their distri-butions), known access patterns (workload characteristics)and will leverage statistical properties and ML results(such as data transformations and embeddings) to accom-plish its goals.

D. Research Theme 3: Optimization – Alternatives, Eval-uation, and Optimal Selection

The goal of RT3 is to derive an ”on-the-fly optimizedprocessing strategy” for the analytical query at hand. As-sume we are given a performance metric (e.g., money-costs,task-processing time, bandwidth, etc.) to optimize. RT3research aims to produce the optimal execution strategyfor the given metric. There are several degrees of freedomhere, which define the following main research items:

1) Access Structure/Method Selection: A variety ofindexes (i.e., those developed in RT2) may exist or mayworth to be built on the fly during query processing.Selecting the most appropriate access structure is the firstdegree of freedom, and a key challenge.

2) Distributed Processing Paradigm: Based on ourexperience, we differentiate between a MapReduce styleof distributed task processing (on top of systems likeHadoop, Spark, Shark, Flink, etc.) versus a coordinator-cohort paradigm, whereby a coordinating node accessesdirectly the storage engine. Different circumstances dictateuse of different paradigms. Assume there exist indexes forprocessing analytics queries. Then, having a coordinatingnode accessing the (typically distributed) index and thenuse it to surgically access small subsets of base data,directly from the back-end storage, may be preferable tohaving an all-out MapReduce processing of data nodes.

3) Inference Model: SEA will contain a number of in-ference models used for predictive analytics. Furthermore,higher-level functionality, to be provided by SEA, maydirectly depend on using a number of different models.Even if said models derive from the same family (e.g.,regression-based), different models have been found to bebest for different data subspaces: e.g., when consideringusing different regression base models or boosting-basedensemble models [41], [42]. Therefore, identifying the mostappropriate inference model to employ is another keychallenge.

RT3 involves in-depth experimentation in order toidentify costs (for the aforementioned metrics) associatedwith the alternatives mentioned above. SEA will conductsuch experiments, derive key data and perform featureselection from this data in order to develop ML models,which can be trained from this data, and accurately predictthe best execution method.

E. Research Theme 4: New Functionality – Higher-levelQueries and Explanations

RT4 consists of two investigation threads defining ap-propriate (i) higher-level queries and (ii) query-answerexplanations.

1) Higher-level Queries: Building upon RT2, this willfacilitate more complex and in-depth analyses, proposinghigher-level queries, which encompass the more ”basic”queries of WP2.

• The first challenge is to identify higher-level queriesthat can be themselves processed efficiently, scalablyand accurately.

• The second challenge pertains to the development ofthe processing mechanisms (indexes, semantic caches,etc.) to do so, ensuring higher performance comparedto executing each constituent basic query in isolation.

• The third challenge is to define appropriate hierarchi-cal or graph structured spaces, showing how queriesat lower levels can be combined to offer higher-levelfunctionality.

• The final challenge is how to visualize and export toanalysts this complex query capability set.2) Query-Answer Explanations: A large number of

analytical queries supported by present day data systemsreturn a single scalar value, which as mentioned, leavesmuch to be desired. This thread will develop novel rich, yetconcise, explanations for query-answer pairs that will facil-itate deeper understanding of the queried data subspaces.These explanations are expected to be models / functionsthat show how the answer to the query depends on thequery’s parameters. As an example, an explanation can bea (piecewise) linear regression model showing how count of(or the correlation coefficient between attributes within) adata subspace depends on the size of the subspace. Thiswill facilitate result visualizations and provide the answerof many related exploratory queries the analyst may wishto issue, without issuing them; the analyst will be able tosimply plug in values for parameters to the explanationmodels.

The next challenges concern the ability to computeexplanations in a SEA fashion and deriving appropriateexplanations for the more complex queries above.

F. Research Theme 5: Global-Scale Geo-Distributed SEART5 is concerned with how to best deploy the devel-

oped SEA models in a wide-scale (geo-distributed) settingand the additional research problems that this settingintroduces.

Big cloud computing providers are increasingly invest-ing in a federation of geo-distributed data centres, locatedaround the world to provide high local-client efficiency ofdata access as well as robust, reliable and fault-tolerantdata access to all clients around the world. For example,the Google global cloud platform employs tens of such datacentres [43] and the Microsoft counterpart involves morethan 100 such data centres [44]. As a result the massivedata sets, expected to be involved in big data analyticsscenarios, can be geo-distributed across such global cloudplatforms. Providing analytics, thus, over such data sets isfraught with efficiency and scalability limitations - as hasbeen well recognized, for example in the Iridium project[45].

Research themes RT1 – RT3 can be of real value inthese environments. The target is to reduce WAN-based

inter-datacentre communication, which can introduce highresponse times and unpredictability during geo-distributedanalytics. Figure 3 exemplifies how SEA is envisaged tooperate within such a geo-distributed setting. The essentialidea is to place key functionality at the edge nodes of theglobal network, akin to the earlier discussion about multi-system analytics. The agents encapsulating the developedintelligence (from RT1 and RT2) can be deployed at edgenodes to localize query processing as much as possible.

In this way, the system (i.e., an agent at some edgenode) accesses base data (stored at remote data centres)only when expected errors of local models at the edge nodeis high. Conversely, analysts can be informed of expectederrors in the answers received by local agents and canissue an exact query to base data stored remotely, onlyif estimated error of local predictions is high.

RT5 brings to the surface several research tasks, dis-cussed below.

1) Network System Architecture: We envisage the net-work to contain core nodes and edge nodes. The corenodes store the actual data. Additionally, as we shall seebelow, they can also participate in building models fromthe data. Therefore, they possess the option of either exe-cuting an exact-answer query (e.g., scanning the base data)or providing approximate answers by answering queriesbased on the built models. Conversely, edge nodes typicallymaintain only models of the base data and can provide onlyapproximate answers to queries.

Given this node-functionality separations, the centralquestion is how to organize the query- and data-flowbetween nodes in the network. The issues here concernhow to (i) organize the (intelligent agents at the) edges sothat they can possibly collaborate in knowledge sharingand query answering; and (ii) how to organize edges-to-core nodes communication. A combination of peer-to-peeroverlay network approaches and processing as well as aclient-server model communication is possible, dependingon the number and churn of edge nodes.

2) Distributed Model Building: In an ideal setting,analytical queries from edge nodes target different datasubspaces. In this case, each edge node would build its ownseparate model. The initial queries would be treated astraining queries and the edge would gather enough (query,answer) pairs to locally train a model. When the model istrained, the edge can then filter all queries from reachingthe core, by using the local model to answer queries.

In general, however, this will not be the case. Thequeried subspaces from queries originating at different edgenodes will be overlapping. This affords the opportunity ofdistributed model building. As before, the initial trainingqueries will reach the core nodes, but this time fromdifferent edge nodes. Said core nodes can then collaborateto train a model faster, by considering training queriesfrom several different edge nodes. Subsequently, the corenodes can then communicate the model to the edge nodesfrom where relevant queries originated. From this point on,edges can enter their query-answering phase.

An important relevant issue here is how to share the

Fig. 3: The Vision for Wide-Scale Distributed SEA Big Data Analytics Processing.

model state, defining which models have been built andfor which data subspaces.

3) Distributed Model Maintenance: The key issues herepertain to answering the following questions: (i) whichmodels to build? (ii) where (i.e., at which edge nodes)should these models be located? (iii) how to efficientlymaintain the models consistent, vis-a-vis changes in querypatterns and data state updates (data insertions, deletions,and updates).

We expect that, for many applications, queries in typi-cal workloads are overlapping and target certain subspaceswhich are of significantly smaller size than the total sizesof the stored data sets. Therefore, a key desideratum isthat, given the massive size and number of stored datasets, only models for the (much smaller) data subspaces ofinterest are built. Hence, we expect core nodes to identifythe data subspaces of (current) interest. How is this bestaccomplished? Certainly edge nodes can help as they are’query-facing’, but as mentioned core nodes will likely beinvolved as queried subspaces from different edges overlap.

The methods for query quantization, mentioned in RT1can come handy here. But we must ensure that modelledsubspaces are as small as possible in order to increasemodel accuracy. However, we need also reduce modeltraining time. So this is a challenging task.

Given the need for model consistency and maintenance,said models should be carefully distributed at edge nodesso to, on the one hand, increase the filtering power of edgenodes, while on the other, reduce the costs involved inmaintaining model consistency and accuracy.

Shifts in the user interests can be detected locally byedge nodes and globally by core nodes. This detectionshould lead to purging ’older’ models, referring to datasubspaces which are no longer of interest and/or to theredefinition of subspaces of interest and to the update ofmodels to account for the extension or shrinking of saidsubspaces.

4) Analytical Query Routing: Given an analytical queryat some edge node, query routing refers to deciding whereshould the query be answered. Should it answered at thelocal edge node? Should it be sent to another edge node?If so, which one and how? Should it reach other nodes?

Depending on how the model state is being shared bythe nodes of the system, several options for answering thequery are possible.

5) Model Error Maintenance: Given the distribution ofmodels across the edge and core networks a key require-ment is to be able to accurately predict the model error.This may be extremely challenging in itself, depending onwhich ML models are being used. But even assuming thatit is possible to predict the error associated with a deployedmodel, the different nodes where the model is actively usedshould have an accurate expectation with regard to itsaccuracy (in the presence of data and query changes andmodel updates).

VI. Conclusions

Distributed systems play a key role in big data analyt-ics, as data analytics stacks involve different distributedsystems at their various layers (for distributed/paralleldataflow, distributed resource management, distributedfile systems, distributed NoSQL systems, etc.). Further-more, given the push for global-scale geo-distributed an-alytics, collections of such analytics stacks are workingtogether, answering analytical queries. In these settings,unfortunately the current state of the art often fails toaddress concerns with respect to efficiency, scalability,and accuracy. These three properties represent the hollygrail for modern big data analytics. To overcome currentlimitations a new paradigm is needed.

With this vision paper we attempt to provide this miss-ing paradigm: It revolves around the novel notion of data-less data analytics. It proposes that current approaches failto accomplish the SEA desiderata exactly because theyaccess too many data-storing nodes and too much dataduring analytics processing. In contrast, the new paradigmattempts to answer analytical queries using models of theunderlying data; said models are much lighter and canbe highly accurate. This style of processing avoids time-consuming and resource hungry processing over big dataanalytics stacks.

We have outlined the principles on which related re-search that will materialize this vision should be based. Wehave outlined successes so far from our recent work, which

testify to the high potential of the vision. We have alsoput forth a research program that attempts to organize themain research threads that should be undertaken, offeringinitial suggestions that appear promising, and identifyingchallenges that need be addressed.

References[1] M. L. Williams, K. Fischer, J. Freymueller, B. Tikoff, and A. T.

et al, “Unlocking the secrets of the north american continent:An earthscope science plan for 2010-2020,” 2010.

[2] Z. D. Stephens, S. Y. Lee, F. Faghri, and R. H. C. et al, “Bigdata: astronomical or genomical?” PLoS Biol, 2015.

[3] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google filesystem,” in Proceeding of SOSP, 2003.

[4] J. Dean and S. Ghemawat, “Mapreduce: Simplified data pro-cessing on large clusters,” in Proceeding of OSDI, 2004.

[5] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “Thehadoop distributed file system,” in Proceeding of IEEE 26thSymposium on Mass Storage Systems and Technologies, 2010.

[6] M. Zaharia, M. Chowdhury, T. Das, and A. D. et al, “Re-silient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in Proceeding of NSDI, 2012.

[7] A. Alexandrov, R. Bergmann, and S. E. et al, “The stratosphereplatform for big data analytics,” The VLDB Journal, 2014.

[8] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmele-egy, and R. Sears, “Mapreduce online,” in Proceeding of NSDI,2010.

[9] V. Valilapalli, A. Murthy, C. Douglas, and S. A. et al, “Apachehadoop yarn: Yet another resource negotiator,” in Proceedingof ACM SOCC, 2013.

[10] F. Chang, J. Dean, and S. G. et al, “Bigtable: A distributedstorage system for structured data,” in Proceeding of OSDI,2006.

[11] L. George, HBase: The definitive Guide. OÕ Reilly, 2011.[12] A. Lakshman and P. Malik, “Cassandra: A decentralized struc-

tured storage system,” in SIGOPS Operating Systems Review,2010.

[13] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar,M. Tolton, and T. Vassilakis, “Dremel: interactive analysis ofweb-scale datasets,” Communication ACM, 2011.

[14] C. Engle, A. Lupher, R. Xin, and M. Z. et al, “Shark: Fastdata analysis using coarse-grained distributed memory,” inProceeding of ACM SIGMOD, 2012.

[15] S. Chaudhuri, G. Das, and V. Narasayya, “Optimized stratifiedsampling for approximate query processing,” ACM Trans. onOn Database Systems, (TODS), 2007.

[16] G. Cormode and S. Muthukrishnan, “An improved data streamsummary: The count-min sketch and its applications,” Journalof Algorithms, 2005.

[17] S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, andI. Stoica, “Blinkdb: Queries with bounded errors and boundedresponse times on very large data,” in Proceeding of ACMEurosys 2013.

[18] L. Sidirourgos, M. L. Kersten, and P. A. Boncz, “Sciborq: Sci-entific data management with bounds on runtime and quality,”in Proceeding of CIDR, 2011.

[19] Y. Park, A. S. Tajik, M. Cafarella, and B. Mozafari, “Databaselearning: Toward a database that becomes smarter every time,”in Proceeding of ACM SIGMOD, 2017.

[20] A. Wasay, X. Wei, N. Dayan, and S. Idreos, “Data canopy:Accelerating exploratory statistical analysis,” in Proceeding ofACM SIGMOD, 2017.

[21] M. I. Jordan, “On statistics, computation and scalability,”Bernoulli, 2013.

[22] L. Bottou and O. Bousquet, “The tradeoffs of large scalelearning,” in Proceeding of NIPS, 2007.

[23] J. Acharya, I. Diakonikolas, J. Li, and L. Schmid, “Fast algo-rithms for segmented regression,” in Proceeding of ICML’16.

[24] F. Savva, C. Anagnostopoulos, and P. Triantafillou, “Explain-ing analytical queries,” in In progress, 2018.

[25] S. Idreos, O. Papaemmanouil, and S. Chaudhuri, “Overview ofdata exploration techniques,” in Proceeding of ACM SIGMOD,2015.

[26] C. Anagnostopoulos and P. Triantafillou, “Learning set cardi-nality in distance nearest neighbours,” in Proceeding of IEEEInternational Conference on Data Mining, (ICDM15), 2015.

[27] ——, “Learning to accurately count with query-driven predic-tive analytics,” in Proceeding of IEEE International Conferenceon Big Data, 2015.

[28] ——, “Efficient scalable accurate regression queries in in-dbmsanalytics,” in Proceeding of IEEE International Conference onData Engineering, (ICDE17), 2017.

[29] ——, “Query-driven learning for predictive analytics of datasubspace cardinality,” ACM Trans. on Knowledge Discoveryfrom Data, (ACM TKDD), 2017.

[30] N. Ntarmos, Y. Patlakas, and P. Triantafillou, “Rank joinqueries in nosql databases,” in Proceedings of VLDB Endow-ment (PVLDB), 2014.

[31] A. Eldawy and M. F. Mokbel, “Spatialhadoop: A mapreduceframework for spatial data,” in Proceeding of IEEE Int. Conf.on Data Engineering (ICDE), 2015.

[32] D. Xie, F. Li, B. Yao, and G. L. et al, “Simba: Efficient in-memory spatial analytics,” in Proceeding of ACM SIGMOD,2016.

[33] A. Cahsai, N. Ntarmos, C. Anagnostopoulos, and P. Triantafil-lou, “Scaling k-nearest neighbours queries (the right way),”in Proceeding of 37th IEEE Int. Conf. Distributed ComputingSystems (ICDCS), 2017.

[34] J. Wang, N. Ntarmos, and P. Triantafillou, “Indexing querygraphs to speed up graph query processing,” in Proceeding of19th International Conference on Extending Database Technol-ogy, (EDBT), 2016.

[35] ——, “Graphcache: A caching system for graph queries,” inProceeding of 20th International Conference on ExtendingDatabase Technology, (EDBT), 2017.

[36] C. Anagnostopoulos and P. Triantafillou, “Scaling out big datamissing value imputations,” in Proceeding of ACM SIGKDDConference, (KDD14), 2014.

[37] F. Katsarou, N. Ntarmos, and P. Triantafillou, “Subgraphquerying with parallel use of query rewritings and alterna-tive algorithms,” in Proceedings of Conference on ExtendingDatabase Technology, (EDBT), 2017.

[38] ——, “Hybrid algorithms for subgraph pattern queries in graphdatabases,” in Proceedings of IEEE International Conferenceon Big Data, (BigData17), 2017.

[39] J. LeFevre, J. Sankaranarayanan, H. Hacigumus, J. Tatemura,N. Polyzotis, and M. J. Carey, “Miso: Souping up big dataquery processing with a multistore system,” in Proceedingsof the 2014 ACM SIGMOD International Conference onManagement of Data, ser. SIGMOD ’14. New York,NY, USA: ACM, 2014, pp. 1591–1602. [Online]. Available:http://doi.acm.org/10.1145/2588555.2588568

[40] V. Gadepally and et al, “The bigdawg polystore system andarchitecture,” in arXiv:1609.07548v1, 2016.

[41] J. H. Friedman, “Greedy function approximation: a gradientboosting machine,” Annals of statistics, pp. 1189–1232, 2001.

[42] T. Chen and C. Guestrin, “Xgboost: A scalable tree boostingsystem,” in Proceedings of the 22nd acm sigkdd internationalconference on knowledge discovery and data mining. ACM,2016, pp. 785–794.

[43] “Google data centre locations,”https://www.google.com/about/data-centers/inside/locations/index.html.

[44] “Microsoft data centres,” https://www.microsoft.com/en-us/cloud-platform/global-datacenters.

[45] Q. Pu, G. Ananthanarayanan, P. Bodik, S. Kandula,A. Akella, P. Bahl, and I. Stoica, “Low latency geo-distributed data analytics,” in Proceedings of the 2015ACM Conference on Special Interest Group on DataCommunication, ser. SIGCOMM ’15. New York, NY,USA: ACM, 2015, pp. 421–434. [Online]. Available:http://doi.acm.org/10.1145/2785956.2787505

[46] L. Bottou, F. Curtis, and J. Nocedal, “Optimization for large-scale machine learning,” in arXiv:1606.04838[stat.ML], 2017.

[47] F. Katsarou, N. Ntarmos, and P. Triantafillou, “Performanceand scalability of indexed subgraph query processing methods,”in Proceedings of VLDB Endowment (PVLDB), 2015.

[48] Q. Ma and P. Triantafillou, “Query-driven regression modelselection,” in In progress, 2018.

warwick.ac.uk/lib-publications · distributed/parallel execution frameworks, etc. in the current...

Documents