overview of data management in sensor networks

Overview of Data Management inSensor Networks

© Dr. Deepak Ganesan, edited by Dr. Robert Akl

Deepak Ganesan (UMass)

Data Management Basics

Sensor networks are data-centric Significant amount of data is being generated

within the network

Data management: How to you manage(store/process) the data in the network

Different data management approachesdepending on: Sensor: Data rate or Event rate Resource: Local storage, processing, bandwidth

and power capacity. Query: Type, arrival rate, complexity, latency

requirement


Key Challenges in Data Management

Where should thedata be stored?

How shouldqueries be routedto the stored data?

Where and howshould aggregationbe performed?

How shouldqueries for sensornetworks beexpressed?

Info

rmat

ion

flow

Com

man

d flo

w


Data Management Challenges Where should data be stored and query processing be

performed? Inside: dealing with storage limitations, query processing

overhead, distributed query processing. Outside: Dealing with bandwidth, scheduling, reliability

issues, power How should queries be routed to data?

Inside: flooding, geographic routing, gradient-basedrouting

Outside: Tree-based routing Where and how should aggregation be performed?

Opportunistically along routing path. Cluster-based, Gossip-based

How should queries on sensor data be expressed? Declarative querying for users Macroprogramming for developers

Where should data be stored?


Spectrum of Data Storage and ProcessingLo

cal S

tora

ge

Communication for Data Storage

Local Storage andHierarchical Index

Local Storage andFlooding or Geography-based Query Processing

Multi-resolution Storageand indexing

Centralized Storage and Querying


Spectrum of Data Storage and ProcessingC

om

mu

nic

ati

on

fo

r D

ata

Sto

rag

e

Communication for Query Processing

Local Storage andHierarchical Index

Multi-resolution Storageand indexing


Local Storage andFlooding or Geography-based Query Processing



Method: Archive nothing locally, transmiteverything of interest When data item of interest is detected, send

all useful information to the base-station Advantages:

Persistent Centralized Storage. Intelligence is at more resource-rich node.

Complicated signal processing can be easilydone outside the network. Sensor nodesperform very simple filtering of data.

Disadvantages: Power Inefficient. Not applicable to

applications where large amount of data ispotentially useful.

Query, Trigger

When is centralized storage and querying appropriate? First Generation Data Collection/Acquisition Systems

James reserve, Great Duck Island, Structural Monitoring (Wisden)…etc Scientific applications where users need all the data.


Multi-resolution Storage and Indexing Method: Store data in a multi-resolution

hierarchy Raw data at leaves, processed summaries of data

at clusterheads (may be higher power nodes)

Advantages: Root has a multi-resolution view of the data in the

network. This can be used to make intelligentdecisions about what nodes to query and toperform complicated processing

Data is replicated at multiple devices Even if raw data is phased out, summaries can be

stored.

Disadvantages: Processing and hierarchical storage requires power,

although not as much as centralized storage.

When is distributed storage and indexing appropriate? Second Generation Data Collection/Acquisition Systems Scientific applications where data sizes are large, and users need to

find patterns in sensor data.


Local Storage and Distributed Indexing Method: Store data locally at each

node, construct distributed indexstructures to make search efficient

Advantages: Makes search efficient and requires

low communication overhead. Disadvantages:

Data is lost if node fails. Index structures can only deal with

specific attribute-based search, andnot with arbitrary signal processingfunctions over data.

When is local storage and distributed indexing appropriate? When search can be effectively scoped using simple attributes. For

example, if temperature is a good indicator of some other activity,this can be used to limit scope of search.

1<= Event Attribute <= 8





Local Storage and Distributed Querying

Method: Store data locally at eachnode, query is flooded out to thenetwork or geographically routed.Query processing is performed on-demand.

Advantages: Only on-demand processing,

therefore energy efficient. Disadvantages:

Data is lost if node fails. Puts significant complexity into a

network of very low-power devices. Frequent queries incur high overhead

When is local storage and distributed querying appropriate? When queries are simple and have limited scope When schemes can deal with node failure.

How should queries be routedto stored data?


Data-centric routing techniques

Push-based query routing

Pull-

bas

ed q

uer

y ro

uting

Tree-based routing

Query Flooding orGeographic Routing

Gradient-based routing


Flooding queries into the network Flood the query throughout the network.

Nodes with matching attributes/parametersrespond to the query.

Pros: Very simple and reliable

Cons: Inefficient if frequent queries are posed or large-

scale network.

When is it useful? A large fraction of current deployment, and

possibly future deployments will be flooding-based just because of the inherent simplicityand reliability.


Geographic routing to known locations

If query explicitly specifies location,selectively route the query to particularlocations of interest. Eg: “Find the average temperature in west

corridor of CS Building”

Pros: Can reduce query routing overhead by

selectively choosing nodes.

Cons: Complex routing strategy. Needs special

mechanisms to route around communicationholes. Lack of redundancy might result in querybeing lost.


Gradient-based routing Setup gradients in the network that can

assist the queries to lead them towards theareas of interest. Also called publish-subscribe schemes.

Pros: Resilient to failures, packet-loss (similar to

gossip-based schemes) Not restricted to location-based queries. Can be

used for any spatially correlated attribute.

Cons: Incurs more overhead than geo-routing

schemes.


Tree-based routing In push-based systems, the query can

remain at the base-station and the datacan be routed to it

Pros: Query process can be complex. Decisions can be

made at the intelligent node rather than theresource-constrained one.

Periodic push is synchronous, and can beoptimized through better scheduling policies.

Cons: Pure push is rather inefficient since decision

making is solely at the central location. Usually acombination of push/pull is more appropriate.

Where and how should queryresults be aggregated?


General aggregationGeneral aggregation Let Hk be the information from k sources. It generally

satisfies the following conditions: It is non-decreasing with respect to k It is concave with respect to k

Uncorrelated sources Hk = k

Correlated sources: Hk = 1

Intermediate correlation

Number of sources

AggregateInformation


Aggregation of query results

Opportunistic Aggregation Build Shortest Path Trees. Aggregate at

the junction nodes.

Cluster close to source of data Force query results to be aggregated

close to the data.

Query-optimized trees Build trees that are optimized for

particular kinds of queries.

Query processing language andoptimization


Query Processing Challenges

Intended Audience: Users who pose queries Application developers

How much complexity to expose? Complex inter-resource constraints Distributed computation Data fusion/collaborative signal

processing

How much run-time vs compile-timequery optimization?


Query Processing Language for Users Expressing spatio-temporal queries

When and where did event occur? Scoping the spatial or temporal region of interest.

Addressing individual sensors Able to specify what sensors and what sensor parameters you are

interested in. Confidence intervals or other measures of error tolerance.

Addressing Events Able to specify “events” of interest transparently from the event

processing. Confidence interval, error tolerance Hide distributed nature of computation for naïve users

Specify query processing constraints Latency of result

Hide distributed nature of computation for naïve users. Enable extensivequery expression of sensor data, events and query constraints.


Programming Language for Developers Addressing groups of sensors and data fusion.

Combine data from motion detector, vibration sensor and camera(that may not be co-located) into a “detection event”.

Aggregation: Data type ‘vibration signal’ can be combined bylooking at the fft and picking the 4 dominant frequencies.

Specify the routing structure Cluster area into groups of nodes that observe correlated events.

Allow user-defined signal processing definition Each application has different aggregation needs.

Express resource constraints Energy: Do not expend more than J joules in trying to get the

result

Expose distributed nature of computation but providecomposable library of primitives for easier development.


Runtime query optimization Energy constraints pose difficult query optimization

requirements Every sensor sample incurs energy with different

sensors incurring different overhead Processing and storage consume power as well.

Consider the query: Sample vibration andmagnetometer and report if vibration > Threshold1and magnetic flux > Th2. Vibration sampling requires lower energy than

magnetometer sampling, hence it should be donefirst.

Ordering of sampling, processing, communicationcan matter for energy reasons. How to performruntime query optimization?

overview of data management in sensor networks

Documents