efficiency and reliability of the transit data lifecycle a study of multimodal migration, storage,...
Post on 27-Dec-2015
218 Views
Preview:
TRANSCRIPT
Efficiency and Reliability of the Transit Data Lifecycle
A study of multimodal migration, storage, and retrieval techniques
for public transit data
Presented by: Matthew Ahrens
Faculty Mentor: Dr. Uma Shama
Overview- Background
• GeoGraphics Lab• Maintain public transit data for Regional Transit Authorities
(RTAs) in the Commonwealth of Massachusetts.
• Services
• Digitizing of static schedule data• Dynamic and real-time vehicle location data• Consultation and expert advice role
Overview- Background
• This project• Interdisciplinary between Mathematics and Computer Science
• Focus on real-world / business applications of data analysis
• Time Span
• Spring 2013 • exploratory analysis
• Summer 2013 and ATP summer grant• Modeling experiments
• Fall 2013• Implementation and integration
Overview- Background
• This project – cont.• Evolved through several iterations
• Original Purpose: Spatial analysis on ridership and vehicle location data
• Four areas of focus occurred, changing focus of project over time
• 1. Concepts were unclear among Authorities
• 2. Inconsistent data collection tools for historical analysis purposes
• 3. development on systems affected core features
• 4. documentation for systems was in code, no clear point of injection
Overview- Outline
• Four sections• Abstraction and modeling of transit data
• Analysis of design patterns and algorithms with comparison to existing systems
• The design and implementation of a context free data model
• The design and implementation of a multimodal, application-level interface
Abstraction
• Research Questions• How can the different transit data protocols be described to
compromise between conflicting definitions and structures?
• Is there a compromise that can be reached that is still purposeful and clear?
• Purpose• Comparison of three authorities
• GTFS / GTFS-realtime• TCIP• Proprietary (various).
Abstraction
• GTFS Example
• Pros:• Descriptive, data type or storage inclusive.
• Separation of required for definition and optional metadata
• Cons:• Perspective of transit user
• Many definitions do not have explicit relationships
Abstraction
• GTFS-Realtime Example
• Pros:• Descriptive, data type or storage inclusive.
• Separation of required for definition and optional metadata
• Cons:• Defined as a feed, no distinction or limitation of rate
• Optional fields not purposeful for minimum definition or structure.
Abstraction
• TCIP Example
• Pros:• Complete, covers every aspect of transit
• Cons:• Vague
• Concerned with relationships between data systems
• Specifies medium over message, requires XML/XSD format but does not clearly define data elements
Abstraction
• Proprietary Example - ERSI
• Pros:• Shows relationships between geospatial definitions
• Standard Leader for GIS protocols (GML, OpenGeo )
• Cons:• Concerned with GIS and use definitions over technical
definitions
• Missing most transit data concepts
Abstraction
• Methodology• Create an understandable, unambiguous definition for common
transit concepts
• Use as few primitives as possible to ease implementation
• Use composition to aggregate data
• Two options considered
• Define a object – method relationship• Define a set-theoretical model of
transit data structures
Abstraction
• Methodology• Remove implementation and use specific context from transit
data structures
• Find minimum required composition
• Acknowledge commonly attributed metadata
• Define data by production mechanism rate
Abstraction
• Disambiguation• Real-time
• Produced frequently in real-time• Best represented as a signal or a
message stream• Dynamic
• Infrequent but unknown rate of production
• Best represented as a feed• Static
• Infrequent, known interval rate of production
• File system or other static resource
Abstraction
• Results• Data flow model influenced the decision
Abstraction
• Results• Set Theoretical Model
• Description
• Define implementation independent definition of primitives
• Compose transit data structure from those primitives
• Define complex data structures as supersets of simple structures
Abstraction
• Commonly used examples• Primtives
• Geolocation• Datetime• Unique, Index-friendly ID (numeric,
simple text)• Simple structure
• Stop• Trip
• Composite Structures
• AVL• ETA
Abstraction
• Composition Example
Data Migration
• Research Questions• What technologies, techniques, or models most efficiently and
reliably move transit data from producer to consumer?
• Which of those best embody the concepts of reuse, extendibility, and reusability?
• Which ones are resistant to need modification and internal maintenance?
Data Migration
• Purpose• Perform exploratory work to set standards for handling data
transit
• Which of those best embody the concepts of reuse, extendibility, and reusability?
• Which ones are resistant to need modification and internal maintenance?
Data Migration
• Methodology• Study of BusLocator – current data migration technology of AVL
and Route specific data
• Duplication of Timer-event concurrency model for real-time data
• Pull design pattern vs. Push design pattern
• Approximation Algorithms
Data Migration
• BusLocator• C# Microsoft Solution in two parts
• Windows Service using Timer-event concurrency
• Pulls AVL data every 30 minutes• Pulls route data every 5 minutes• Sends via SOAP to WCF service
• WCF• Webservice endpoint• Accepts data• Parses and stores in SQL tables
Data Migration
• Graphical Depiction
Data Migration
• Major bottlenecks• Event timer
• Problems
• Pulls too slow to deliver real-time produced data to be consumed in real-time
• Pulls over timeframe, sends duplicate over the wire
• Does not scale or load balance• SOAP XML message is large, metadata
heavy• Not optimal for real-time
Data Migration
• Effort to duplicate for ETA• Pull from ETA feed as Rest service via XML
Data Migration
• Effort to duplicate for ETA• Purposes
• Analytical use of AVL data as static resource, not real-time
• Made easier to organize by set-theory model
• Able to composite ETA from other sources
• Able to automate analysis
Data Migration
• Effort to duplicate for ETA• Problems
• AVL not complete for historical use• Lead to development of clear definition
of AVL and other transit data structures• Showed need for new system
• Replace BusLocator• Define development framework for
transit applications• Eliminate pull or approximate push
design pattern
Data Migration
• Pull vs. Push• Pull design pattern
• A.k.a. Request-response, on-demand• Client (unknown) sends request to
Server/Source (known)• Server processes and responds
• Push design pattern
• Subscription pattern• Client establishes connection to Server• Server pushes response to client upon
local event
Data Migration
• Pull vs. Push• Pull design pattern
• A.k.a. Request-response, on-demand• Client (unknown) sends request to
Server/Source (known)• Server processes and responds
• Push design pattern
• Subscription pattern• Client establishes connection to Server• Server pushes response to client upon
local event
Data Migration
• Pull best use cases• When data is not consumed as a string
• Need the most recent data once or on demand
• Example
Data Migration
• Push approximating• Push is appropriate for real-time produced data
• Goal
• minimize time between production and availability for use
• Problem
• Push not supported by all web communication
• Solution
• Pull approximation
Data Migration
• Appx. 1 – timer event approximation
• Goal
• Predict the rate of production using historical data
• Method
• Exponential Moving Average• Use previous history and predictions to
make future predictions• Keep tabs of average interval between
data updates• Take proportion of history for accuracy• Take proportion of predictions for
smothing
Data Migration
• Exponential Moving Average example
• Real data hard to monitor, simulation was created
• Simulate 10 vehicles• 10% chance of packet drop
• Measurement criteria
• Minimize difference between production time and consumption time
• Minimize redundant data packets• Minimize dropped packets
Data Migration
• Exponential Moving Average example
• Cache free model was developed
• Emulating current system• Adaptable to batch query and
changing vehicle configuration• Measure average previous interval
Data Migration
• Exponential Moving Average example
• Psuedocode
Data Migration
• Exponential Moving Average example
• Results
Implementation: GLaaS Model and API
• Goals• Taking the knowledge gained so far, implement and document
a framework that exhibits best practices
• Avoid anti-patterns• Choose the best medium for the job• Separate data, metadata, and
implementation data• Keep business logic separate from data
management• Migrate data near production rate• Multimodal retrieval and consumption
mechanisms
Implementation: GLaaS Model and API
• Considerations• Security
• Closed Pipe vs. Open Pipe• Authentication
• Access level
• Differential Privacy• Analysis protection
• Reusability
• Maintenance
• Scalability
• Documentation and Training
GLaaS Model
• Database Schema• Feature oriented
• Consider transit data primitives as features
• Make set defined elements required fields
• Make metadata Optional fields
• Design iterations
• Trigger based trickle down model• Purpose
• Fight over-index anti-pattern• Minimize select time purposefully
• Output chain, batch-oriented
GLaaS Model
• Structure• Tables
• Primary• Insert Entry point
• Guaranteed for analysis use• Acts as contract and definition of
feature• Trigger
• On insert, pushes and updates specific tables
• Specific• Select / update point• Only accessible by stored procedure
• Info• Metadata chainable by indexed fields
GLaaS Model
• Refactoring• Triggers did not work the way intended
• Appearance• Separate files, separate queries• Resemble event handling
• Simple and Concurrent in imperative languages
• Function• Append to insert query
• Not concurrent• Artificial dependency
• Traced• One failure invalidates entire insert --
including original
GLaaS Model
• Output variable• Represents inserted data similar to trigger
• Called from and insert into primary stored procedures
• Calls down the chain, separated by query delimiter
• Enforces statically declared batching• Concurrent, let SQL environment make
dependency decisions• Responsible for populating specific tables
GLaaS Model• Results, integrity and protocol
GLaaS Model
• Explicit use of API and Stored Procedures
• No direct application level queries
• API only approved access point
• Explicit enforcement of authentication by function not by data type
• Eliminates need for application specific tables
• Fights Sql injection
GLaaS API
• Multimodal approach to consumption
• Mechanism for static, on-demand, and real-time consumption
• File system and known URI• Similar to GTFS-realtime implementation• Application specific feed format
• Request-Response• REST in several mediums
• Binds to specific URI and HTTP Verb• Eliminates need for expensive header
• SOAP backwards compatibility
• Subscription model via push pattern• Websocket
GLaaS API
• Soap vs Rest• Soap
• XML defined package• URIs surrogate for Endpoints
• 1 URI per service
• Message header contains definitions and method bindings
• RPC
• Message data contains payload
GLaaS API
• Soap vs Rest• Soap definition example for AVL
GLaaS API
• Soap vs Rest• Rest
• URI multiplexing via routes• URI structure relative to root bound to
request definition• Request object definition and HTTP verb
binds to method and response
• Request messages• Only contain data needed for
functionality• No header, light-weight• JSON, XML, URI-embedded, any custom
data organization
GLaaS API
• Soap vs Rest• Rest
GLaaS API
• Goals• Maintenance
• Dynamically generated use documentation
• Compartmentalized object definition• Requests• Response• Global Entry Point
• Configuration• Application level authentication
• Service Definition
GLaaS API
• Goals• Extensibility
• Add data functionality to feature• Add specific tables• Add metadata specific data columns
• Add application level functionality• Add request, response DTOs• Add service method bindings
• Replication
• Feature encapsulates protocol defined parts
• Replicate abstraction model and appropriate retrieval mechanisms for new feature
GLaaS API
• Results• Reusability of features and data mechanisms
• Tools, algorithms and methodologies reusable between applications
• Persistent data
• Design patterns built in for popular transit data techniques
• Example• AVL as a service
• Polyline Encoding
Acknowledgements
• Thank you• Dr. Uma Shama, Larry Harman, and the GeoGraphics Lab for
this research opportunity.
• Dr. Gross, my honors committee and my proof readers / co-workers for their advice and help.
• CCRTA, their vehicles, and their riders for their data mechanisms and the inspiration of this study
• Future work• Integration of these results and implementations for current
GeoLab projects
• Future service-oriented software design in my graduate career.
top related