mta analysis · 2013. 3. 7. · analysis for union square line4 figure:plot of the time intervals...
TRANSCRIPT
MTA ANALYSIS
HOW MANY SUBWAY STATIONS ARE IN NEWYORK CITY?
493
HOW MANY SUBWAY STATIONS ARE IN NEWYORK CITY?
493
HOW ARE THESE DISTRIBUTED ACROSS THE FIVEBOROUGHS?
HOW ARE THESE DISTRIBUTED ACROSS THE FIVEBOROUGHS?
HOW DOES IT APPEAR AFTER NORMALIZING WITHRESPECT TO SURFACE AREA?
HOW DOES IT APPEAR AFTER NORMALIZING WITHRESPECT TO SURFACE AREA?
ANY BETTER WITH POPULATION FACTORED IN INSTEAD?
ANY BETTER WITH POPULATION FACTORED IN INSTEAD?
MTA DATA
Subway Main Tunrstile
StationEntrances calendar tunrstile
calendar-dates RBSroutesstops
stop-timestransfers
trips
Extracting details from data files
IMAGINE YOU WANTED TO FIND OUT THE THE ARRIVALTIMES OF LINE B AT COLUMBUS CIRCLE.
YOU FOLLOW THESE SIMPLE STEPS:
1: GO TO THE FILE stops AND LOOK FOR THE stop-id OFCOLUMBUS CIRLCE.2: GO TO THE FILE trips AND LOOK UNDER COLUMNroute-id FOR B3. EACH ROW WITH B CONTAINS A UNIQUE trip-id.4. GO TO FILE stop-times, LOCATE THE trip-id FROM 3.THAT ROW CONTAINS A STOP ID AND TIME.5. LOCATE THE stop-id FROM 1 AND THE CORRESPONDINGtime.6.REPEAT 3-5 FOR EVERY ROW WITH B
ANALYSIS FOR UNION SQUARE LINE 4
Figure: Plot of the time intervals for arrivals of line 4 at the UnionSquare on weekdays and weekends.
I Observe the oscillations in the time interval at daytime on weekends.
I Observe also that the longest interval over the cycle is about 20 minutes and does not depend on whetherit is a weekday or weekend.
ANALYSIS: Turnstile Data At Station Stops.
WE WANT TO KNOW HOW THE USAGE OF THESUBWAY VARIES OVER THE DAY
Figure: 28th Street -Line 1(weekdays). Observe the remarableconsistency in the pattern across all weekdays.
ANALYSIS: Turnstile Data At Station Stops.
WE WANT TO KNOW HOW THE USAGE OF THESUBWAY VARIES OVER THE DAY
Figure: 28th Street -Line 1(weekdays). Observe the remarableconsistency in the pattern across all weekdays.
Figure: 28th Street -Line 1 (weekends)
Figure: 34th Street Penn Station(weekdays).Again, notice how nearlyidentical the variation over 24 hours is for all weekdays
Figure: 34th Street Penn Station(weekend).
Neighborhoods Geographic DataI I WANTED TO UNDERSTAND THE DISTRIBUTION OF
STOPS ACROSS NEIGHBORHOODS.I I OBTAINED GIS DATA FOR THE NEIGHORHOODS OF
NYC FROM nycopendataI EACH NEIGHBORHOOD HAS BY A SET OF POINTS
THAT DESCRIBES THE BOUNDARY POLYGON
Figure: DUMBO Polygon
Point in Polygon
I TO DETERMINE IF A STATION IS WITHIN ANEIGHORHOOD, I WROTE A SMALL CODE TO FINDOUT IF A POINT LIES WITHIN THE BOUNDARY OF APOLYGON.
I THE MAIN IDEA IS TO COUNT THE CHANGES IN THEQUADRANTS OF THE VECTOR FROM THE POINT INQUESTION TO A VERTEX OF THE POLYGON AS WE GOAROUND IT.
I IF THE TOTAL CHANGE IS +4 OR -4 (SIGN DETERMINESDIRECTION) THEN THE POINT IS INSIDE THE POLYGON
def PiP ( pol , x , y ) :q count , quad p=0,−1quad=0f o r ps i n po l :
dx=ps [0]−xdy=ps [1]−yi f dx==0 or dy==0: r e t u r n Truequad =quad det ( dx , dy ) # A s imp l e f u n c t i o n tha t d e t e rm in e s
# which qua rd r an t the v e c t o r l i e s i n .i f ( quad p >=0):
i f ( quad−quad p)%4 ==1:# Quadrant has s h i f t e d by +1 ( c o u n t e r c l o c kw i s e )
q count+= 1e l i f ( quad−quad p)%4 ==3:# Quadrant has s h i f t e d by −1 ( c l o c kw i s e s h i f t o f 1)
q count−=1e l i f ( quad − quad p)%4 ==2: # Here we have to de t e rm ine i f the quadrant#changed i n a c l o c kw i s e or ant−c l o c kw i s e d i r e c t i o n .
det=Fa l s ewh i l e ( det==Fa l s e ) : # Choose a po i n t on the#l i n e j o i n i n g the two v e c t o r s
r= rand ( )p x =dx p + r ∗(dx−dx p )p y = dy p + r ∗(dy−dy p )quad m = quad det ( p x , p y )i f ( quad m!=quad ) and ( quad m!=quad p ) :
i f ( quad m −quad p)%4==1:q count+=2
e l s e :q count−=2
det=Truedx p=dxdy p=dyquad p=quad
i f abs ( q count )==4:r e t u r n True
e l s e :r e t u r n Fa l s e
Figure: Heat Map for distribution of Stops across neighborhoods
TOP 10 NEIGHBORHOODS RANKED BY STOPS:
22 Midtown - Midtown South17 SoHo-Tribeca-CivcCentr-LittleItaly13 BatteryParkCity-LowerManhattan
12 HudsnYds-Chelsea-Flatirn-UnionSq11 HuntersPt-Sunnyside-WstMaspeth
10 East New York (part A)10 West Village
9 Bensonhurst West9 Park Slope - Gowanus
TOP 10 NEIGHBORHOODS RANKED BY STOPS:
22 Midtown - Midtown South17 SoHo-Tribeca-CivcCentr-LittleItaly13 BatteryParkCity-LowerManhattan
12 HudsnYds-Chelsea-Flatirn-UnionSq11 HuntersPt-Sunnyside-WstMaspeth
10 East New York (part A)10 West Village
9 Bensonhurst West9 Park Slope - Gowanus
Figure: Heat Map for distribution of stops across neighborhoods adjustedto area
TOP 10 NEIGHBORHOODS RANKED BY STOPS(ADJUSTED TO AREA)
13 BatteryParkCity-LowerManhattan22 Midtown - Midtown South
17 SoHo-Tribeca-CivcCentr-LittleItaly10 West Village6 Fort Greene
12 HudsnYds-Chelsea-Flatirn-UnionSq7 UpperEastSide - CarnegieHill
5 Central Harlem South5 Chinatown
TOP 10 NEIGHBORHOODS RANKED BY STOPS(ADJUSTED TO AREA)
13 BatteryParkCity-LowerManhattan22 Midtown - Midtown South
17 SoHo-Tribeca-CivcCentr-LittleItaly10 West Village6 Fort Greene
12 HudsnYds-Chelsea-Flatirn-UnionSq7 UpperEastSide - CarnegieHill
5 Central Harlem South5 Chinatown
The Bigger Picture For Stops Distribution
The Bigger Picture For Stops Distribution
Neighborhood stops and income inequality:
Figure: Compare this with the distribution for income inequality in NewYork City
Neighborhood stops and income inequality:
Figure: Compare this with the distribution for income inequality in NewYork City
Stop Clusters
:
k-MEANS ALGORITHM :
STEP 1: CHOOSE k CENTERS RANDOMLY ON THE 2DSURFACESTEP 2: ASSIGN EVERY STOP TO ONE OF THE CENTERSBASED ON CRITERION OF MINIMUM DISTANCESTEP 3: RELOCATE EVERY CENTER TO THE CENTROID OFTHE STOPS ASSIGNED TO IT.STEP 4: REPEAT 2-3 UNTIL CONVERGENCE
A short code for k-means algorithm
wh i l e ( conv==Fa l s e ) : # Loop u n t i l c ove rgencec t+=1e r r=0conv=True # Flag f o r conve rgencef o r j , pos i n enumerate ( z i p ( l o n s x , l a t s y ) ) : # Loop ove r each data po i n t
d c l u s =[ d i s t ( pos [0]−x1 , pos [1]− y1 ) ,2 f o r x1 , y1 i n z i p ( x , y ) ] # A l i s t w i thd i s t a n c e s to the c u r r e n t c l u s t e r c e n t r o i d s .
T dst2=T dst2+ min ( d c l u s )r e c l u s=d c l u s . i nd e x (min ( d c l u s ) ) # Index o f the minimum−d i s t a n c e c l u s t e ri f ct<2:
s t o p c l u s [ j ]= r e c l u sc l u s s t o p [ s t o p c l u s [ j ] ] . append ( j )
e l i f ( r e c l u s != s t o p c l u s [ j ] ) : # Re−as s ingment o f n e a r e s t c l u s t e r .c l u s s t o p [ s t o p c l u s [ j ] ] . remove ( j )c l u s s t o p [ r e c l u s ] . append ( j )s t o p c l u s [ j ]= r e c l u sconv=Fa l s e # The min imal a s s i gnment has not been reached ye t .
f o r j , pos i n enumerate ( z i p ( x , y ) ) : # Loop ove r the c l u s t e r c e n t r o i d sd x=[ l o n s x [ k ] f o r k i n c l u s s t o p [ j ] ]d y=[ l a t s y [ k ] f o r k i n c l u s s t o p [ j ] ]i f l e n ( d x )==0: breakx n=sum( d x )/ l e n ( d x ) # Se t t i n g the new l o c a t i o n s .y n=sum( d y )/ l e n ( d y )x [ j ]= x ny [ j ]= y n
Representation of k-means clustering
Figure: Clusters =5. The centroids are marked by a black star and thefigure next to it is the number of stops attached to that.
Figure: Clusters =6. The centroids are marked by a black star and thefigure next to it is the number of stops attached to that. Notice that oneof the cluster centroids in at the edge of Bay Ridge(Brooklyn)
Figure: Clusters =7. The centroids are marked by a black star and thefigure next to it is the number of stops attached to that. Notice thatStaten Island and Far Rockaway are now indepedent clusters.
Significance of Clusters
I HOW DO WE DECIDE THE NUMBER OF CLUSTERS ISOPTIMUM?
I IT IS A MATTER OF INDIVIDUAL DISCRETION. MYAPPROACH HAS BEEN TO CONSIDER THE VARIABLETHIS ALGORITHM MINIMIZES J :SUM OF SQUARES OFDISTANCES TO THE ASSIGNED CENTERS OF ALLPOINTS.
I IF J DECREASES CONSIDERABLY WITH AN ADDITONOF AN EXTRA CLUSTER, THEN I TAKE IT THAT MORECLUSTERS ARE NEEDED TO REPRESENT THE DATACORRECTLY.
I WHEN J DOES NOT VARY MUCH ON ADDITION OFCLUSTERS, I STOP.(Of course, it will decreasemonotonically until we reach as many clusters as there arepoints when it is zero!)
PROBLEMS WITH THE DATA:(A) INCONSISTENCIES:(Eg: 34TH ST PENN STATION IS LISTED 3 TIMES IN RBSFILE ; PATH TRAIN STOPS ARE INCLUDED IN ONE SET ANDNOT ANOTHER)
(B) UNRELIABLE DATA :Eg: IN FILE tunstile-data, THERE ARE SEVERAL RECORDSUNDER THE CATEGORY OF ’DOOR OPEN’ WHICH IS WHENCOMMUTERS CAN SKIP THE TURNSTILE ALTOGETHER.
ALSO, SOMETIMES THE TURNSTILES ARE SUDDENLYRESET (AT RANDOM TIMES!).
(C) SPECIAL AWARD FOR WORST DATA EVER:NEW YORK CITY SHAPEFILES: THIS IS GROTESQUE!
THANK YOU