beyond counting: new perspectives on the active ipv4 ... · reported in akamai’s state of the...

Beyond Counting:New Perspectives on the Active IPv4 Address Space

Philipp RichterTU Berlin

[email protected]

Georgios SmaragdakisMIT

[email protected]

David PlonkaAkamai

[email protected]

Arthur BergerAkamai / MIT

[email protected]

ABSTRACTIn this study, we report on techniques and analyses that en-able us to capture Internet-wide activity at individual IP address-level granularity by relying on server logs of a large com-mercial content delivery network (CDN) that serves closeto 3 trillion HTTP requests on a daily basis. Across thewhole of 2015, these logs recorded client activity involv-ing 1.2 billion unique IPv4 addresses, the highest ever mea-sured, in agreement with recent estimates. Monthly clientIPv4 address counts showed constant growth for years prior,but since 2014, IPv4 count has stagnated while IPv6 countshave grown. Thus, it seems we have entered an era markedby increased complexity, one in which the sole enumerationof active IPv4 addresses is of little use to characterize recentgrowth of the Internet as a whole.

With this observation in mind, we consider new points ofview in the study of global IPv4 address activity. Our anal-ysis shows significant churn in active IPv4 addresses: theset of active IPv4 addresses varies by as much as 25% overthe course of a year. Second, by looking across the activeaddresses in a prefix, we are able to identify and attribute ac-tivity patterns to network restructurings, user behaviors, and,in particular, various address assignment practices. Third, bycombining spatio-temporal measures of address utilizationwith measures of traffic volume, and sampling-based esti-mates of relative host counts, we present novel perspectiveson worldwide IPv4 address activity, including empirical ob-servation of under-utilization in some areas, and completeutilization, or exhaustion, in others.

1. INTRODUCTIONThe Internet continuously evolves as a result of the in-

teraction among different stakeholders with diverse businessstrategies, operational practices, and access to network re-sources [8]. This evolution occurs in the context of the archi-tecture of the Internet, which is based on a number of prin-ciples that guarantee basic connectivity and interoperability,including global addressing, realized with the Internet Pro-tocol (IP). This has motivated a number of researchers tostudy address space utilization characteristics to assess thecurrent state and expansion of the Internet [5, 9, 11, 26, 34].Such assessment has become even more important recently,

when the exhaustion of the readily available IPv4 addressspace puts increased pressure on both ISPs and policy mak-ers around the world. The ISPs need to find new ways toaccommodate the ongoing demand for IPv4 connectivity oftheir customers, e.g., by increasing the utilization efficiencyof their respective address blocks. The policy makers need toestablish regulatory guidelines for the emerging marketplacefor IPv4 address space [28].

Recent studies that present Internet-wide statistics on IPv4address space utilization have pushed the envelope in eithermeasuring or estimating the total number of active IPv4 ad-dresses [34] and address blocks [11] in the Internet by rely-ing on a number of diverse data sources. However, the totalnumber of active IPv4 addresses and blocks only partiallycaptures address space utilization. Moreover, with the ex-haustion in allocation of IPv4 addresses, the situation willlikely be changing and will reflect the independent deci-sions of network operators, where the administration of theIP address space is under the control of the respective ad-ministrative domain (Autonomous System, of which about51K can be found in the global routing table, currently).Varying resource demands and operational practices, as wellas available supply of free and unused IP address space,blurs the notion of an “active” IP address. Today, we facea situation in which individual addresses and address rangesvary in their periods of activity and in that activity’s na-ture and volume. For example, dynamic addressing, net-work reconfigurations, and users’ schedules dramatically af-fect periods of activity. When active, traffic characteris-tics and volumes range widely, from lightly used addressesand sparsely-populated blocks to individual addresses andblocks used by proxy gateways connecting thousands of de-vices to the Internet.

As a result, questions about the number of IP addressesactive at a point of time, let alone their usage characteristics,are still difficult – if not impossible – to answer. It is evenmore difficult to comment on address space usage charac-teristics over time or on how to even choose the right timegranularity to observe such activity. The lack of detailedmeasurement of address space activity is a major obstacleto our understanding of the current state of the IPv4 addressspace exhaustion. Tracking and understanding address space

1

arX

iv:1

606.

0036

0v1

[cs

.NI]

1 J

un 2

016

utilization on a detailed level is both important for ISPs,who need to make business-critical decisions such as howto adapt their address assignment practices to this situation,as well as for regulators, who have to rely on estimationsand predictions when introducing new policies, which willultimately affect what will be deployed in practice. A de-tailed understanding of address activity also serves as foun-dation for security-critical systems that rely on the notion ofIP addresses, e.g., for client reputation [2, 16], as well as forsystems that rely on active IP addresses to perform measure-ments, e.g., geolocation systems [10,15,19,31] and networktroubleshooting systems [14, 20, 26].

In this study, we provide an unprecedented, detailed, andlongitudinal view of IPv4 address space activity, as seenthrough the lens of a large commercial CDN that serves al-most 3 trillion requests per day. This unique vantage pointenables us to measure Internet-wide IPv4 address activity atthe granularity of individual IP addresses, over a period thatspans one entire year. Our study provides a number of in-sights on the state and growth of the Internet in the face ofincreasing IPv4 scarcity.

Our main contributions can be summarized as follows:(i) We find that, after years of constant linear growth, the

total number of active IPv4 addresses has stagnatedsince 2014. Also, we find that state-of-the-art activemeasurement campaigns miss up to 40% of the hoststhat contact the CDN.

(ii) We show that, despite the stagnation in active IPv4addresses in 2014, the set of active addresses is farfrom constant. In fact, over the course of a year, morethan 25% of the active IP address pool changes. Ouranalysis shows that most client networks contribute,with varying degrees, to this “address churn” and thatthis churn is hardly visible in the global routing table.

(iii) We identify a variety of address block activity pat-terns, and can attribute them to network restructuring,user behaviors, and, in particular, various address as-signment practices. Based on our observations, we in-troduce metrics that allow us to quantify prevalent ad-dressing practices at scale and comment on additionalutilization potential within these — already active —address blocks.

(iv) We augment our address activity metrics with corre-sponding traffic volumes and relative host counts, whichwe derive from HTTP User-Agent samples, observinga trend of increasing traffic for addresses already heav-ily trafficked. Combining our three key metrics ofaddress activity, we then derive Internet-wide demo-graphics of the active IPv4 address space and discussimplications that our study has towards enhancementof current operational and measurement practices.

2. RETHINKING ADDRESS ACTIVITYThe study of the Internet’s growth has attracted the in-

terest of the research community since its early days. One

●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●

date [ticks: January of each year]

uniq

ue IP

v4 a

ddre

sses

200M

400M

600M

800M

1B

2008 2009 2010 2011 2012 2013 2014 2015 2016

● unique active IPv4 addresses per monthlinear regression until 2014−01

●

IANA exhaustion●

RIPE exhaustion

●

ARIN exhaustion

●

APNIC exhaustion●

LACNIC exhaustion

Figure 1: Unique active IPv4 addresses observedmonthly by a large CDN.fundamental dimension of this growth is the utilization ofthe available address space. As initially envisioned, everydevice on the Internet needs a globally unique IP address tobe part of the Internet. Thus, the number of active addressesis a natural metric to track growth of the Internet. Figure 1shows the number of monthly total active IPv4 addresses,as seen by a large commercial CDN.1 For many years, analmost perfectly linear growth in terms of active IPv4 ad-dresses was observed, conforming with our mental modelof a steadily growing number of used IPv4 addresses andcorresponding address blocks. The most compelling obser-vation from this plot, however, is a sudden stagnation of thenumber of active addresses in 2014. This observation under-lines a fundamental point in the history of the Internet: Thegrowth of active IPv4 addresses has subsided.

IPv4 address space scarcity has recently come to the fullattention of the research and operators community, as fourout of the five Regional Internet Registries (RIRs) that man-age global IP address assignments have exhausted their avail-able IPv4 address space [28]. Figure 1 is annotated with therespective exhaustion dates for each RIR. The prospect ofexhaustion fueled intense discussions about how to ensureunhindered growth of the Internet by introducing technicalas well as political measures to satisfy the ongoing demand,until we reach sufficient IPv6 adoption [9]. A fundamentalproblem, however, is that even getting an accurate and de-tailed picture of the current state of IPv4 address space activ-ity and how this activity evolves over time is difficult due tothe Internet’s decentralized structure. Past studies (which wewill discuss in detail in Section 3.1) typically relied on ac-tive or passive measurements to enumerate active addressesand blocks. Given that we have now entered a period ofstagnation, we argue that a sole enumeration of active IPv4addresses does not draw an accurate picture of address space1Note: The values in Figure 1 are about 5% greater than thosereported in Akamai’s State of the Internet Report, [30], as the lat-ter restricts to those addresses for which bandwidth is measured,which is also discussed in that report. As the present work is notconcerned with bandwidth, we omit this condition.

2

utilization and will be of little help, be it as a basis for policydecisions or for network operators making business-criticalchoices on how to manage their address space. We pose thefollowing questions:

Q1 How effectively and at what granularity (AutonomousSystem, Prefix, /24 equivalent, IP address) can activity of theIPv4 address space be measured? (Section 3)

Q2 At what timescales does activity manifest itself? Whatare the long-and short-term dynamics of IPv4 address spaceutilization? (Section 4)

Q3 Precisely, what operational practices contribute to thesedynamics and which knobs could be adjusted to improve uti-lization? (Section 5)

Q4 What is the relationship between address space uti-lization, traffic volume, and the number of connected hosts?(Section 6)

Q5 Can we extract meaningful address space demograph-ics when combining our various metrics of address spaceactivity? (Section 7)

3. MEASURING ADDRESS ACTIVITYIn this section, we first introduce the various methods that

have been used in the past to measure and capture IP ad-dress space activity. To this end, we discuss active and pas-sive approaches used in related work. We then introduce ourdataset, its collection methodology and its advantages. Toassess the visibility of our dataset, we provide a comparisonof our passive IP address activity logs with active probing.

3.1 Related WorkThe most popular way of assessing IP address activity

is by actively probing IP addresses (IPs), e.g., with ICMPqueries. Heidemann et al. presented a survey of IP addressactivity by systematically probing a subset of 1% of the allo-cated IPv4 address space with ICMP ping requests as earlyas 2008 [17], which was followed by studies that also cap-ture aspects of network management, e.g., diurnal activitypatterns [5,27] and Internet reliability [26]. Recent improve-ments in active scanning techniques were introduced by Du-rumeric et al. in ZMap [13], that enable scanning of theentire IPv4 address space within less than one hour or evenin less than 5 minutes [1]: a milestone in Internet-wide ac-tive measurement. Note that a reply from an IP address doesnot necessarily indicate that this host is indeed active or evenexistent; tarpits, firewalls, and other middleboxes might sendreplies to probe traffic destined to other IP addresses, or evenentire IP address ranges. Also, active measurements cannotcapture activity at all timescales, as a reply might be depen-dent on many factors [27, 29]. For example, it is commonthat network administrators and home routers block ICMPtraffic, thus, active measurements are not always success-ful in detecting active address blocks [12]. Advanced activemeasurement techniques that scan specific ports can also beutilized to increase the detection success of active IPs [13].

Dainotti et al. [11,12] used passive measurements of packet

IP addresses /24 blocks ASesDescription total avg. total avg. total avg.Daily: 08/17/15 - 12/06/15 975M 655M 5.9M 5.1M 50.7K 47.9KWeekly: Jan - Dec 2015 1.2B 790M 6.5M 5.3M 53.3K 47.8K

Table 1: Datasets: Totals and averages per snapshot.

captures and network flow summaries recorded at variousvantage points to infer address space activity at the /24 level,finding 4.5M active /24 blocks from passive measurementsin 2013. They detect and remove spoofed traffic, which canotherwise lead to overestimation of address activity. To ourbest knowledge, only one related piece of work, by Zanderet al. [34], estimates the number of active IPv4 addresses(in contrast to address blocks). Combining seven differentpassively captured datasets and two active datasets, they usea statistical capture/recapture model to account for invisibleaddresses and estimate the total number of active IPv4 ad-dresses to be 1.2 billion as of 2014.

A number of studies proposed techniques to identify dy-namically assigned IP addresses and uncover their dynam-ics. In [32], Xie et al. introduced a novel method, UDmap,that takes, as input, user-login traces (e-mail logins in theirstudy) and identifies the dynamic IP addresses by associat-ing the unique login information of each user with the set ofIPs it utilizes. They concluded that IP dynamics exhibit alarge variation across networks, ranging from hours to sev-eral days. Jin et al., in [18], proposed and evaluated a tech-nique based on distinct traffic activity patterns of static anddynamic IP addresses, in part when encountering outsidescanning traffic. Maier et al., in [21], used packet tracescollected in a residential network and showed that residen-tial addresses are regularly re-assigned by network opera-tors. In a very recent measurement study, Moura et al., in[23], proposed an active ICMP-based method to scan the ad-dresses of an ISP in search of blocks that rely on dynamichost configuration protocol (DHCP) to dynamically assignIPs to users and also to estimate DHCP churn rates. Allthe above mentioned works push the envelope in inferringdynamically assigned IP addresses, but either rely on useridentification information, are an active measurement, or donot scale to the entire IPv4 address space. Plonka and Bergercount active World-Wide Web (WWW) client addresses bypassive measurement and develop temporal and spatial ad-dress classification methods [25]. Their work has similari-ties to ours, here, in its use of CDN server logs (in fact, thesame logs we utilize) and in its spatio-temporal approach,but differs in that they study only IPv6 addresses.

3.2 The CDN as an ObservatoryThe foundation for this study are server logs of one of

the world’s largest CDNs. In the year 2015, the CDN oper-ated more than 200,000 servers in 120 countries and 1,450networks, serving content to end-users worldwide. Eachtime a client fetches a Web object from a CDN edge server,a log entry is created, which is then processed and aggre-gated through a distributed data collection framework. After

3

ASes(N=51k)

BGP prefixes(N=460k)

/24s(N=6m)

IPs(N=950m)

0.0

0.2

0.4

0.6

0.8

1.0

CDN only CDN & ICMP ICMP only

(a) Visibility of IPv4 addresses, address blocks and networks.

ASes(N=2k)

BGP prefixes(N=55k)

/24s(N=495k)

IPs(N=77m)

0.0

0.2

0.4

0.6

0.8

1.0

server server/router router unknown

(b) Classification of elements in the red bars of Figure 2(a).

Figure 2: Visibility into the IPv4 address space of theCDN compared with active measurements (Oct. 2015).

processing, we have access to the exact number of requests(“hits”) issued by each single IP address. In this work, werely on two datasets, which are shown in Table 1. For theyear-long dataset, we have weekly aggregates of all IP ad-dresses and for the daily dataset, we cover a period of 4months. In the following, we refer to an IP address as activeif the CDN saw requests from that IP address in the giventime interval. Correspondingly, we refer to an IP address asinactive if there was not a request. Here, requests refer tosuccessful WWW transactions, i.e., an IP address will onlybe associated with a request if the client initiated a successfulTCP and HTTP/HTTPS connection and successfully fetchedan object. Therefore, address activity is evident from ourlog dataset and a major advantage compared to other pas-sive measurements. The second advantage of our dataset isits granularity, both space and time-wise. The logs containnumbers of requests on a per-IP level, illuminating a detailedpicture of address activity.

To assess the view from our vantage point, next, we pro-vide a comparison of visible addresses from the CDN, ascompared to the portion of the Internet which replies to ICMPqueries. For this, we use the aggregated counts of CDN-observed active IP addresses and compare them to the unionof all IP addresses that were seen in 8 ICMP scans, whichwe derive from the ZMap project [13].2 Figure 2(a) showsthis comparison where the green bars are entities seen byCDN but not ZMap, the blue bars are entities seen by boththe CDN and in ZMap, and the red bars are entities seenonly by ZMap. As illustrated in Figure 2(a), over 40% ofthe 950 million IPv4 addresses show activity in the CDNlogs, but do not appear to be active from ICMP probes. The

2We chose to show the comparison for October 2015 because thelargest number of ICMP scans is available for this month.

main reason for this difference can likely be attributed tohosts that sit behind NAT gateways and/or firewalls that donot reply to external ICMP requests or hosts that are onlyresponsive for a short period of time. While this pitfall iswell-known [12, 14], we are not aware of any prior studiesthat quantify this effect at large scale. This incongruity isless pronounced when aggregating the address space to /24prefixes and ASes.3 For routed prefixes and ASes, the num-ber of (in)visible units is comparable for both methods, withICMP outnumbering the CDN for the case of prefixes. Thus,measuring address space activity on a per-prefix or even per-AS level, active measurements provide a significant cover-age. On the per-IP level, however, active measurements areinsufficient. We acknowledge that there is a bias in favor ofthe CDN logs with respect to WWW clients since we com-pare an entire month worth of CDN logs against 8 snapshotsof ICMP scans, which will naturally not capture hosts thatare active for only a short period of time.

3.3 Other ActivityDespite the fact that much WWW content is hosted on the

CDN platform and all the successful connections reported,our dataset has at least two limitations: (i) the platform typi-cally does not receive requests from Internet “infrastructure”such as routers and servers (though some routers and serversdo obtain software updates from the WWW, and some serversobtain content from the WWW to complete requests fromtheir clients) and (ii) an IP may be assigned to a user who didnot interact with the CDN platform. To assess these, we nextcompare the portion of IPs that do reply to an ICMP request,but are not present in our CDN dataset, i.e., the red bars onthe right in Figure 2(a). While this is roughly only 8% of theIPv4 addresses in combined the CDN/ICMP dataset, we areinterested in examining them further.

Figure 2(b) shows a classification of these IP addresses,prefixes, and ASes. Here, we use additional data to identifyservers and router infrastructure. To identify servers, we relyon additional data from the ZMap project: IP addresses thatreplied to server connection requests using HTTP(S), SMTP,IMAP(S) or POP3(S). To identify router IP addresses, weuse one month worth of the Ark [6] dataset and extracted allrouter IP addresses that appeared on any of the traceroutes(N=490M), i.e., they replied with an ICMP TTL Exceedederror. Close to half of the addresses that did not connect tothe CDN, indeed, can be attributed to server or infrastructureIP addresses. These fractions increase when aggregating toprefixes and ASes. We also note, however, that about half ofthese IP addresses did not show any server or infrastructureactivity. These IP addresses might be (a) serving infrastruc-ture that is not present in the Ark dataset, or is running otherprotocols than those probed by ZMap or (b) practically un-used IP addresses, or (c) active IP addresses that simply donot connect to the CDN.

3Here, we count a prefix/AS as active if we see activity from atleast one IP address within the respective prefix/AS.

4

●●●●●●●●●●

●●●●●●●●●

●●●●●●●

●●●●●●●

●●●●●●●

●●●●●●●

●●●●●●●

●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●

●●●●●

●●

days from 2015−08−17 to 2015−12−06

uniq

ue IP

v4 a

ddre

sses

020

0M40

0M60

0M

0 14 28 42 56 70 84 98 112

● active IPv4 addressesup eventsdown events

(a) Daily active IPv4 addresses andup/down events.

●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

aggregation window [days]

% o

f act

ive

IPs

per

snap

shot

1 7 14 21 28

05

1015 ● up events [min,median,max]

down events [min,median,max]

(b) Median up/down events between sub-sequent snapshots for different aggregationwindows (union active IPs within window).

time lag from 2015−01−01

chan

ge in

act

ive

IPv4

add

ress

es

1 week 26 weeks 52 weeks

−20

0M−

100M

010

0M20

0M

appear

−25

%−

12.5

%0

12.5

%25

%

disappear

(c) Difference in active IPv4 addressescompared to first snapshot of our period(weekly).

Figure 3: Activity and churn in active IPv4 addresses.

4. A MACROSCOPIC VIEW OF ACTIVITYTo bootstrap our analysis, we study IP address activity

on a broad scale in this Section. In particular we focus onhow many IP addresses our vantage point observes as wellas how consistent the set of active IP addresses is over time.Then, we focus on spatial properties of the observed dynam-ics and compare our observations with what is visible fromthe global routing table.

4.1 Volatility of Address ActivityTo assess IPv4 address activity over time, we show in Fig-

ure 3(a) the daily number of unique IPv4 addresses that con-tact the CDN over the course of 16 weeks. We observe about650M unique active IPv4 addresses on a daily basis, and lesspronounced usage on weekend days. Although Figure 3(a)shows a relatively constant number of active IPv4 addresses,the set of addresses can vary. To capture changes in the pop-ulation of active addresses, we define an up event if an ad-dress is not seen in a given window of time, e.g. a day or 7days, but then is seen in the subsequent window. Likewisea down event occurs if an address is seen in a given win-dow of time, but not seen in a subsequent window. Figure3(a) shows an average of 55M daily up events, likewise fordown events. Hence, each day we see 55M addresses show-ing activity that were not active the day before. Another 55Maddresses are active that day, but not on the next day.

We next assess whether this churn appears only on shorttimescales (i.e., due to short-term inactivity of certain IPaddresses) and disappears on longer timescales, e.g., whencomparing subsequent weeks to each other as opposed todays. In Figure 3(b) we partition the 112 days of Figure 3(a)into non-overlapping windows, of a given size. For a win-dow size of 7 days, for example, there would be 16 windows,or snapshots. In each window, we note the union of all ac-tive IP addresses. Then for window i and i+ 1 we computethe percentage of addresses that had an up event as 100*(thenumber of addresses in window i + 1 that are not presentin window i) divided by the number of addresses in windowi+1. Hence, for a window size of 7 days, we obtain 15 such

percentages. We then note the minimum, median, and max-imum of these percentages. We do the analogous computa-tion for down events. In Figure 3(b) the two red and greenpoints at x = 1 on the x-axis show the min, median, andmax of the percentage of addresses that had up/down eventson a daily basis, corresponding to Figure 3(a). On an aver-age day, about 8% of the active addresses “come”, another8% “go”. We see that the maximum values for up/downevents are as high as 14%, reflecting changes from week-days to weekends and vice versa. The red/green dots atx = 7 show these statistics when we aggregate our datasetinto weeks and compare subsequent weeks. The stunningobservation from this Figure is that, while churn is more ap-parent on short timescales (particularly for window sizes for1 and 2 days, related to day-of-the-week effects), the dy-namics in up/down events do not decay to zero for higheraggregates. Indeed, we observe that the churn level for ag-gregates larger than 7 days remains constant at roughly 5%.Thus, whichever aggregation level we choose (days, weeks,months), the set of active IP addresses is in constant change,both on short, as well as on long time scales.

To highlight the long-term effects, we show in Figure 3(c)the number of weekly newly appearing and disappearing IPaddresses as compared to the first week of 2015. That is,for each week in 2015 (x-axis), we show the number of ad-dresses that were not active in the first week (positive y-axis,appear), but in the given week and also the number of IPaddresses that were active in the first week, but not in thegiven week (negative y-axis, disappear). In fact, the set ofactive addresses has changed by as much as 25% over thecourse of 2015. The sum of the addresses that disappeared(were not present in the last week anymore) and those thatappeared (were not present in first week, but in the last week)is almost 50% of all addresses seen in the first week.

4.2 Dissecting Address VolatilityHaving seen that the active portion of the IPv4 address

space is highly volatile in nature, we next study some macro-scopic features of the observed dynamics. In particular, we

5

median % of IPs with up event

CD

F: p

er A

S m

edia

n %

up

even

ts <

X

0.1

0.5

1.0

5.0

10.0

50.0

100.

0

0.0

0.2

0.4

0.6

0.8

1.0

1−day window7−day window28−day window

(a) CDF: Median % of up events per ASand snapshot (only ASes with > 1000 ac-tive IPs, N = 8.6K).

>=

/16

/20

/24

/28

/32

event size [prefix notation]

frac

tion

of u

p ev

ents

0.0

0.1

0.2

0.3

0.4

0.51−day window7−day window28−day window

(b) Size distribution of up events for differ-ent time ranges.

1 day 7 days 28 days

aggregation window size

% e

vent

s co

rrel

ated

with

BG

P c

hang

e

0.0

0.5

1.0

1.5

2.0

2.5

up eventsdown eventsactive (no change)

(c) % of up/down events that go togetherwith a change in BGP for different aggre-gation windows.

Figure 4: Address churn properties.

study (i) if networks contribute similar levels of churn, (ii)the size of up/down events, in terms of prefixes and (iii)check if this churn is also reflected in the global routing ta-ble.A network view of churn: In Figure 4(a) we show perAutonomous System (AS) the median percentage of IP ad-dresses with an up event for each snapshot. That is, we parti-tion the set of addresses into ASes, and we repeat the calcu-lation of Figure 3(b) for addresses in each AS, and obtain amedian percentage for each AS. Figure 4(a) shows the CDFof these medians. We only consider ASes for which we sawat least 1K active IP addresses during our observation periodand we only show up events, the CDF for down events is sim-ilar. The takeaway from this Figure is that high dynamics ofIP address activity is not a phenomenon limited to a smallnumber of ASes - rather, about 10% to 20% (depending onwindow size) of the ASes have a 10% or higher median per-centage of IPs with an up event. About half of the ASes havea churn rate below 5%, the other half a higher one. We ob-serve this for different aggregation windows, with a slightdecrease in volatility for some ASes on higher aggregationlevels. Thus, churn is a ubiquitous phenomenon, which weobserve for a large number of networks.A prefix view of churn: So far, we have considered upand down events on a per-address basis (for different timewindow sizes). Next, we are interested in whether up anddown events really only affect single addresses, or rather en-tire address ranges. In particular, we are interested in entireprefixes that have been inactive and then some or all of theaddresses become active, which we expect would likely in-dicate network operator actions as opposed to independent,individual user behavior.

To accomplish this, we find for each per-address up eventthe smallest prefix mask m (where a smaller mask corre-sponds to a prefix that contains more addresses), in whichall addresses either had an up event or showed no activity inboth snapshots. Figure 4(b) shows a histogram of the frac-tion of per-address up events, for a given window size, wherewe assign each up event to its tagged prefix mask m (the his-

togram for down events looks similar). For example, for awindow size of 1 day, more than 70% of the per-address upevents are associated only with a mask ≥ /31, indicating thatthese dynamics typically only affect individual IP addresses.

For larger aggregates (e.g., 28-days), we still see almost50% of the up events in the ≥ /31 range, however we alsoobserve some up events spanning larger ranges of addresses,with more than 38% of month-to-month up events affectinglarger address blocks with a mask ≤ /24. Thus, a key obser-vation when studying churn across different time aggregatesis that a significant proportion of long-term events (38% ona month-to-month aggregation) affect entire prefix masks ≤/24, some of them as large as an entire /16 prefix. These“bulky” events hint towards changes in address assignmentpractice (e.g., network restructurings), as opposed to churncaused by individual ON/OFF activity of a single IP address.While this is an expected property and holds for some por-tion of the month-to-month churn, we also notice that thisdoes certainly not hold for all events on larger timescales. Infact, even on a month-to-month scale, more than 36% of theevents only affect prefixes of size /31 or even /32, single IPaddresses.A routing table view of churn: Given that the active IP ad-dress population changes by about 25% over the course ofa year (per Figure 3(c)), an appropriate question is whetherthese dynamics are also reflected in the global Border Gate-way Protocol (BGP) routing table. To assess this, we nextassociate each IP address with its origin AS using daily snap-shots of the global routing table.4 We show in Figure 4(c)the fraction of up/down events that go together with a BGPchange. Here, we consider both route announcements, with-drawals, as well as origin AS changes as a “BGP change”event. The green bars show the percentage of up events thatgo together with a BGP change, and the red bars show thepercentage of down events. In addition, we also plot the frac-

4We rely on daily snapshots from a RouteViews collector inAS6539. For larger window sizes, we determine the origin AS fora given IP address using a majority vote of all contained daily IP-to-AS mappings.

6

appear disappeartotal 139M 129Mentire /24 prefix affected 65% 54%

BGP no change 87.1% 90.4%BGP origin change 3.3% 7.1%BGP announce/withdraw 9.6% 2.5%

Table 2: IP addresses that appeared/disappeared com-paring Jan/Feb 2015 and Nov/Dec 2015, percentage ofthose IP addresses where the entire containing /24 prefixappeared/disappeared, and corresponding BGP changes.

tion of steadily active (no up/down event) IP addresses andfor what fraction of them we observe changes in the rout-ing table. While we can clearly see that (i) IP addresseswith up/down events are much more likely to correlate withevents in the routing table when compared to steadily activeaddresses and (ii) that on higher aggregation levels, up/downevents are more likely to correlate with BGP changes, re-flecting network changes, we see that (iii) only a tiny minor-ity of these events are visible in the global routing table (lessthan 2.5% for monthly aggregation levels). Thus, the vastmajority of volatility in IP address activity is entirely hiddenfrom the global routing table.

4.3 Volatility During One YearNext, we study in particular those IP addresses that were

first inactive for a long period and then became active, aswell as IP addresses that showed activity but then went in-active. For this, we pick the first two months of our obser-vation period (January, February 2015) as well as the lasttwo months of our observation period (November, Decem-ber 2015), where we take the union of all active IP addressesthat were seen within each snapshot. We then compare thetwo snapshots, and also the associated BGP activity. Ta-ble 2 summarizes our results. Continuing the trend shownin Figure 4(b) that on longer time scales the churn becomesbulkier, we observe that more than half of the events, 65%and 54% respectively, affected entire address blocks, andare thus more likely to be caused by operational changes.However, another large chunk of long-term volatility affectssmaller aggregates, down to single IP addresses. The mainresult of Figure 4(c) that only a small minority of these eventscoincide with BGP changes also pertains for the year-longtime scale of Table 2. In fact, most of these IP addresseswere — and are — still routed by the same AS.

More than 30K ASes announce IP addresses that showlong-term volatility in our dataset without any change inBGP configuration. The top 10 ASes in terms of IP ad-dresses in the appear/disappear class contribute about 30%of the total addresses in each class. These top 10 ASes in-clude major ISPs connecting both residential as well as cel-lular users. In fact, we find that ASes contributing the mostIP addresses to the appear class are also those ASes con-tributing the most addresses to the disappear class. Focus-

ing on our two sets of top 10 ASes, we note that we find7 of those contributing to the appear class are also amongthe top 10 contributing to the disappear class. Thus, whilecontributing large number of IP addresses with high volatil-ity, the total number of active IP addresses for these ASesvaried only marginally, in the order of a few percentages.Hence, we can attribute the majority of long-term volatilityto AS-internal dynamics, as opposed to networks enteringthe market or going out of business.

5. A MICROSCOPIC VIEW OF ACTIVITYGiven observations of churn in the active IPv4 address

space, we now drill down into their root causes. The waythat IP addresses are allocated and used across network oper-ators is not uniform. There are many factors that contributeto how a network operator assigns IPs to client hosts, e.g.,address pool size, client population, type of clients (enter-prise or residential), privacy considerations (residential usersmay have their address lease expire after a maximum dura-tion), or other operational practices (static/dynamic addressassignment). Thus, it is challenging to characterize the IPassignment strategies within a single network, let alone anentire address space.

To offer a glimpse of how activity is typically manifestin different areas of the IPv4 address space, in Figure 5, weshow examples of activity patterns in four address blocks.Here, we examine specific /24 prefixes, as it allows us topresent a spatio-temporal view of activity in address-leveldetail. To generate these plots, we rely on our 4 monthsworth of daily IP address activity (x-axis). We then align allIP addresses within the selected /24 on the y-axis in increas-ing order. Having this “activity matrix” in place, we plot ared point for each day on which a given address was active.With these examples in mind, we introduce two root causesfor churn in address activity:Regular activity patterns: Address assignment practice.The four examples in Figure 5 show striking differences indaily address activity. While we see a non-uniform, lightutilization in Figure 5(a), with a day-of-week pattern for fewactive addresses, we see heavier utilization in Figures 5(b), 5(c),and 5(d), with a variety of activity patterns involving dy-namic assignment from address pools. While Figure 5(b)shows a round-robin IP address assignment in an underuti-lized pool, Figure 5(c) shows dynamic addressing with avery long lease time (i.e., the duration for which a specificsubscriber holds an IP address), with some IP addresses hav-ing almost continuous activity and others having infrequentactivity. Figure 5(d) shows another mode of dynamic ad-dressing, wherein the ISPs lease time is set to a maximumof 24 hours, thus causing hosts to be frequently reassigned adifferent IP address. We refer to the activity patterns in Fig-ure 5 as in situ activity, as they result from address assign-ment practice and its interplay with end-user behavior in oneadministratively-configured situation; that is, we have no ev-idence that the situation, nor the activity pattern, changed

7

time [months]

IP a

ddre

ss a

ctiv

ity w

ithin

/24

0 1 2 3 4

.0.1

27.2

55

(a) Statically assigned addressblock (German University,FD=29, STU=0.04).

time [months]

IP a

ddre

ss a

ctiv

ity w

ithin

/24

0 1 2 3 4

.0.1

27.2

55

(b) Dynamically assigned ad-dress block (US University,FD=254, STU=0.18).

time [months]

IP a

ddre

ss a

ctiv

ity w

ithin

/24

0 1 2 3 4

.0.1

27.2

55

(c) Dynamically assignedaddress block with residen-tial users (US ISP, FD=175,STU=0.26).

time [months]

IP a

ddre

ss a

ctiv

ity w

ithin

/24

0 1 2 3 4

.0.1

27.2

55

(d) Dynamically assigned ad-dress block with residentialusers (German ISP, FD=254,STU=0.75)

Figure 5: Regular activity patterns: Interplay between address assignment practice and user-behavior.

time [months]

IP a

ddre

ss a

ctiv

ity w

ithin

/24

0 1 2 3 4

.0.1

27.2

55

(a) German University, FD=256,STU=0.32.

time [months]

IP a

ddre

ss a

ctiv

ity w

ithin

/24

0 1 2 3 4

.0.1

27.2

55

(b) German University, FD=187,STU=0.38.

Figure 6: Modified assignment practice.

due to network reconfiguration. A key observation is that insitu activity in address blocks varies significantly amongstthose that have different address assignment configurations.Changed patterns: Modification of assignment practice.As shown in Figure 6, we also observe activity patterns thatare temporally or spatially inconsistent. This is some evi-dence that the patterns’ dynamics are not the result of con-stant address assignment policy, but, rather, are the resultof address (a) reallocation, (b) assignment reconfiguration,and/or (c) repurposing.

We next study address activity pattern at large scale. Inparticular, we are first interested in detecting, which por-tions of the address space show a consistent address assign-ment pattern as opposed to blocks that show major changesin their activity pattern. We then dive into the former, activ-ity patterns that are the result of address assignment practicein conjunction with end-user behavior. Here, we put a partic-ular emphasis on the resulting utilization of address blocks.

5.1 Block Activity MetricsIn order to comprehensively characterize IP address ac-

tivity, it is imperative to use metrics that capture the activityspatially, i.e., over the IP address space of an address block,and temporally, i.e., across time. We experimented with sev-eral metrics to capture the range of activity patterns we ob-served and found two to be particularly useful:

IP address filling degree: this metric captures the numberof active IPs within an address block within a window oftime. Admittedly, there is not a single address block sizethat is ideal, but we chose a /24 Classless Intern-DomainRouting (CIDR) prefix, i.e., the smallest globally routableentity. This is a compromise, since we recognize that bothsmaller prefixes are sometimes more appropriate as in Fig-ure 6(b) and that larger prefixes sometimes exhibit uniformpatterns of activity. Values of this metric range from 1 to256. We will later see that this metric is particularly helpfulin dissecting static from dynamic addressing mechanisms.Spatio-temporal utilization: this metric captures the aggre-gate activity of active IPs over time. We define utilization asthe fraction: spatio-temporal activity divided by the maxi-mum spatio-temporal activity, for a given block and windowof observation (time). Relying on our four months worth(112 days) of daily activity data, the spatio-temporal activitycan range from 1, where one single IP address was active forone day, up to 112× 256 = 28672, where all addresses in ablock were active every day, which would be the maximumspatio-temporal activity.

Figures 5 and 6 are annotated with their respective val-ues for filling degree (FD) and spatio-temporal utilization(STU). In these examples, the filling degrees vary from val-ues as low as 29 to as high as 256. The spatio-temporalutilization varies from 0.04 up to 0.75.

5.2 Detecting ChangeAs a first-order partitioning of the active IPv4 address

space, we are interested in identifying address blocks with asignificant change in address assignment practice during ourobservation interval. Per Section 4.2, we know that someportion of address churn on longer timescales affects largeraddress ranges (“bulky changes”) than do short-term changes.To quantify changes in address assignment, we rely on ourspatio-temporal utilization metric. In particular, Figure 7(a)shows the maximum change in spatio-temporal utilizationon a month-to-month basis for each active /24 block. Here,we observe that the majority (90%) of the /24 blocks clusteraround the origin, i.e., they do not show a major change in

8

max. diff in monthly spatio−temporal utilization

CD

F: a

ctiv

e /2

4 bl

ocks

−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(a) Maximum monthly change in spatio-temporal utilization per /24 block. We se-lect 90% of the blocks in the stable region.

active IP addresses within /24

CD

F: a

ctiv

e /2

4 bl

ocks

1 64 128 192 256

0.0

0.2

0.4

0.6

0.8

1.0

static all dynamic

(b) Filling degree of active /24 blocks,where we dissect some identifiable blocksto be static or dynamic using reverse DNS.

% of max possible spatio−temporal utilization

activ

e /2

4 bl

ocks

0 20 40 60 80 100

040

K80

K12

0K

(c) Spatio-temporal utilization within /24blocks with >250 active IP addresses.

Figure 7: Spatio-temporal aggregate views of IP address activity.

their utilization. Another 10% of the active address blocks,on the other hand, are located more closely to the tails ofthe CDF, these are blocks for which we observe significantchanges in address activity.

To dissect address blocks into major change and minorchange blocks, we set a threshold at X = ±0.25. We de-cided to use this value, as it retains cases of heavy in situchange, e.g., Figure 5(b), but excludes of major configura-tion change (Figure 6). Based on this threshold, we findthat as many as 9.8% of the active /24 blocks show majorchange in their address activity within our four months pe-riod, while 90.2% of the blocks show no more than minorchange. Thus, we separate blocks that likely underwent re-allocation or change in address assignment practice (majorchange, Figure 6) from those that did not (Figure 5).5

5.3 Addressing MattersHaving culled out those blocks with major changes, we

next focus on the activity characteristics of steady addressblocks. Since we have observed that the address assignmentpolicy greatly influences its activity patterns, we would liketo identify specific assignment practices. We pay particularattention to utilization characteristics associated with thesepractices. We argue that an address block’s utilization is de-termined by its (a) address assignment policy and (b) thebehavior of its users.Static vs. dynamic addressing: As a first cut, we are in-terested in how static and dynamic addressing mechanismscompare when it comes to address space utilization. In thestatic case, the ISP assigns a fixed IP address for each de-vice/subscriber. Dynamic addressing, on the other hand,automatically assigns IP addresses from predefined ranges.In order to apply our metrics, we wanted an initial set ofblocks that are known to be likely statically or dynamicallyassigned. To this end, we used PTR (reverse DNS) records

5We acknowledge that some changes in address assignment mightresult in only minor activity changes and that other changes mightresult in larger changes in spatio-temporal utilization. We chose athreshold based on our anecdotal examination of activity patterns.

and tagged /24 blocks containing addresses with consistentnames that suggest static (keyword static) as well as dy-namic (keyword dynamic, pool) assignment, a well-knownmethodology [23, 27, 32]. In total, we find 456K dynamic/24 address blocks and 262K static address blocks. We thencompare their activity based on our dataset. Figure 7(b)shows a CDF of the filling degree (active IPs per /24) forthe two subsets of static or dynamic /24s, as well as for theentirety of our dataset. Comparing the curves for dynami-cally and statically assigned address blocks, we see a starkdifference: While 75% of static /24s show a filling degreelower than 64 IPs, more than 80% of the dynamic /24s showa very high filling degree, i.e., higher than 250 IP addresses.When comparing these observations to our entire dataset, weobserve that about 50% of the entire visible address spaceshows a very high filling degree (higher than 250). Another30%, by contrast, show filling degrees lower than 64. Ifour DNS-derived samples are representative, most sparselypopulated /24 blocks are statically assigned and most dy-namic pools cycle, i.e., have every address assigned at leastonce, during our observation window of 4 months, result-ing in a high filling degree. However, about 20% of theactive /24s that remain have varying filling degrees. Theseare either statically-assigned blocks with higher utilizationor dynamically-assigned blocks with quite little utilization,e.g., those with long lease times as in Figure 5(c)).Dynamic address pools: We find that dynamically-assigned/24 prefixes generally show a very high filling degree, withmore than 250 active IP addresses in more than 80% of thecases. This heavily depends on the configured assignmentpolicy, i.e., the pool size in relation to the number of con-necting devices. Figures 5(b) and 5(d) both show dynamicaddressing patterns, however we see that their utilization isvery different. To shed more light into such dynamic pools,we make use of our second metric, the spatio-temporal uti-lization. Focusing on those 1.2 million /24 blocks that have avery high filling degree (larger 250, and hence likely dynam-ically assigned), Figure 7(c), shows their spatio-temporalutilization as a percentage of their maximum possible uti-

9

days active

med

ian

daily

hits

5/25 75/95 %25/50 50/75 %median

1 28 56 84 112

110

100

1000

1000

0

(a) Median daily hits per IP address binnedby activity (days). y-axis is log-scaled.

●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●

days active

CD

F: f

ract

ion

activ

e IP

s, to

tal t

raffi

c

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●

●

●●

●

IP addressestraffic contribution

1 28 56 84 112

0.0

0.2

0.4

0.6

0.8

1.0

(b) Total number of IP addresses in eachbin, total traffic contribution per bin.

months [2015]

% tr

affic

sha

re o

f top

10%

IPs

4950

5152

53

01 06 12

weeklymoving average (4 weeks)

(c) Relative share of total traffic of top 10%IPs.

Figure 8: Activity time-range of IP addresses vs. their traffic contribution.

lization. Here, we see that most of these address blockshave high utilization, with most blocks at more than 80%.In fact, we even see some 60K /24 blocks with 100% spatio-temporal utilization. This extraordinary utilization hints thatthey might contain shared proxy or gateway addresses; wewill revisit these in Section 6. We also see more than 450K/24 prefixes with a utilization lower than 60% and 200K /24swith a utilization even lower than 20%.

5.4 Potential UtilizationFigure 7(b) makes it clear that the spatio-temporal utiliza-

tion of address blocks differs dramatically. We find thatstatic vs dynamic addressing mechanisms play an impor-tant first-order role, and now present some estimates on anaddress block’s maximum potential spatio-temporal utiliza-tion. We constrain this exercise to only those blocks knownto be active, i.e., those that are known to be allocated, globally-routed, and in operation. We argue that increasing utilizationin these blocks is — in some instances — a mere configu-ration issue. Sometimes this means switching from static todynamic assignment, but other times it means only reconfig-uring an existing dynamic pool.

Specifically, we find that more than 30% of the active IPaddress blocks, more than 1.5M /24 blocks, have a filling de-gree lower than 64 active IP addresses. Our DNS PTR-basedtagging method suggest that static address assignment prac-tices are the main driver for low spatio-temporal utilizationof IP address space. On the other hand, for the 50% of the ac-tive /24 address that appear to be dynamically managed, wefind that the majority have high spatio-temporal utilization,i.e., more than 80%. However, we also find that about onethird of dynamic blocks show low spatio-temporal utiliza-tion; Figure 5(b) is a striking such example. We argue that— as these address blocks are already dynamically assigned— reducing their pool sizes could instantly free significantportions of address space.

6. TRAFFIC & DEVICESUntil this point we studied the activity of an IP over time,

and with respect to neighboring addresses in a /24 prefix.

We have seen a variety of IP address activity and associatedaddressing mechanisms. We next take another dimensioninto account: traffic. In particular, we are interested in (i)how does address activity correlate with traffic, (ii) do wesee long-term trend with respect to the fraction of traffic as-sociated with the heavy-hitter addresses, and (iii) how doestraffic contribution relate to the number of connected endhosts? In the following Section 7, we will combine trafficmetrics and host estimates with the activity measurementsof Section 5 to obtain a comprehensive perspective of theactive IPv4 address space.

6.1 Activity vs. TrafficFirstly, we are interested in how the binary notion of activ-

ity of IP addresses is connected to the actual traffic volumethey fetch from the CDN. For this, we rely on our datasetthat captures the number of daily HTTP requests as issuedby each individual IP address (see Section 3.2). To assessthis, we group all IP addresses that were active during our4-month (112 days) period into 112 bins, corresponding tothe number of days each individual IP address was active.Figure 8(a) shows the median daily hits that were issued bythe total of IP addresses in each bin, where we only considerdays where an IP address issued at least one hit. We alsoshow the 5, 25, 75 and 95 percentiles for each bin (the y-axisis log-scaled). Note the strong correlation between temporalactivity of IP addresses and their daily traffic contribution.While addresses that were only active for a few days issueonly a median of fewer than 100 requests per day, the trafficcontribution is much higher for addresses that were activeon more days. Indeed, we see that the traffic contributionsignificantly increases for IP addresses that were active al-most every day (≥ 110 days), and those addresses that wereactive every day show an even higher median daily trafficcontribution. This observation becomes clearer when look-ing at Figure 8(b), where we plot CDFs of the total numberof IP addresses falling into each bin (red dots), and their rela-tive total traffic contribution (all of the CDN’s traffic). Whileonly fewer than 10% of IP addresses were active every sin-gle day, these IP addresses account for more than 40% of

10

the CDNs total traffic! The combination of continuous dailyactivity over the course of four consecutive months as wellas the significantly larger contribution in overall traffic sug-gests that those 10% of the active IPv4 addresses includegateways, e.g., NAT routers and web proxies, aggregatingthe traffic of multiple users, as well as WWW client bots(e.g., employed by search engines).

6.2 Traffic ConsolidationGiven that we have reached a stage in which the num-

ber of active IPv4 addresses has stagnated, we were curiouswhether there is an observable trend over 2015 of increas-ing traffic concentration in the heavy-hitter addresses. Tovisualize this, we show in Figure 8(c) the traffic share of the10% of addresses with the greatest traffic. (Note that they-axis starts at 49%). Here, we use our weekly dataset toshow how this trend has been developing over the entiretyof the year 2015. Figure 8(c) indeed shows a clear trend oftraffic consolidation. While in January 2015, those IP ad-dresses received a share between 49% and 50%, we see thattheir traffic share steadily increased over the course of theyear. As of December 2015, the top 10% of the active IP ad-dresses consume an additional 3% of the total traffic, whichwe believe is a notable increase over one year.

6.3 Estimating Relative Host CountsHaving understood that the characteristics of activity of an

IP address vary dramatically, both regarding its utilizationas well as volume of traffic, we are next interested in howmany hosts reside in a given address block. Indeed, here weare particularly interested in cases where IP addresses gener-ate huge amounts of traffic and show continuous activity, asseen in the previous section. While we do not have any dataavailable that provides us with a definitive number of con-nected hosts per IP address, we will use as a proxy HTTPUser-Agent strings. Whenever a Web object is requestedfrom a server, the respective client application identifies it-self by providing a User-Agent String within the HTTP re-quest header. We extended the CDN data-collection plat-form to store a random sample of HTTP User-Agent Stringsof connecting hosts. Due to the high volume of this data,we only store the User-Agent field for 1 out of 4K HTTP re-quests, and we restrict this analysis to the last month of ourobservation period.

In the canonical case, the User-Agent identifies the browserversion, OS version, as well as the screen resolution. How-ever, in more recent times, primarily driven by smartphoneapplications, which typically identify themselves and theirversion number with an individual User-Agent String, wesee a much higher diversity in terms of User-Agent Strings[33]. HTTP User-Agent Strings have been used in the past toquantify host populations behind NAT devices in residentialnetworks [22]. Here, we use them only as a relative mea-sure of host counts per address block, i.e., we do not claimto be able to numerically quantify host populations. This is

user−agent samples per /24

uniq

ue u

ser−

agen

t str

ings

sam

pled

per

/24

1 10 100 1K 10K 100K 1M 10M

110

100

1K10

K1M

1

600

160K

Figure 9: Diversity of User-Agent Strings per /24 block.

mainly because of two reasons (a) the coarse-grained sam-pling of this dataset and (b) the fact that one single devicemight introduce multiple User-Agent Strings (a smartphonewith different applications), and on the other hand, multi-ple users can run the same software on the same device andthus emit a small number of unique User-Agent Strings, ofwhich the former can result in over-estimation and the latterin under-estimation of the host population.

Figure 9 shows for each active /24 block the number ofUser-Agent samples (x-axis) and the number of unique User-Agent Strings on the y-axis. Thus, the x-axis is an estimateof the traffic (as we have sampled data here), that hosts inthis block issued and the y-axis a relative measure of thenumber of the hosts residing in this block. Overall, we seea strong correlation between traffic and hosts. Upon a closerlook, we can dissect the area in the plot in three groups: Thefirst (and largest) group of /24 blocks ranges from the cen-ter of the Figure to the lower left. Indeed, here we find thebulk of address blocks, e.g., from residential ISPs. Then, wehave blocks that are shifted more towards the right, but showa low number of unique User-Agent strings (bottom rightin the Figure). Upon closer investigation of these blocks,we find that they are mainly related to crawling bots, whichissue a large number of requests, but do so with very few(or one) User-Agent string(s). More interestingly, we see athird region, in the top right, of a huge number of requests,and a very high diversity of User-Agent strings. A closerinspection of these blocks reveals that it is precisely thoseblocks that correspond to gateways, aggregating the trafficof thousands of end-users. We manually inspected the top5K blocks in the top-right region of the plot. Using WHOISinformation, we find that more than half of these blocks be-long to ISPs located in Asia and that the majority is in useby cellular operators.

7. DERIVING DEMOGRAPHICSIn the previous section we have studied different features

of address blocks, namely, spatio-temporal utilization in Sec-tion 5, traffic, and (unique) User-Agents as a relative mea-sure of host counts in Section 6. In this section we com-bine these metrics to provide a comprehensive perspective

11

spatio-temporal utilizationtraffi

c contribution

rela

tive

host

cou

nt

Figure 10: Characterization of the active IPv4 addressspace: Spatio-temporal activity, traffic contribution, rel-ative host count per /24.

of the active IPv4 address space. Our three different featuresare fundamentally different in nature, which manifests itselfalso in different scaling of our derived values per addressblock. Hence, to project our features onto a unified scale,we first need to normalize our measures of traffic and therelative host count. Our measure of spatio-temporal utiliza-tion is already normalized to a range (0, 1]. We normalizethe traffic contribution as well as the relative host count, byusing a log-transform of the value per /24 block and divideit by the maximum log-transformed value of all active /24blocks. Having these three normalized values per /24 blockin hand, we next bin the resulting values into 10 intervals ofa length of 0.1. This then results in a 3-dimensional arraywith 1000 entries. We now assign each /24 block to one ofthese bins within our matrix.6

7.1 Internet-wide DemographicsFigure 10 shows a 3D-visualization of our feature matrix,

where we indicate the number of /24 address blocks fallinginto each bin by scaling the size of the respective sphere.We can make several observations from this plot: (i) wesee a strong dissection of address blocks along the spatio-temporal utilization axis. While one set of blocks is clus-tered towards values with a very small spatio-temporal uti-lization (less than 0.2), the other block is clustered towardsvery high spatio-temporal utilization. Recalling Section 5this can mainly be attributed to varying addressing mecha-nisms. (ii) When taking the traffic contribution into account,we see that densely utilized address blocks typically have ahigher traffic volume. However, this observation is not al-ways true, as we also see significant portions of the addressspace with high traffic volume in sparsely-populated areas.

6for traffic contribution, the median value of the 1st/5th/10th bincorresponds to 4/1.5M /44B monthly hits, the median value ofthe 1st/5th/10th bin for the relative host density corresponds to2/2K/500K unique sampled User-Agents strings.

(iii) When relating these two features to our host count mea-sure, we again see a higher diversity for highly-utilized andtraffic-intensive blocks. In particular, we see only a verytiny portion of /24 blocks that fall into the highest bin forthe host count metric. These blocks typically also show amaximum spatio-temporal utilization and maximum trafficcontribution (small spheres on the top-right). It is importantto notice that these blocks contained in these small spheresare responsible for a significant share of the CDN’s overalltraffic.

7.2 Regional CharacteristicsLastly, we dissect the address space by Regional Reg-

istries. Recall that the address space is subject to manage-ment from 5 different organizations (RIRs, Section 2). EachRIR applies different management policies and the currentstate of address exhaustion also varies per RIR. Thus, webelieve that our grouping can assist in understanding the cur-rent status of the address space in each of these regions andsupport policy decisions when it comes to managing the lastremaining blocks and/or re-allocations of already in-use ad-dress blocks. Figure 11 shows our address space categoriza-tion for the five RIRs. Here, we plot the spatio-temporalutilization and traffic contribution on the X and Y axes, andindicate the relative host counts by the color scale (grey: lowrelative host count, red: high relative host count). Again, weadjust the size of the circles to reflect the number of /24sfalling into each bin.

We can see that about half of the active address spacewithin the ARIN region clusters towards the left, i.e., showslow utilization, low traffic contribution. However, we notethat there are some heavily active address blocks also in thisregion (small red dots at X=0.2/Y=0.8,0.9). We see that theother regions have more of their address space in the highly-utilized region, which is especially true for LACNIC andAFRINIC. A possible explanation for this behavior is thatLACNIC and AFRINIC were incorporated much later thanthe other RIRs and had address conservation as a primarygoal from the very beginning onwards [28]. Noticeably, wesee particularly for the APNIC, as well as the AFRINIC re-gion a significant chunk of /24 blocks towards the top-rightof the Figure (X= 1.0, Y = 0.7,0.8), which also show a veryhigh relative host count. This hints towards increased proxy-ing/gateway deployments which is more pronounced in theseregions when compared to e.g., ARIN.

8. IMPLICATIONSImplications to measurement practice: We count 1.2 bil-lion active, globally-unique IPv4 addresses, more than hasbeen reported previously, except by statistical estimation [34],boding well for future use of such statistical models andtechniques driven by sampled observation. Our address-countingresults imply that remote active measurements are insuffi-cient for census or complete survey of the Internet, partic-ularly at the IP address-level granularity. Also, our passive

12

spatio−temporal utilization

traf

fic c

ontr

ibut

ion

ARIN

0.2 0.6 1.0

0.2

0.6

1.0


traf

fic c

ontr

ibut

ion

RIPE

0.2 0.6 1.0

0.2

0.6

1.0


traf

fic c

ontr

ibut

ion

APNIC

0.2 0.6 1.0

0.2

0.6

1.0


traf

fic c

ontr

ibut

ion

LACNIC

0.2 0.6 1.0

0.2

0.6

1.0


traf

fic c

ontr

ibut

ion

AFRINIC

0.2 0.6 1.0

0.2

0.6

1.0

Figure 11: Breakdown of IP address space characterization per RIR. Color encodes the relative host count.

measurements have shown extensive churn in IPv4 addresseson all timescales, which implies that any census needs to bequalified by the observation frequency and period.Implications to Internet Governance: The 1.2 billion ac-tive addresses we count represent 42.8% of the possible uni-cast addresses that we see advertised in the global routing ta-ble. If we restrict our implications to the 6.5 million /24 pre-fixes in which we observed active WWW clients addresses,Table 1, i.e., exclude blocks that may be dedicated to net-work infrastructure and services, we see that roughly 450million address may have been unused. If some large subsetactually are unused, today, one could imagine reallocatingthem for use in IPv6 transition mechanisms that require IPv4addresses, e.g., NAT64 and DNS64 [3, 4], or as a commod-ity whose supply might last years in a marketplace, based onpast rates of growth in IPv4 address use (Figure 1).

IPv4 address markets are an operational reality, governedby the respective RIR policies [28]. A pertinent implicationof our work for these markets is that our metrics, combinedwith the appropriate vantage points, are ideal to readily de-termine spatio-temporal utilization of network blocks. Thiscan aid RIRs determining the current state-of-affairs of ad-dress utilization in their respective region, determining if atransfer conforms with their transfer policy (e.g., four out offive RIRs require market transfer recipients to justify needfor address space), as well as identify the likely buyers andsellers of addresses.Implications to network management: It is feasible forany network to employ our metrics and perform our anal-ysis on a continual basis, e.g., by monitoring traffic at itsborder. Measuring spatio-temporal utilization would enablean operator to more efficiently manage the IPv4 addressesthey assign, especially in networks such as those discussedin Section 6. Networks that make gains in efficiency by dis-covering unnecessary address blocks may decide to becomesellers in the IPv4 transfer marketplace. More generally, webelieve that our measurements can serve as input for fruitfuldiscussions on address assignment practices and their even-tual effect on address space utilization.Implications to network security: Our observations of many,disparate rates of change in the assignment of IP addressesto users has consequences for maintaining host-based accesscontrols and host reputations. A host’s IP address is oftenassociated with a reputation subsequently used for networkabuse mitigation, e.g., in the form of access control lists and

application rate-limits that specifically use those IP networkblocks or addresses as identifiers with which some level oftrust is (or is not) associated. Unfortunately, in this way, ad-dresses and the network blocks become encumbered by theirprior uses and the behavior of users within. This happenswhen reputation information is stale. The implication of ourwork here is that it can inform host-based access control andhost reputation, e.g., by determining the spatial and temporalbounds beyond which an IP addresses reputation should nolonger be respected. Further, our change detection method(Section 5.2) could be used to trigger expiration of host rep-utation, avoiding security vulnerabilities when networks arerenumbered or repurposed.Implications to content delivery: Details about user ac-tivity at the address level are valuable in CDN operation.A key responsibility of CDNs is to map users to the ap-propriate server(s) based on criteria including performanceand cost [24]. Details about active IP addresses and networkblocks are increasingly important when the CDN uses end-user mapping [7], where client addresses are mapped to theappropriate server.

9. CONCLUSIONIn this paper, we study the Internet through the lens of

IPv4 address-level activity as measured by successful con-nections to a large CDN. After many years of constant growth,active IPv4 address counts have stagnated, while IPv6 countshave grown [25]. In addition, we observe churn in the set ofactive addresses on time scales ranging from a day to a year.Simple address counts do not capture the increasingly com-plex situation of usage of the IPv4 address space. Instead, weuse three metrics that our results show are helpful to under-stand what is happening now: spatio-temporal aspects of ad-dress activity, address-associated traffic volume, and relativehost counts. Continued overall growth but lagging adoptionof IPv6 have brought a reimagined IPv4 upon us, one thatentails address sharing in both space and time. The Inter-net community is in a complex and costly resource-limitedpredicament, foreseen but unavoided. The prolonged tusslecontinues amongst operators about whether and when to im-plement incremental changes to IPv4, adopt IPv6, or both.Our study, as well as others that might adopt our metrics,can guide us in this tussle and better illuminate the conditionof the IPv4 address space.

13

10. REFERENCES

[1] D. Adrian, Z. Durumeric, G. Singh, and J. A.Halderman. Zippier ZMap: Internet-wide Scanning at10 Gbps. In 8th USENIX Workshop on OffensiveTechnologies, 2014.

[2] M. Antonakakis, R. Perdisci, D. Dagon, W. Lee, andN. Feamster. Building a Dynamic Reputation Systemfor DNS. In USENIX Security Symposium, 2010.

[3] M. Bagnulo, P. Matthews, and I. van Beijnum. StatefulNAT64: Network Address and Protocol Translationfrom IPv6 Clients to IPv4 Servers. IETF RFC 6146,April 2011.

[4] M. Bagnulo, A. Sullivan, P. Matthews, and I. vanBeijnum. DNS64: DNS Extensions for NetworkAddress Translation from IPv6 Clients to IPv4Servers. IETF RFC 6147, April 2011.

[5] X. Cai and J. Heidemann. Understanding Block-LevelAddress Usage in the Visible Internet. In ACMSIGCOMM, 2010.

[6] CAIDA. Ark Measurement Infrastructure.http://www.caida.org/projects/ark/.

[7] F. Chen, R. K. Sitaraman, and M. Torres. End-UserMapping: Next Generation Request Routing forContent Delivery. In ACM SIGCOMM, 2015.

[8] D. Clark, J. Wroclawski, K. Sollins, and R. Braden.Tussle in Cyberspace: Defining Tomorrow’s Internet.In ACM SIGCOMM, 2002.

[9] J. Czyz, M. Allman, J. Zhang, S. Iekel-Johnson,E. Osterweil, and M. Bailey. Measuring IPv6Adoption. In ACM SIGCOMM, 2014.

[10] F. Dabek, R. Cox, F. Kaashoek, and R. Morris.Vivaldi: A Decentralized Network Coordinate System.In ACM SIGCOMM, 2004.

[11] A. Dainotti, K. Benson, A. King, k. claffy,M. Kallitsis, E. Glatz, and X. Dimitropoulos.Estimating Internet address space usage throughpassive measurements. ACM CCR, 44(1):42–49, 2014.

[12] A. Dainotti, K. Benson, A. King, k. claffy, E. Glatz,X. Dimitropoulos, P. Richter, A. Finamore, andA. Snoeren. Lost in Space: Improving Inference ofIPv4 Address Space Utilization. Technical report, (toappear in IEEE JSAC Q2, 2016), Oct 2014.

[13] Z. Durumeric, E. Wustrow, and J. A. Halderman.ZMap: Fast Internet-Wide Scanning and its SecurityApplications. In USENIX Security Symposium, 2013.

[14] X. Fan and J. Heidemann. Selecting Representative IPAddresses for Internet Topology Studies. In ACMIMC, 2010.

[15] B. Gueye, A. Ziviani, M. Crovella, and S. Fdida.Constraint-Based Geolocation of Internet Hosts.IEEE/ACM Trans. Networking, 14(6):1219–1232,2006.

[16] S. Hao, N. A. Syed, N. Feamster, A. G. Gray, andS. Krasser. Detecting Spammers with SNARE:Spatio-temporal Network-level Automatic Reputation

Engine. In USENIX Security Symposium, 2009.[17] J. Heidemann, Y. Pradkin, R. Govindan,

C. Papadopoulos, G. Bartlett, and J. Bannister. Censusand Survey of the Visible Internet. In ACM IMC, 2008.

[18] Y. Jin, E. Sharafuddin, and Z. L. Zhang. Identifyingdynamic IP address blocks serendipitously throughbackground scanning traffic. In CoNEXT, 2007.

[19] E. Katz-Bassett, J. P. John, A. Krishnamurthy,D. Wetherall, T. Anderson, and Y. Chawathe. TowardsIP geolocation using delay and topologymeasurements. In ACM IMC, 2006.

[20] E. Katz-Bassett, H. Madhyastha, V. Adhikari, C. Scott,J. Sherry, P. van Wesep, A. Krishnamurthy, andT. Anderson. Reverse Traceroute. In NSDI, 2010.

[21] G. Maier, A. Feldmann, V. Paxson, and M. Allman.On Dominant Characteristics of ResidentialBroadband Internet Traffic. In ACM IMC, 2009.

[22] G. Maier, F. Schneider, and A. Feldmann. NAT usagein residential broadband networks. In PAM, 2011.

[23] G. C. M. Moura, C. Ganan, Q. Lone, P. Poursaied,H. Asghari, and M. van Eeten. How Dynamic is theISPs Address Space? Towards Internet-Wide DHCPChurn Estimation. In Workshop on Research andApplications of Internet Measurements, 2015.

[24] E. Nygren, R. K. Sitaraman, and J. Sun. The AkamaiNetwork: A Platform for High-performance InternetApplications. SIGOPS Oper. Syst. Rev., 44(3), 2010.

[25] D. Plonka and A. Berger. Temporal and SpatialClassification of Active IPv6 Addresses. In ACM IMC,2015.

[26] L. Quan, J. Heidemann, and Y. Pradkin. Trinocular:Understanding Internet Reliability Through AdaptiveProbing. In ACM SIGCOMM, 2013.

[27] L. Quan, J. Heidemann, and Y. Pradkin. When theInternet sleeps: correlating diurnal networks withexternal factors. In ACM IMC, 2014.

[28] P. Richter, M. Allman, R. Bush, and V. Paxson. APrimer on IPv4 Scarcity. ACM CCR, 45(2), 2015.

[29] A. Schulman and N. Spring. Pingin’ in the Rain. InACM IMC, 2011.

[30] Akamai Technologies. State of the Internet Report.https://www.akamai.com/us/en/our-thinking/

state-of-the-internet-report.[31] B. Wong, I. Stoyanov, and E. Gun Sirer. Octant: A

Comprehensive Framework for the Geolocalization ofInternet Hosts. In NSDI, 2007.

[32] Y. Xie, F. Yu, K. Achan, E. Gillum, M. Goldszmidt,and T. Wobber. How Dynamic are IP Addresses? InACM SIGCOMM, 2007.

[33] Q. Xu, J. Erman, A. Gerber, Z. Mao, J. Pang, andS. Venkataraman. Identifying Diverse UsageBehaviors of Smartphone Apps. In ACM IMC, 2011.

[34] S. Zander, L. Andrew, and G. Armitage. CapturingGhosts: Predicting the Used IPv4 Space by InferringUnobserved Addresses. In ACM IMC, 2014.

14

http://www.caida.org/projects/ark/

https://www.akamai.com/us/en/our-thinking/state-of-the-internet-report

https://www.akamai.com/us/en/our-thinking/state-of-the-internet-report

beyond counting: new perspectives on the active ipv4 ... · reported in akamai’s state of the...

Documents