open street map project

OpenStreetMap Project

Data Wrangling with MongoDB

By Edmond Chin Juin Fung

Map Area: Singapore

Content:

1. Problems Encountered in the map

1.1 Nodes that is not relevant to the country Singapore

1.1.1 is_in:country

1.1.2 is_in

1.2 Problem related to subtag of the nodes

1.3 Omission of “name” attribute

1.4 Auditing “street”

1.5 Auditing Post Code

1.5.1 Alphabet and whitespace in postcode

1.5.2 Cleaning data

Stage 1 – filtering with city

Stage 2 – filtering with street

Stage 3 – filtering with name

Stage 4 – filtering the rest of the documents

Final stage – cleaning

2. Data Overview

3. Additional Idea

Reference

1. Problems Encountered in the Map

After converting the Singapore.osm file into json file and storing it into mongodb (refer to

conversion_original.py), I ran some queries in mongo shell to analyse the data. I noticed there are

multiple problems with the data. They will be discussed below.

1.1 Nodes that is not relevant to the country Singapore

1.1.1 is_in:country

Basic querying revealed that some of the nodes have attribute “is_in:country”. So I use code such as

find({“is_in:country” : {“$exist”: 1}}) and .count() to print out a list of nodes that have such attributes

and total number of such nodes. I noticed a lot of them are not from Singapore but of neighbouring

countries such as Malaysia and Indonesia. Example as below:

From the above example, we see it is stated that the place is located in Malaysia, Johor state. The

name, “Layang-Layang” is definitely a place in Malaysia, therefore this document should not belong

in Singapore data.

Unfortunately, we cannot generalize every document that has “is_in:country” attribute with value

that is of non-Singapore to be located outside of Singapore, as it is possible for user to make input

mistake. Yet, it will be less likely to be an input mistake if the user entered 2 non-Singapore value

into 2 different attributes. Using above example, the “is_in:country” has value “Malaysia” and the

“is_in:state” has value “Johor”, which are both non-Singapore value, thus less likely to be an input

mistake and more likely to be a non-Singapore data.

(The python code used for the following paragraphs can be found in verification.py in the zip file)

Using this argument, I first find out the total number of documents that have “is_in:country” and its

unique values. Results as below:

I then proceed to find out the total unique value of “is_in:state” for each documents that do not

have “Singapore” as value for the “is_in:country” attribute. The result are as follow:

I identified that all 6 of them are non-Singapore value, thus suitable to be removed from the data.

Since the next most common attribute that is shared among those documents are “name”, I have to

verify them one by one. Luckily, I only left 39 documents to go through and it is not that difficult to

verify them since Singapore location are named in English whereas Malaysia and Indonesia locations

are named in Malay. I ran some code and part of the results are as follow:

I identified all of them as non-Singapore value since the name are in Malay, thus it is safe to remove

them from the data.

From my analysis, it is now safe to conclude that we can eliminate all documents that have non-

Singapore value for “is_in:country” attribute. I went back to the osm file, and used python code

(function cleanup_is_in in conversion_revised.py) to filter out these documents before converting it

to json and importing to mongodb. The results are as below:

1.1.2 is_in

(Additional cleaning; code can be found in verification.py)

After cleaning the data, further querying also reveals that there are node that do not have

“is_in:country” attribute but has “is_in” attribute that also points to locations that does not belong

to Singapore (example as below). This means these node had been missed out by our previous

analysis since it was based on “is_in:country” attribute. Example below:

From the example, we see that there is an “is_in” attribute that points to a non-Singapore locations

without “is_in:country” attribute. I then ran a code to query the data. Result as below:

From the result above, apart from 3 values (“Singapore, , Singapore”, “Singapore”, “Sentosa”), the

rest do not belongs to Singapore. Like previous analysis, to make sure that they are not input

mistake, I am going to run the code along with “name” attribute. If the “name” value is in Malay,

then we can assume the documents does not belong in Singapore. Part of the results are as below:

I have identified them to be of Malay origin, thus it is safe to remove the documents. The rest of the

documents are those that do not have “name”. Example as below:

From the above example, we can see that there is no other additional data that allows me to identify

whether it is not a Singapore location or just simply an input mistake by user. Therefore, I shall

remove these documents as well (please refer to conversion_revised.py for the identification and

removal function). I have also replaced documents with “is_in” attribute value

“Singapore,,Singapore” to “Singapore” as well in the coding. The results are as below:

1.2 Problem related to the subtags of the nodes

While querying the number of nodes and ways, I found some discrepancy in the total number

documents and the number of nodes and ways.

The total number of documents:

The total number of nodes:

The total number of ways:

As you can see from above, the value does not add up. There are 42 documents that are neither

node nor way. This is odd as during my conversion from osm file to json file, I specifically program it

so that only those that are of type “node” and “way” will be inserted into the json file. So what these

42 unknowns? Some querying reveals that the type attribute has been replaced by some other

information (refer to diagram below).

The problem lies in the way I write my python code and the osm data (please refer to the green

comments in conversion_original.py for the exact problem in the code). I assigned the “type” key

early on my code. Yet, it is possible for tag.attrib[“k”] to have value “type” as well, thus replacing my

previous assignment with the new one. I have decided not to dispose these information, but to

assign new key to it as the information might be useful. I have to do some modification add a

function as below:

I make sure that tag.attrib[“k”] will be put into a category with slightly different name if it has the

same string as “id”, “type”, “visible”, “created”, “pos”, “address”, “node_refs”. This will prevent the

replacement of value of the stated predetermined attributes. The results are as below:

The total number of documents:

The total number of nodes:

And the total number of ways:

Now they finally add up.

Furthermore, I have also created a new attribute “sub-tag_type” for those extra data. Results as

below (code can be found in verification.py):

1.3 Omission of “name” attribute

Since in this data, “name” is one of the most common attribute for documents that have meaningful

information. Yet, not all documents that have meaningful information have the attribute “name”.

Example as below:

By including “name” attribute into those documents, it will make the data more consistent, thus

allowing easier coding and analysis in the future. I will work on the most common documents, which

have one of these attributes, “highway”, “network” and “amenity”. I uses the function below for the

editing (can also be found in conversion_revised.py):

In my code, I first identify those documents, then add “name” attribute with the same value as

“highway”, “network” or “amenity” attribute. The result are as follow:

1.4 Auditing “street”

Some of the documents have attribute “addr:street”, with tells us which street the particular node is

in. Yet, the naming is a bit inconsistent. Some of the name uses abbreviation. For example, “Drive” is

written as “Dr.”, Avenue is written as “Ave.” etc. Below are some abbreviated street example that I

found on my data:

(For coding details, please refer to update_street(tag, mapping, node) function in conversion_revised.py)

To solve this, I first created a dictionary of which stores all the abbreviation as key and their original

words as value. Then, I convert all the abbreviation to the original words. Part of the results are as

below:

1.5 Auditing Post Code

Some querying reveals that the postcode are not consistent. Singapore postcode has 6 digits. Some

of the postcode do not have 6 digits, whereas some postcode has alphabet in it. Example as below:

In the example, we see that the postcode is not 6 digits and information such as “addr:city” also

shows that it is not even a Singapore location. Yet, we cannot generalize that all documents that do

not have 6 digits does not belongs to Singapore, as it might be input mistake by user. Therefore, we

will also have to rely on other information, such as “addr:city” to determine the location. On other

other hand, we also need to consider postcode that has alphabet or whitespace. We shall first settle

the alphabet and whitespace problem, then move on to cleaning the data.

1.5.1 Alphabet and whitespace in postal code

Some postcode has alphabet or whitespace in it. Example as such:

From the example above, we see that it is a Singapore location but the postcode has “Singapore”

word in it. I therefore remove all non-digit value from these data so that they will only produce digits

(please refer to all_digit(node) function in conversion_revised.py). Some of the resulting changes are

as follow:

1.5.2 Cleaning data

Since there are non-six digit postcode in the data, we will have to double check them with other

attribute to determine if the documents is really not from Singapore. On the other hand, since not

all documents have the same attributes, we will have to go through several stages to complete the

whole auditing and cleaning process. Below are some information regarding the documents:

Following the information above, we shall filter the documents with attributes in this order: city,

street, name, rest of the documents.

Stage 1 - Filtering with city:

Running code that filters postcode length not equal to 6 and with “addr:city” attribute gives me a list

of city as such:

Apart from “Singapore”, the rest of the city are not from Singapore. So it is safe to remove those

documents. I have to dig deeper why there are 53 documents with “Singapore” as city value that do

not have 6 digits postcode. Querying these documents for their street info get me the result as

below:

It appears all of them are indeed located in Singapore, but the postcode are not in correct order. I

will thus remove the “addr:postcode” attribute from these documents.

Stage 2 - Filtering with street:

Running code that filters postcode length not equal to 6, without “addr:city” but have “addr:street”

attribute gives me a list of city, part of them as below:

Apart from “Chancellor Drive”, the rest of the name are in Malay, thus we can consider them to be

from non-Singapore location. Further querying into the “Chancellor Drive” documents reveal

information as below:

Since all the details are in English and relates to a university, so I type in the postcode 79200 into

google map and it return me a location in Johor, where university of Southampton is located1.

Therefore, I conclude these are not Singapore related documents as well.

Stage 3 - Filtering with name:

Running code that filters postcode length not equal to 6, without “addr:city” and “addr:street” but

have “name” attribute gives me a list of city, part of them as below:

Apart from a few Malay name that I can recognize, there are several documents which I cannot

verify its location. I will thus examine them with the rest of the leftover documents in the next stage.

Stage 4 - Filtering the rest of the documents:

Running code that filters the rest of the documents by their “user” attribute, I get results such as

below:

UTM stands for Universiti Teknologi Malaysia, thus we can remove those documents with user

starting with UTM. That left 3 documents to analyse. We shall analyse their full documents one by

one. After choosing some attributes to present, the results are such as below:

As mentioned previously, postcode 79200 points to a university town in Malaysia. Since the name of

the documents also seems to come from a university background, we shall assume they are not from

Singapore and remove them. As for the “buffet”, we do not have any practical way to identify it, thus

we shall keep it in the database but remove the inconsistent postcode.

Final Stage - Cleaning (code may be found in clean_postcode(node) function in conversion_revised.py)

I will thus remove all documents that do not have 6 digits postcode, apart from some

documents stated in stage 1 and stage 4. To verify that the cleaning has been done, the new

total number of documents should be 1015776 (no. doc. Before cleaning) – (5668(no. doc.

that do not have 6 digit postcode) – 54 (no. doc. retained from stage1 and stage4)) =

1,010,162. The result are as below:

2. Data Overview

The file sizes of the osm and json format files are as follow:

Singapore.osm - 188 MB

Singapore.osm.json - 284 MB

Number of documents

Number of nodes

Number of ways

Some fundamental percentages related to the data

(Code may be found in data_overview.py)

Top 20 appearing amenities

(Code may be found in data_overview.py)

3. Additional Idea

It is quite disappointing to see that only around 10% of the data have any useful information in it (as

percentage of nodes without name or description is 90.45%). On the other hand, a whopping 68.7%

of the data that have useful information are concerning highway and road. It seems like the users

who input the data are very interested in cars and highway. Undoubtedly, the most common listed

amenity is parking. This is quite puzzling as the total no. of landed vehicles in Singapore are around

900k2. and the total population of Singapore is around 5.5million3., that means only around 16% of

the people actually owns a car. Yet, around 70% of the data that we get are regarding highway or

parking. It would be great if there are more information on businesses and public transport, as these

information are a lot more useful to many other people that do not drive.

Furthermore, the data do not follow any kind format, which make analysis rather difficult. For

example, document regarding highway have identification attribute “highway”, whereas document

regarding subway have identification attribute “network”. Yet, I think it would be a lot more

convenient if we just put these identification into a single category, such as amenity or types of

facilities. This will make the data tidier and allow easier analysis. As such, we will be able to identify

every unique type of documents by just calling the single category.

On the other hand, not all documents have the same attributes. For example, among the documents

that have “addr:postcode”, not all of them have “addr:city” or “addr:street”. Such inconsistencies

make analysis quite difficult. Therefore, it will be more efficient if user follow a certain kind of data

input method. Like, they will have to insert “addr:city” whenever they inserted “addr:postcode”.

Unfortunately, as these data depends on public user input, it will be quite difficult to insist on certain

format. One way of encouraging that will be to set a note to them regarding the appropriate format

to be inserted while they are doing the inserting. Another way is to make some input field

mandatory to be filled while they are inputting the data.

Still, insisting users to take certain action might discourage them to be involved in data input. Yet, if

osm is popular and useful, they might look beyond these restrictions. On the other hand, restricting

them to fill up certain mandatory field might also invite odd data input as user might just input

garbage into the field. To overcome this, we can also ensure the field are filled up properly with

drop-down selection or only allow certain character and format in the field.

Reference

1. https://www.google.com.sg/maps/place/Educity+Student+Village/@1.4303999,103.612795

5,17.24z/data=!4m5!1m2!2m1!1s79200!3m1!1s0x31da0ba2bc9f9ec1:0x80b6846292a4d575

2. http://www.singstat.gov.sg/statistics/latest-data#8/

3. https://www.lta.gov.sg/content/dam/ltaweb/corp/PublicationsResearch/files/FactsandFigur

es/MVP01-1_MVP_by_type.pdf

http://www.singstat.gov.sg/statistics/latest-data#8/

open street map project

Documents