apollo.humber.caapollo.humber.ca/~king/ceng251/datastructure diagram... · web viewan embedded...

12
Course Notes: Data Structure Diagrams A Data Structure Diagram is a simple sketch showing the physical organization of data in memory. It makes no difference if the data is in ram, on a disk or is being transmitted as a packet over a network – the organization is identical and the same diagram can be used! A DSD allows us to describe and design data structures which can use to manipulate the data. Except for the 1 st image I’m using hand drawn sketches here in order to encourage you to do the same. There are tools one can use to create a DSD but our diagrams are going to be so simple that it’s not worth our time to use them. Drawing a Single Scalar Value Just draw a box and label it. In some cases we’re interested in individual bits, in which case we draw lines, label the bit positions and the purpose of each field.

Upload: lydien

Post on 07-Feb-2019

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: apollo.humber.caapollo.humber.ca/~king/CENG251/DataStructure Diagram... · Web viewAn embedded controller such as an Arduino might use 16 bit/2 byte pointers. Figure 6= Diagram showing

Course Notes: Data Structure Diagrams

A Data Structure Diagram is a simple sketch showing the physical organization of data in memory. It makes no difference if the data is in ram, on a disk or is being transmitted as a packet over a network – the organization is identical and the same diagram can be used! A DSD allows us to describe and design data structures which can use to manipulate the data. Except for the 1st image I’m using hand drawn sketches here in order to encourage you to do the same. There are tools one can use to create a DSD but our diagrams are going to be so simple that it’s not worth our time to use them.

Drawing a Single Scalar Value

Just draw a box and label it.

In some cases we’re interested in individual bits, in which case we draw lines, label the bit positions and the purpose of each field.

Page 2: apollo.humber.caapollo.humber.ca/~king/CENG251/DataStructure Diagram... · Web viewAn embedded controller such as an Arduino might use 16 bit/2 byte pointers. Figure 6= Diagram showing

Figure 11

Drawing Arrays

The first convention is to show an array as a contiguous block of memory. For C strings this is usually drawn horizonally by a series of boxes with a letter placed in each box and a diagonal line placed at the end showing that a string is terminated by a null character.

Figure 2 - Datastructure Diagram for a C String

However a string in a different language such as Pascal could be represented as 2 or 4 byte length followed by the actual string.

Figure 3-How Strings are represented in Pascal. The string length comes first and there is no null termination.

In Windows/MS-DOS the end of a line of text is represented by carriage return + newline. When we need to document a blank, draw a b with a line through it. Other special characters like carriage return \n and newline \n.

1 http://www.puntoflotante.net/FLOATING-POINT-FORMAT-IEEE-754.htm retrieved Feb 1, 2018.

Page 3: apollo.humber.caapollo.humber.ca/~king/CENG251/DataStructure Diagram... · Web viewAn embedded controller such as an Arduino might use 16 bit/2 byte pointers. Figure 6= Diagram showing

On Unix systems only newline is used. This can create problems when transfering plain text files between systems. Unix provides two conversion programs to handle the different: dos2unix and unix2dos.

Figure 5 - How Unix represents a line of text

The beginning of a data structure is called the base address. This is always labelled with the number 0, regardless of the actual physical address which can change from one run of a program to another. The label appears on the line at the start of the diagam, not inside the box. Each of the boxes that appear afterwards are optionally labelled as well using the relative distance from the base address which is called the offset.

For other types of arrays the array is usually drawn vertically. Note that the first example shows that argv has a null value at the end because that is how it is implemented. The size of each offset is determined by the size of the datatype of the array. Pointers on 64 bit machines such as apollo use 8 bytes each. A 32 bit machine such as munro uses 4 bytes for a pointer. An embedded controller such as an Arduino might use 16 bit/2 byte pointers.

Figure 6= Diagram showing argv and a command line

Figure 4 - How Windows represents a line of text

Page 4: apollo.humber.caapollo.humber.ca/~king/CENG251/DataStructure Diagram... · Web viewAn embedded controller such as an Arduino might use 16 bit/2 byte pointers. Figure 6= Diagram showing

Often an array will have a special value at the end known as a sentinel value that is checked by a program to prevent one from running off the end into some other block of memory. Sentinel values are values that would not be confused with actual data. null is appropropriate for pointers or character strings. For an array of integers representing ages we might use a value such as -1.

Optionally you can use square brackets to indicate special locations that are of interest. These are placed next to the space, not the line. The number inside the brackets is the index, not the offset.

For your 1st set of DSDs I want you to label every line. Later on you can telegraph the pattern by giving 2-3 labels then using ellipses … to show that the pattern is continued.

Drawing Structs

Like arrays, structs are contiguous data structures. Consider the following example. The offsets of each field was calculated by the program flight2.c I’ve added the sizes of each field as well though this usually isn’t done in a DSD.

The base address of a struct, like an array is always zero. The label inside each box is the name of the field. The offset depends on the size of the previous field’s datatype. Typically chars are 1 byte, but if

Figure 7- An array of short ints with a sentinel value at the end

Page 5: apollo.humber.caapollo.humber.ca/~king/CENG251/DataStructure Diagram... · Web viewAn embedded controller such as an Arduino might use 16 bit/2 byte pointers. Figure 6= Diagram showing

you are using Unicode2 (C has a special datatype wchar_t for this) it may be larger. short values are usually 2 bytes, ints are 4 bytes but this is not well defined. In some environments they may be 2 or 8. long ints are usually 4 or 8 bytes. There is also the type long long which is usually 8 bytes and can represent integers as large as ± 263. floats are usually 4 bytes, doubles 8 bytes. The datatype long double may also be supported.

struct FLIGHT { short flightNo; char airline[11]; bool onTime; time_t departureTime; char destination[7]; int capacity; //maximum # of passengers double fuelLevel;};

The C standard does not specify the size of any data type. It depends on the compiler and the target machine, but we can check it by using the sizeof operator3. With some experience we can also make educated guesse. sizeof can be used either with a scalar variable, an array or a type declaration. The datatype of the result is of type size_t.

fprintf(stdout,”The size of an int is: %d\n”, sizeof int);4

fprintf(stdout,”The size of x is: %d\n” sizeof x); fprintf(stdout,”The size of struct FLAG is: “, sizeof(struct FLAG));5

fprintf(stdout,”The size of flag is: %d\n”, sizeof(flag)); fprintf(stdout,”The size of myArray is: %d\n” sizeof myArray); fprintf(stdout,”The size of an element of myArray is:\n”, sizeof myArray[0]); 2 Unicode is a 2-4 byte standard for representing symbols and non-Latin characters such as Arabic, Korean and Chinese. Handling Unicode is a specialized topic that is outside the scope of our course and uses routines such as fwscanf and fwprintf for I/O. Personally I’ve never had to use it but handling Unicode is an important topic nonetheless! The problem is handled more transparently in higher level languages such as Java. 3 sizeof is really a buildin operator like + * - % and /, in spite of it being written in letters not as a special character! However if you want you can use parenthesis and think of it like a function: sizeof(int). But it’s not and you won’t find a man page entry for it!

4 Copying code from a Word document. Don’t or procede with caution . There are two problems. The first is that Word uses a different A encoding for both single and double quotes than is used for C. “ and ‘ not only look different from [insert the characters] but they are not recognized by the C compiler. The other is that Microsoft uses \r\n (carriage return, newline) to represent the end of a line but Unix uses only \n. As such your code may not properly compile.

5 Because struct FLAG is 2 words not one, parenthesis are needed. It also makes it more readable

Figure 8-The DSD shows the offsets in memory for each of the member variables, relative to the base address which is always shown as zero.

Page 6: apollo.humber.caapollo.humber.ca/~king/CENG251/DataStructure Diagram... · Web viewAn embedded controller such as an Arduino might use 16 bit/2 byte pointers. Figure 6= Diagram showing

Referring back to figure 9, you may have noticed that the sum of all the sizes is a bit less that the size of the whole. onTime is only 1 byte, but the next field departureTime starts at position 16, not position 14! There are 2 bytes that are unused, though it’s not immediately clear which of the 3 bytes is used for the boolean value. destination starts at postion 24 and is only 7 bytes long. capacity starts at position 32, so there’s another wasted byte. (3 so far). capacity starts at location 32 and is 4 bytes long, but instead of starting at location 36 the last field, fuelLevel starts 4 bytes later. The whole datastructure takes up 48 bytes, 7 of which are wasted and unused! If you had 100 million records, each 48 bytes long, 14.6% of your disk space would be wasted – or .7 gigabytes of storage!

The reason for this is that the compiler is optimizing the code for speed. In assembly language the LOAD instruction likes to start on a word boundary, depending on the word size of the machines. Apollo is a 64 bit machine meaning the word size is 8 byte long. If an int, long, double or float starts on an odd address it takes 2 fetches to load the value into a register. That would mean that your code would be slightly bigger and would run slightly slower. By wasting a bit of space the compiler, by default, optimizes for speed.

But what if you are generating a lot of data and you don’t mind the execution performance hit. If the data is moving around on a network smaller data packets might make your program faster. The answer is to use the gcc compiler option -fpack-struct. This optimizes the compiled code for space as opposed to speed. The data structure diagram now looks like this:

This is one of the reasons that data structure diagrams are so useful. The data structure declaration alone doesn’t tell us the whole story. The offsets of the various fields can be different depending on whether the code was compiled with a 32 bit or 64 bit compiler6, and whether to code was compiled with -fpack-struct.

6 Of significant interest, the datatype time_t is 4 bytes on a 32 bit machine like munro. That means that on Jan 18 2038 at 10:14:07 PM EST, less than 20 years from now, we would literally run out of time. Newer systems like apollo use 64 bits/8 bytes for time_t values. That gives us a bit more breathing space – about 2.14 billion years.

The year 2000 was noted for some major system reviews as many systems designed between 1960 and 1980, to save on storage only stored the last 2 digits of the year. This was called the “Y2K Problem”. The 1st result was a major surge in IT hires and hundreds of millions of $s spent checking old code, especially financial, navigation and military systems that depended on the date. Lots of people made a very good living at this. In many cases it was cheaper to write new software systems and purchase new hardware. Disasters were predicted but actual problems were few. The 2nd effect was an IT industry slump that lasted from 2001 to about 2007. The extra staff were let go and managers, who saw the preventative exercise as a cost with little return to the bottom line, cut costs by reducing their IT budgets.

Expect the same cycle to happen again starting around 2032.

Figure 9 - Compressed struct, optimized by the -fpackstruct compile option to be space efficient

Page 7: apollo.humber.caapollo.humber.ca/~king/CENG251/DataStructure Diagram... · Web viewAn embedded controller such as an Arduino might use 16 bit/2 byte pointers. Figure 6= Diagram showing

Drawing Linked Values

A link is shown by drawing an arrow from a pointer to the data being pointed to. The first example we used in class illustrating the relationship between argv and the command line as it has been loaded into memory whenever any program is executed. The second application is for linked lists which are usually drawn as a series of data packets

Concluding (?) Remarks

Is there more to DSDs? Just a little, but we’ve covered the essential. One can also have arrays of structs, doubly linked lists, and pointers to different types data including pointers to pointers. DSD drawing notion is fairly simple but the full can get fairly complicated and one has to use one’s judgment as to how much detail to provide. With complex drawings do a rough sketch 1st, then decide how you would reposition things to make the outline clearer.

Afterwords

Using od to look at the data

od (octal dump) is a Unix utility that allows you to look at data that’s been written to a file. The easiest option is -c which displays the data as ascii characters. Some special characters such as null, alert, backspace, tab, newline, vertical tab, formfeed (skip to the next page) and carriage return are displayed using \0 \a, \b, \n, \v, \f and \r. Other non-printable characters are shown using their octal equivalent.

-bash-3.2$ od -c asciiDump #The characters 0-255 were placed in the file

Page 8: apollo.humber.caapollo.humber.ca/~king/CENG251/DataStructure Diagram... · Web viewAn embedded controller such as an Arduino might use 16 bit/2 byte pointers. Figure 6= Diagram showing

0000000 \0 001 002 003 004 005 006 \a \b \t \n \v \f \r 016 0170000020 020 021 022 023 024 025 026 027 030 031 032 033 034 035 036 0370000040 ! " # $ % & ' ( ) * + , - . /0000060 0 1 2 3 4 5 6 7 8 9 : ; < = > ?0000100 @ A B C D E F G H I J K L M N O0000120 P Q R S T U V W X Y Z [ \ ] ^ _0000140 ` a b c d e f g h i j k l m n o0000160 p q r s t u v w x y z { | } ~ 1770000200 200 201 202 203 204 205 206 207 210 211 212 213 214 215 216 2170000220 220 221 222 223 224 225 226 227 230 231 232 233 234 235 236 2370000240 240 241 242 243 244 245 246 247 250 251 252 253 254 255 256 2570000260 260 261 262 263 264 265 266 267 270 271 272 273 274 275 276 2770000300 300 301 302 303 304 305 306 307 310 311 312 313 314 315 316 3170000320 320 321 322 323 324 325 326 327 330 331 332 333 334 335 336 3370000340 340 341 342 343 344 345 346 347 350 351 352 353 354 355 356 3570000360 360 361 362 363 364 365 366 367 370 371 372 373 374 375 376 377

The numbers on the left represent the offsets from the beginning of the file, measured in octal. You can change the the display to decimal or hexadecimal by using the flag -A followed by one of o, d, x or n, meaning octal (the default), decimal, hex or none.

You can also specify the width of the line using the -w flag followed by a number, but the number should be a multiple of 4. The -c option is used for looking a text.

If the file consists of only of numbers of a given type you can display that as well.

od -s shortData.dat od -i intData.datod -t dL longData.datod -f floatData.datod -F doubleData.dat

One can also add x to the option to look at the file as hexadecimal values.

od -sx shortData.dat

Examing mixed data structures can be difficult, especially if -fpackstruct has been used. Usually one just looks for strings and checks that the other fields are the correct number of bytes. The options -j (skip bytes) and -N (number of bytes to process can be used decode individual fields.

The following bash shell script uses od to dump the first record of file consisting of unpacked flights in binary form contained in the file airline.dump. This data was generated using the program flight2.c.

Page 9: apollo.humber.caapollo.humber.ca/~king/CENG251/DataStructure Diagram... · Web viewAn embedded controller such as an Arduino might use 16 bit/2 byte pointers. Figure 6= Diagram showing

Each line in the script generates an individual field. The -An flag suppresses the address, following by a flag reflecting the data type of the field. -j nn skips to the appropriate byte and -N nn limits the presentation to the size of the field.

#!/bin/bash#File: flightDumpfile=airline.dump

echo 'flightNo: ' `od -An -s -j0 -N2 $file` #display the flightNoecho 'airLine: ' `od -An -c -j2 -N11 $file` #display the airlineecho 'ontime: ' `od -An -c -j13 -N1 $file` #display onTimewhen=`od -An -t dL -j16 -N8 $file` #extract the number for the dateecho 'departureTime: ' $when `date -d"Jan 1, 1970 EST + $when seconds"`echo 'destination: ' `od -An -c -j24 -N7 $file` #extract destinationecho 'capacity: ' `od -An -i -j32 -N4 $file` #extract capacityecho 'fuelLevel: ' `od -An -t f8 -j40 -N8 $file` #extract fuelLevel

The script is very specific to an unpacked FLIGHT data structure. It would have to be rewritten to handle a packed FLIGHT. It only works for the first record however it could be modified so that the offset could be calculated.