arrays cs101 2012.1. chakrabarti (1-dimensional) array so far, each int, double or char variable...
TRANSCRIPT
Arrays
CS101 2012.1
Chakrabarti
(1-dimensional) array So far, each int, double or char variable
had a distinct name• Called “scalar” variables
An array is a collection of variables• They share the same name• But they have different indices in the array
Arrays are provided in C++ in two ways• “Native arrays” supported by language itself• The vector type (will introduce later)
Similar: understand one, understand both
Chakrabarti
Notation When writing series expressions in math, we
use subscripts like ai, bj etc.
For writing code, the subscript is placed in box brackets, as in a[i], b[j]
Inside […] can be an arbitrary integer expression
In C++, the first element is a[0], b[0] etc. If the array has n elements, the last element
is at position n-1, as in a[n-1] Watch against out-of-bound index errors
Chakrabarti
Why do we need arrays Print a running median of the last 1000
temperatures recorded at a sensor Tedious to declare and use 1000 scalar
variables of the form t0, t1, …, t99 Cannot easily express a computation like
“print differences between consecutive readings” as a loop computation
Want to write “ti+1 – ti”
I.e., want to access elements using index that is an integer expression
Chakrabarti
Declaring a native arraymain() { int vn = 9; int va[vn]; for (int vx = 0; vx < vn; ++vx) { va[vx] = vx * (vn - 1 - vx); } for (int vx = 0; vx < vn; ++vx) { cout << va[vx] << ", ";}cout << endl;
Number of elements in array to be created
Reserve memory for 9-element array
Lvalue: cell to which rhs value is to be written
Note: no size() or length; hang on to vn
Rvalue: access int in specified cell
Chakrabarti
Representation in memory Elements of array have
fixed size Laid out consecutively Compiler associates
array name va with memory address of first byte
To fetch va[ix], go to byte address A + ix*4 and fetch next four bytes as integer
A, A+4, A+8, A+12, …
One 32-bit int (0)
One 32-bit int (7)
One 32-bit int (12)
va starts here (address A)
va[1]
va[2]
va[0]
Chakrabarti
Sum and product of all elements
double arr[an]; // fill in suitably
double sum = 0, prod = 1;for (int ax = 0; ax < an; ++ax) { sum += arr[ax]; prod *= arr[ax];} In standard notation we would write these as
Chakrabarti
Dot-product of two vectors
double av[nn], bv[nn]; // filled in
double dotprod = 0;for (int ix = 0; ix < nn; ++ix) { dotprod += av[ix] * bv[ix];}
Chakrabarti
Cosine of the angle between two vectors Need to compute norms alongside dot proddouble av[nn], bv[nn]; // filled indouble dot=0, anorm=0, bnorm=0;for (int ix = 0; ix < nn; ++ix) { dot += av[ix] * bv[ix]; anorm += av[ix] * av[ix]; bnorm += bv[ix] * bv[ix];}double ans =dot/sqrt(anorm)/sqrt(bnorm);
Chakrabarti
Josephus problem numPeople people stand in a circle Say named 0, 1, …, numPeople-1 Start at some arbitrary person (say 0) Skip skip people Loop back from numPeople-1 to 0 if needed Throw out victim at next position Repeat until circle becomes empy Eviction order? Who survives to the end? Need to model only two states per person:
present (true) or absent (false)
Chakrabarti
Josephus solutionbool present[numPeople]; // initialize to all true (present)int victim = 0, numEvicted = 0;while (numEvicted < numPeople) { // Skip over skip present people // Evict victim} Irrespective of whether victim is present or not at
the beginning of the loop, ensure that after the skip step, victim is again a
present person
Chakrabarti
The skip stepfor (int toSkip = skip; ; victim=(victim+1)%numPeople)
{ if (present[victim]) { --toSkip; } if (toSkip == 0) { break; }} Note that present[victim] is true here If skip is large, may skip over the same
candidate victim many times Note use of for without condition to check
Chakrabarti
Evict step
present[victim] = false;++numEvicted;cout << "Evicted " << victim << endl;
Eviction takes constant time victim is now absent, but next iteration’s
skip step will take care of that
Chakrabarti
Alternative representation Are we wasting too much time skipping over
people who have already left? Alternative: instead of a bool array, maintain
a shrinking int array with person IDs On every eviction, squeeze out the evicted
ID and reduce array size by 1 (Allocated space remains same, we just don’t
use all of it)
Chakrabarti
Busy eviction
for (int cx=victim, cx<numPeople-1; ++cx) {
survivors[cx] = survivors[cx+1];}--numPeople; Cannot just swap, because order of people in survivors must not be disturbed
0 numPeoplevictim
Chakrabarti
Skipping is now trivial All person IDs in survivors is present by
construction So skipping is as simple as
victim = (victim + skip) % numPeople; Takes constant time Note that numPeople decreases by one after
each eviction
Chakrabarti
Time analysis First representation (bit vector)
• Skipping is messy• Evicting is trivial
Second representation (person ID vector)• Skipping is trivial• Evicting is messy (move data)
Which is better?• Depends on the goal of the computation
Chakrabarti
Smallest and largest elements Convention: for an empty array
• Largest element is (minus infinity)• Smallest element is + (plus infinity)• Will show how to instantiate for all numeric types
int array[an]; // suitably filled inint min = PLUS_INF, max = MINUS_INF;for(int ax=0; ax < an; ++ax){ int elem = array[ax]; min = (elem < min)? elem : min; max = (elem > max)? elem : max;}
Ties?
Chakrabarti
Position (index) of smallest element Given double array[0, …, an-1] Find index imin of smallest element Must remember current smallest too
double amin = PLUS_INF;int imin = -1; // illegal value to startfor (int ax = 0; ax < an; ++ax) { if (array[ax] < amin) { amin = array[ax]; imin = ax; }} // imin holds the answer
Chakrabarti
Swapping array elements Once we find imin, we can exchange the
element at imin with the element at 0double tmp = array[imin];array[imin] = array[0];array[0] = tmp;
This places the smallest element at slot 0 Or largest, if we wish We can keep doing this, will give us a sorted
array
Chakrabarti
Given array a[0,…,n-1] Compute array b[0,…,n-1] where
b[i] = a[0] + a[1] + … + a[i] Nested loop works but is naïve int a[n], b[n]; // vector a filledfor (int bx = 0; bx < n; ++bx) { b[bx] = 0; for (int ax = 0; ax <= bx; ++ax) { b[bx] += a[ax]; }}
Prefix (cumulative) sum
Time taken is proportional to n2
Chakrabarti
Given array a[0,…,n-1] Compute array b[0,…,n-1] where
b[i] = a[0] + a[1] + … + a[i]
int a[n], b[n];// vector a filledfor (int ix = 0; ix < n; ++ix) { b[ix] = ( b[ix-1]) + a[ix];}
Prefix sum: faster
a[0] a[1] a[2] a[3]
b[0]=a[0] a[0]+a[1] a[0]+a[1]+a[2]?
(ix == 0)? 0 :
Makes code inefficient
Chakrabarti
Applications of cumulative sum Running balance in bank account
• Deposits positive, withdrawals negative
Range sum: given array, proprocess it so that, given “query” i, j, can returna[i] + a[i+1] + … + a[j] very quickly• Answer is simply b[j] – b[i-1]
Chakrabarti
Merging sorted arrays Arrays double a[m], b[n]
• Repeated values possible
Each has been sorted in increasing order• Know sorting by repeated move-smallest-to-front
Output is array c[m+n]• Contains all elements of a[m] and b[n]• Including all repeated values• In increasing order
Example: a=(0, 2, 7, 8) b=(2, 3, 5, 8, 9) c=(0, 2, 2, 3, 5, 7, 8, 8, 9)
Chakrabarti
Merge approach Run two indexes ax on a[…] and bx on b[…]
• Also called cursors
Choose minimum among cursors, advance that cursor
Append chosen number to c[…]
0 2 7 8
2 3 5 8 90
ax
bx
2 2 3 5 7
Chakrabarti
Merge codedouble a[m], b[n], c[m+n];// a and b suitably initializedint ax = 0, bx = 0, cx = 0;while (ax < m && bx < n) { if (a[ax]<b[bx]) {c[cx++]=a[ax++];} else { c[cx++] = b[bx++];}}// one or both arrays are empty herewhile (ax < m) { c[cx++] = a[ax++]; }while (bx < n) { c[cx++] = b[bx++]; }
Chakrabarti
Another way to sort Earlier we sorted by finding the minimum
element and pulling it out to the leftmost unsorted position
Suppose array length is a power of 2 First sort segments of length 1 (already
done) Merge into sorted segments of length 2
• [0,1] [2,3] [4,5] [6, 7]
Merge into sorted segments of length 4• [0, 1, 2, 3] [4, 5, 6, 7]
Mergesort
Chakrabarti
Mergesort picture Recall that any
location of RAM can be read/written at approximately same speed
But access to hard disk best made in large contiguous chunks of bytes
Mergesort best for data larger than RAM
35 9 22 16 17 13 29 4
9 35 16 22 13 17 4 29
9 16 22 35 4 13 17 29
4 9 13 16 17 22 29 35
Chakrabarti
Time taken by mergesort Time to merge a[m] and b[n] is m+n Suppose array to be sorted is s[2p] 2p-1 merges of segments of size 1, 1 takes
time 2p
2p-2 merges of segments of size 2, 2 takes time 2p again
p merge phases each taking 2p time So total time is p 2p
Writing N=2p, total timeis N log N, optimal!
Index arithmetic and bookkeeping
slightly complicated
Chakrabarti
Searching a sorted array Given array a=(4, 9, 13, 16, 17, 22) Find the index/position of q=16 (answer: 3)
• Print -1 if q not found in a
Why do we care? E.g., income tax department database
• Array pan[numPeople] (sorted)• Array income[numPeople] (not sorted)• Each index is for one person
Find index of specific PAN Then access person’s income
Chakrabarti
Searching a sorted array Given array a=(4, 9, 13, 16, 17, 22) Find the index/position of q=16 (answer: 3)
• Print -1 if q not found in a Linear searchfor (int ax=0; ax<an; ++ax) { if (a[ax] == q) { cout << ax; break; }}if (ax==an) { cout << -1; }
More efficient to do binary search
No need to look farther because all values will be larger than q
Chakrabarti
Binary search Bracket unchecked array segment between
indexes lo and hi Bisect segment: mid = (lo + hi)/2 Compare q with a[mid] If q is equal to a[mid] answer is mid If q < a[mid], next bracket is lo…mid-1 If q > a[mid], next bracket is mid+1…hi
4 9 13 16 17 22
q=16lo hi
0 1 2 3 4 5
Chakrabarti
Binary search Terminate when lo == hi (?) Before first halving, bracket has n candidates After first halving, reduces to ~n/2 After second halving, … ~n/4 Number of halvings required is approximately
log n If array was not sorted, need to check all n To implement, need a little care with index
arithmetic• lo=2 hi=3 mid=2, lo=3 hi=4 mid=3
Chakrabarti
Binary search code
int lo = 0, hi = n-1, ans=-1;while (lo <= hi) { int mid = (lo + hi)/2; if (q < a[mid]) { hi = mid-1; } if (q > a[mid]) { lo = mid+1; } if (q == a[mid]) { ans = mid; break; }}if (ans == -1) { … }
If not found, common convention to return the negative form of the index where the
query should be inserted
Chakrabarti
Median Given a sorted array a with n elements By definition, median is a[n/2] or a[n/2+1] Similarly, percentiles What if the array is not sorted? Trivial: first sort then find middle element
• Have seen that this will take n2 or n log n time
Can we do better?• Turns out cn steps are enough for some constant
c• (Complicated)
Chakrabarti
Median Given two sorted arrays a[m] and b[n],
supposing their merge was c[m+n] Goal: find median of c without calculating it
(would take m+n time) I.e., find median in much less time
small 20
small 27 large
>
All items marked large are larger than all items
marked smallTherefore no large item
can be a medianEliminates ~(m+n)/2
elements
Lab!
a
b
m/2
n/2
Chakrabarti
Back to mergesort index arithmetic Assume array size N = 2pow
There are pow merge rounds In the pxth merge round, px=0,1,…,pow-1: There are 2pow-px sorted runs Numbered rx = 0, 1, …, rn = 2pow-px
Each run has 2px elements rxth run starts at index rx * 2px
And ends at index (rx+1)* 2px (excluded) Merge rxth and (rx+1)th runs for rx=0, 2, 4, …
Demo
Chakrabarti
Bottom up mergesort Start with
solutions to small subproblems, then combine them to form solutions of larger subproblems
Eventually solving the whole problem
35 9 22 16 17 13 29 4
9 35 16 22 13 17 4 29
9 16 22 35 4 13 17 29
4 9 13 16 17 22 29 35
Chakrabarti
Top down expression To mergesort array in positions [lo,hi) … If hi == lo then we are done, otherwise … Find m = (lo + hi)/2 Mergesort array in positions [lo, m)
• Recursive call
Mergesort array in positions [m, hi)• Recursive call
Merge these two segments
Chakrabarti
vector A vector can hold items of many types Need to tell C++ compiler which type Declare as vector<double> oneArray;vector<int> anotherArray;
Very similar to string in other respects• array.resize(10); // sets or resets size• array.size() // current number of elements• foo = array[index]; // reading a cell• array[index] = value; // writing a cell
Chakrabarti
Initialization
vector<int> array;const int an = 10;array.resize(an);for (int ax = 0; ax < an; ++ax) { array[ax] = ax * (an – ax);}
Usually, arrays are read from data files
Chakrabarti
Advantages of using vector Can store elements of any type in it E.g. vector<string> firstNames; Memory management is handled by C++
runtime library Can grow and shrink array after declaration But there’s more!
• Sorting is already provided• So is binary search• And many other useful algorithms
Demo
Chakrabarti
Sparse arrays Thus far, our arrays allocated storage for
each and every element Sometimes, can’t afford this E.g. documents Represented as vectors in
a very high-dimensional space One axis for each word in
the vocabulary, which couldhave billions of tokens
Most coordinates of most docs are zeros
cat dog
logic
Chakrabarti
Example A corpus with only two docs
• my care is loss of care• by old care done
Distinct tokens: 0=by 1=care 2=done 3=is 4=loss 5=my 6=of 7=old
Represented as token IDs, docs look like• 5 1 3 4 6 1• 0 7 1 2
In sparse notation ignoring sequence info• { 1:2, 3:1, 4:1, 5:1, 6:1 }• { 0:1, 1:1, 2:1, 7:1 }
Documents are similar if they frequently share words
Extreme compression: can instead store gaps
between adjacent token IDs
Chakrabarti
Storing sparse arrays Assign integer IDs to each axis/dimension Use two “ganged” vectors
• vector<int> dims storesdimension IDs
• vector<float> vals storescoordinate in corresponding dim
Some important operations• Find the L2 norm of a vector• Add two vectors• Find the dot product between two vectors
1
2
3
1
4
1
5
1
6
1
Chakrabarti
Normvector<int> dims; // suitably filledvector<float> vals; // suitably filledfloat normSquared = 0;for (int vx = 0, vn = vals.size();vx < vn; ++vx) {
normSquared += vals[vx] * vals[vx];}float norm = sqrt(normSquared);
Chakrabarti
Sparse sum aDims, aVals, bDims, bVals cDims, cVals Best to keep all dims in increasing order Conventionally written as
• a = { 0:2.2, 3:1.4, 11:35 } b = { 1: 5.2, 3: 1.4 }• Then c = { 0:2.2, 1:5.2, 11:35 }
Two parts• Given a sparse vector that is not ordered by dim,
how to clean it up• Given a and b sorted on dim, how to compute c
Second part first It’s just a merge!
Chakrabarti
Sum merge
0
2.2
3
-1.4
11
35
1
-5.2
3
1.4
aDims
aVals
bDims
bVals
ax
bx0
2.2
cDims
cVals
3
1
-5.2
3
11
35
Chakrabarti
Sum merge codevector<int> aDims, bDims, cDims;vector<float> aVals, bVals, cVals;// a and b filled, c emptyint ax=0, an=aDims.size(),bx=0, bn=bDims.size();
while (ax < an && bx < bn) { // three cases: // aDims[ax] < bDims[bx] // aDims[ax] > bDims[bx] // aDims[ax] == bDims[bx]}// append leftovers as before
Chakrabarti
Three cases aDims[ax] < bDims[bx]
• cDims.push_back(aDims[ax]);• cVals.push_back(aVals[ax]); ++ax;
aDims[ax] > bDims[bx]• cDims.push_back(bDims[bx]);• cVals.push_back(bVals[bx]); ++bx;
aDims[ax] == bDims[bx]• float newVal = aVals[ax] + bVals[bx];• If |newVal| is zero/small, discard, else• cDims.push_back(aDims[ax]);• cVals.push_back(newVal);• In either case, ++ax; ++bx;
Chakrabarti
Sparse dot product Again a merge Produce output to c only if dims matchvector<int> aDims, bDims; // filledvector<float> aVals, bVals; // filledfloat dotProd = 0;int ax=0, an=aDims.size(),bx=0, bn=bDims.size();
while (ax < an && bx < bn) { // again three cases}// nothing to clean up
Chakrabarti
Dot product merge: Three cases aDims[ax] < bDims[bx]
• ++ax aDims[ax] > bDims[bx]
• ++bx aDims[ax] == bDims[bx]
• dotProd += aVals[ax] * bVals[bx];• ++ax; ++bx;
Chakrabarti
Sorting sparse vectors on dim Return to the first problem Given aDims and aVals Still ganged, or meaningless But suppose aDims is not
sorted as needed whencomputing sum and dot product
Let us rewrite selection sort (move smallest to front) for aDims and aVals
“Smallest” now means in aDims, not aVals But move to front involves both
Chakrabarti
Sorting on dim, example
11 7 5 2 8
2.1 6.5 4.3 3.2 5.5
fx=0, minPos=3,dim[minPos]=2
2 7 5 11 8
3.2 6.5 4.3 2.1 5.5
fx=1, minPos=2,dim[minPos]=5
2 5 7 11 8
3.2 4.3 6.5 2.1 5.5
fx=2, minPos=2,dim[minPos]=7
fx=3, minPos=4,dim[minPos]=8
2 5 7 8 11
3.2 4.3 6.5 5.5 2.1
Chakrabarti
Sparse vector: selection sort on dimvector<int> aDims; vector<float> aVals;const int an = aDims.size();for (int fx=0; fx < an; ++fx) { int minDim = MAX_INT, minPos = -1; for (int mx=fx; mx < an; ++mx) { if (minDim > aDims[mx]) { minDim = aDims[mx]; minPos = mx; } } int tDim=aDims[minPos], tVal=aVals[minPos]; aDims[minPos]=aDims[fx]; aVals[minPos]=aVals[fx];
aDims[fx]=tDim; aVals[fx]=tVal;}
Swap minPos and fx on both aDims and aVals
Chakrabarti
Indirection/index arrays Suppose each element of names[dn]
consumes lots of bytes• E.g. it could be a string
Both sort methods we have seen involve copying elements
Implies copying all bytes representing element
Can we avoid this? Create pos[dn] such that names[pos[px]] is in increasing order as px increases
Chakrabarti
Example If names = [ z, p, a, y, b, m ] Want pos = [ 2, 4, 5, 1, 3, 0 ] In words, the ith cell of pos should contain the index
of the ith smallest element of names
i pos[i] names[pos[i]]0 2 a
1 4 b
2 5 m
3 1 p
4 3 y
5 0 z
Chakrabarti
Filling the indirection array pos[.]vector<string> names; // suitably filledvector<int> pos; // a permutationconst int nn = names.size();for (int nx=0; nx < nn; ++nx) { pos.push_back(nx); // identity}for (int fx=0; fx < nn; ++fx) { // find “smallest” string among // positions pos[fx] … pos[nn-1] // swap}
Demo
Chakrabarti
Use of indirection array Suppose each element in array data[cn]
occupies 1000 bytes• Name, home addr, employer, work addr, dob,…• How to pack all these into one record? “struct”
pos[cn] will sort without physically moving records within data[cn]
Suppose we do want to restructure data[cn] to be physically sorted by name
Can use another array tmpData[cn] What if this 2x space is not available?
Chakrabarti
Cycles in permutations
i pos[i] data[pos[i]]0 2 a
1 4 b
2 5 m
3 1 p
4 3 y
5 0 z
0
25
1
43
Chakrabarti
Moving data[…] as per pos[…]
First pretend tmpData[…] is available But we still choose to copy from data[…] to
tmpData[…] using a special schedule Restricted to the first cycle Then we realize that we don’t need tmpData!
0
25
1
43
tmpData data
z
p
a
y
b
m
a
m
z
Chakrabarti
Tracing the first cycle copy
2 4 5 1 3 00 1 2 3 4 5
z p a y b m
cycleBegin=0tmp = zcx = 0
data[0]=data[pos[0]]cx = pos[0]=2
pos
data[2]=data[pos[2]]cx = pos[2]=5
a p a y b m
a p m y b m
data[5]=tmp
pos[5]=cycleBegin
a p m y b z
Demo
Chakrabarti
Binary/range search through indirection Enumerate all elements in data
between “c” and “p”• I.e. “c” elem “p”
Answer: m, p Modify binary search to deal with
two situations• Query found in array• Query not found, between two
adjacent values in array
Status on “c”: not found, 2 And on “p”: found, 3 Sweep range between indices
data
z
p
a
y
b
m
pos
2
4
5
1
3
0
How to return two values from a
function?pair<bool, int>
Chakrabarti
Binary search extension Earlier:
while (lo <= hi) { int mid = (lo + hi)/2; if (q < a[mid]) { hi = mid-1; } if (q > a[mid]) { lo = mid+1; } if (q == a[mid]) {
ans = mid; break; } } Now, catch the lo == hi case separately
Demo
Chakrabarti
Example: names and marks Initially, given two arrays of same size cn
• names contains student names• marks contains marks awarded to student
Each index ix accesses names[ix] and his/her marks[ix]
names or marks not in any particular order Index the data by both names and marks so
that we can• Query fast for the marks obtained by a given
student• Report names of students who got a range of
marks
Chakrabarti
Solution: two indirection arrays namesPos
• names[namesPos[nx]] increases with nx• marks[namesPos[nx]] need not
marksPos• marks[marksPos[mx]] increases with mx• names[marksPos[mx]] need not
Demo