utpa computer science

Dr. Zhixiang Chen

Professor and Chair

Department of Computer Science

UTPA Computer Science

What Is Computer Science?

The systematic study of the feasibility, structure, expression, and mechanization of the methodical processes (or algorithms) that underlie the acquisition, representation, processing, storage, communication of, and access to information, whether such information is encoded in bits and bytes in a computer memory or transcribed in genes and protein structures in a human cell. The fundamental question underlying all of computing is: what computational processes can be efficiently automated and implemented? The core is about problem solving, efficient problem solving, and automated problem solving.

UTPA Computer Science Two undergraduate programs

BSCS (Broad-field major in CS) ABET accredited since F2001

Computer Engineering (CMPE) program ABET Accredited since 2011 Jointly with EE department

Students: over 440 Two Masters Graduate Programs

MS in Computer Science MS in Information Technology Graduate Students: close to 100

Career PlacementsGraduates working with major companies: Microsoft, IBM, Xerox, Intel, WalMart, Exxon-

Mobil, Local Business, Independent School Districts,

UTPA Computer Center, IAG Houston All recent CS/CMPE graduates have job

placements or move to graduate schools

IBM, Microsoft, WalMart, Xerox, and other companies aggressively recruit CS students.

ScholarshipsScholarships O’Dell Memorial Scholarship Computer Science Alumni Scholarship Wal-Mart Scholarship Xerox Scholarship

Engineering and Computer Science Boeing and others

University Scholarships UTPA Excellence University Scholar

Student Financial Services

Graduate Studies Many graduates move on to pursue MS/Doctoral

graduate Studies Laura Grabowski got her Ph.D. Dissertation at Michigan State and

is now a professor in the Department

Student Research 15+ senior projects every semester All graduate students need to finish a research project or a

thesis A number of students have publications in journals and

conference proceedings The annual UTPA Computer Science Student Research Day

Conference- CSSRD

Best Jobs in Americawww.CareerExplorer.net

Network Systems and Data Communications AnalystsTen hottest careers rank: 1Salary Range: $18,610 — $96,860

Computer Application Software EngineersTen hottest careers rank: 5Salary Range: $18,610 — $96,860

Database AdministratorsTen hottest careers rank: 8Salary Range: $16,460 — $74,390

Top Five Fastest Growing IT Jobs Now - 2014

1) Network systems and data communications analyst • Expected growth rate; 54.6% •Middle 50 percent earned between $46,480

and $78,060.

2) Computer applications software engineer • Expected growth rate: 48.4% •Middle 50 percent earned between $59,130

and $92,130.

3) Computer systems software engineer • Expected growth rate: 43% •Middle 50 percent earned between $63,150 and

$98,220.

4) Network and computer systems administrator • Expected growth rate: 38.4% •Middle 50 percent earned between $63,150

and $98,220.

5) Database administrator • Expected growth rate: 38.2% •Middle 50 percent earned between $46,260 and

$73,620. U.S. Bureau of Labor Statistics

A Large Picture of CS

Theory

Systems and Architectures

Languages

ApplicationsSoftware

= Program

= algorithms + Data Structures

Sorting: Some Taste of Algorithm and Data Structures

Sorting is to arrange a list of data in either increasing or decreasing orderSorting is fundamental in computer scienceWe shall examine quicksort.

QuicksortQuicksort is the fastest known sorting algorithm in practice.Its average running time is O(NlogN).Quicksort is a divide-and-conquer recursive algorithm.The algorithm

If the number of elements in input array S is 0 or 1, then return.

Pick any element v in S. This is called the pivot. Partition S-{v} (the remaining elements in S) into two

disjoint groups:S1 = {x S – {v} | x <= v}, and S2 = {x S – {v} | x > v}

Return {quicksort(S1) followed by v followed by quicksort(S2)}.

Issues in Quicksort How to choose the pivots?

The choices of pivots affect the performance of quicksort a lot!

If the pivots being chosen are in sorted order, quicksort degenerates to selectionSort.For instance, sort 13, 81, 43, 92, 31, 65, 57, 26, 75, 0Pivots we choose happen to be: 0 13 26 31 43 57 65 75 81 92

We hope that by using the pivot, we can partition the array into two subarrays with nearly equal size.

13 81 43 92 31 65 57 26 75 0partition using 0

0 13 81 43 92 31 65 57 26 75partition using 13

0 13 81 43 92 31 65 57 26 75

0 13 26 81 43 92 31 65 57 75

0 13 26 31 81 43 92 65 57 75

partition using 26

partition using 31

Quicksort ExampleSort 13, 81, 43, 92, 31, 65, 57, 26, 75, 0The pivot is chosen by chance to be 65.

13 81 43 92 31 65 57 26 75 0select pivot

13 81 43 92 31 65 57 26 75 0partition

13 43 31 57 26 0 65 81 92 75quicksort small quicksort large

0 13 26 31 43 57 65 75 81 92

0 13 26 31 43 57 65 75 81 92

How to partition the array?

Quicksort recursively solves two subproblems and requires linear additional work (partitioning step), but, the subproblems are not guaranteed to be of equal size, which is potentially bad.

The reason that quicksort is faster is that the partitioning step can actually be performed in place and very efficiently. This efficiency more than makes up for the lack of equal-sized recursive calls.

Picking the Pivot A wrong way

Using the first element as the pivot The choice is acceptable if the input is random. If the input is presorted or in reverse order, then the pivot

provides a poor partition, because either all the elements go into S1 or they go into S2.

The poor partition happens consistently throughout the recursive calls if the input is presorted or in reverse order.

The practical effect is that if the first element is used as the pivot and the input is presorted, then quicksort will take quadratic time to do essentially nothing at all.

92 81 75 65 57 43 31 26 13 0

81 75 65 57 43 31 26 13 0 92partition using 92

81 75 65 57 43 31 26 13 0partition using 81

75 65 57 43 31 26 13 0 81

Picking the Pivot A wrong way

Using the first element as the pivot (cont.) Moreover, presorted input (or input with a large presorted

section) is quite frequent, so using the first element as pivot is an absolutely horrible idea and should be discarded immediately.

Choosing the larger of the first two distinct elements as pivot

This has the same bad properties as merely choosing the first elements.

Do NOT use that pivoting strategy either.

Picking the Pivot A safe maneuver

A safe course is merely to choose the pivot randomly. This strategy is generally perfectly safe, since it is very

unlikely that a random pivot would consistently provide a poor partition.

However, random number generation is generally an expensive commodity and does not reduce the average running time of the rest of the algorithm at all.

Using median The median of a group of N numbers is the N/2th

largest number. The best choice of pivot would be the median of the

array, since it gives two subarrays with nearly equal size (actually, can be different by at most 1).

Theoretically, the median can be found in linear time, thus would not affect the running time of quicksort; but in practice, it would slow down quicksort considerably.

13 81 43 92 31 65 57 26 75 0partition using 57

13 43 31 26 0 57 81 92 65 75

Picking the Pivot Median-of-Three partitioning

Pick three elements randomly and use the median of these three as pivot. (Random numbers are expensive to generate.)

The common course is to use as pivot the median of the left, right, and center (in position (left + right)/2)elements.

Using median-of-three partitioning clearly eliminates the bad case for sorted input (the partitions become equal in this case) and actually reduces the running time of quicksort by about 5%.

92 81 75 65 57 43 31 26 13 0

43 31 26 13 0 57 92 81 75 65partition using 57

Partitioning Strategy Objective: to move all the small elements to the left part of the array and all the large elements to the right part. “Small” and “large” are, of course, relative to the pivot.The basic idea

Swap the pivot with the last element; thus the pivot is placed at the last position of the array.

i starts at the first element and j starts at the next-to-last element.

While i is to left of j, we move i right, skipping over elements that are smaller than the pivot; we move j left, skipping over elements that are larger than the pivot.

When i and j have stopped, i is pointing at a large element and j is pointing at a small element.

If i is to the left of j, those elements are swapped. (The effect is to push a large element to the right and a small element to the left.)8 1 4 9 0 3 5 2 7 6

i j

Partitioning Strategy Example Input: 81 13 43 92 31 65 57 26 75 0Pivot: 57

i j

81 13 43 92 31 65 57 26 75 0

81 13 43 92 31 65 0 26 75 57

i j81 13 43 92 31 65 0 26 75 57

i j26 13 43 92 31 65 0 81 75 57

i stopped

j stoppedswap

i j26 13 43 92 31 65 0 81 75 57

i stopped

i j26 13 43 92 31 65 0 81 75 57

j stopped

move j left

move i right

move j left

Partitioning Strategy Example Input: 81 13 43 92 31 65 57 26 75 0Pivot: 57 (cont.)

i j26 13 43 92 31 65 0 81 75 57

j stopped

i j26 13 43 0 31 65 92 81 75 57

swap

move i right

i j26 13 43 0 31 65 92 81 75 57

i stopped

ij26 13 43 0 31 65 92 81 75 57

move j left

26 13 43 0 31 57 92 81 75 65

i and j have crossed. Swap the element at position i with the pivot.

j stopped

Small Arrays For very small arrays (N <= 20), quicksort does not perform as well as insertion sort.Because quicksort is recursive, these cases will occur frequently.A common solution is not to use quicksort recursively for small arrays, but instead use a sorting algorithm that is efficient for small arrays, such as insertion sort.Using this strategy can actually save about 15% in the running time. A good cutoff range is N=10.This also saves nasty degenerate cases, such as taking the median of three elements when there are only one or two.

Quicksort C++ code

a

left center right

a

left center right

small median largeSort elementsat positions left, right, andcenter.

left center right

small a[right-1] median

a

Quicksort C++ code

i center j

small a[right-1] pivot

a

a

left right

left right

Partition the input array into two subarrays: one contains small elements; the other contains large elements.

Analysis of Quicksort We will do the analysis for a quicksort, assuming a random pivot (no median-of-three partitioning) and no cutoff for small arrays.We take T(0) = T(1) = 1.The running time of quicksort is equal to the running time of the two recursive calls plus the linear time spent in the partition (the pivot selection takes only constant time). This gives the basic quicksort relation

T(N) = T(i) + T(N-i-1) + cNwhere i = |S1| is the number of elements in S1.

Analysis of Quicksort Worst-Case Analysis

The pivot is the smallest element, all the time. Then i = 0 and if we ignore T(0) = 1, which is insignificant, the recurrence is

T(N) = T(N-1) + cN, N > 1We telescope, using equation above repeatedly. Thus

T(N-1) = T(N-2) + c(N-1)T(N-2) = T(N-3) + c(N-2)………..

T(3) = T(2) + c(3)T(2) = T(1) + c(2)

Adding up all these equations yields

)()1()( 2

2

NOicTNTN

i

Analysis of Quicksort Best-Case Analysis

In the best case, the pivot is in the middle. To simply the analysis, we assume that the two

subarrays are each exactly half the size of the original. Although this gives a slight overestimate, this is acceptable because we are only interested in a Big-Oh notation.

T(N) = 2T(N/2) + cNDivide both sides of the equation by N.

We will telescope using this equation.

Adding up all equations, we havewhich yields T(N) = cNlogN + N = O(NlogN)

cN

NT

N

NT

2/

)2/()(

cTT

cN

NT

N

NT

1

)1(

2

)2(

8/

)8/(

4/

)4/(

cN

NT

N

NT

4/

)4/(

2/

)2/(

NcT

N

NTlog

1

)1()(

Analysis of Quicksort Average-Case Analysis

Recall that T(N) = T(i) + T(N–i–1) + cN (1) The average value of T(i), and hence T(N-i-1), is Equation (1) then becomes (2)

If Equation (2) is multiplied by N, it becomes(3)

We telescope with one more equation.(4)

(3) – (4), we obtainNT(N) – (N-1)T(N-1) = 2T(N-1) + 2cN – c

We rearrange terms and drop the insignificant –c on the right, obtaining

NT(N) = (N+1)T(N-1) + 2cN (5)Divide Equation (5) by N(N+1):

(6)

1

0

)(1 N

j

jTN

cNjTN

NTN

j

1

0

)(2

)(

21

0

)(2)( cNjTNNTN

j

22

0

)1()(2)1()1(

NcjTNTNN

j

1

2)1(

1

)(

N

c

N

NT

N

NT

Analysis of Quicksort Average-Case Analysis (cont.)

(6)

Now we can telescope.

Adding them up yields

Thus, we have

and so T(N) = O(NlogN)

1

2)1(

1

)(

N

c

N

NT

N

NT

3

2

2

)1(

3

)2(

1

2

2

)3(

1

)2(

2

1

)2()1(

cTT

N

c

N

NT

N

NTN

c

N

NT

N

NT

1

3

12

2

)1(

1

)( N

i ic

T

N

NT This is the sum of harmonic series, which is O(logN).

)(log1

)(NO

N

NT

Huffman Codes: Some Taste of Data Compression and Decompression

Data compression and decompression is critical for fast data transmitting, storage, and retrieval. Huffman code is a classical technique for textual data compression and decompression

Huffman CodesA data file with 100K characters, which we want to store or transmit compactly.Only 6 different characters in the file with their frequencies shown below.Design binary codes for the characters to achieve maximum compression.Using fixed length code, we need 3 bits to represent six characters.

a b c d e ffreq(K)45 13 12 16 9 5code 1 000 001 010 011 100 101

Storing the 100K character requires 300K bits using this code.Can we do better?

Huffman CodesWe can improve on this using variable length codes.Motivation: use shorter codes for more frequent letters, and longer codes for infrequent letters.An example is the code 2 below.

a b c d e ffreq(K) 45 13 12 16 9 5code 1 000 001 010 011 100 101code 2 0 101 100 111 1101 1100

Using code 2, the file requires(1*45+3*13+3*12+3*16+4*9+4*5)K = 224K bits.Improvement is 25% over fixed length codes. In general, variable length codes can give 20-90% savings.

Variable Length CodesIn fixed length coding, decoding is trivial. Not so with variable length codes.Suppose 0 and 000 are codes for x and y, what should decoder do upon receiving 00000?We could put special marker codes but that reduce efficiency.Instead we consider prefix codes: no codeword is a prefix of another codeword.So, 0 and 000 will not be prefix codes, but 0, 101, 100, 111, 1101, 1100 are prefix code.To encode, just concatenate the codes for each letter of the file; to decode, extract the first valid codeword, and repeat.Example: Code for “abc” is 0101100.

“001011101” uniquely decodes to “aabe”a b c d e f

code 2 0 101 100 111 1101 1100

Tree RepresentationDecoding best represented by a binary tree, with letters as leaves.Code for a letter is the sequence of bits between root and that leaf.

An optimal tree must be full: each internal node has two children. Otherwise we can improve the code.

Measuring OptimalityLet C be the alphabet. Let f(x) be the frequency of a letter x in C.Let T be the tree for a prefix code; let dT(x) be the depth of x in T.The number of bits needed to encode our file using this code is

We want a T that minimizes B(T).

Cx

T xdxfTB )()()(

Huffman’s AlgorithmInitially, each letter represented by a singleton tree. The weight of the tree is the letter’s frequency.Huffman repeatedly chooses the two smallest trees (by weight), and merges them.The new tree’s weight is the sum of the two children’s weights.If there are n letters in the alphabet, there are n-1 merges.Pseudo-code:build a heap Q on C;for I = 1 to n-1 do

z = a new tree node;x = left[z] = DeleteMin(Q);y = right[z] = DeleteMin(Q);f[z] = f[x] + f[y];Insert(Q, z);

IllustrationShow the steps of Huffman algorithm on our example.

24 24

Analysis of HuffmanRunning time is O(nlogn). Initial heap building plus n heap operations.We now prove that the prefix code generated is optimal.It is a greedy algorithm, and we use the standard swapping argument.Lemma: Suppose x, y are the two letters of lowest frequency. Then, there is optimal prefix code in which codewords for x and y have the same (maximum) length and they differ only in the last bit.

Correctness of Huffman

Let T be optimal tree and b, c be the two sibling letters at max depth.Assume f(b) <= f(c), and f(x) <= f(y).Then f(x) <= f(b) and f(y) <= f(c).Transform T into T’ by swapping x and b.Since dT(b) >= dT(x) and f(b) >= f(x), the swap does not increase the frequency * depth cost. That is, B(T’) <= B(T).Similarly, we next swap y and c. If T was optimal, so must be T”.Thus, the greedy merge done by Huffman is correct.

Correctness of HuffmanThe rest of the argument follows from induction.When x and y are merged; we pretend a new character z arises, with

f(z) = f(x) + f(y).Compute the optimal code/tree for these n-1 letters: C{z}-{x,y}.Attach two new leaves to the node z, corresponding to x and y.

RSA-Public Key Cryptosystem:Some Taste Computer Security

The importance of computer security is obvious. No one wants to an insecure computer system. Public key cryptosystem is a miracle: Even though the key is given, nobody (besides the sender and the receiver) can decipher the encrypted message

Hard Problems

Some problems are hard to solve. No polynomial time algorithm is known. e.g., NP-hard problems such as machine

scheduling, bin packing, 0/1 knapsack, finding prime factors of an n-digit number.

Is this necessarily bad?No!

Data encryption relies on difficult to solve problems.

Cryptography

encryption key

decryption keydecryption algorithm

encryption algorithm

message

message

Transmission Channel

Public Key Cryptosystem (RSA)A public encryption method that relies on a public encryption algorithm, a public decryption algorithm, and a public encryption key.

Using the public key and encryption algorithm, everyone can encrypt a message.

The decryption key is known only to authorized parties.

Public Key Cryptosystem (RSA)

p and q are two prime numbers.n = pqm = (p-1)(q-1)a is such that 1 < a < m and gcd(m,a) = 1.b is such that (ab) mod m = 1.a is computed by generating random positive integers and testing gcd(m,a) = 1 using the extended Euclid’s gcd algorithm.The extended Euclid’s gcd algorithm also computes b when gcd(m,a) = 1.

RSA Encryption And Decryption

Message M < n.Encryption key = (a,n).Decryption key = (b,n).Encrypt => E = Ma mod n.Decrypt => M = Eb mod n.

Breaking RSA

Factor n and determine p and q, n = pq.Now determine m = (p-1)(q-1).Now use Euclid’s extended gcd algorithm to compute gcd(m,a). b is obtained as a byproduct.The decryption key (b,n) has been determined!

Security Of RSA

Relies on the fact that prime factorization is computationally very hard.Let k be the number of bits in the binary representation of n.No algorithm, polynomial in k, is known to find the prime factors of n.Try to find the factors of a 100 bit number.

PageRank: A Taste of Web Technology

The Web has become part of our daily life experience.The Web proves we are in the age of computing civilization. The success of the Web replies on an army of computer science technologyRanking web pages for any given query is one of the fundamental ones.

What is PageRank

PageRank is the core technology behind Google.In order to measure the relative importance of web pages, PageRank is proposed. It is a method for computing a ranking for every web page based on the graph of the web.

Related work and problems

Backlink counts Problem: for example, if a web page

has a link off the Yahoo home page, it may be just one link but it is very important one. This page should be ranked higher than many pages with more backlinks but from obscure places.

The ranks and numbers of backlinks This covers both the case that when a

page has many backlinks and when a page has a few highly ranked backlinks.

The Formula

Let w be a any webpage,B(w) be the set of pagesthat point to w, C(w) bethe number of links outfrom w and let d be a factor used fornormalization, then asimplified version ofPageRank is on the right.

)( )(

)()(

uBw wC

wPRduPR

Some Illustration

Rank Sink

Problem: may form a rank sink. Consider two web pages

that point to each other but to no other page. And if there

is some web page which points to one of them. Then,

during iteration, this loop will accumulate rank but never

distribute any rank. The loop forms a sort of trap called a

rank sink.

Link Structure of the Web

Pages are as nodesLinks are as edges (outedges and inedges)

Every page has some forward links (outedges) andbacklinks (inedges). We can never know whether wehave found all the backlinks of a particular page but if wehave downloaded it, we know all of its forward links at

thattime. PageRank handles both cases and everything inbetween by recursively propagating weights through thelink structure of the web.

Definition of PageRankWe assume page A has pages T1,…,Tn, which point to it. The parameter d is a damping factorwhich can be set between 0 and 1(usually d isset to 0.85). Also C(A) is defined as the numberof links going out of page A. The PageRank of page A is given as follows:

T1

PR=0.5

T2

PR=0.3

T3

PR=0.1

A

PR(A)=(1-d) + d*(PR(T1)/C(T1) + PR(T2)/C(T2) + PR(T3)/C(T3)) =0.15+0.85*(0.5/3 + 0.3/4+ 0.1/5)

3

4

5

2

PageRank vs. Eigen Vectors

Let A be a square matrix with the rows and column

corresponding to web pages. Let if there is an edge from u to v and if not. Ifwe treat R as a vector over web pages, then wehave . Here E is a uniform vector.Since , we can rewrite this as . So R is an eigenvector of

uNvuA

1,

))1

1((d

EARdR

with eigen value with eigen value dd..

1|||| R

0, vuA

Rd

EAdR ))1

1((

)1

1(d

EA

Dangling LinksDangling links are simply links that point to any page

withno outgoing links. They affect the model because it is

notclear where their weights should be distributed, and

thereare a large number of them. Because they do not

affectthe ranking of any other page directly, we simply

removethem from the system until all the PageRanks arecalculated. After all the PageRanks are calculated,

theycan be added back in, without affecting things

significantly.

Implementation

1. Sort the link structure by ParentID2. Remove dangling links from the link

database3. Make an initial assignment of the

ranks4. Memory is allocated for the weights

for every page5. After the weights have converged,

add the dangling links back in and recompute the rankings

utpa computer science

Documents

computer sciencems

computer memory

computer science boeing

computer systems administrator

utpa computer center

growth rate

network systems

hottest careers rank