information management

1

Information Management

Lecture 3: Cataloging, Indexing, Searching

J. Michael Moshell

University of Central Florida

Original image* by Moshell et al .

-2 -

Cataloging and Indexing

Why are we discussing this?

www.joe-ks.com

I don't believe in memorizing a bunch of soon-obsolete facts.

I DO believe that many of you will have to solve info-management problems.

You will probably invent ways of doing it.

So you should "steal from the best" – not reinvent the wheelbarrow.

-3 -

How do we find things?

1) By starting in the neighborhood of similar things.

1) By using the name of the thing,and asking an "expert" or "resource"

-4 -




When reading a book:

Look in the table of contents, for an ARTICLE.

Look in the index, for a TOPIC.

-5 -




At the library:

Go to the relevant section, browse shelves.

Use the (card) catalog (really an index.)

-6 -




On the Internet:

Follow links from trusted sources (like cnet).

Use the indexes, e. g.

• those provided by search engines

• those provided by vendors (eBay, Amazon...)

• those provided by facilitators (uTube, craigslist)

-7 -

What's an index?

• An index is a system that serves to optimize speed in finding relevant documents in a search.

• An index is a system that, given one or more search terms from either metadata or essence, efficiently reports the location of the essence.

What's fast? What's efficient?

here comes some math ... (how we all love it!)

-8 -

Order statistics

A document contains k records. (perhaps k=1000).

If you must examine EACH RECORD to find what you seek,

the search is Order-k (written as O(k).)

For ancient records, this is usually the only way.

For instance, the Archivo General de Indias in Seville, Spain

www.learningcurve.gov.uk

-9 -

Order statistics

A document contains k records. (perhaps k=1000).

If you must examine EACH RECORD to find what you seek,

the search is Order-k (written as O(k).)

For ancient records, this is usually the only way.

On the average, you would look at 500 records (0.5*k) to

find the one you are seeking.

Let's say we seek a ship named Nuestra Senora de Atocha

-10 -

Indexing

To prepare an index of all ships' names, , captains' names,

owners and dates in the archive, it would take O(k) time.

Why? Because every document would be visited. Each index item contains SEARCH TERM and DOCUMENT NUMBER

BUT now (if the index is sorted, which it is) we can

find S=Nuestra Senora de Atocha much faster, by playing

"binary search".

S>this?

sorted

index

A

Z

-11 -

Indexing

If someone prepared an index of all ships, captains' names,

owners and dates in the archive, this would take O(k) time.

Why? Because every document would be visited.

BUT now (if the index is sorted, which it is) we can

find S=Nuestra Senora de Atocha much faster, by playing

"binary search". sorted

index

A

ZS>this? no

-12 -

Indexing and binary Search

1 comparison distinguishes 2 records

2 comparison distinguish 4 records

3 comparisons distinguish 8 records ...

10 comparisons distinguish 1024

20 comparisons distinguish over a million records.

sorted

index

A

Z

Each comparison

cuts in half

the search space

-13 -


1 comparison distinguishes 2 records

2 comparison distinguish 4 records

3 comparisons distinguish 8 records ...

10 comparisons distinguish 1024

20 comparisons distinguish over a million records.

sorted

index

A

Z

Each comparison

cuts in half

the search space

O(log k)

-14 -

OMG, a Log? Puleeeeez ....

Yep, this is college and you are a

DIGITAL Media Major. So here goes.

20=1

21=2

22=2*2=4

23=2*2*2=8

...

210=2*2*...*2= 1024=1 kilo, about a thousand

O(log k)

Ten twos

-15 -




20=1

21=2

22=2*2=4

23=2*2*2=8

...

210=2*2*...*2=1024

220= 2*2*...*2 = 1024 * 1024 = 1meg, about a million

O(log k)

Twenty twos

-16 -




20=1

21=2

22=2*2=4

23=2*2*2=8

...

210=2*2*...*2=1024


230 = 2*2*...*2 = 1024 * 1024 *1024 = 1 gig, about a billion

O(log k)

Thirty twos

-17 -




20=1

21=2

22=2*2=4

23=2*2*2=8

...

210=2*2*...*2=1024


230 = 2*2*...*2 = 1024 * 1024 *1024 = 1 gig, about a billion

(log2 k)

1

2

3

10

20

30

k

2

4

8

1024

1 meg

1 gig

-18 -




20=1

21=2

22=2*2=4

23=2*2*2=8

...

210=2*2*...*2=1024

220= 1 meg

230=1 gig

You need to be able to tell me what is log2(k) for any k (power of two) between

1 and 1meg.

Example:

256k?

256=28. and 1k~=210.

So that's 2*2*2..*2

18 twoslog2(256k) = 18

-19 -


I will provide a Logarithm Practice Sheet on the website

to help you study and practice for the midterm exam.

-20 -


Linear Search Binary Search

1000 items 10 steps

1 million items 20 steps

1 billion items 30 steps

sorted

index

A

Z

Each comparison

cuts in half

the search space

O(log2 k)

-21 - -21 -

Sorting N Objects

We will discuss sorting, a bit later

After you recover from Math Anxiety

Slcc.edu

-22 -

Why not just keep books in order?

Could you do 'binary search' directly on the books ...?

Well, WHICH order? If they're on the shelf in that order, yes.

- by ship names?

- by captains' names?

- by year of construction?

- by year of sinking or decommissioning?

An index can be sorted on any data field, then searched.

(Sorting k objects takes O(k * log k) time

(so sorting a billion objects; 1 billion * log2(1 billion)

=1 billion* 30 = 30 billion steps)

-23 -

Why not just keep books in order?

- An index can be sorted on any data field, then searched.

(Sorting k objects takes O(k log k) time

(so sorting a billion objects; 1 billion * log2(1 billion)

=1 billion* 30 = 30 billion steps)

(This can be done overnight, when computers aren't busy)

BUT – once sorted, inserting new information is O(log k) time.

So, you can insert a new fact into our billion-item index in

about 30 steps. Fast!

-24 -

What terms shall we index?

- For text, the essence yields keyword search

- The dumbest but easiest kind of search, if essence=digital text.

-25 -

What terms shall we index?

- For text, the essence yields keyword search

- The dumbest but easiest kind of search, if essence=digital text.

- This was not true for traditional libraries.

- Nobody had time to catalog every word of every book.

- Professional catalogers had to develop techniques:

- Author

- Title

- Publication Date

- Subject

(METADATA!)

And this last one, Subject, took more work than all the rest together.

-26 -

What's so hard about subject indexing?

- The problem: restricting the vocabulary.

Let's consider a fictional book:

The Skills of a Nineteenth Century Bartender.

Henry Macintosh, New York, 1889

How might someone seek this book?

Or: what metadata fields might the librarian use?

Occupations: bartender, barkeeper, barman, barkeep

(Are there others we forgot to search for?)

So catalogers established rules involving precedent

to restrict vocabularies and establish standards

-27 -

Cataloging an Item for a Library

The card catalog at Yale University(of course, it's all computerized now)

-28 - -28 -

Cataloging an Item for a Library

Problem #1: What book (or other object) are we talking about?

- Each item has an accession number (that's easy to issue)

- Each title has a catalog number, shared with all instances

(sometimes separate copies are called .c1, .c3 etc.)

Problem #2: What catalog number should I give this item?

- Did someone else catalog it already? If so, use that.

- If not, follow the

-

International Standard Bibliographic Description (ISBD)

-29 -

•Title

•statement of responsibility (author or editor),

•edition,

•material specific details (for example, the scale of a map),

•publication and distribution,

•physical description (for example, number of pages),

•Series (e. g. this might be part 3 of a trilogy)

•notes,

•standard number (ISBN).

International Standard Bibliographic Description (ISBD)

-30 - -30 -

And then follow

A complex set of rules

Most English cataloging follows

Anglo-American Cataloging Rules (AACR2)

Germans follow

Regeln für die alphabetische Katalogisierung

Etc…

-31 -

How to organize an index

- Step 1: Deciding what fields to include

(the Ontology) of the subject space

- Step 2: Deciding if each metadata field is open or controlled (CV).

Open set: American family names

Closed set: Chinese family names

In software, ,CV fields are often presented as pulldown menus.

- Step 3: Establishing the controlled

vocabulary, and rules for

extending it.

- Step 4: Maintaining it.

- (e. g. MIME types, subtypes.)

http://www.kksou.com

-32 -

Concept: "Low-hanging fruit"

- In any new domain, some ideas will come together

that present opportunities not previously possible

- Some of them will be easy to do.

- Get these first, and you may be rich.

The cataloging of dynamic media such as

video can take advantage of techniques

for Content Logging.

In this area,

closed captions was a low-hanging

fruit. www.recipeforlowhangingfruit.com

-33 -

Closed Captions for Content Logging

- Originally for deaf ... now for bars, etc.

- "Closed" – not all viewers will see the captions

- But they are built into most TV broadcasts.

>> Indicates a new speaker has begun to talk.

www.recipeforlowhangingfruit.com

-34 -

Closed Captions for TV

- Originally for deaf ... now for bars, etc.

- "Closed" – not all viewers will see the captions

- But they are built into most TV broadcasts.

>> Indicates a new speaker has begun to talk.

But – isn't speech recognition still hard?

- yes – but there are SCRIPTS and TELEPROMPTERS behind

most TV programming. Live news feeds are a mix of scripted

and unscripted.

BBC developed a re-speak technology to maximize clarity.

Sound effects and music are shown by # or notes.


-35 -

Closed Captions for TV

- now that CC exists, you can index it to produce metadata.

- Services monitor in real-time for significant stories.


-36 -

Can you think of another TV "LHF"?

Where is another source of already-in-text-form metadata

about TV program contents? (I can think of two).


-37 -

Can you think of another TV "LHF"?

Where is another source of already-in-text-form metadata

about TV program contents? (I can think of two).

• Electronic Program Guides, such as

Tivo's TV programming schedule

• Broadcasters' Websites (e. g. www.cbs.com)

-38 -

We've discussed third party logging

But what about in-house logging (by materials' own producers.)

Static metadata (exists independently of the essence)

• Production Notes, including original scripts

• Edit Decision List (part of production notes)

• Advanced Authoring Format (AAF)

• News Feed rundowns (cues for local broadcasters)

Media Object Server (MOS) format

-39 -

We've discussed third party logging

But what about in-house logging (by materials' own producers.)

Dynamic metadata (sampled from or derived from the essence)

A hierarchy of proxy representations:

- time code (ties it all together)

- Proxy video (low res, maybe easier to scan – or harder!)

- Keyframes (still images for pattern recognition)

- Audio transcript

- annotation – added by staff

-40 -

Speech Analysis

- Phoneme: minimal meaningful unit of speech. English has 44.

- Phone: the 'rendering' of a phoneme by an individual. Infinite #

- Recognition of words: difficult under good conditions,

nearly impossible under noisy conditions

However, you don't need to get ALL the words to make the

document searchable. Even getting SOME of the words is better

than none.

www.nuance.com

-41 -

Indexing things that aren't words

- Built-in metadata (e. g. digital camera data, Adobe metadata)

- Image libraries – cataloged by human beings

(We will study some of the metadata standards used.)

- Automatic pattern recognition

- http://www.autonomy.com/content/Solutions/video-surveillance/index.en.html

- Assignment: Download ONE of the "Autonomy Virage" documents,

- read it and be prepared to give a one-minute summary of its claims.

-42 -

Recognizing Faces

- FINDING a face in a scene is far easier than RECOGNIZING it.

- Nikon's cameras can now find faces and focus on them.

Face-priority AF in Nikon Coolpix Cameras

But it's a rough rough world out there. The website listed

below provides a list of vendors ... many of which are 'dead

links' as companies come and go.

http://www.face-rec.org/vendors//

-43 -

And ... where do we go from here?

Go back through these slides. Make a list of the important words.

If you can write a one-sentence explanation of every word on this list, AND answer logarithm questions, you're ready for the midterm. ...

at least with regard to Searching and Pattern Recognition.

But now let's go talk about SORTING.

-44 -

SortingWhy are we discussing this?

It's a good example of DUMB vs. SMART algorithms.

What's an algorithm?

A systematic procedure for solving a problem.

Programs are built on the basis of algorithms. But so are

* carpentry

* medical diagnosis

* electronic repair .. Etc etc etc

.

-45 - -45 -

Sorting and IgnoranceTwo thousand name-tags

Printed in NAME order

Needed in COMPANY order

So… they put

Six temps to

Work …

For HOURS… Mnddc.org

.

-46 - -46 - -46 -

Sorting the Hard WaySpread 'em all on a long table

Insert each one into the ordered pile.

Problem: The pile gets bigger and bigger,

so the insertion goes more & more slowly..

-47 - -47 - -47 - -47 -

Sorting the Hard WaySpread 'em all on a long table

Insert each one into the ordered pile.

This technique takes O(n2) – that's n squared.

2000 * 2000 = 4 million operations!

Walk down the row (pass n badges), insert one.

Do this n times. You have n * n distance to walk.

.

-48 - -48 - -48 - -48 - -48 -

Sorting, a smart way1. Grab 20 badges, and sort them in a small group.

Create 100 small, sorted batches.

2. Combine the batches 2 by 2, like this:

20 40

20 80 etc.

20 40

20

.

-49 - -49 - -49 - -49 - -49 - -49 -

Sorting, a smart way2. Combine the batches 2 by 2, like this:

20 40

20 80 etc.

20 40

20

Reminds you of binary search? Yes,

Merging twice as many groups only takes

One more step (layer).

4 groups – 2 layers (3 operations)

8 groups – 3 layers (7 operations) etc.

.

-50 - -50 - -50 - -50 - -50 - -50 - -50 -

Sorting by 'merge-sort'

Merge-Sort requires O(n log2 n) operations to sort n objects.

For 2000 name badges, log2 (2000) = log2 (1000) + 1You recognize log2 (1000) ~= log2 (1k) = 10,So log2 (2000) ~= 11

So our total estimate for sorting 2000 name badges is

Approximately 2000 * 11 or 22,000 steps

Compared to 4 million steps (2000 * 2000) ifdoing the job the BFI (Brute Force & Ignorance) way!

-51 - -51 - -51 - -51 - -51 - -51 - -51 - -51 -


The moral of this story:

1) Do a little research before you undertakeA major project.

An hour's investigation might save WEEKS of work,And it might save your BUSINESS.

2) Ask an expert, if you have one.Become an expert, if you don't have one.

Usfamily.net

-52 - -52 - -52 - -52 - -52 - -52 - -52 - -52 - -52 -


The moral of this story:

1) Do a little research before you undertakeA major project.

An hour's investigation might save WEEKS of work,And it might save your BUSINESS.

2) Ask an expert, if you have one.Become an expert, if you don't have one.

<<Tell the story of the stuck sailboat>>Usfamily.net

-53 -

Seattletimes.nwnews.com

information management

Documents

index item

neighborhood of similar

playingbinary search

search engines

search terms

ancient records

order statisticsa document

nuestra senora