information management
DESCRIPTION
Information Management. Lecture 3: Cataloging, Indexing, Searching J. Michael Moshell University of Central Florida. Original image* by Moshell et al. Cataloging and Indexing. Why are we discussing this?. I don't believe in memorizing a bunch of soon-obsolete facts. - PowerPoint PPT PresentationTRANSCRIPT
1
Information Management
Lecture 3: Cataloging, Indexing, Searching
J. Michael Moshell
University of Central Florida
Original image* by Moshell et al .
-2 -
Cataloging and Indexing
Why are we discussing this?
www.joe-ks.com
I don't believe in memorizing a bunch of soon-obsolete facts.
I DO believe that many of you will have to solve info-management problems.
You will probably invent ways of doing it.
So you should "steal from the best" – not reinvent the wheelbarrow.
-3 -
How do we find things?
1) By starting in the neighborhood of similar things.
1) By using the name of the thing,and asking an "expert" or "resource"
-4 -
How do we find things?
1) By starting in the neighborhood of similar things.
1) By using the name of the thing,and asking an "expert" or "resource"
When reading a book:
Look in the table of contents, for an ARTICLE.
Look in the index, for a TOPIC.
-5 -
How do we find things?
1) By starting in the neighborhood of similar things.
1) By using the name of the thing,and asking an "expert" or "resource"
At the library:
Go to the relevant section, browse shelves.
Use the (card) catalog (really an index.)
-6 -
How do we find things?
1) By starting in the neighborhood of similar things.
1) By using the name of the thing,and asking an "expert" or "resource"
On the Internet:
Follow links from trusted sources (like cnet).
Use the indexes, e. g.
• those provided by search engines
• those provided by vendors (eBay, Amazon...)
• those provided by facilitators (uTube, craigslist)
-7 -
What's an index?
• An index is a system that serves to optimize speed in finding relevant documents in a search.
• An index is a system that, given one or more search terms from either metadata or essence, efficiently reports the location of the essence.
What's fast? What's efficient?
here comes some math ... (how we all love it!)
-8 -
Order statistics
A document contains k records. (perhaps k=1000).
If you must examine EACH RECORD to find what you seek,
the search is Order-k (written as O(k).)
For ancient records, this is usually the only way.
For instance, the Archivo General de Indias in Seville, Spain
www.learningcurve.gov.uk
-9 -
Order statistics
A document contains k records. (perhaps k=1000).
If you must examine EACH RECORD to find what you seek,
the search is Order-k (written as O(k).)
For ancient records, this is usually the only way.
On the average, you would look at 500 records (0.5*k) to
find the one you are seeking.
Let's say we seek a ship named Nuestra Senora de Atocha
-10 -
Indexing
To prepare an index of all ships' names, , captains' names,
owners and dates in the archive, it would take O(k) time.
Why? Because every document would be visited. Each index item contains SEARCH TERM and DOCUMENT NUMBER
BUT now (if the index is sorted, which it is) we can
find S=Nuestra Senora de Atocha much faster, by playing
"binary search".
S>this?
sorted
index
A
Z
-11 -
Indexing
If someone prepared an index of all ships, captains' names,
owners and dates in the archive, this would take O(k) time.
Why? Because every document would be visited.
BUT now (if the index is sorted, which it is) we can
find S=Nuestra Senora de Atocha much faster, by playing
"binary search". sorted
index
A
ZS>this? no
-12 -
Indexing and binary Search
1 comparison distinguishes 2 records
2 comparison distinguish 4 records
3 comparisons distinguish 8 records ...
10 comparisons distinguish 1024
20 comparisons distinguish over a million records.
sorted
index
A
Z
Each comparison
cuts in half
the search space
-13 -
Indexing and binary Search
1 comparison distinguishes 2 records
2 comparison distinguish 4 records
3 comparisons distinguish 8 records ...
10 comparisons distinguish 1024
20 comparisons distinguish over a million records.
sorted
index
A
Z
Each comparison
cuts in half
the search space
O(log k)
-14 -
OMG, a Log? Puleeeeez ....
Yep, this is college and you are a
DIGITAL Media Major. So here goes.
20=1
21=2
22=2*2=4
23=2*2*2=8
...
210=2*2*...*2= 1024=1 kilo, about a thousand
O(log k)
Ten twos
-15 -
OMG, a Log? Puleeeeez ....
Yep, this is college and you are a
DIGITAL Media Major. So here goes.
20=1
21=2
22=2*2=4
23=2*2*2=8
...
210=2*2*...*2=1024
220= 2*2*...*2 = 1024 * 1024 = 1meg, about a million
O(log k)
Twenty twos
-16 -
OMG, a Log? Puleeeeez ....
Yep, this is college and you are a
DIGITAL Media Major. So here goes.
20=1
21=2
22=2*2=4
23=2*2*2=8
...
210=2*2*...*2=1024
220= 2*2*...*2 = 1024 * 1024 = 1meg, about a million
230 = 2*2*...*2 = 1024 * 1024 *1024 = 1 gig, about a billion
O(log k)
Thirty twos
-17 -
OMG, a Log? Puleeeeez ....
Yep, this is college and you are a
DIGITAL Media Major. So here goes.
20=1
21=2
22=2*2=4
23=2*2*2=8
...
210=2*2*...*2=1024
220= 2*2*...*2 = 1024 * 1024 = 1meg, about a million
230 = 2*2*...*2 = 1024 * 1024 *1024 = 1 gig, about a billion
(log2 k)
1
2
3
10
20
30
k
2
4
8
1024
1 meg
1 gig
-18 -
OMG, a Log? Puleeeeez ....
Yep, this is college and you are a
DIGITAL Media Major. So here goes.
20=1
21=2
22=2*2=4
23=2*2*2=8
...
210=2*2*...*2=1024
220= 1 meg
230=1 gig
You need to be able to tell me what is log2(k) for any k (power of two) between
1 and 1meg.
Example:
256k?
256=28. and 1k~=210.
So that's 2*2*2..*2
18 twoslog2(256k) = 18
-19 -
OMG, a Log? Puleeeeez ....
I will provide a Logarithm Practice Sheet on the website
to help you study and practice for the midterm exam.
-20 -
Indexing and binary Search
Linear Search Binary Search
1000 items 10 steps
1 million items 20 steps
1 billion items 30 steps
sorted
index
A
Z
Each comparison
cuts in half
the search space
O(log2 k)
-21 - -21 -
Sorting N Objects
We will discuss sorting, a bit later
After you recover from Math Anxiety
Slcc.edu
-22 -
Why not just keep books in order?
Could you do 'binary search' directly on the books ...?
Well, WHICH order? If they're on the shelf in that order, yes.
- by ship names?
- by captains' names?
- by year of construction?
- by year of sinking or decommissioning?
An index can be sorted on any data field, then searched.
(Sorting k objects takes O(k * log k) time
(so sorting a billion objects; 1 billion * log2(1 billion)
=1 billion* 30 = 30 billion steps)
-23 -
Why not just keep books in order?
- An index can be sorted on any data field, then searched.
(Sorting k objects takes O(k log k) time
(so sorting a billion objects; 1 billion * log2(1 billion)
=1 billion* 30 = 30 billion steps)
(This can be done overnight, when computers aren't busy)
BUT – once sorted, inserting new information is O(log k) time.
So, you can insert a new fact into our billion-item index in
about 30 steps. Fast!
-24 -
What terms shall we index?
- For text, the essence yields keyword search
- The dumbest but easiest kind of search, if essence=digital text.
-25 -
What terms shall we index?
- For text, the essence yields keyword search
- The dumbest but easiest kind of search, if essence=digital text.
- This was not true for traditional libraries.
- Nobody had time to catalog every word of every book.
- Professional catalogers had to develop techniques:
- Author
- Title
- Publication Date
- Subject
(METADATA!)
And this last one, Subject, took more work than all the rest together.
-26 -
What's so hard about subject indexing?
- The problem: restricting the vocabulary.
Let's consider a fictional book:
The Skills of a Nineteenth Century Bartender.
Henry Macintosh, New York, 1889
How might someone seek this book?
Or: what metadata fields might the librarian use?
Occupations: bartender, barkeeper, barman, barkeep
(Are there others we forgot to search for?)
So catalogers established rules involving precedent
to restrict vocabularies and establish standards
-27 -
Cataloging an Item for a Library
The card catalog at Yale University(of course, it's all computerized now)
-28 - -28 -
Cataloging an Item for a Library
Problem #1: What book (or other object) are we talking about?
- Each item has an accession number (that's easy to issue)
- Each title has a catalog number, shared with all instances
(sometimes separate copies are called .c1, .c3 etc.)
Problem #2: What catalog number should I give this item?
- Did someone else catalog it already? If so, use that.
- If not, follow the
-
International Standard Bibliographic Description (ISBD)
-29 -
•Title
•statement of responsibility (author or editor),
•edition,
•material specific details (for example, the scale of a map),
•publication and distribution,
•physical description (for example, number of pages),
•Series (e. g. this might be part 3 of a trilogy)
•notes,
•standard number (ISBN).
International Standard Bibliographic Description (ISBD)
-30 - -30 -
And then follow
A complex set of rules
Most English cataloging follows
Anglo-American Cataloging Rules (AACR2)
Germans follow
Regeln für die alphabetische Katalogisierung
Etc…
-31 -
How to organize an index
- Step 1: Deciding what fields to include
(the Ontology) of the subject space
- Step 2: Deciding if each metadata field is open or controlled (CV).
Open set: American family names
Closed set: Chinese family names
In software, ,CV fields are often presented as pulldown menus.
- Step 3: Establishing the controlled
vocabulary, and rules for
extending it.
- Step 4: Maintaining it.
- (e. g. MIME types, subtypes.)
http://www.kksou.com
-32 -
Concept: "Low-hanging fruit"
- In any new domain, some ideas will come together
that present opportunities not previously possible
- Some of them will be easy to do.
- Get these first, and you may be rich.
The cataloging of dynamic media such as
video can take advantage of techniques
for Content Logging.
In this area,
closed captions was a low-hanging
fruit. www.recipeforlowhangingfruit.com
-33 -
Closed Captions for Content Logging
- Originally for deaf ... now for bars, etc.
- "Closed" – not all viewers will see the captions
- But they are built into most TV broadcasts.
>> Indicates a new speaker has begun to talk.
www.recipeforlowhangingfruit.com
-34 -
Closed Captions for TV
- Originally for deaf ... now for bars, etc.
- "Closed" – not all viewers will see the captions
- But they are built into most TV broadcasts.
>> Indicates a new speaker has begun to talk.
But – isn't speech recognition still hard?
- yes – but there are SCRIPTS and TELEPROMPTERS behind
most TV programming. Live news feeds are a mix of scripted
and unscripted.
BBC developed a re-speak technology to maximize clarity.
Sound effects and music are shown by # or notes.
www.recipeforlowhangingfruit.com
-35 -
Closed Captions for TV
- now that CC exists, you can index it to produce metadata.
- Services monitor in real-time for significant stories.
www.recipeforlowhangingfruit.com
-36 -
Can you think of another TV "LHF"?
Where is another source of already-in-text-form metadata
about TV program contents? (I can think of two).
www.recipeforlowhangingfruit.com
-37 -
Can you think of another TV "LHF"?
Where is another source of already-in-text-form metadata
about TV program contents? (I can think of two).
• Electronic Program Guides, such as
Tivo's TV programming schedule
• Broadcasters' Websites (e. g. www.cbs.com)
-38 -
We've discussed third party logging
But what about in-house logging (by materials' own producers.)
Static metadata (exists independently of the essence)
• Production Notes, including original scripts
• Edit Decision List (part of production notes)
• Advanced Authoring Format (AAF)
• News Feed rundowns (cues for local broadcasters)
Media Object Server (MOS) format
-39 -
We've discussed third party logging
But what about in-house logging (by materials' own producers.)
Dynamic metadata (sampled from or derived from the essence)
A hierarchy of proxy representations:
- time code (ties it all together)
- Proxy video (low res, maybe easier to scan – or harder!)
- Keyframes (still images for pattern recognition)
- Audio transcript
- annotation – added by staff
-40 -
Speech Analysis
- Phoneme: minimal meaningful unit of speech. English has 44.
- Phone: the 'rendering' of a phoneme by an individual. Infinite #
- Recognition of words: difficult under good conditions,
nearly impossible under noisy conditions
However, you don't need to get ALL the words to make the
document searchable. Even getting SOME of the words is better
than none.
www.nuance.com
-41 -
Indexing things that aren't words
- Built-in metadata (e. g. digital camera data, Adobe metadata)
- Image libraries – cataloged by human beings
(We will study some of the metadata standards used.)
- Automatic pattern recognition
- http://www.autonomy.com/content/Solutions/video-surveillance/index.en.html
- Assignment: Download ONE of the "Autonomy Virage" documents,
- read it and be prepared to give a one-minute summary of its claims.
-42 -
Recognizing Faces
- FINDING a face in a scene is far easier than RECOGNIZING it.
- Nikon's cameras can now find faces and focus on them.
Face-priority AF in Nikon Coolpix Cameras
But it's a rough rough world out there. The website listed
below provides a list of vendors ... many of which are 'dead
links' as companies come and go.
http://www.face-rec.org/vendors//
-43 -
And ... where do we go from here?
Go back through these slides. Make a list of the important words.
If you can write a one-sentence explanation of every word on this list, AND answer logarithm questions, you're ready for the midterm. ...
at least with regard to Searching and Pattern Recognition.
But now let's go talk about SORTING.
-44 -
SortingWhy are we discussing this?
It's a good example of DUMB vs. SMART algorithms.
What's an algorithm?
A systematic procedure for solving a problem.
Programs are built on the basis of algorithms. But so are
* carpentry
* medical diagnosis
* electronic repair .. Etc etc etc
.
-45 - -45 -
Sorting and IgnoranceTwo thousand name-tags
Printed in NAME order
Needed in COMPANY order
So… they put
Six temps to
Work …
For HOURS… Mnddc.org
.
-46 - -46 - -46 -
Sorting the Hard WaySpread 'em all on a long table
Insert each one into the ordered pile.
Problem: The pile gets bigger and bigger,
so the insertion goes more & more slowly..
-47 - -47 - -47 - -47 -
Sorting the Hard WaySpread 'em all on a long table
Insert each one into the ordered pile.
This technique takes O(n2) – that's n squared.
2000 * 2000 = 4 million operations!
Walk down the row (pass n badges), insert one.
Do this n times. You have n * n distance to walk.
.
-48 - -48 - -48 - -48 - -48 -
Sorting, a smart way1. Grab 20 badges, and sort them in a small group.
Create 100 small, sorted batches.
2. Combine the batches 2 by 2, like this:
20 40
20 80 etc.
20 40
20
.
-49 - -49 - -49 - -49 - -49 - -49 -
Sorting, a smart way2. Combine the batches 2 by 2, like this:
20 40
20 80 etc.
20 40
20
Reminds you of binary search? Yes,
Merging twice as many groups only takes
One more step (layer).
4 groups – 2 layers (3 operations)
8 groups – 3 layers (7 operations) etc.
.
-50 - -50 - -50 - -50 - -50 - -50 - -50 -
Sorting by 'merge-sort'
Merge-Sort requires O(n log2 n) operations to sort n objects.
For 2000 name badges, log2 (2000) = log2 (1000) + 1You recognize log2 (1000) ~= log2 (1k) = 10,So log2 (2000) ~= 11
So our total estimate for sorting 2000 name badges is
Approximately 2000 * 11 or 22,000 steps
Compared to 4 million steps (2000 * 2000) ifdoing the job the BFI (Brute Force & Ignorance) way!
-51 - -51 - -51 - -51 - -51 - -51 - -51 - -51 -
Sorting by 'merge-sort'
The moral of this story:
1) Do a little research before you undertakeA major project.
An hour's investigation might save WEEKS of work,And it might save your BUSINESS.
2) Ask an expert, if you have one.Become an expert, if you don't have one.
Usfamily.net
-52 - -52 - -52 - -52 - -52 - -52 - -52 - -52 - -52 -
Sorting by 'merge-sort'
The moral of this story:
1) Do a little research before you undertakeA major project.
An hour's investigation might save WEEKS of work,And it might save your BUSINESS.
2) Ask an expert, if you have one.Become an expert, if you don't have one.
<<Tell the story of the stuck sailboat>>Usfamily.net
-53 -
Seattletimes.nwnews.com