This week we learned about national language and controlled vocabularies.  Natural language is the language that we use to speak and write.  It is typically used in three ways for information retrieval and representation:

  • Use terms taken from titles, topic sentences, and other important components of a document for information representation
  • Use terms derived from any part of a document for information representation
  • Use words or phrases extracted directly from peoples questions for query representation

There are two kinds of words in natural language: significant and function words (articles, prepositions, and conjunctions).  These function words typically are considered part of a stop list, or terms that are too general to be suitable for representations.

Controlled vocabulary is an artificial language in which syntax, semantics, and pragmatics are limited and defined.  Terms in controlled vocabularies are selected by using the principles of literacy or user warrant.  Literacy warrant involves terms chosen from existing literature.  User warrant involves terms that must have been used in the past in order to be included.

There are three major types of controlled vocabularies (CV):

  • Thesauri – most widely used in information representation and retrieval
  • Subject heading lists
  • Classification schemes – 1st type of CV developed; built on artificial framework of knowledge

The strengths of controlled vocabularies are the weaknesses of natural language:

  • Synonyms
  • Homographs
  • Syntax

The weaknesses of controlled vocabularies are the strengths of natural language:

  • Accuracy
  • Updating
  • Cost
  • Compatibility

Metadata is “data about data”.  Common forms of metadata include:

  • Author
  • Date of publication
  • Source of publication
  • Document length
  • Document genre

Descriptive metadata is external to the meaning of a document and is related to how it was created (see common forms above); whereas semantic metadata characterizes the subject matter within a document’s contents.

There are two main techniques to organize data:

  • Taxonomies – classes organized hierarchically
  • Folksonomies – user freely choose keywords, called tags

We learned about various types of markup languages:

  • SGML
  • HTML
  • XML
  • RDF
  • HyTime

Text compression is becoming more relevant in the digital age, and various methods of compression were discussed in our text, to include:

  • Statistical methods

o    Modeling

  • Adaptive models
  • Static models
  • Semi-static models

o    Coding

  • Huffman codes
  • Byte-Huffman codes
  • Dense codes
  • Dictionary based methods

Our text indicates the best choice to introduce compression into modern information retrieval systems is through the use of word-based semi-static methods.

I think a combination of natural language and controlled vocabulary works best for information retrieval, with the emphasis being on natural language since that is what users are typically familiar with.  Controlled vocabularies are more specialized and tend to require some formal training for them to be truly effective and efficient.  Keyword searches (using natural language) are usually easier for a user to formulate then for them to be able to identify subject headings (controlled vocabulary).  However, this being said, controlled vocabularies can be much more powerful and return more relevant information.

Metadata formatting is extremely important in the information retrieval field, particularly as it relates to library records.  Machine Readable Cataloging Record (MARC) is the metadata format used for these records.  MARC formats are the international standard for the dissemination of bibliographic data.

With regards to markup languages, I am most familiar with HTML.  I created a blog several years ago in my very first class at USF.  Although blogging software allows you to create posts using WYSIWYG methodology, you can view your entries in HTML format (see screen shots below).  I would view the posts in HTML to learn more about the coding aspects of HTML.