Why Keywording Matters

Arturo Montejo Ráez, Ralf Steinberger



Most information retrieval systems nowadays use full-text searchingbecause algorithms are fast and very good results can beachieved. However, the use of full text indexes has its limitations,especially in the multilingual context, and it is not a solution forfurther information access requirements such as the need for efficientdocument navigation, categorisation, abstracting and other means by which thedocument contents can be quickly understood and compared withothers. We show that automaticindexing withcontrolled vocabulary keywords (descriptors) complements full-textindexing because it allows cross-lingualinformation access. Furthermore, controlled vocabulary indexingproduces document representations that are useful for human users, forthe Semantic Web, and for other application areas that require thelinking and comparison of documents with each other. Due to itsobvious usefulness, controlled vocabulary indexing has receivedincreasing attention over the last few years. Aiming at a betterunderstanding of the state-of-the-art in this field, we discuss thevarious approaches to automatic keywording and propose a taxonomy fortheir classification.


We launch our web browser and, after clicking on a bookmark, aone-field form appears embedded in the page. Once a few wordsare typed inside the text field, we click on the `submit' button expectingthe answer to ourquestion. A few seconds later the browser showsa page containing a list of items: those the system considersmost suitable to answer our needs. The discrimination ofresults becomes a non-trivial operation due to the large number ofentries returned. Sometimes we can get rid of some of them at a glance:the title or the text provided along with the item is enough toknow we are not interested, but sometimes we have to click andcheck the real document to see whether it is the information wewant, or not.

Many of us will recognize the sequence of steps performed above. Wewere looking for information using a full-text searchengine. This operational mode in information searching and retrievalhas populated almost every digital system which stores information.We can find forms like the one described when:

  • Searching for web pages: as in the example above. Search engines are amongst the biggest aggregators of information nowadays.

  • Searching for files: most operating systems come with tools supporting this feature. Thus, we can search for files containing certain words. Some systems even allow the possibility of using regular expressions, that is, as a more advanced form of the useful grep command on UNIX systems. Also commonly-used electronic mail clients let the user look for a message containing a particular word in a collection of email messages.

  • Searching for books: now libraries offer their catalogues on-line  and, in the case of electronic libraries, they can search for query words amongst the full text of documents stored.

  • Searching for reports: some administrative tools integrate inverted files into their structure to make searching faster.

  • and more...

Though the usefulness of full text search engines has been widelyproven and, therefore, accepted, they are still not good enough insome cases and totally inappropriate in others. The first kind of less-successfulcases are those where the collection contains a huge range of subjects anddocuments: for example, the World Wide Web.Old approaches using purely full-text-based engines wereabandoned, since the quality of results provided was declining withthe growth of the collection. Therefore, new techniques arosewith the aim of filtering and re-qualifying the rank (the PageRank algorithm is one of the most successful examples [1]). They index every word in a page sothey can perform  full-text searches later. The problem with thisapproach is that language is complex, ambiguous and rich in variation,so the quality of the results is still not as good as we would like.But this technique of indexing is solving the big problem of searchingfor information on the web. It is an implementable solution in verygeneral contexts.

The second field where full text-search techniques do notdo so well is when textual information is not available. There arestill some kinds of collections which are not suitable (yet) for thisgenre of engines. We refer here to pieces of information like images,sounds, etc. The current solution is to provide, beforehand, textualinformation related to every item (that is, enrich the data with text)so that later we can search using this related text as an access point. Manytechniques have been developed in order to automate such aprocess by pattern recognition, clustering and so on.

Subject keys in traditional information systems

Imagine you had to organize your personal library, what sort of ideas do youthink you would try in order to achieve well organized shelves?Maybe one of your first ideas would be to group books by theme, then to labelthem and put their details in a kind of index. Later on you mightfind you have so many books, it would be better to arrange them by size(large repositories do so). Whatever method you used, in the end you would haveto indexthem  in one way or another. Now the question could be: whichindexes should I use? It is not an easy task to define them becauseseveral considerations must be taken into account. Vickery alreadyemphasizes this reality [2]:

The problem of subject representation is therefore far lessstraightforward than other aspects of document description.

In the beginning, the use of keywords for information storage andretrieval was due to two major needs: the need for classification andthe need for retrieval. The former need had a double benefit: first, itlet librarians organize physical volumes into logical clusters;second, the possibility to search within a defined cluster was regardedas a way to speed up the searching for information (as pointed out by theso-called 'cluster hypothesis' of Rijsbergen [3]).

Hence, two major goals of indexing are to:

  1. Select records in a file that deal with a specific topic
  2. Group in proximity in a file records on similar subjects

Alphabetical terminologies and classification structures(known as 'thesauri') were thought of as tools to improve the twomain measures in information retrieval: precision andrecall. These refer to the quality of retrieved documents when comparedto the search query. 'Precision'is the number of relevant documents retrieved over the total number ofdocuments retrieved. 'Recall' is the number of relevant documentsretrieved over the total number of relevant documents in thecollection. These two measures show the problem of an antagonisticrelationship: if we try to improve one of them, the other will decay.For example, if we retrieve the whole collection in answer to a givenquery, our recall will be 100%, but our precision will beso low that the result will be unusable. The challenge resides,then, in finding a method which shows a good performance for both measures.

In earlier times, techniques were used to improvethese two values for a defined retrieval system; i.e. theimplementation of these techniques was oriented to the purpose andcontent of the retrieval system. The techniques traditionally used rely on setting relationships between words in a controlledvocabulary. Using those relations on a given query we can improve recall (by expanding to related terms) or precision (bynarrowing with less generic terms). These are the reasons for the useof thesauri.


There are several definitions for the word 'thesaurus'. In an oldwork of Vickery [2] we find a definition for thesaurus whichsummarizes in a few words the rationale associated with  it:

''The thesaurus is a display of the terms in a retrieval languageshowing semantic relations between them.''

Here, Vickery shows, on the one hand, the main purpose of a thesaurus:it defines a retrieval language, whatever the retrieval method mightbe. On the other hand, he does not define the kind of relationships betweenentries (synonyms, broader terms...), specifying only that a set ofsemantic relations is defined. We will see that this brief definitionfits perfectly with any type of existing thesaurus.

One of the earliest thesauri (and maybe the most famous one) isRoget's Thesaurus[4]. Dr. Peter MarkRoget's main idea behind this compilation was to create a system which wouldoffer words to express a given meaning, while conversely traditional dictionariesoffermeanings for a given word. This would help writers to express theirconcepts in the most suitable form. Such users had thethesaurus as a reference book for writing of texts. Thus, it wasmostly intended to be useful in the document creation phase.

The power of reducing a language to its basic concepts has become moreand more useful, especially since the "semantic network" has ariseninelectronic form. WordNet [5]  is an on-line referencesystem" (their authors state). English nouns, verbs, adverbs andadjectives are organized into synonym sets (also calledsynsets), each representing one underlying lexical concept.Nowadays we can assume that almost every thesaurus (specialized ornot) is available in electronic form.

Thesaurus descriptors are normally unambiguous because they areclearly defined, whereas full text indexing does not provide any differentiation for words such as 'plant' in 'power plant'versus 'greenplant'.

There is even a multilingual thesaurus based on WordNet calledEuroWordNet [6], which, using English as central node,maps synsets between different European languages. This workrepresents a milestone in multilingual information retrieval.

Both WordNet and Roget's Thesaurus are general reference sources, i.e. theydon't focus on specialized terminologies. But the areas where  thesauribecome useful toolsare in specialized domains (Law,Medicine, Material Science, Physics, Astronomy...). One example is theINSPEC thesaurus [7], focused on technical literature; orthe ASIS thesaurus, specialized in Information Science [8]. NASA, the European Union,and other organizationsproduce their own specialized thesauri (like the multilingualEUROVOC thesaurus [9]).

Each thesaurus has its own organization, according to the purpose it needs to accomplish. But we can summarize any of themby the followingcomponents:

the set of items in the thesaurus. They areusually referred to as descriptors, index terms, keywords, keyphrases, topics, concepts or themes. We will use ''keyword'' to namethem.

the set of subsets of the set of terms. Eachsubset contains a group of terms which are interrelated by thesynonym relationship (i.e. words with the same meaning).This relationship is important because resulting subsets are elementsin other relations.

this is a set of relations keyword to keyword, keyword to meaning, meaning tokeyword and meaning to meaning.

There are two relations which are commonly used among existingthesauri:

  • Hyponymy. This is a relationship between meanings. We say that x is a hyponymy of y if x is a kind of y. This relation is reflexive, anti-symmetric and transitive, therefore it establishes a partial order between meanings. The symmetric relation is called specialization and also defines a partial order over the set of descriptors.

  • Meronymy. This can be split into three different (but closer) relationships:

    1. x is part of y, e.g. branch is part of tree
    2. x is a member of y, e.g. citizen is member of society
    3. x is constituent material of y, e.g. iron is constituent material of knife

Of course, depending on the purpose of the thesaurus, some of theserelations may be ignored. Also new relations could occur. WordNet,for example, includes all of the given relations. INSPEC and Eurovocthesauri condense meronym relations into the ''related'' relationship (see [8], RT means''related terms''). Synonymy is implementedby the application of the ''USE'' statement.

Figure 1: Excerpt from Eurovoc thesaurus
     [...]            penalty            NT1   alternative sentence            NT1   carrying out of sentence            NT2   barring of penalties by limitation            NT2   reduction of sentence            RT   repentance            RT   terrorism     (0431)            NT2   release on licence            NT2   suspension of sentence            NT1   conditional discharge            NT1   confiscation of property            RT   seizure of goods     (1221)            NT1   criminal record            NT1   death penalty            NT1   deprivation of rights            RT   civil rights     (1236)            [...]            

Usually in specialized thesauri either the synonymy is neglected or apreferred word representing the meaning is given, since thepurpose  is to provide a list of controlled terms (and that ''control'' refers to the use of just one wordfor a givenmeaning). Nevertheless,  most of them includesynonymy in one way or another.

There are, however, some special cases of thesauri where there is morethan just terms and relations. In some cases the thesaurus is a complexreference of specific relations, with specially defined rules to build a document'skeywords. This is the case of the DESY thesaurus [10],specializing in high energy physics literature. With theentries given we can construct particle combinations, reaction equationsand energy declarations among other examples.

 These facts bring us to theconclusion of Vickery that there is a tight relationship between the thesaurusand its domain of retrieval.

Applications of keywords

The construction of hand-crafted thesauri for use in computerapplications dates back to the early 1950s with the work of H. P.Luhn, who developed a thesaurus for the indexing of scientificliterature at IBM. The number of thesauri and systems is now growingsteadily because manually or automatically keyworded documents havemany advantages and offer additional applications over simpledocuments not linked tothesauri. Depending on whether people or machines make use of keywordsassigned to documents, we distinguish the following uses:

Human manipulation of keywords

Human users mainly usekeywords for browsing and searching of documentcollections.


Keywords are used to facilitate the browsingof document collections, either as part of a wholecollection or the small subset returned by a search operation. Examples ofhow keywords can aid browsing:

  • Use of keywords as a document summary. Thesaurus descriptors are usually a small list of carefully chosen terms that represent the document contents particularly well. Depending on the thesaurus, they are of a summarising, conceptual nature. They often do not occur explicitly in text so that they are of a completely different nature from full text indexes. Descriptors function as a kind of abstract summary and give users a quick and rough idea of the document contents. This helps the users to quickly sieve out the most important or relevant documents from a large collection. The use of keywords as a means for automatic summarization is an interesting application already in practice in many digital libraries and on-line catalogues.

    For high energy documents this can speed up the search process in specialized collections which grow by hundreds of documents every week [11].

    If the thesaurus is multilingual, this summarising function also works across languages, i.e. a user will see a list of keywords (a summary) in their own language of a document written in another language [12].

  • Use of keywords for document navigation. If the database containing the full texts and their keywords offers hyperlinks based on the keywords, it is possible to navigate through the document collection by starting with one document and searching for similar documents by clicking on one or more of the keywords to see other documents indexed with the same descriptors.

  • Classification of documents. In some search engines the results of a search are classified ''on the fly'' into categories so that the browsing of documents is easier and more self-organized. The user can distinguish faster between interesting documents and irrelevant ones. Although many of these systems use words from documents to label automatically generated categories, others select them from a controlled vocabulary. An example for such a system is GRACE [13].

    However, the use of keywords for classification goes beyond the pure retrieval domain. The freedesktop.org project [14]  promotes the use of keywords for the arrangement of icons (representing application launchers) in the main menu of the desktop where applications are internally attached to a list of categories. It means that there is no predefined taxonomy to which program launchers are classified. Instead, programs are labelled with keywords from which menus are created in the graphical interface of an operating system (like the Gnome [15] desktop available for Linux and other operating systems).

  • Identifying the common subject of cited documents. An interesting feature that will be tested under the HEPindexer project is the use of keywords to help users in navigating through document references, allowing them to recognize the subject that is shared by the reference and the document the reference belongs to.


Keywords are helpful during thesearch phase. For example:

  • For query expansion. Some authors, like Vassilevskaya [16] among others, propose the use of controlled vocabulary for query expansion. In the query formulating process, the query is passed to the automatic assignment tool and some thesaurus keywords are suggested to the user. These can then either be chosen instead of, or in addition to, the query.

  • For cross-lingual searching. When the thesaurus used for indexing is multilingual, users can be given the option to use thesaurus descriptors as search terms. The search can then be carried out using search terms in another language to achieve cross-lingual document search and retrieval.

  • Descriptors provided by a thesaurus have a relevant advantage over single terms selected for automatic full-text indexing. Concepts which can be expressed in several synonymous ways, such as "Chemotherapy" and "Drug Therapy", are conflated to only one form ("Drug Therapy"), while phrases can be treated as single concepts ("stage IIIB breast cancer", for example). This is one of the conclusions found by using the MetaMap Indexer for medical documents [17].

  • When searching in highly specialized domains, keywords can be used as a directory or subject tree for the user who is not able to make his/her information needs explicit as a set of query terms. If documents in the collection have been labelled with keywords and the structure of the thesaurus is hierarchical, the user can drill down through the categories narrowing the search space. This could be the electronic equivalent of browsing the Universal Decimal Code (UDC) when we enter a library for the first time. Such a tree could profit from semantic relations between keywords, hence we may find related topics, general topics, and further relationships.

Using keywords as a document representation for machineusage

The fact that keywords were traditionally developed forhuman readers does not necessarily mean that they can only be usedby people. Several powerful applications have shown that descriptorscan well be used to represent document contents for a number ofautomatic procedures:

  • Using descriptors from a hierarchically organised thesaurus allows searching by subject field. For instance, searching for "RADIOACTIVE MATERIALS" can automatically be extended to all individual instances of radioactive elements, such as 'uranium', 'plutonium', etc. This form of query expansion is being used for some domains like high energy physics [16].

  • Multilingual document similarity calculation. As the list of thesaurus descriptors of a text is a semantic representation of this text, texts can be compared with each other via these descriptor lists. The idea is that the higher the number of content descriptors two documents have in common, the more similar are the documents. For multilingual thesauri like Eurovoc [9], which currently exists in twenty-one language versions with one-to-one translations for each descriptor, the document similarity calculation is even possible for texts written in different languages. Steinberger, Pouliquen and Hagman have shown [18] that a translation of a given document can quite reliably be identified in a multilingual document collection because it is correctly identified as the most similar document to the original.

  • Multilingual clustering and multilingual document maps. The same cross-lingual document similarity measure can be used as input to further multilingual applications. These include multilingual clustering and classification of documents, as well as the visualisation of multilingual document collections in single document maps [19].

  • The Semantic Web. Our opinion is that all these technologies will play a key role in the development of the Semantic Web [20]. Considering that the whole structure of the Semantic Web depends on RDF (Resource Description Framework) and that there are already some projects to use thesauri like a schema (ontology) defining the terms used to represent the RDF version of WordNet, we can conclude that this is a promising area of research. A web of documents can be related via their associations between keywords.

  • The Semantic Grid. Not only documents can be linked using keywords, but any type of service (for example, any web service) can be attached with keywords which could be automatically assigned using the description of the service as a basis. The Semantic Grid [21] objective can be reached faster using subject enhancement, i.e. keywording.

Therefore, we could imagine a scenariowhere we want to look for a service, e.g. a database of iron manufacturers.We get the keywords ofthe service which may have been generated from the content of topweb pages in the portal of this service (the pages which let usaccess the database via web forms or any other web basedinteraction). These keywords show us that there is anotherdatabase which offers iron toys, since the thesaurussplits the keyword iron manufacturer into the subtopics iron-madetoys manufacturer, naval manufacturer,etc. Thanks to the semantic network created from keyword relationshipswe are able to find the provider we need.

Automatic key word assignment tools: a taxonomy

F. W. Lancaster [22] gives us the following definition forindexing:

''The main purpose of indexing and abstracting is to constructrepresentations of published items in a form suitable forinclusion in some type of database.''.

This is a very general description, but it still summarizes the goalof indexing: provide an abstraction of document contents for betterstorage and retrieval (which is the goal of any database). We findseveral types of indexing. For example, web search engines (likeGoogle, Altavista and others) generallyfull-text-index web pages automatically, but for some specializedand popular subject areas, they ask teams of professional indexersto carry out careful manual indexing.

We distinguish two main types of indexing [22]:

  • Indexing by assignment. Most human indexing is assignment indexing, involving the representation of subject matter by means of terms selected from some form of controlled vocabulary. Due to the intellectual work involved, this manual task is very labour-intensive and therefore expensive. Fortunately, new automatic solutions are emerging with reasonable performances.

  • Indexing by extraction. Words or phrases appearing verbatim in a text are extracted and used to represent the contents of the text as a whole. Keyword extraction is thus less abstract and more limited than assignment indexing.

The taxonomy we propose focuses on those systems that doassignment of keywords instead of extraction.

Figure 2: Indexing by assignment

Figure 2 provides a graphical view of this process.The indexer reads the document and selects keywords from a thesaurusor a controlled vocabulary.

Although there has been extensive work done on automatic thesaurigeneration, less work has been done on automatic descriptorassignment. Although research is advancing slowly in this area, itbenefits from development in other IR (information retrieval) areas. We mention heresome of the systems developed for automatic assignment, along with ataxonomy proposal for this kind of tool.

We can classify automatic keyword assignment systems (AKWAs) into twomain categories, depending on their degree of automation:

  • Machine Aided Indexing (MAI): those systems supporting indexers in their manual task of finding good keywords for documents, like NASA MAI System [23] or BIOSIS [24].

  • Fully Automatic Indexing (FAI): those systems intended for a fully automatic keywording assignment process without any human intervention. For FAI tools a document is used as input and the system automatically produces as output a list of keywords.

The following is a taxonomy that summarizes the various AKWAapproaches developed so far:

  • Indexing by analogy to similar documents. This approach can be used if there is already a collection of pre-indexed documents with which a new document can be compared. The basic idea is that the document, for which indexing is to be carried out, should be indexed with the keywords of the most similar documents in the collection. For this purpose, an existing document retrieval engine will identify the most similar documents in the collection. In the following step, the most frequent keywords of the retrieved documents will be merged and assigned to the new document. The process of retrieving similar documents in the collection can be based on a lexical vector space model, on the references the documents have in common, on formatting features [25], or on any other similarity measure.

    Advantages of this relatively simple approach are its relatively easy implementation and the fact that, depending on the kind of merge function used, training may not be necessary. A disadvantage is that the merging function might be difficult to implement.

    An example for this category is the system developed by Ezhela et al. [26], which is fully based on the references linked to a publication. Even with a quite simple approach, the results were acceptable, due to the high specialization of the subject domain (High Energy Physics). The merging strategy used was a pure intersection operation.

  • Indexing by classification. This approach also requires the existence of previously indexed documents. The idea is that each descriptor is treated like a class and there are as many classes as there are descriptors in the thesaurus. A document indexed with five descriptors is thus a document multiply classified into five classes. Each class is represented by all those documents that have been indexed with the corresponding descriptor. During the assignment phase, a new document will be classified into the most appropriate classes in order to identify the most appropriate descriptors. The quality of the assignment depends on the performance of the classification algorithm used.

    Most current systems fall under this category. One of them is the first version of the HEPindexer system [27], which uses a vector space model fully dedicated to the automatic assignment of descriptors.

Finally, the last criterion for classifying AKWAs is based ontraining needs, i.e. on the amount of effort required todevelop the system:

  • Machine learning systems always require automatic training using a model collection of documents that have previously been indexed. The systems can vary quite a lot depending on the type of document representation that is given to the text. This can range from a simple word frequency list for each document to a multi-faceted collection of document features. Typically, documents are represented by either all their words, or by all their lemmas, or by all their nouns and noun phrases.

    An example of this kind of approach is the European Commission's EUROVOC indexing tool [28], which represents each class by a combination of the most frequent lemmatized keywords of all of its documents, uses a large number of parameters and calculates various types of vector space similarities between the new document and all of its classes in order to determine the most appropriate descriptor. A more standard approach to Machine Learning was taken by Hulth [29], but her work is limited to keyword extraction rather than assignment.

    A general, positive feature of these systems is that they can rather easily be adapted to new thesauri and languages as they automatically optimise the selection of features from the feature set provided as input. Language-independent systems typically work with little linguistic input (e.g. only with a text's frequency list of word tokens). Better performance can be achieved using more linguistic processing such as lemmatization and parsing or using linguistic resources such as synonym lists and other dictionaries.

  • Rule based systems typically make an intensive use of linguistic resources and use language- and/or domain-dependent rules which are normally developed manually. The major problems with this approach are its development cost and the fact that the systems cannot easily be adapted to new domains and languages. The work done by Vassilevskaya [16] fits under this category. She proposed a system specialized on high energy physics, based on five types of rules with hundreds of rules introduced manually . A more extreme example of  this labour-intensive approach is that of Hlava and Hainebach [30], who produced over 40,000 hand-crafted rules for English alone. The conditions were of various types, including conditions regarding the presence of text strings, the usage of synonym lists, vicinity operators and even the recognition and exploitation of legal references in texts.


We have shown that manual or automatic indexing of documentcollections with controlled vocabulary thesaurus descriptors iscomplementary to full-text indexing and that it provides both humanusers and machines with the means to analyse, navigate and access thecontents of document collections in a way full-text indexing would notpermit. Indeed, instead of being replaced by full-text searches inelectronic libraries, a growing number of automatic keyword assignmentsystems are being developed that use a range of very differentapproaches.

In this paper, we have given an introduction to automatic keywordassignment, distinguishing it from keyword extraction, and proposing aclassification of approaches, referring to sample implementations foreach approach. This presentation will hopefully help researchers inthe area to better understand and classify emerging approaches. Wehave also summarized some of the powerful applications that this kindof tool is offering in the field of information retrieval, and wehave structured them into comprehensive categories and have shown realexamples and working solutions for some of them.

As the number of different systems for automatic keyword assignmenthas been increasing over recent years, it was our aim to give someorder to the state of the art in this interesting and promising fieldof research.


Arturo Montejo-Ráez

Arturo Montejo-Ráez
European Organization for Nuclear Research

Tel: +41 (0) 22 767 3833
URL: http://cern.ch/amontejo

Ralf Steinberger
European Commission
Joint Research Centre
T.P. 267, 21020 Ispra (VA)

Tel: +39 - 0332 78 6271
Email: ralf.steinberger@jrc.it
URL: www.jrc.it/langtech

