10 Working With a Thesaurus in Oracle Text
This chapter describes how to improve your query application with a thesaurus. The following topics are discussed in this chapter:
10.1 Overview of Oracle Text Thesaurus Features
Users of your query application looking for information on a given topic might not know which words have been used in documents that refer to that topic.
Oracle Text enables you to create case-sensitive or case-insensitive thesauruses that define synonym and hierarchical relationships between words and phrases. You can then retrieve documents that contain relevant text by expanding queries to include similar or related terms as defined in the thesaurus.
You can create a thesaurus and load it into the system.
This section contains the following topics.
Note:
Oracle Text thesaurus formats and functionality are compliant with both the ISO-2788 and ANSI Z39.19 (1993) standards.
10.1.1 Oracle Text Thesaurus Creation and Maintenance
Thesauruses and thesaurus entries can be created, modified, deleted, imported, and exported by all Oracle Text users with the CTXAPP
role.
This section contains the following topics.
10.1.1.1 CTX_THES Package
To maintain and browse your thesaurus programatically, you can use the PL/SQL package, CTX_THES
. With this package, you can browse terms and hierarchical relationships, add and delete terms, add and remove thesaurus relations, and import and export thesaurus in and out of the thesaurus tables.
10.1.1.2 Thesaurus Operators
You can also use the thesaurus operators in the CONTAINS
clause to expand query terms according to your loaded thesaurus. For example, you can use the SYN
operator to expand a term such as dog to its synonyms as follows:
'syn(dog)'
10.1.1.3 ctxload Utility
The ctxload
utility can be used for loading thesauruses from a plain-text file into the thesaurus tables, as well as dumping thesauruses from the tables into output (or dump) files.
The thesaurus dump files created by ctxload
can be printed out or used as input for other applications. The dump files can also be used to load a thesaurus into the thesaurus tables. This can be useful for using an existing thesaurus as the basis for creating a new thesaurus.
WARNING:
To ensure sound security practices, Oracle recommends that you enter the password for ctxload
using the interactive mode, which prompts you for the user password. Oracle strongly recommends that you do not enter a password on the command line.
Note:
You can also programatically import and export thesaurus in and out of the thesaurus tables using the PL/SQL package CTX_THES
procedures IMPORT_THESAURUS
and EXPORT_THESAURUS
.
Refer to Oracle Text Reference for more information about these procedures.
10.1.2 Using a Case-sensitive Thesaurus
In a case-sensitive thesaurus, terms (words and phrases) are stored exactly as entered. For example, if a term is entered in mixed-case (using either the CTX_THES
package or a thesaurus load file), the thesaurus stores the entry in mixed-case.
Note:
To take full advantage of query expansions that result from a case-sensitive thesaurus, your index must also be case-sensitive.
When loading a thesaurus, you can specify that the thesaurus be loaded case-sensitive using the -thescase
parameter.
When creating a thesaurus with either CTX_THES.CREATE_THESAURUS
or CTX_THES.IMPORT_THESAURUS
, you can specify that the thesaurus created be case-sensitive.
In addition, when a case-sensitive thesaurus is specified in a query, the thesaurus lookup uses the query terms exactly as entered in the query. Therefore, queries that use case-sensitive thesauruses allow for a higher level of precision in the query expansion, which helps lookup when and only when you have a case-sensitive index.
For example, a case-sensitive thesaurus is created with different entries for the distinct meanings of the terms Turkey (the country) and turkey (the type of bird). Using the thesaurus, a query for Turkey expands to include only the entries associated with Turkey.
10.1.3 Using a Case-insensitive Thesaurus
In a case-insensitive thesaurus, terms are stored in all-uppercase, regardless of the case in which they were originally entered.
The ctxload
program loads a thesaurus in case-insensitive mode by default.
When creating a thesaurus with either CTX_THES
.CREATE_THESAURUS
or CTX_THES
.IMPORT_THESAURUS
, the thesaurus is created as case-insensitive by default.
In addition, when a case-insensitive thesaurus is specified in a query, the query terms are converted to all-uppercase for thesaurus lookup. As a result, Oracle Text is unable to distinguish between terms that have different meanings when they are in mixed-case.
For example, a case-insensitive thesaurus is created with different entries for the two distinct meanings of the term TURKEY (the country or the type of bird). Using the thesaurus, a query for either Turkey or turkey is converted to TURKEY for thesaurus lookup and then expanded to include all the entries associated with both meanings.
See Also:
The "ctxload Utility"
10.1.4 Default Thesaurus
If you do not specify a thesaurus by name in a query, by default, the thesaurus operators use a thesaurus named DEFAULT. However, Oracle Text does not provide a DEFAULT thesaurus.
As a result, if you want to use a default thesaurus for the thesaurus operators, you must create a thesaurus named DEFAULT. You can create the thesaurus through any of the thesaurus creation methods supported by Oracle Text:
-
CTX_THES
.CREATE_THESAURUS
(PL/SQL) -
CTX_THES
.IMPORT_THESAURUS
(PL/SQL) -
ctxload
utilitySee Also:
Oracle Text Reference to learn more about using
ctxload
and theCTX_THES
package, and "ctxload Utility" in this chapter
10.1.5 Supplied Thesaurus
Although Oracle Text does not provide a default thesaurus, Oracle Text does supply a thesaurus, in the form of a file that you load with ctxload
, that can be used to create a general-purpose, English-language thesaurus.
The thesaurus load file can be used to create a default thesaurus for Oracle Text, or it can be used as the basis for creating thesauruses tailored to a specific subject or range of subjects.
This section contains the following topics.
See Also:
Oracle Text Reference to learn more about using ctxload
and the CTX_THES
package, and "ctxload Utility" in this chapter
10.1.5.1 Supplied Thesaurus Structure and Content
The supplied thesaurus is similar to a traditional thesaurus, such as Roget's Thesaurus, in that it provides a list of synonymous and semantically related terms.
The supplied thesaurus provides additional value by organizing the terms into a hierarchy that defines real-world, practical relationships between narrower terms and their broader terms.
Additionally, cross-references are established between terms in different areas of the hierarchy.
10.1.5.2 Supplied Thesaurus Location
The exact name and location of the thesaurus load file is operating system dependent; however, the file is generally named dr0thsus
(with an appropriate extension for text files) and is generally located in the following directory structure:
<Oracle_home_directory> <interMedia_Text_directory> sample thes
See Also:
Oracle Database Installation Guide for the installation documentation specific to your operating system for more information about the directory structure of Oracle Text
10.2 Defining Terms in a Thesaurus
You can create synonyms, related terms, and hierarchical relationships with a thesaurus.
This section contains the following topics.
10.2.1 Defining Synonyms
If you have a thesaurus of computer science terms, you might define a synonym for the term XML as extensible markup language. This enables queries on either of these terms to return the same documents.
XML
SYN Extensible Markup Language
You can thus use the SYN operator to expand XML into its synonyms:
'SYN(XML)'
is expanded to:
'XML, Extensible Markup Language'
10.2.2 Defining Hierarchical Relations
If your document set is made up of news articles, you can use a thesaurus to define a hierarchy of geographical terms. Consider the following hierarchy that describes a geographical hierarchy for the U.S state of California:
California NT Northern California NT San Francisco NT San Jose NT Central Valley NT Fresno NT Southern California NT Los Angeles
You can thus use the NT
operator to expand a query on California as follows:
'NT(California)'
expands to:
'California, Northern California, San Francisco, San Jose, Central Valley, Fresno, Southern California, Los Angeles'
The resulting hitlist shows all documents related to the U.S. state of California regions and cities.
10.3 Using a Thesaurus in a Query Application
Defining a custom thesaurus enables you to process queries more intelligently. Because users of your application might not know which words represent a topic, you can define synonyms or narrower terms for likely query terms. You can use the thesaurus operators to expand your query into your thesaurus terms.
There are two ways to enhance your query application with a custom thesaurus so that you can process queries more intelligently:
-
Load your custom thesaurus and enter queries with thesaurus operators
-
Augment the knowledge base with your custom thesaurus (English only) and use the
ABOUT
operator to expand your query.
Each approach has its advantages and disadvantages.
This section contains the following topics.
10.3.1 Loading a Custom Thesaurus and Issuing Thesaurus-based Queries
You can build and load a custom thesaurus.
The advantage of using this method is that you can modify the thesaurus after indexing.
The limitation of this method is that you must use thesaurus expansion operators in your query. Long queries can cause extra overhead in the thesaurus expansion and slow your query down.
To build a custom thesaurus, follow these steps:
10.3.2 Augmenting Knowledge Base with Custom Thesaurus
You can add your custom thesaurus to a branch in the existing knowledge base. The knowledge base is a hierarchical tree of concepts used for theme indexing, ABOUT
queries, and deriving themes for document services.
When you augment the existing knowledge base with your new thesaurus, you query with the ABOUT
operator which implicitly expands to synonyms and narrower terms. You do not query with the thesaurus operators.
To augment the existing knowledge base with your custom thesaurus, follow these steps:
10.3.2.1 Advantage
Compiling your custom thesaurus with the existing knowledge base before indexing enables faster and simpler queries with the ABOUT
operator. Document services can also take full advantage of the customized information for creating theme summaries and Gists.
10.3.2.2 Limitations
Use of the ABOUT
operator requires a theme component in the index, which requires slightly more disk space. You must also define the thesaurus before indexing your documents. If you make any change to the thesaurus, you must recompile your thesaurus and re-index your documents.
10.3.2.3 Linking New Terms to Existing Terms
When adding terms to the knowledge base, Oracle recommends that new terms be linked to one of the categories in the knowledge base for best results in theme proving.
See Also:
Oracle Text Reference for more information about the supplied English knowledge base
If new terms are kept completely separate from existing categories, fewer themes from new terms will be proven. The result of this is poor precision and recall with ABOUT
queries as well as poor quality of gists and theme highlighting.
You link new terms to existing terms by making an existing term the broader term for the new terms.
10.3.2.3.1 Example: Linking New Terms to Existing Terms
You purchase a medical thesaurus medthes
containing a a hierarchy of medical terms. The four top terms in the thesaurus are as follows:
-
Anesthesia and Analgesia
-
Anti-Allergic and Respiratory System Agents
-
Anti-Inflammatory Agents, Antirheumatic Agents, and Inflammation Mediators
-
Antineoplastic and Immunosuppressive Agents
To link these terms to the existing knowledge base, add the following entries to the medical thesaurus to map the new terms to the existing health and medicine branch:
health and medicine NT Anesthesia and Analgesia NT Anti-Allergic and Respiratory System Agents NT Anti-Inflamammatory Agents, Antirheumatic Agents, and Inflamation Mediators NT Antineoplastic and Immunosuppressive Agents
10.3.2.4 Loading a Thesaurus with ctxload
Assuming the medical thesaurus is in a file called med.thes
, you load the thesaurus as medthes
with ctxload
as follows:
ctxload -thes -thescase y -name medthes -file med.thes -user ctxsys
When you enter the ctxload
command line, you are prompted for the user password. For best security practices, never enter the password at the command line. Alternatively, you may omit the -user
and let ctxload
prompt you for username and password, respectively.
10.3.2.5 Loading a Thesaurus with PL/SQL procedure CTX_THES.IMPORT_THESAURUS
The following example creates a case-sensitive thesaurus named mythesaurus
and imports the thesaurus content present in myclob
into the Oracle Text thesaurus tables:
declare myclob clob; begin myclob := to_clob('peking SYN beijing BT capital country NT beijing tokyo'); ctx_thes.import_thesaurus(‘mythesaurus', myclob, ‘Y'); end;
The format of the thesaurus to be imported (myclob
in this example) should be the same as used by the ctxload
utility. If the format of the thesaurus to be imported is not correct, then IMPORT_THESAURUS
raises an exception.
10.3.2.6 Compiling a Loaded Thesaurus
To link the loaded thesaurus medthes
to the knowledge base, use ctxkbtc
as follows:
ctxkbtc -user ctxsys -name medthes
When you enter the ctxkbtc
command line, you are prompted for the user password. As with ctxload
, for best security practices, do not enter the password at the command line.
WARNING:
In order to ensure sound security practices, Oracle recommends that you enter the password for ctxload
and ctxkbtc
using the interactive mode, which prompts you for the user password. Oracle strongly recommends that you do not enter a password on the command line.
10.4 About the Supplied Knowledge Base
Oracle Text supplies a knowledge base for English and French. The supplied knowledge contains the information used to perform theme analysis. Theme analysis includes theme indexing, ABOUT
queries, and theme extraction with the CTX_DOC
package.
The knowledge base is a hierarchical tree of concepts and categories. It has six main branches:
-
Science and technology
-
Business and economics
-
Government and military
-
Social environment
-
Geography
-
Abstract ideas and concepts
See Also:
Oracle Text Reference for the breakdown of the category hierarchy
The supplied knowledge base is like a thesaurus in that it is hierarchical and contains broader term, narrower term, and related term information. As such, you can improve the accuracy of theme analysis by augmenting the knowledge base with your industry-specific thesaurus by linking new terms to existing terms.
See Also:
You can also extend theme functionality to other languages by compiling a language-specific thesaurus into a knowledge base.
See Also:
Knowledge bases can be in any single-byte character set. Supplied knowledge bases are in WE8ISO8859P1. You can store an extended knowledge base in another character set such as US7ASCII.
This section contains the following topics.
10.4.1 Adding a Language-Specific Knowledge Base
You can extend theme functionality to languages other than English or French by loading your own knowledge base for any single-byte whitespace delimited language, including Spanish.
Theme functionality includes theme indexing, ABOUT
queries, theme highlighting, and the generation of themes, gists, and theme summaries with CTX_DOC
.
You extend theme functionality by adding a user-defined knowledge base. For example, you can create a Spanish knowledge base from a Spanish thesaurus.
To load your language-specific knowledge base, follow these steps:
To use this knowledge base for theme analysis during indexing and ABOUT
queries, specify the NLS_LANG
language as the THEME_LANGUAGE
attribute value for the BASIC_LEXER
preference.
See Also:
10.4.2 Limitations for Adding Knowledge Bases
The following limitations apply for adding knowledge bases:
-
Oracle supplies knowledge bases in English and French only. You must provide your own thesaurus for any other language.
-
You can only add knowledge bases for languages with single-byte character sets. You cannot create a knowledge base for languages which can be expressed only in multibyte character sets. If the database is a multibyte universal character set, such as UTF-8, the
NLS_LANG
parameter must still be set to a compatible single-byte character set when compiling the thesaurus. -
Adding a knowledge base works best for whitespace delimited languages.
-
You can have at most one knowledge base for each
NLS_LANG
language. -
Obtaining hierarchical query feedback information such as broader terms, narrower terms and related terms does not work in languages other than English and French. In other languages, the knowledge bases are derived entirely from your thesauruses. In such cases, Oracle recommends that you obtain hierarchical information directly from your thesauruses.
See Also:
Oracle Text Reference for more information about theme indexing,
ABOUT
queries, using theCTX_DOC
package, and the supplied English knowledge base