8 Searching Document Sections in Oracle Text
This chapter describes how to use document sections in an Oracle Text query application.
The following topics are discussed in this chapter:
8.1 About Oracle Text Document Section Searching
Section searching enables you to narrow text queries down to blocks of text within documents. Section searching is useful when your documents have internal structure, such as HTML and XML documents.
You can also search for text at the sentence and paragraph level.
This section contains these topics:
8.1.1 Enabling Oracle Text Section Searching
The steps for enabling section searching for your document collection are:
8.1.1.1 Create a Section Group
Section searching is enabled by defining section groups. You use one of the system-defined section groups to create an instance of a section group. Choose a section group appropriate for your document collection.
You use section groups to specify the type of document set you have and implicitly indicate the tag structure. For instance, to index HTML tagged documents, you use the HTML_SECTION_GROUP
. Likewise, to index XML tagged documents, you can use the XML_SECTION_GROUP
.
Table 8-1 lists the different types of section groups you can use:
Table 8-1 Types of Section Groups
Section Group Preference | Description |
---|---|
|
This is the default. Use this group type when you define no sections or when you define only |
|
Use this group type for defining sections where the start and end tags are of the form Note: This group type does not support input such as unbalanced parentheses, comments tags, and attributes. Use |
|
Use this group type for indexing HTML documents and for defining sections in HTML documents. |
|
Use this group type for indexing XML documents and for defining sections in XML documents. |
|
Use this group type to automatically create a zone section for each start-tag/end-tag pair in an XML document. The section names derived from XML tags are case-sensitive as in XML. Attribute sections are created automatically for XML tags that have attributes. Attribute sections are named in the form tag@attribute. Stop sections, empty tags, processing instructions, and comments are not indexed. The following limitations apply to automatic section groups:
|
|
Use this group type to index XML documents. Behaves like the The difference is that with this section group you can do path searching with the |
|
Use this group for defining sections in newsgroup formatted documents according to RFC 1036. |
Note:
Documents sent to the HTML
, XML
, AUTO
and PATH
sectioners must begin with \s*<
, where \s*
represents zero or more whitespace characters. Otherwise, the document is treated as a plaintext document, and no sections are recognized.
You use the CTX_DDL
package to create section groups and define sections as part of section groups. For example, to index HTML documents, create a section group with HTML_SECTION_GROUP
:
begin ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP'); end;
8.1.1.2 Define Your Sections
You define sections as part of the section group. The following example defines a zone section called heading for all text within the HTML < H1> tag:
begin ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP'); ctx_ddl.add_zone_section('htmgroup', 'heading', 'H1'); end;
Note:
If you are using the AUTO_SECTION_GROUP
or PATH_SECTION_GROUP
to index an XML document collection, then you need not explicitly define sections. The system does this for you during indexing.
See Also:
-
"Oracle Text Section Types" for more information about sections
-
"XML Section Searching with Oracle Text" for more information about section searching with XML
8.1.1.3 Index Your Documents
When you index your documents, you specify your section group in the parameter clause of CREATE
INDEX
.
create index myindex on docs(htmlfile) indextype is ctxsys.context parameters('filter ctxsys.null_filter section group htmgroup');
8.1.1.4 Section Searching with the WITHIN Operator
When your documents are indexed, you can query within sections using the WITHIN
operator. For example, to find all the documents that contain the word Oracle within their headings, enter the following query:
'Oracle WITHIN heading'
See Also:
Oracle Text Reference to learn more about using the WITHIN
operator
8.1.1.5 Path Searching with INPATH and HASPATH Operators
When you use the PATH_SECTION_GROUP
, the system automatically creates XML sections for you. In addition to using the WITHIN
operator to enter queries, you can enter path queries with the INPATH
and HASPATH
operators.
See Also:
-
"XML Section Searching with Oracle Text" to learn more about using these operators
-
Oracle Text Reference to learn more about using the
INPATH
operator
8.1.1.6 Marking an SDATA Section to be Searchable
$Sdatatype
table created, use the CTX_DDL.SET_SECTION_ATTRIBUTE
API.
The following tables are created:
-
$SN
–NUMBER
-
$SD
–DATE
-
$SV
–VARCHAR2
,CHAR
-
$SR
–RAW
-
$SBD
–BINARY DOUBLE
-
$SBF
–BINARY FLOAT
-
$ST
–TIMESTAMP
-
$STZ
–TIMESTAMP WITH TIMEZONE
$SV
table for this SDATA section which allows efficient searching on that section.ctx_ddl.add_sdata_section('sec_grp', 'sdata_sec', 'mytag', 'varchar');
ctx_ddl.set_section_attribute('sec_grp', 'sdata_sec', 'optimized_for', 'search');
The default value of this attribute is FALSE
.
8.1.2 Oracle Text Section Types
All section types are blocks of text in a document. However, sections can differ in the way that they are delimited and the way that they are recorded in the index. Sections can be one of the following types:
-
Attribute Section (for XML documents)
-
Special Sections (sentence or paragraphs)
Table 8-2 shows which section types may be used with each kind of section group.
Table 8-2 Section Types and Section Groups
Section Group | ZONE | FIELD | STOP | MDATA | NDATA | SDATA | ATTRIBUTE | SPECIAL |
---|---|---|---|---|---|---|---|---|
NULL |
NO |
NO |
NO |
NO |
NO |
NO |
NO |
YES |
BASIC |
YES |
YES |
NO |
YES |
YES |
YES |
NO |
YES |
HTML |
YES |
YES |
NO |
YES |
YES |
YES |
NO |
YES |
XML |
YES |
YES |
NO |
YES |
YES |
YES |
YES |
YES |
NEWS |
YES |
YES |
NO |
YES |
YES |
YES |
NO |
YES |
AUTO |
NO |
NO |
YES |
NO |
NO |
NO |
NO |
NO |
PATH |
NO |
NO |
NO |
NO |
NO |
NO |
NO |
NO |
8.1.2.1 Zone Section
A zone section is a body of text delimited by start and end tags in a document. The positions of the start and end tags are recorded in the index so that any words in between the tags are considered to be within the section. Any instance of a zone section must have a start and an end tag.
For example, the text between the <TITLE>
and </TITLE>
tags can be defined as a zone section as follows:
<TITLE>Tale of Two Cities</TITLE> It was the best of times...
Zone sections can nest, overlap, and repeat within a document.
When querying zone sections, you use the WITHIN
operator to search for a term across all sections. Oracle Text returns those documents that contain the term within the defined section.
Zone sections are well suited for defining sections in HTML and XML documents. To define a zone section, use CTX_DDL
.ADD_ZONE_SECTION
.
For example, assume you define the section booktitle
as follows:
begin ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP'); ctx_ddl.add_zone_section('htmgroup', 'booktitle', 'TITLE'); end;
After you index, you can search for all the documents that contain the term Cities within the section booktitle
as follows:
'Cities WITHIN booktitle'
With multiple query terms such as (dog and cat) WITHIN booktitle, Oracle Text returns those documents that contain cat and dog within the same instance of a booktitle
section.
Repeated Zone Sections
Zone sections can repeat. Each occurrence is treated as a separate section. For example, if <H1> denotes a heading
section, they can repeat in the same documents as follows:
<H1> The Brown Fox </H1> <H1> The Gray Wolf </H1>
Assuming that these zone sections are named Heading
, the query Brown WITHIN Heading returns this document. However, a query of (Brown and Gray) WITHIN Heading does not.
Overlapping Zone Sections
Zone sections can overlap each other. For example, if <B>
and <I>
denote two different zone sections, they can overlap in a document as follows:
plain <B> bold <I> bold and italic </B> only italic </I> plain
Nested Zone Sections
Zone sections can nest, including themselves as follows:
<TD> <TABLE><TD>nested cell</TD></TABLE></TD>
Using the WITHIN
operator, you can write queries to search for text in sections within sections. For example, assume the BOOK1, BOOK2, and AUTHOR zone sections occur as follows in documents doc1 and doc2:
doc1:
<book1> <author>Scott Tiger</author> This is a cool book to read.</book1>
doc2:
<book2> <author>Scott Tiger</author> This is a great book to read.</book2>
Consider the nested query:
'(Scott within author) within book1'
This query returns only doc1.
8.1.2.2 Field Section
A field section is similar to a zone section in that it is a region of text delimited by start and end tags. Field sections are more efficient than zone sections and are different than zone sections in that the region is indexed separately from the rest of the document. You can create an unlimited number of field sections.
Since field sections are indexed differently, you can also get better query performance over zone sections for when you have a large number of documents indexed.
Field sections are more suited to when you have a single occurrence of a section in a a document such as a field in a news header. Field sections can also be made visible to the rest of the document.
Unlike zone sections, field sections have the following restrictions:
-
Field sections cannot overlap
-
Field sections cannot repeat
-
Field sections cannot nest
This section contains the following topics.
8.1.2.2.1 Visible and Invisible Field Sections
By default, field sections are indexed as a sub-document separate from the rest of the document. As such, field sections are invisible to the surrounding text and can only be queried by explicitly naming the section in the WITHIN
clause.
You can make field sections visible if you want the text within the field section to be indexed as part of the enclosing document. Text within a visible field section can be queried with or without the WITHIN
operator.
The following example shows the difference between using invisible and visible field sections.
The following code defines a section group basicgroup
of the BASIC_SECTION_GROUP
type. It then creates a field section in basicgroup
called Author
for the <A>
tag. It also sets the visible flag to FALSE
to create an invisible section:
begin ctx_ddl.create_section_group('basicgroup', 'BASIC_SECTION_GROUP'); ctx_ddl.add_field_section('basicgroup', 'Author', 'A', FALSE); end;
Because the Author
field section is not visible, to find text within the Author
section, you must use the WITHIN
operator as follows:
'(Martin Luther King) WITHIN Author'
A query of Martin Luther King without the WITHIN
operator does not return instances of this term in field sections. If you want to query text within field sections without specifying WITHIN
, you must set the visible flag to TRUE
when you create the section as follows:
begin ctx_ddl.add_field_section('basicgroup', 'Author', 'A', TRUE); end;
8.1.2.2.2 Nested Field Sections
Field sections cannot be nested. For example, if you define a field section to start with <TITLE>
and define another field section to start with <FOO>
, the two sections cannot be nested as follows:
<TITLE> dog <FOO> cat </FOO> </TITLE>
To work with nested sections, define them as zone sections.
8.1.2.2.3 Repeated Field Sections
Repeated field sections are allowed, but WITHIN
queries treat them as a single section. The following is an example of repeated field section in a document:
<TITLE> cat </TITLE> <TITLE> dog </TITLE>
The query dog and cat within title returns the document, even though these words occur in different sections.
To have WITHIN
queries distinguish repeated sections, define them as zone sections.
8.1.2.3 Stop Section
A stop section may be added to an automatic section group. Adding a stop section causes the automatic section indexing operation to ignore the specified section in XML documents.
Note:
Adding a stop section causes no section information to be created in the index. However, the text within a stop section is always searchable.
Adding a stop section is useful when your documents contain many low-information tags. Adding stop sections also improves indexing performance with the automatic section group.
The number of stop sections you can add is unlimited.
Stop sections do not have section names and hence are not recorded in the section views.
8.1.2.4 MDATA Section
An MDATA
section is used to reference user-defined metadata for a document. Using MDATA
sections can speed up mixed queries. There is no limit to the number of MDATA
sections that can be returned in a query.
Consider the case in which you want to query both according to text content and document type (magazine, newspaper, or novel). You can create an index with a column for text and a column for the document type, and then perform a mixed query of this form-in this case, searching for all novels with the phrase Adam Thorpe (author of the novel Ulverton):
SELECT id FROM documents WHERE doctype = 'novel' AND CONTAINS(text, 'Adam Thorpe')>0;
However, it is usually faster to incorporate the attribute (in this case, the document type) into a field section, rather than use a separate column, and then use a single CONTAINS
query:
SELECT id FROM documents WHERE CONTAINS(text, 'Adam Thorpe AND novel WITHIN doctype')>0;
There are two drawbacks to this approach:
-
Each time the attribute is updated, the entire text document must be re-indexed, resulting in increased index fragmentation and slower rates of processing DML.
-
Field sections tokenize the section value. This has several effects. Special characters in metadata, such as decimal points or currency characters, are not easily searchable; value searching (searching for Thurston Howell but not Thurston Howell, Jr.) is difficult; multi-word values are queried by phrase, which is slower than single-token searching; and multi-word values do not show up in browse-words, making author browsing or subject browsing impossible.
For these reasons, using MDATA
sections instead of field sections may be worthwhile. MDATA
sections are indexed like field sections, but metadata values can be added to and removed from documents without the need to re-index the document text. Unlike field sections, MDATA
values are not tokenized. Additionally, MDATA
section indexing generally takes up less disk space than field section indexing.
Starting with Oracle Database 12c Release 2 (12.2), the MDATA
section can be updatable or non-updatable depending upon the value of its read-only tag, which can be set to either FALSE
or TRUE
.
Use CTX_DDL.ADD_MDATA_SECTION
to add an MDATA
section to a section group. By default, the value of a read-only MDATA
section is FALSE
. It implies that you want to permit calling CTX_DDL.ADD_MDATA()
and CTX_DDL.REMOVE_MDATA()
for this MDATA section, otherwise you can set it to TRUE
. When set to FALSE
, the queries on the MDATA
section run less efficiently because a cursor must be opened on the index table to track the deleted values for that MDATA
section. This example adds an MDATA
section called AUTHOR
and gives it the value Soseki Natsume (author of the novel Kokoro).
ctx_ddl.create.section.group('htmgroup', 'HTML_SECTION_GROUP'); ctx_ddl.add_mdata_section('htmgroup', 'author', 'Soseki Natsume');
MDATA
values can be changed with CTX_DDL.ADD_MDATA
and removed with CTX_DDL.REMOVE_MDATA
. Also, MDATA
sections can have multiple values. Only the owner of the index is allowed to call CTX_DDL.ADD_MDATA
and CTX_DDL.REMOVE_MDATA
.
Neither CTX_DDL.ADD_MDATA
nor CTX_DDL.REMOVE_MDATA
are supported for CTXCAT
and CTXRULE
indexes.
MDATA
values are not passed through a lexer. Instead, all values undergo a simplified normalization as follows:
-
Leading and trailing whitespace on the value is removed.
-
The value is truncated to 64 bytes.
-
The value is indexed as a single value; if the value consists of multiple words, it is not broken up.
-
Case is preserved. If the document is dynamically generated, you can implement case-insensitivity by uppercasing
MDATA
values and making sure to search only in uppercase.
After a document has MDATA
metadata added to it, you can query for that metadata using the CONTAINS
query operator:
SELECT id FROM documents WHERE CONTAINS(text, 'Tokyo and MDATA(author, Soseki Natsume)')>0;
This query will only be successful if an AUTHOR
tag has the exact value Soseki Natsume (after simplified tokenization). Soseki or Natsume Soseki returns no rows.
The following are considerations for MDATA
:
-
MDATA
values are not highlightable, will not appear in the output ofCTX_DOC.TOKENS
, and will not appear whenFILTER PLAINTEXT
is enabled. -
MDATA
sections must be unique within section groups. You cannot have anMDATA
section namedFOO
and a zone or field section of the same name in the same section group. -
Like field sections,
MDATA
sections cannot overlap or nest. AnMDATA
section is implicitly closed by the first tag encountered. For instance, in this example:<AUTHOR>Dickens <B>Shelley</B> Keats</AUTHOR>
The
<B>
tag closes theAUTHOR
MDATA
section; as a result, this document has anAUTHOR
of 'Dickens', but not of 'Shelley' or 'Keats'. -
To prevent race conditions, each call to
ADD_MDATA
andREMOVE_MDATA
locks out other calls on that rowid for that index for all values and sections. However, sinceADD_MDATA
andREMOVE_MDATA
do not commit, it is possible for an application to deadlock when calling them both. It is the application's responsibility to prevent deadlocking.
See Also:
-
"ALTER INDEX"
in Oracle Text Reference -
"ADD_MDATA_SECTION"
in Oracle Text Reference -
"CTX_SECTIONS"
in Oracle Text Reference -
The
"CONTAINS"
query operators chapter of the Oracle Text Reference for information on theMDATA
operator -
The
"CTX_DDL"
package chapter of Oracle Text Reference for information on adding and removingMDATA
sections
8.1.2.5 NDATA Section
Fields containing data to be indexed for name searching can be specified exclusively by adding NDATA
sections to section groups of type: BASIC_SECTION_GROUP
, HTML_SECTION_GROUP
, or XML_SECTION_GROUP
.
Users can synthesize textual documents, which contain name data, using two possible datastores: MULTI_COLUMN_DATASTORE
or USER_DATASTORE
. The following example uses MULTI_COLUMN_DATASTORE
to pick up relevant columns containing the name data for indexing:
create table people(firstname varchar2(80), surname varchar2(80)); insert into people values('John', 'Smith'); commit; begin ctx_ddl.create_preference('nameds', 'MULTI_COLUMN_DATASTORE'); ctx_ddl.set_attribute('nameds', 'columns', 'firstname,surname'); end; /
This produces the following virtual text for indexing:
<FIRSTNAME> John </FIRSTNAME> <SURNAME> Smith </SURNAME>
You can then create NDATA
sections for FIRSTNAME
and SURNAME
sections:
begin ctx_ddl.create_section_group('namegroup', 'BASIC_SECTION_GROUP'); ctx_ddl.add_ndata_section('namegroup', 'FIRSTNAME', 'FIRSTNAME'); ctx_ddl.add_ndata_section('namegroup', 'SURNAME', 'SURNAME'); end; /
Then create the index using the datastore preference and section group preference created earlier:
create index peopleidx on people(firstname) indextype is ctxsys.context parameters('section group namegroup datastore nameds');
NDATA
sections support both single- and multi-byte data, however, there are character- and term-based limitations. NDATA
section data that is indexed is constrained as follows:
-
the number of characters in a single, white space delimited term
511
-
the number of white space delimited terms
255
-
the total number of characters, including white spaces
511
8.1.2.6 SDATA Section
The value of an SDATA
section is extracted from the document text like other sections, but is indexed as structured data, also referred to as SDATA
. Using SDATA
sections supports operations such as projection, range searches, and ordering. It also enables SDATA
indexing of section data such as embedded tags, and detail table or function invocations. This enables you to perform various combinations of text and structured searches in one single SQL statement.
SDATA
operators should be used only as descendants of AND
operators that also have non-SDATA
children. SDATA
operators are meant to be used as secondary, checking or non-driving, criteria. For instance, "find documents with DOG that also have price > 5", rather than "find documents with rating > 4".
Use CTX_DDL.ADD_SDATA_SECTION
to add an SDATA
section to a section group. Use CTX_DDL.UPDATE_SDATA
to update the values of an existing SDATA
section. When querying within an SDATA
section, you must use the CONTAINS
operator. The following example creates a table called items
, and adds an SDATA
section called my_sec_group
, and then queries SDATA
in the section.
After you create an SDATA
section, you can further modify the attributes of the SDATA
section using CTX_DDL.SET_SECTION_ATTRIBUTE
.
Create the table items
:
CREATE TABLE items (id NUMBER PRIMARY KEY, doc VARCHAR2(4000)); INSERT INTO items VALUES (1, '<description> Honda Pilot </description> <category> Cars & Trucks </category> <price> 27000 </price>'); INSERT INTO items VALUES (2, '<description> Toyota Sequoia </description> <category> Cars & Trucks </category> <price> 35000 </price>'); INSERT INTO items VALUES (3, '<description> Toyota Land Cruiser </description> <category> Cars & Trucks </category> <price> 45000 </price>'); INSERT INTO items VALUES (4, '<description> Palm Pilot </description> <category> Electronics </category> <price> 5 </price>'); INSERT INTO items VALUES (5, '<description> Toyota Land Cruiser Grill </description> <category> Parts & Accessories </category> <price> 100 </price>'); COMMIT;
Add SDATA
section my_sec_group
:
BEGIN CTX_DDL.CREATE_SECTION_GROUP('my_sec_group', 'BASIC_SECTION_GROUP'); CTX_DDL.ADD_SDATA_SECTION('my_sec_group', 'category', 'category', 'VARCHAR2'); CTX_DDL.ADD_SDATA_SECTION('my_sec_group', 'price', 'price', 'NUMBER'); END;
Create the CONTEXT
index:
CREATE INDEX items$doc ON items(doc) INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS('SECTION GROUP my_sec_group');
Run a query:
SELECT id, doc FROM items WHERE contains(doc, 'Toyota AND SDATA(category = ''Cars & Trucks'') AND SDATA(price <= 40000 )') > 0;
Return the results:
ID DOC ---- ---------------------------------------------------------------------- 2 <description> Toyota Sequoia </description> <category> Cars & Trucks </category> <price> 35000 </price>
The following example updates the value of the SDATA section price
for a document having the rowid of 1, to a new value of 30000.
BEGIN SELECT ROWID INTO rowid_to_update FROM items WHERE id=1; CTX_DDL.UPDATE_SDATA('items$doc', 'price', SYS.ANYDATA.CONVERTVARCHAR2('30000'), rowid_to_update); END;
After executing the above query, the price of Honda Pilot
is changed from 27000 to 30000.
Note:
You can also add an SDATA
section to an existing index, without rebuilding the index, using the ADD SDATA SECTION
parameter of the ALTER INDEX PARAMETERS
statement. See "ALTER INDEX" section of the Oracle Text Reference for more information.
See Also:
-
The "
CONTAINS
" query section of the Oracle Text Reference for information on theSDATA
operator -
The "
CTX_DDL
" package section of the Oracle Text Reference for information on adding and updating theSDATA
sections and changing their attributes using theADD_SDATA_SECTION
,SET_SECTION_ATTRIBUTE
, and theUPDATE_SDATA
procedures
Storage
For optimized_for
search SDATA sections, you can specify the storage preferences for the $Sdatatype
tables and the indexes on these tables using the CTX_DDL.SET_ATTRIBUTE
.
The LOB caching is turned on by default for $S*
tables, but is turned off by default for $S*
indexes. These attributes are only valid on SDATA sections.
Query Operators
Optimized_for
search SDATA supports the following query operators:
-
=
-
<>
-
between
-
not between
-
<=
-
<
-
>=
-
>
-
is null
-
is not null
-
like
-
not like
8.1.2.7 Attribute Section
You can define attribute sections to query on XML attribute text. You can also have the system automatically define and index XML attributes for you.
See Also:
8.1.2.8 Special Sections
Special sections are not recognized by tags. Currently the only special sections supported are sentence and paragraph. This enables you to search for combination of words within sentences or paragraphs.
The sentence and paragraph boundaries are determined by the lexer. For example, the BASIC_LEXER
recognizes sentence and paragraph section boundaries as follows:
Table 8-3 Sentence and Paragraph Section Boundaries for BASIC_LEXER
Special Section | Boundary |
---|---|
SENTENCE |
|
PARAGRAPH |
|
If the lexer cannot recognize the boundaries, no sentence or paragraph sections are indexed.
To add a special section, use the CTX_DDL
.ADD_SPECIAL_SECTION
procedure. For example, the following code enables searching within sentences within HTML documents:
begin ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP'); ctx_ddl.add_special_section('htmgroup', 'SENTENCE'); end;
You can also add zone sections to the group to enable zone searching in addition to sentence searching. The following example adds the zone section Headline
to the section group htmgroup
:
begin ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP'); ctx_ddl.add_special_section('htmgroup', 'SENTENCE'); ctx_ddl.add_zone_section('htmgroup', 'Headline', 'H1'); end;
8.1.3 Oracle Text Section Attributes
Section attributes are the settings related to the Oracle Text sections of tokenized type, such as, Field, Zone, Hybrid, and SDATA. Section attributes improve the query performance due to the finer control at the section level, rather than at the document level or index level.
By using the section attributes, you can:
-
specify lexer preferences on certain sections of a document. This is useful for part-name searches, where a certain section of a document containing a part-name needs to be lexed differently than the rest of the document. The lexer preference can also be used for handling multi-lingual documents, where there is a section to language mapping.
-
specify a substring index only on certain sections of a document. This helps in reducing the index size.
-
specify creating prefix tokens only on certain sections of a document. The prefix tokens are used to improve the performance of right-truncated queries, but can also cause the index size to grow rapidly. Specifying prefix indexing only on certain sections provides improved performance for the right-truncated queries on the specific sections, without rapidly growing the size of the index.
-
specify stoplists for certain sections of a document.
-
specify creating a new section type that combines the flexibility of Zone sections with the performance of Field sections. Currently, Zone sections have poor performance compared to Field sections. However, Field sections do not support nested section search.
Section attributes are set using the procedure CTX_DDL.SET_SECTION_ATTRIBUTE
.
Table 8-4 lists the section attributes that you can use:
Table 8-4 Section Attributes
Section Attribute | Description |
---|---|
|
You can use the Specify Default is For the Field section type, the visible attribute overrides the value specified in the |
|
You can use the Specify the lexer preference name to decide the tokenization of an SDATA section. Default is The lexer preference must be valid at the time of calling the |
|
You can use the Specify the wordlist preference name for a section to enable section specific prefix indexing and substring indexing. Default is The wordlist preference must be valid at the time of calling the |
|
You can use the Specify the stoplist preference name for enabling section specific stoplist. Default is The stoplist preference must be valid at the time of calling the |
The following example enables the visible
attribute of a Field section:
begin ctx_ddl.create_section_group(‘fieldgroup', ‘BASIC_SECTION_GROUP'); ctx_ddl.add_field_section(‘fieldgroup', ‘author', ‘AUTHOR'); ctx_ddl.set_section_attribute(‘fieldgroup', ‘author', ‘visible', ‘true'); end;
See Also:
Oracle Text Reference for the syntax of CTX_DDL.SET_SECTION_ATTRIBUTE
procedure.
8.2 HTML Section Searching with Oracle Text
HTML has internal structure in the form of tagged text which you can use for section searching. For example, you can define a section called headings for the <H1>
tag. This enables you to search for terms only within these tags across your document set.
To query, you use the WITHIN
operator. Oracle Text returns all documents that contain your query term within the headings section. Thus, if you wanted to find all documents that contain the word oracle within headings, enter the following query:
'oracle within headings'
This section contains these topics:
8.2.1 Creating HTML Sections
The following code defines a section group called htmgroup
of type HTML_SECTION_GROUP
. It then creates a zone section in htmgroup
called heading
identified by the <H1> tag:
begin ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP'); ctx_ddl.add_zone_section('htmgroup', 'heading', 'H1'); end;
You can then index your documents as follows:
create index myindex on docs(htmlfile) indextype is ctxsys.context parameters('filter ctxsys.null_filter section group htmgroup');
After indexing with section group htmgroup
, you can query within the heading section by issuing a query as follows:
'Oracle WITHIN heading'
8.2.2 Searching HTML Meta Tags
With HTML documents you can also create sections for NAME/CONTENT
pairs in <META> tags. When you do so you can limit your searches to text within CONTENT
.
Example: Creating Sections for <META>
Tags
Consider an HTML document that has a META
tag as follows:
<META NAME="author" CONTENT="ken">
To create a zone section that indexes all CONTENT
attributes for the META
tag whose NAME
value is author:
begin ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP'); ctx_ddl.add_zone_section('htmgroup', 'author', 'meta@author'); end
After indexing with section group htmgroup
, you can query the document as follows:
'ken WITHIN author'
8.3 XML Section Searching with Oracle Text
Like HTML documents, XML documents have tagged text which you can use to define blocks of text for section searching. The contents of a section can be searched on with the WITHIN
or INPATH
operators.
The following sections describe the different types of XML searching:
8.3.1 Automatic Sectioning
You can set up your indexing operation to automatically create sections from XML documents using the section group AUTO_SECTION_GROUP
. The system creates zone sections for XML tags. Attribute sections are created for the tags that have attributes and these sections named in the form tag@attribute
.
For example, the following statement creates the index myindex on a column containing the XML files using the AUTO_SECTION_GROUP
:
CREATE INDEX myindex ON xmldocs(xmlfile) INDEXTYPE IS ctxsys.context PARAMETERS ('datastore ctxsys.default_datastore filter ctxsys.null_filter section group ctxsys.auto_section_group' );
8.3.2 Attribute Searching
You can search XML attribute text in one of two ways:
-
Create attribute sections with
CTX_DDL
.ADD_ATTR_SECTION
and then index withXML_SECTION_GROUP
. If you useAUTO_SECTION_GROUP
when you index, attribute sections are created automatically. You can query attribute sections with theWITHIN
operator. -
Searching Attributes with the INPATH Operator
Index with the
PATH_SECTION_GROUP
and query attribute text with theINPATH
operator.
8.3.2.1 Creating Attribute Sections
Consider an XML file that defines the BOOK
tag with a TITLE
attribute as follows:
<BOOK TITLE="Tale of Two Cities"> It was the best of times. </BOOK>
To define the title attribute as an attribute section, create an XML_SECTION_GROUP
and define the attribute section as follows:
begin ctx_ddl.create_section_group('myxmlgroup', 'XML_SECTION_GROUP'); ctx_ddl.add_attr_section('myxmlgroup', 'booktitle', 'book@title'); end;
To index:
CREATE INDEX myindex ON xmldocs(xmlfile) INDEXTYPE IS ctxsys.context PARAMETERS ('datastore ctxsys.default_datastore filter ctxsys.null_filter section group myxmlgroup' );
You can query the XML attribute section booktitle as follows:
'Cities within booktitle'
8.3.3 Creating Document Type Sensitive Sections
For an XML document set that contains the <book>
tag declared for different document types, you may want to create a distinct book section for each document type. The following scenario shows how to create book sections for each document type to improve search capability.
Assume that mydocname1
is declared as an XML document type (root element) as follows:
<!DOCTYPE mydocname1 ... [...
Within mydocname1
, the element <book>
is declared. For this tag, you can create a section named mybooksec1
that is sensitive to the tag's document type as follows:
begin
ctx_ddl.create_section_group('myxmlgroup', 'XML_SECTION_GROUP'); ctx_ddl.add_zone_section('myxmlgroup', 'mybooksec1', 'mydocname1(book)');
end;
Assume that mydocname2
is declared as another XML document type (root element) as follows:
<!DOCTYPE mydocname2 ... [...
Within mydocname2
, the element <book>
is declared. For this tag, you can create a section named mybooksec2
that is sensitive to the tag's document type as follows:
begin
ctx_ddl.create_section_group('myxmlgroup', 'XML_SECTION_GROUP'); ctx_ddl.add_zone_section('myxmlgroup', 'mybooksec2', 'mydocname2(book)');
end;
To query within the section mybooksec1, use WITHIN
as follows:
'oracle within mybooksec1'
8.3.4 Path Section Searching
XML documents can have parent-child tag structures such as:
<A> <B> <C> dog </C> </B> </A>
In this scenario, tag C is a child of tag B which is a child of tag A.
With Oracle Text, you can do path searching with PATH_SECTION_GROUP
. This section group enables you to specify direct parentage in queries, such as to find all documents that contain the term dog in element C which is a child of element B and so on.
With PATH_SECTION_GROUP
, you can also perform attribute value searching and attribute equality testing.
The new operators associated with this feature are
-
INPATH
-
HASPATH
This section contains the following topics.
8.3.4.1 Creating an Index with PATH_SECTION_GROUP
To enable path section searching, index your XML document set with PATH_SECTION_GROUP
. For example:
Create the preference.
begin ctx_ddl.create_section_group('xmlpathgroup', 'PATH_SECTION_GROUP'); end;
Create the index.
CREATE INDEX myindex ON xmldocs(xmlfile) INDEXTYPE IS ctxsys.context PARAMETERS ('datastore ctxsys.default_datastore filter ctxsys.null_filter section group xmlpathgroup' );
When you create the index, you can use the INPATH
and HASPATH
operators.
8.3.4.2 Top-Level Tag Searching
To find all documents that contain the term dog in the top-level tag <A>:
dog INPATH (/A)
or
dog INPATH(A)
8.3.4.3 Any-Level Tag Searching
To find all documents that contain the term dog in the <A> tag at any level:
dog INPATH(//A)
This query finds the following documents:
<A>dog</A>
and
<C><B><A>dog</A></B></C>
8.3.4.4 Direct Parentage Searching
To find all documents that contain the term dog in a B element that is a direct child of a top-level A element:
dog INPATH(A/B)
This query finds the following XML document:
<A><B>My dog is friendly.</B></A>
but does not find:
<C><B>My dog is friendly.</B></C>
8.3.4.5 Tag Value Testing
You can test the value of tags. For example, the query:
dog INPATH(A[B="dog"])
Finds the following document:
<A><B>dog</B></A>
But does not find:
<A><B>My dog is friendly.</B></A>
8.3.4.6 Attribute Searching
You can search the content of attributes. For example, the query:
dog INPATH(//A/@B)
Finds the document
<C><A B="snoop dog"> </A> </C>
8.3.4.7 Attribute Value Testing
You can test the value of attributes. For example, the query
California INPATH (//A[@B = "home address"])
Finds the document:
<A B="home address">San Francisco, California, USA</A>
But does not find:
<A B="work address">San Francisco, California, USA</A>
8.3.4.8 Path Testing
You can test if a path exists with the HASPATH
operator. For example, the query:
HASPATH(A/B/C)
finds and returns a score of 100 for the document
<A><B><C>dog</C></B></A>
without the query having to reference dog at all.
8.3.4.9 Section Equality Testing with HASPATH
You can use the HASPATH
operator to do section quality tests. For example, consider the following query:
dog INPATH A
finds
<A>dog</A>
but it also finds
<A>dog park</A>
To limit the query to the term dog and nothing else, you can use a section equality test with the HASPATH
operator. For example,
HASPATH(A="dog")
finds and returns a score of 100 only for the first document, and not the second.
See Also:
Oracle Text Reference to learn more about using the INPATH
and HASPATH
operators