27 Mining Unstructured Text
Explains how to use Oracle Data Mining to mine unstructured text.
27.1 About Unstructured Text
Data mining algorithms act on data that is numerical or categorical. Numerical data is ordered. It is stored in columns that have a numeric data type, such as NUMBER
or FLOAT
. Categorical data is identified by category or classification. It is stored in columns that have a character data type, such as VARCHAR2
or CHAR
.
Unstructured text data is neither numerical nor categorical. Unstructured text includes items such as web pages, document libraries, Power Point presentations, product specifications, emails, comment fields in reports, and call center notes. It has been said that unstructured text accounts for more than three quarters of all enterprise data. Extracting meaningful information from unstructured text can be critical to the success of a business.
27.2 About Text Mining and Oracle Text
Understand what is text mining and oracle text.
Text mining is the process of applying data mining techniques to text terms, also called text features or tokens. Text terms are words or groups of words that have been extracted from text documents and assigned numeric weights. Text terms are the fundamental unit of text that can be manipulated and analyzed.
Oracle Text is a Database technology that provides term extraction, word and theme searching, and other utilities for querying text. When columns of text are present in the training data, Oracle Data Mining uses Oracle Text utilities and term weighting strategies to transform the text for mining. Oracle Data Mining passes configuration information supplied by you to Oracle Text and uses the results in the model creation process.
Related Topics
27.3 Data Preparation for Text Features
The model details view for text features is DM$VX
model_name.
The text feature view DM$VX
model_name
describes the extracted text features if there are text attributes present. The view has the following schema:
Name Type
-------------- ---------------------
PARTITION_NAME VARCHAR2(128)
COLUMN_NAME VARCHAR2(128)
TOKEN VARCHAR2(4000)
DOCUMENT_FREQUENCY NUMBER
Table 27-1 Text Feature View for Extracted Text Features
Column Name | Description |
---|---|
|
A partition in a partitioned model to retrieve details |
|
Name of the identifier column |
|
Text token which is usually a word or stemmed word |
|
A measure of token frequency in the entire training set |
27.4 Creating a Model that Includes Text Mining
Learn how to create a model that includes text mining.
Oracle Data Mining supports unstructured text within columns of VARCHAR2
, CHAR
, CLOB
, BLOB
, and BFILE
, as described in the following table:
Table 27-2 Column Data Types That May Contain Unstructured Text
Data Type | Description |
---|---|
|
Oracle Data Mining interprets |
|
Oracle Data Mining interprets |
|
Oracle Data Mining interprets |
|
Oracle Data Mining interprets Oracle Data Mining interprets |
The settings described in the following table control the term extraction process for text attributes in a model. Instructions for specifying model settings are in "Specifying Model Settings".
Table 27-3 Model Settings for Text
Setting Name | Data Type | Setting Value | Description |
---|---|---|---|
|
|
Name of an Oracle Text policy object created with |
Affects how individual tokens are extracted from unstructured text. See "Creating a Text Policy". |
|
|
1 <= value <= 100000 |
Maximum number of features to use from the document set (across all documents of each text column) passed to Default is 3000. |
A model can include one or more text attributes. A model with text attributes can also include categorical and numerical attributes.
To create a model that includes text attributes:
-
Create an Oracle Text policy object..
-
Specify the model configuration settings that are described in "Table 27-3".
-
Specify which columns must be treated as text and, optionally, provide text transformation instructions for individual attributes.
-
Pass the model settings and text transformation instructions to
DBMS_DATA_MINING.CREATE_MODEL
.Note:
All algorithms except O-Cluster can support columns of unstructured text.
The use of unstructured text is not recommended for association rules (Apriori).
27.5 Creating a Text Policy
An Oracle Text policy specifies how text content must be interpreted. You can provide a text policy to govern a model, an attribute, or both the model and individual attributes.
If a model-specific policy is present and one or more attributes have their own policies, Oracle Data Mining uses the attribute policies for the specified attributes and the model-specific policy for the other attributes.
The CTX_DDL.CREATE_POLICY
procedure creates a text policy.
CTX_DDL.CREATE_POLICY( policy_name IN VARCHAR2, filter IN VARCHAR2 DEFAULT NULL, section_group IN VARCHAR2 DEFAULT NULL, lexer IN VARCHAR2 DEFAULT NULL, stoplist IN VARCHAR2 DEFAULT NULL, wordlist IN VARCHAR2 DEFAULT NULL);
The parameters of CTX_DDL.CREATE_POLICY
are described in the following table.
Table 27-4 CTX_DDL.CREATE_POLICY Procedure Parameters
Parameter Name | Description |
---|---|
|
Name of the new policy object. Oracle Text policies and text indexes share the same namespace. |
|
Specifies how the documents must be converted to plain text for indexing. Examples are: For |
|
Identifies sections within the documents. For example, For Note: You can specify any section group that is supported by |
|
Identifies the language that is being indexed. For example, For |
|
Specifies words and themes to exclude from term extraction. For example, the word "the" is typically in the stoplist for English language documents. The system-supplied stoplist is used by default. See "Stoplists" in Oracle Text Reference. |
|
Specifies how stems and fuzzy queries must be expanded. A stem defines a root form of a word so that different grammatical forms have a single representation. A fuzzy query includes common misspellings in the representation of a word. See " |
Related Topics
27.6 Configuring a Text Attribute
Learn how to identify a column as a text attribute and provide transformation instructions for any text attribute.
As shown in Table 27-2, you can identify columns of CHAR,
shorter VARCHAR2
(<=4000), BFILE
, and BLOB
as text attributes. If CHAR
and shorter VARCHAR2
columns are not explicitly identified as unstructured text, then CREATE_MODEL
processes them as categorical attributes. If BFILE
and BLOB
columns are not explicitly identified as unstructured text, then CREATE_MODEL
returns an error.
To identify a column as a text attribute, supply the keyword TEXT
in an Attribute specification. The attribute specification is a field (attribute_spec
) in a transformation record (transform_rec
). Transformation records are components of transformation lists (xform_list
) that can be passed to CREATE_MODEL
.
Note:
An attribute specification can also include information that is not related to text. Instructions for constructing an attribute specification are in "Embedding Transformations in a Model".
You can provide transformation instructions for any text attribute by qualifying the TEXT
keyword in the attribute specification with the subsettings described in the following table.
Table 27-5 Attribute-Specific Text Transformation Instructions
Subsetting Name | Description | Example |
---|---|---|
|
Name of an Oracle Text policy object created with |
( |
|
The following values are supported:
|
|
|
Maximum number of features to use from the attribute. |
|
Note:
The TEXT
keyword is only required for CLOB
and longer VARCHAR2
(>4000) when you specify transformation instructions. The TEXT
keyword is always required for CHAR
, shorter VARCHAR2
, BFILE
, and BLOB
— whether or not you specify transformation instructions.
Tip:
You can view attribute specifications in the data dictionary view ALL_MINING_MODEL_ATTRIBUTES
, as shown in Oracle Database Reference.
Token Types in an Attribute Specification
When stems or themes are specified as the token type, the lexer preference for the text policy must support these types of tokens.
The following example adds themes and English stems to BASIC_LEXER
.
BEGIN CTX_DDL.CREATE_PREFERENCE('my_lexer', 'BASIC_LEXER'); CTX_DDL.SET_ATTRIBUTE('my_lexer', 'index_stems', 'ENGLISH'); CTX_DDL.SET_ATTRIBUTE('my_lexer', 'index_themes', 'YES'); END;
Example 27-1 A Sample Attribute Specification for Text
This expression specifies that text transformation for the attribute must use the text policy named my_policy
. The token type is THEME
, and the maximum number of features is 3000.
"TEXT(POLICY_NAME:my_policy)(TOKEN_TYPE:THEME)(MAX_FEATURES:3000)"