21 Data Mining With SQL
Learn how to solve business problems using the Oracle Data Mining application programming interface (API).
21.1 Highlights of the Data Mining API
Learn about the advantages of Data Mining application programming interface (API).
Data mining is a valuable technology in many application domains. It has become increasingly indispensable in the private sector as a tool for optimizing operations and maintaining a competitive edge. Data mining also has critical applications in the public sector and in scientific research. However, the complexities of data mining application development and the complexities inherent in managing and securing large stores of data can limit the adoption of data mining technology.
Oracle Data Mining is uniquely suited to addressing these challenges. The data mining engine is implemented in the Database kernel, and the robust administrative features of Oracle Database are available for managing and securing the data. While supporting a full range of data mining algorithms and procedures, the API also has features that simplify the development of data mining applications.
The Oracle Data Mining API consists of extensions to Oracle SQL, the native language of the Database. The API offers the following advantages:
-
Scoring in the context of SQL queries. Scoring can be performed dynamically or by applying data mining models.
-
Automatic Data Preparation (ADP) and embedded transformations.
-
Model transparency. Algorithm-specific queries return details about the attributes that were used to create the model.
-
Scoring transparency. Details about the prediction, clustering, or feature extraction operation can be returned with the score.
-
A workflow-based graphical user interface (GUI) within Oracle SQL Developer. You can download SQL Developer free of charge from the following site:
http://www.oracle.com/pls/topic/lookup?ctx=db122&id=datminGUI
Related Topics
21.2 Example: Targeting Likely Candidates for a Sales Promotion
This example targets customers in Brazil for a special promotion that offers coupons and an affinity card.
The query uses data on marital status, education, and income to predict the customers who are most likely to take advantage of the incentives. The query applies a decision tree model called dt_sh_clas_sample
to score the customer data.
Example 21-1 Predict Best Candidates for an Affinity Card
SELECT cust_id FROM mining_data_apply_v WHERE PREDICTION(dt_sh_clas_sample USING cust_marital_status, education, cust_income_level ) = 1 AND country_name IN 'Brazil'; CUST_ID ---------- 100404 100607 101113
The same query, but with a bias to favor false positives over false negatives, is shown here.
SELECT cust_id FROM mining_data_apply_v WHERE PREDICTION(dt_sh_clas_sample COST MODEL USING cust_marital_status, education, cust_income_level ) = 1 AND country_name IN 'Brazil'; CUST_ID ---------- 100139 100163 100275 100404 100607 101113 101170 101463
The COST MODEL
keywords cause the cost matrix associated with the model to be used in making the prediction. The cost matrix, stored in a table called dt_sh_sample_costs
, specifies that a false negative is eight times more costly than a false positive. Overlooking a likely candidate for the promotion is far more costly than including an unlikely candidate.
SELECT * FROM dt_sh_sample_cost; ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE COST ------------------- ---------------------- ---------- 0 0 0 0 1 1 1 0 8 1 1 0
21.3 Example: Analyzing Preferred Customers
The examples in this section reveal information about customers who use affinity cards or are likely to use affinity cards.
Example 21-2 Find Demographic Information About Preferred Customers
This query returns the gender, age, and length of residence of typical affinity card holders. The anomaly detection model, SVMO_SH_Clas_sample
, returns 1
for typical cases and 0
for anomalies. The demographics are predicted for typical customers only; outliers are not included in the sample.
SELECT cust_gender, round(avg(age)) age, round(avg(yrs_residence)) yrs_residence, count(*) cnt FROM mining_data_one_class_v WHERE PREDICTION(SVMO_SH_Clas_sample using *) = 1 GROUP BY cust_gender ORDER BY cust_gender; CUST_GENDER AGE YRS_RESIDENCE CNT ------------ ---------- ------------- ---------- F 40 4 36 M 45 5 304
Example 21-3 Dynamically Identify Customers Who Resemble Preferred Customers
This query identifies customers who do not currently have an affinity card, but who share many of the characteristics of affinity card holders. The PREDICTION
and PREDICTION_PROBABILITY
functions use an OVER
clause instead of a predefined model to classify the customers. The predictions and probabilities are computed dynamically.
SELECT cust_id, pred_prob FROM (SELECT cust_id, affinity_card, PREDICTION(FOR TO_CHAR(affinity_card) USING *) OVER () pred_card, PREDICTION_PROBABILITY(FOR TO_CHAR(affinity_card),1 USING *) OVER () pred_prob FROM mining_data_build_v) WHERE affinity_card = 0 AND pred_card = 1 ORDER BY pred_prob DESC; CUST_ID PRED_PROB ---------- --------- 102434 .96 102365 .96 102330 .96 101733 .95 102615 .94 102686 .94 102749 .93 . . . . 102580 .52 102269 .52 102533 .51 101604 .51 101656 .51 226 rows selected.
Example 21-4 Predict the Likelihood that a New Customer Becomes a Preferred Customer
This query computes the probability of a first-time customer becoming a preferred customer (an affinity card holder). This query can be executed in real time at the point of sale.
The new customer is a 44-year-old American executive who has a bachelors degree and earns more than $300,000/year. He is married, lives in a household of 3, and has lived in the same residence for the past 6 years. The probability of this customer becoming a typical affinity card holder is only 5.8%.
SELECT PREDICTION_PROBABILITY(SVMO_SH_Clas_sample, 1 USING 44 AS age, 6 AS yrs_residence, 'Bach.' AS education, 'Married' AS cust_marital_status, 'Exec.' AS occupation, 'United States of America' AS country_name, 'M' AS cust_gender, 'L: 300,000 and above' AS cust_income_level, '3' AS houshold_size ) prob_typical FROM DUAL; PROB_TYPICAL ------------ 5.8
Example 21-5 Use Predictive Analytics to Find Top Predictors
The DBMS_PREDICTIVE_ANALYTICS
PL/SQL package contains routines that perform simple data mining operations without a predefined model. In this example, the EXPLAIN
routine computes the top predictors for affinity card ownership. The results show that household size, marital status, and age are the top three predictors.
BEGIN DBMS_PREDICTIVE_ANALYTICS.EXPLAIN( data_table_name => 'mining_data_test_v', explain_column_name => 'affinity_card', result_table_name => 'cust_explain_result'); END; / SELECT * FROM cust_explain_result WHERE rank < 4; ATTRIBUTE_NAME ATTRIBUTE_SUBNAME EXPLANATORY_VALUE RANK ------------------------ -------------------- ----------------- ---------- HOUSEHOLD_SIZE .209628541 1 CUST_MARITAL_STATUS .199794636 2 AGE .111683067 3
21.4 Example: Segmenting Customer Data
The examples in this section use an Expectation Maximization clustering model to segment the customer data based on common characteristics.
Example 21-6 Compute Customer Segments
This query computes natural groupings of customers and returns the number of customers in each group.
SELECT CLUSTER_ID(em_sh_clus_sample USING *) AS clus, COUNT(*) AS cnt FROM mining_data_apply_v GROUP BY CLUSTER_ID(em_sh_clus_sample USING *) ORDER BY cnt DESC; CLUS CNT ---------- ---------- 9 311 3 294 7 215 12 201 17 123 16 114 14 86 19 64 15 56 18 36
Example 21-7 Find the Customers Who Are Most Likely To Be in the Largest Segment
The query in Example 21-6 shows that segment 9 has the most members. The following query lists the five customers who are most likely to be in segment 9.
SELECT cust_id FROM (SELECT cust_id, RANK() over (ORDER BY prob DESC, cust_id) rnk_clus2 FROM (SELECT cust_id, ROUND(CLUSTER_PROBABILITY(em_sh_clus_sample, 9 USING *),3) prob FROM mining_data_apply_v)) WHERE rnk_clus2 <= 5 ORDER BY rnk_clus2; CUST_ID ---------- 100002 100012 100016 100019 100021
Example 21-8 Find Key Characteristics of the Most Representative Customer in the Largest Cluster
The query in Example 21-7 lists customer 100002 first in the list of likely customers for segment 9. The following query returns the five characteristics that are most significant in determining the assignment of customer 100002 to segments with probability > 20% (only segment 9 for this customer).
SELECT S.cluster_id, probability prob, CLUSTER_DETAILS(em_sh_clus_sample, S.cluster_id, 5 using T.*) det FROM (SELECT v.*, CLUSTER_SET(em_sh_clus_sample, NULL, 0.2 USING *) pset FROM mining_data_apply_v v WHERE cust_id = 100002) T, TABLE(T.pset) S ORDER BY 2 desc; CLUSTER_ID PROB DET ---------- ------- -------------------------------------------------------------------------------- 9 1.0000 <Details algorithm="Expectation Maximization" cluster="9"> <Attribute name="YRS_RESIDENCE" actualValue="4" weight="1" rank="1"/> <Attribute name="EDUCATION" actualValue="Bach." weight="0" rank="2"/> <Attribute name="AFFINITY_CARD" actualValue="0" weight="0" rank="3"/> <Attribute name="BOOKKEEPING_APPLICATION" actualValue="1" weight="0" rank="4"/> <Attribute name="Y_BOX_GAMES" actualValue="0" weight="0" rank="5"/> </Details>
21.5 Example : Building an ESA Model with a Wiki Dataset
The examples shows FEATURE_COMPARE
function with Explicit Semantic Analysis (ESA) model, which compares a similar set of texts and then a dissimilar set of texts.
The example shows an ESA model built against a 2005 Wiki dataset rendering over 200,000 features. The documents are mined as text and the document titles are given as the feature IDs.
Similar Texts
SELECT 1-FEATURE_COMPARE(esa_wiki_mod USING 'There are several PGA tour golfers from South Africa' text AND USING 'Nick Price won the 2002 Mastercard Colonial Open' text) similarity FROM DUAL;
SIMILARITY
----------
.258
The output metric shows distance calculation. Therefore, smaller number represent more similar texts. So, 1
minus the distance in the queries result in similarity.
Dissimilar Texts
SELECT 1-FEATURE_COMPARE(esa_wiki_mod USING 'There are several PGA tour golfers from South Africa' text AND USING 'John Elway played quarterback for the Denver Broncos' text) similarity FROM DUAL;
SIMILARITY
----------
.007