 |
 |
(KTC) is a solution for Text Analytics. It automatically prepares and transforms unstructured text attributes into a structured representation to be used within the KAF modeling components. KTC automatically handles the transformation from unstructured data to structured data going through a process involving "stop word" removal, merging sequences of words declared as 'concepts', translating each word into its root through "stemming" rules, and merging synonyms. KTC allows text fields to be used "as is" in classification, regression, and clustering tasks. It comes packaged with rules for several languages such as French, German, English and Spanish, and can be easily extended to other languages.
Benefits: KTC improves the quality of predictive models by taking advantage of previously unused text attributes. For example, messages, emails sent to a support line, marketing survey results, or call center chats can be used to enhance the results of models for cross-sell or attrition.
learn more
|
What: KTC automatically prepares and transforms textual variables into a structured representation to be used in the KXEN Analytic Framework. KTC automatically handles language recognition and can be extended to domain specific languages.
Why: Today's operational databases contain a lot of textual information, such as the messages or a synopsis of the mails sent to a support line, the free forms contained in marketing surveys results, or even the free texts captured in call center tools through direct discussions with customers or prospects. This specific type of data can be used in order to improve the accuracy and reliability of models built to forecast customer attrition, or assign attitudes, or improve cross-selling or fraud detection. In most of these cases, statistical analysis of the texts is enough to improve all these predictive and descriptive tasks at no costs since these texts are already captured.
How: KTC, like any other KXEN component is trained before applied to encode the textual fields. The training phase goals are 1/ to recognize the language amongst a list of possible languages, 2/ to process the textual field values contained into a training data set to isolate roots of words that could then be used to 'encode' the textual fields, thus translating unstructured data into a structured form that can be provided to K2C for example. The internal process is decomposed into several steps, the first one splits any given textual value into a sequence of tokens (words), some sequence of tokens may be associated by the user to a 'concept' specific to a domain, then the resulting list of words is striped from words contained into a stop-list, that can be extended by the user, the stop list is also used to recognize the language. The third phase is the most complex and is called 'stemming' phase: it consists if using rules to extract the 'root' of words, removing word prefixes and suffixes that are specific to each language (such as feminine/masculine or plural/singular declinations, or verb tenses), KTC provides rules for several languages such as French, German, English and Spanish, but it can be extended to other languages without a new software version through a file containing specific stemming rules for this language: Writing stemming rules may require some linguistic knowledge of the said language, but domain users can extend existing languages to specific domain languages without linguistic expertise (such as extending English to English for aeronautics with specific rules or airplanes conventions). Once the textual value is recognized in terms of roots, it is encoded through presence/absence of these roots (more complex encoding schemas can be specified by advanced users). KTC also characterizes the textual value with the recognized language, the original count of words and the count of recognized roots.
Benefits for the business user: Encoding textual data is part of the regular automated process developed by KXEN, so, everything can be transparent to the user once a field is declared as 'textual'. Roots that bring information to the parts will be automatically encoded without further effort, but KXEN provides many reports to improve modeling performance, and business interpretation. The reports provided on the correlations between useful roots can be used to design concepts present in textual values; Roots that do not bring any information may be added to the stop lists for later discarding. Reports on stemming rule usage may be used to develop more accurate domain specific stemming rules.
Benefits for the Data Mining expert: KTC automates and speeds up encoding the data in order to accelerate the entire Data Mining process. KTC will allow the expert to quickly evaluate if textual information may be useful for a specific task, and eventually develop specific rules to improve overall modeling quality. Experts may test several encoding schemes to gain more performance.
Benefits for the Integration specialist and IT: KTC is automatically integrated within the KXEN Analytic Framework when a textual field is declared in the data source. There is no additional integration work to be done for internally parsing long texts or blobs to include them in the analytical process.
|
|
|
 |
|
|
 |
 |
|