Contact Us

Name

Email *

Message *

Thursday 18 December 2014

Data and Text Mining - sentimentsentences dataset

Competency 8.1

Using LightSIDE, sentiment_sentences.csv dataset is loaded and extracted using Unigram feature space on the Feature Extraction panel, and then using Logistic Regression and a 10 fold cross-validation to run experiment.

The experiment is aimed at counting positive and negative words in a review taken from sentiment_sentences dataset to know if the overal review is a good or bad one.


The Model evaluation metric are shown below:

Accuracy = 75.9%

Kappa =  .52


Competency 8.2

To properly leverage on the positive and negative words in the review we added more capabilities to the basic feature extractor such as bigrams and trigrams in our model along with Unigrams. The Model evaluation metric below is slightly better than the baseline model with just unigrams as shown above. 

Accuracy =  76.6%
Kappa = .53

Competency 8.3
When we set the number of features to 3500 we got an accuracy of 76.9% and a kappa of .54% 


Competency 8.5

Using another text category (Movie Reviews.csv) dataset configuring basic features such as Unigrams, Bigrams, Trigrams and punctuation we got an accuracy of  76.3% and a kappa of .45
  





  
  

Wednesday 10 December 2014

Data and Text Mining - collaborative learning process analysis

Week 7: 

Competency 7.1: Describe prominent areas of text mining.

Unstructured text mining is an area which is seeing a sudden spurt in adoptions for business applications. The spurt in adoption is triggered by heightened awareness about text mining and the reduced price points at which text mining tools are available today. Text mining is being applied to answer business questions and to optimize day-to-day operational efficiencies as well as improve long-term strategic decisions. The objective of this article is to demystify the text mining process and examine its ROI by exploring practical real-world instances where text mining has been successfully applied in three industries:

1.     Automotive industry (warranty management)
2.     Health care industry
3.     Credit card industry

Text Mining in the Automotive Industry

It’s been estimated that warranties cost automotive companies more than $35 billion in the U.S. annually. Considering this tough environment, it is imperative that auto companies explore all opportunities for reducing costs. Optimizing warranty cost is a very important lever in the cost equation for automobile manufacturers. If one is able to get even a marginal improvement in money spent in warranty cost, it can have a multiplier effect on the overall bottom line. One of the most underutilized dimensions of optimizing warranty cost is input from service technicians’ comments. From those comments, the text mining process can surface nuggets of component defect insights yielding interventions for preventing them in future.

Text Mining in the Healthcare Industry

Most countries typically spend anywhere between 3-10% of their GDP on healthcare. The healthcare industry is a huge spender on technology and, with the proliferation of hospital management systems and low-cost devices to log patient statistics, there is a sudden increase in the breadth and depth of patient data. By mining the comments of doctors’ diagnosis transcripts, outputs can yield information that benefits the healthcare industry in numerous ways, such as:
1.  Isolating the top 10 diseases by keyword frequencies per region and leveraging the findings to optimize the mix of tablets/medicines to stock on the limited outlet shelf, keeping in mind the changes in frequency of disease related keywords.
2.  Based on doctors’ comments, an early warning system can be woven within text mining outputs to detect sudden changes to “chatter” from doctors regarding specific diseases. For example, if the frequency of the keyword lungs or breathing exceeds 45 appearances in the last 30 days for a given ZIP code or region, it can be a clue to excessive environmental conditions which are resulting in respiratory problems. A proactive intervention can be activated to remedy the situation.
The components of such a successful text mining solution can be found in Figure 1 below.

                                                                 
Figure 1

 Text Mining in the Credit Card Industry

With the proliferation of credit cards, companies need to do the difficult balancing act of identifying which card features (i.e., line of credit, billing cycle, outlet points and coverage) are resonating with customers and, at the same time, minimize the number of defaults/recovery related interventions. Text mining can help optimize both the collection process as well as the customer experience optimization process.

1.  A top ten complaint keyword watch list can be generated by mining the inbound customer service rep (CSR) call transcripts on a daily basis. From this, you can filter out keywords that were expressed by high-value customers. For example, if the keyword billing error occurs for customers with a credit limit over $200,000, then relationship managers can call the customer and put interventions into the billing process to help prevent reoccurrence.

2.  Text mining can also be used to rate call center staff performance. As an example, a large credit card company in the U.S. had about 600 call center reps receiving inbound calls. Every rep was expected to enter verbose comments to record the nature of the call, but not all were entering detailed text. On one end of the spectrum, there were call center representative entering an average 5 to 6 lines, whereas on the other hand, there were a few who entered just 3 to 5 words. As a result, the organization was missing out on valuable intelligence if only sparse text was recorded. A text mining process was built which gave keyword frequency count by call center representatives. The bottom decile had to undergo additional training to ensure that they entered detailed text, which is valuable for the credit card company. Please see figure 2 below. 

                                                                         

                                                                     Figure 2

In a diverse set of industries ranging from credit cards to auto to healthcare and beyond, the text mining process is slowly being adopted to mine gigabytes of unstructured data. In this tough economic environment, as the pressure to optimize the efficiency of business processes increases, using unstructured text mining techniques on previously ignored data such as comments from technicians, doctors and call center representatives can provide competitive differentiation. This competitive advantage can be in terms of optimizing internal business processes and managing external customer-facing experiences which, in turn, can have a multiplier effect on the overall bottom line. As Marcel Proust said, “The real voyage of discovery consists not in seeking new landscapes, but in having new eyes.” Unstructured data has always been lying around, but never “discovered.” All it takes are “new eyes” within the organization to look at the same unstructured data to gain new bottom-line impacting insights.


Competency 7.2: Detail subareas of text mining such as collaborative learning process analysis.

Data and Text Mining - overview

DM/TM is a technique that consists of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations produce a particular enumeration of patterns (or models) over the data (Fayyad et al., 1996). Data mining has been directed to search patterns from data set using methods such as neural networks, symbolic machine learning algorithms, probabilistic reasoning, etc. In the symbolic algorithms field, actually, there characteristic the incorporation of background knowledge through labeled examples in unlabeled data set for future learner on unlabeled data. There is not a pre-defined amount of labeled examples that should be inserted in database, however, if one database contains a high number of labeled examples more easy and correct will be its works. The semi-supervised learning was chosen because of its flexibility and accuracy to use incorporated knowledge (ideal state), represented by labeled examples in the data set, and to classify the students’ performance, represented by unlabeled examples, in collaborative process. For each realized classification, it is possible to know its accuracy level and the used patterns for definition of the value. Another reason is the ability to work with an undetermined amount of examples, but it is important to provide a minimum quantity of data.

Competency 7.3: Use tools such as LightSIDE in a very simple way to run a text classification experiment.

Training and evaluating newsgroup topic dataset predictive model
The evaluation was configured to use 20 folds in the cross-validation.

Evaluation metric:

Accuracy = 0.5796 ≈ 57.9%
Kappa = 0.4414 ≈ .44

Competency 7.4: Describe how models might be used in Learning Analytics research, specifically for the problem of assessing some reasons for attrition along the way in MOOCs.

This endeavor (text mining, collaborative learning process analysis) holds the potential for enabling substantially improved on-line instruction both by providing teachers and facilitators with reports about the groups they are moderating and by triggering context sensitive collaborative learning support on an as-needed basis. 


Monday 1 December 2014

Data, Analytics, and Learning

Competency 6.1: Feature Engineering

Features engineering is an art of creating predictor variables and is the least well –studied part of the process of developing prediction models. It’s clear in feature engineering that models will never be good if their predictions aren't any good.
Some processes of feature engineering are:

1.       Brainstorming features

2.       Deciding what features to create
3.       creating the feature
4.       Studying the impact of features on model goodness
5.       Iterating on features if useful

Competency 6.2: Diagnostic Metrics  

There are various types of Diagnostic metric tools out there, Roc which stands for Receiver- Operator Characteristic Curve.  With Roc, one can predict something which has two values such as
1.       Correct/Incorrect
2.       Gaming the system/not gaming the system
3.       Student dropouts/ Not drop out
Using Roc, prediction models can output probability or even a real value.