In: Computer Science
Competitive Advantage: See How Your Competitors Stack Up
Just as customers share information when they tweet, your competitors expose information about themselves when they report to public databases. There are several instances where the US government has mandated information. Subsequently, the data is published and made available to all.
Zencos worked with a medical device manufacturer using FDA reports as a source. This data set contains several hundred thousand medical device reports of suspected device-associated deaths, acute injuries, and malfunctions. Our team was able to use the database to match the companies to their products. Then we used the text fields to understand the main issues with some of the company’s devices. Some detective work was required, but the result was fruitful.
A deep dive into the text data revealed that the customer had a much higher placement success rate than one of their leading competitors.
Using topic cluster analysis, we demonstrated the common causes of failed device installations by our client’s competitor that led to many patient deaths. Text clustering and sentiment analysis allowed us to find common problems with very adverse outcomes for many devices. Products such as SAS Visual Text Analytics contains sophisticated text mining algorithms.
Text analysis provides valuable insights into the customer’s own malfunctioning devices and allows comparison of the device’s performance with others in the marketplace. It may not have been apparent before the analysis, but now you have a glimpse into market penetration.
Plus you understand the common issues your competitors have.
Other public data, such as that in the Consumer Financial Protection Bureau, reports on what consumers find annoying about financial institutions. Yes, you can retrieve that text to see your rankings – but what a great way to spy on your competitors.
Questions
1. Clustering techniques can be used to identify common problems with different devices. According to the given scenario, please discuss why clustering method is more suitable than decision tree method. [4 marks]
2. Text analysis relies on text mining technology. Within the context of this scenario, please discuss how you will prepare the corpus and set up the term-by-document matrix, in order to achieve the result as shown in the figure. [6 marks]
In order to understand why clustering technique will be useful in this context as compared to decision tree method we need to look at what are clustering techniques and why we use clustering techniques.
In clustering we try to divide the data points into groups such that the points in same group possess similar traits.All the algorithms focus on grouping data with similar traits into one group. The main point is that it is unsupervised technique ie it has got no output labels. Now in our context we need to identify device realted problems, ie we need to group the device with similar traits like death rate and failed installation rate into one group and the rest in other. Hence this problem is a typical case of pattern finding which can be done only with the help of clustering.
Decision trees cannot be used in this context because it is used to find the probability based on certain preconditions which we already know.Even if we use decision trees we will be able to find the probability of sucessful and failed device installation but not the common problems associated with them because for finding out the comon problems we first need to group all those devices which gives problem under one group and then extract common problems from these devices which cannot be done by a decision tree.
ii Since there is no figure provided we will discuss the general process preparing the corpus and setting up term-by document matrix. There are three main processes in seetting up the corpus which are described below
1 - Tokenization
Tokenization is a step which splits longer strings of text into smaller pieces, or tokens. Larger chunks of text can be tokenized into sentences, sentences can be tokenized into words, etc. Further processing is generally performed after a piece of text has been appropriately tokenized. Tokenization is also referred to as text segmentation or lexical analysis. Sometimes segmentation is used to refer to the breakdown of a large chunk of text into pieces larger than words (e.g. paragraphs or sentences), while tokenization is reserved for the breakdown process which results exclusively in words.
This may sound like a straightforward process, but it is anything but. How are sentences identified within larger bodies of text? Off the top of your head you probably say "sentence-ending punctuation," and may even, just for a second, think that such a statement is unambiguous.
Sure, this sentence is easily identified with some basic segmentation rules:
The quick brown fox jumps over the lazy dog.
But what about this one:
Dr. Ford did not ask Col. Mustard the name of Mr. Smith's dog.
Or this one:
"What is all the fuss about?" asked Mr. Peters.
And that's just sentences. What about words? Easy, right? Right?
This full-time student isn't living in on-campus housing, and she's not wanting to visit Hawai'i.
It should be intuitive that there are varying strategies not only for identifying segment boundaries, but also what to do when boundaries are reached. For example, we might employ a segmentation strategy which (correctly) identifies a particular boundary between word tokens as the apostrophe in the word she's (a strategy tokenizing on whitespace alone would not be sufficient to recognize this). But we could then choose between competing strategies such as keeping the punctuation with one part of the word, or discarding it altogether. One of these approaches just seems correct, and does not seem to pose a real problem. But just think of all the other special cases in just the English language we would have to take into account.
2 - Normalization
Before further processing, text needs to be normalized. Normalization generally refers to a series of related tasks meant to put all text on a level playing field: converting all text to the same case (upper or lower), removing punctuation, converting numbers to their word equivalents, and so on. Normalization puts all words on equal footing, and allows processing to proceed uniformly.
Normalizing text can mean performing a number of tasks, but for our framework we will approach normalization in 3 distinct steps: (1) stemming, (2) lemmatization, and (3) everything else.
Stemming
Stemming is the process of eliminating affixes (suffixed, prefixes, infixes, circumfixes) from a word in order to obtain a word stem.
running → run
Lemmatization
Lemmatization is related to stemming, differing in that lemmatization is able to capture canonical forms based on a word's lemma.
For example, stemming the word "better" would fail to return its citation form (another word for lemma); however, lemmatization would result in the following:
better → good
It should be easy to see why the implementation of a stemmer would be the less difficult feat of the two.
Everything else
Stemming and lemmatization are major parts of a text preprocessing endeavor, and as such they need to be treated with the respect they deserve. These aren't simple text manipulation; they rely on detailed and nuanced understanding of grammatical rules and norms.
There are, however, numerous other steps that can be taken to help put all text on equal footing, many of which involve the comparatively simple ideas of substitution or removal. They are, however, no less important to the overall process. These include:
Stop words are those words which are filtered out before further processing of text, since these words contribute little to overall meaning, given that they are generally the most common words in a language. For instance, "the," "and," and "a," while all required words in a particular passage, don't generally contribute greatly to one's understanding of content. As a simple example, the following panagram is just as legible if the stop words are removed:
The quick brown fox jumps over the lazy dog.
A this point, it should be clear that text preprocessing relies heavily on pre-built dictionaries, databases, and rules.
3 - Noise Removal
Noise removal continues the substitution tasks of the framework. While the first 2 major steps of our framework (tokenization and normalization) were generally applicable as-is to nearly any text chunk or project (barring the decision of which exact implementation was to be employed, or skipping certain optional steps, such as sparse term removal, which simply does not apply to every project), noise removal is a much more task-specific section of the framework.
We have to keep in mind again that we are not dealing with a linear process, the steps of which must exclusively be applied in a specified order. Noise removal, therefore, can occur before or after the previously-outlined sections, or at some point between).
How about something more concrete. Let us assume that we obtained a corpus from the world wide web, and that it is housed in a raw web format. We can, then, assume that there is a 100% chance that our text could be wrapped in HTML or XML tags. While this accounting for metadata can take place as part of the text collection or assembly process (step 1 of our textual data task framework), it depends on how the data was acquired and assembled.
The good thing is that pattern matching can be help to ease the task, as can existing software tools built to deal with just such pattern matching tasks.
As you can see, the boundary between noise removal and data collection and assembly is a fuzzy one, and as such some noise removal must take place before other preprocessing steps. For example, any text required from a JSON structure would obviously need to be removed prior to tokenization.