Question

In: Computer Science

Competitive Advantage: See How Your Competitors Stack Up Just as customers share information when they tweet,...

Competitive Advantage: See How Your Competitors Stack Up

Just as customers share information when they tweet, your competitors expose information about themselves when they report to public databases. There are several instances where the US government has mandated information. Subsequently, the data is published and made available to all.

Zencos worked with a medical device manufacturer using FDA reports as a source. This data set contains several hundred thousand medical device reports of suspected device-associated deaths, acute injuries, and malfunctions. Our team was able to use the database to match the companies to their products. Then we used the text fields to understand the main issues with some of the company’s devices. Some detective work was required, but the result was fruitful.

A deep dive into the text data revealed that the customer had a much higher placement success rate than one of their leading competitors.

Using topic cluster analysis, we demonstrated the common causes of failed device installations by our client’s competitor that led to many patient deaths. Text clustering and sentiment analysis allowed us to find common problems with very adverse outcomes for many devices. Products such as SAS Visual Text Analytics contains sophisticated text mining algorithms.

Text analysis provides valuable insights into the customer’s own malfunctioning devices and allows comparison of the device’s performance with others in the marketplace. It may not have been apparent before the analysis, but now you have a glimpse into market penetration.

Plus you understand the common issues your competitors have.

Other public data, such as that in the Consumer Financial Protection Bureau, reports on what consumers find annoying about financial institutions. Yes, you can retrieve that text to see your rankings – but what a great way to spy on your competitors.

Questions

1. Clustering techniques can be used to identify common problems with different devices. According to the given scenario, please discuss why clustering method is more suitable than decision tree method. [4 marks]

2. Text analysis relies on text mining technology. Within the context of this scenario, please discuss how you will prepare the corpus and set up the term-by-document matrix, in order to achieve the result as shown in the figure. [6 marks]

Solutions

Expert Solution

In order to understand why clustering technique will be useful in this context as compared to decision tree method we need to look at what are clustering techniques and why we use clustering techniques.

In clustering we try to divide the data points into groups such that the points in same group possess similar traits.All the algorithms focus on grouping data with similar traits into one group. The main point is that it is unsupervised technique ie it has got no output labels. Now in our context we need to identify device realted problems, ie we need to group the device with similar traits like death rate and failed installation rate into one group and the rest in other. Hence this problem is a typical case of pattern finding which can be done only with the help of clustering.

Decision trees cannot be used in this context because it is used to find the probability based on certain preconditions which we already know.Even if we use decision trees we will be able to find the probability of sucessful and failed device installation but not the common problems associated with them because for finding out the comon problems we first need to group all those devices which gives problem under one group and then extract common problems from these devices which cannot be done by a decision tree.  

ii Since there is no figure provided we will discuss the general process preparing the corpus and setting up term-by document matrix. There are three main processes in seetting up the corpus which are described below

1 - Tokenization

Tokenization is a step which splits longer strings of text into smaller pieces, or tokens. Larger chunks of text can be tokenized into sentences, sentences can be tokenized into words, etc. Further processing is generally performed after a piece of text has been appropriately tokenized. Tokenization is also referred to as text segmentation or lexical analysis. Sometimes segmentation is used to refer to the breakdown of a large chunk of text into pieces larger than words (e.g. paragraphs or sentences), while tokenization is reserved for the breakdown process which results exclusively in words.

This may sound like a straightforward process, but it is anything but. How are sentences identified within larger bodies of text? Off the top of your head you probably say "sentence-ending punctuation," and may even, just for a second, think that such a statement is unambiguous.

Sure, this sentence is easily identified with some basic segmentation rules:

The quick brown fox jumps over the lazy dog.

But what about this one:

Dr. Ford did not ask Col. Mustard the name of Mr. Smith's dog.

Or this one:

"What is all the fuss about?" asked Mr. Peters.

And that's just sentences. What about words? Easy, right? Right?

This full-time student isn't living in on-campus housing, and she's not wanting to visit Hawai'i.

It should be intuitive that there are varying strategies not only for identifying segment boundaries, but also what to do when boundaries are reached. For example, we might employ a segmentation strategy which (correctly) identifies a particular boundary between word tokens as the apostrophe in the word she's (a strategy tokenizing on whitespace alone would not be sufficient to recognize this). But we could then choose between competing strategies such as keeping the punctuation with one part of the word, or discarding it altogether. One of these approaches just seems correct, and does not seem to pose a real problem. But just think of all the other special cases in just the English language we would have to take into account.

2 - Normalization

Before further processing, text needs to be normalized. Normalization generally refers to a series of related tasks meant to put all text on a level playing field: converting all text to the same case (upper or lower), removing punctuation, converting numbers to their word equivalents, and so on. Normalization puts all words on equal footing, and allows processing to proceed uniformly.

Normalizing text can mean performing a number of tasks, but for our framework we will approach normalization in 3 distinct steps: (1) stemming, (2) lemmatization, and (3) everything else.

Stemming

Stemming is the process of eliminating affixes (suffixed, prefixes, infixes, circumfixes) from a word in order to obtain a word stem.

running → run

Lemmatization

Lemmatization is related to stemming, differing in that lemmatization is able to capture canonical forms based on a word's lemma.

For example, stemming the word "better" would fail to return its citation form (another word for lemma); however, lemmatization would result in the following:

better → good

It should be easy to see why the implementation of a stemmer would be the less difficult feat of the two.

Everything else

Stemming and lemmatization are major parts of a text preprocessing endeavor, and as such they need to be treated with the respect they deserve. These aren't simple text manipulation; they rely on detailed and nuanced understanding of grammatical rules and norms.

There are, however, numerous other steps that can be taken to help put all text on equal footing, many of which involve the comparatively simple ideas of substitution or removal. They are, however, no less important to the overall process. These include:

  • set all characters to lowercase
  • remove numbers (or convert numbers to textual representations)
  • remove punctuation (generally part of tokenization, but still worth keeping in mind at this stage, even as confirmation)
  • strip white space (also generally part of tokenization)
  • remove default stop words (general English stop words)

Stop words are those words which are filtered out before further processing of text, since these words contribute little to overall meaning, given that they are generally the most common words in a language. For instance, "the," "and," and "a," while all required words in a particular passage, don't generally contribute greatly to one's understanding of content. As a simple example, the following panagram is just as legible if the stop words are removed:

The quick brown fox jumps over the lazy dog.

  • remove given (task-specific) stop words
  • remove sparse terms (not always necessary or helpful, though!)

A this point, it should be clear that text preprocessing relies heavily on pre-built dictionaries, databases, and rules.


3 - Noise Removal

Noise removal continues the substitution tasks of the framework. While the first 2 major steps of our framework (tokenization and normalization) were generally applicable as-is to nearly any text chunk or project (barring the decision of which exact implementation was to be employed, or skipping certain optional steps, such as sparse term removal, which simply does not apply to every project), noise removal is a much more task-specific section of the framework.

We have to keep in mind again that we are not dealing with a linear process, the steps of which must exclusively be applied in a specified order. Noise removal, therefore, can occur before or after the previously-outlined sections, or at some point between).

How about something more concrete. Let us assume that we obtained a corpus from the world wide web, and that it is housed in a raw web format. We can, then, assume that there is a 100% chance that our text could be wrapped in HTML or XML tags. While this accounting for metadata can take place as part of the text collection or assembly process (step 1 of our textual data task framework), it depends on how the data was acquired and assembled.

The good thing is that pattern matching can be help to ease the task, as can existing software tools built to deal with just such pattern matching tasks.

  • remove text file headers, footers
  • remove HTML, XML, etc. markup and metadata
  • extract valuable data from other formats, such as JSON, or from within databases
  • if you fear regular expressions, this could potentially be the part of text preprocessing in which your worst fears are realized

As you can see, the boundary between noise removal and data collection and assembly is a fuzzy one, and as such some noise removal must take place before other preprocessing steps. For example, any text required from a JSON structure would obviously need to be removed prior to tokenization.


Related Solutions

How do Information Systems provide a competitive advantage for a company? Answer this question in the...
How do Information Systems provide a competitive advantage for a company? Answer this question in the context of a company of your choice. In providing your answer state why have you chosen the company and discuss how this company can achieve competitive advantage using Information Systems. Please refer to your readings to support your answer
2. In this problem, you are going to see just how quickly the Sun uses up...
2. In this problem, you are going to see just how quickly the Sun uses up nuclear energy in its core. a. Four hydrogen atoms (total mass = 6.7x10-27 kg) fuse to produce one helium atom (mass 6.6x10-27 ). If one gm of hydrogen is burned in the Sun into helium, how much energy will be produced? b. The Sun has luminosity of 4x1026 Joules per second. It has mass of 2x1030kg. Three quarters of the Suns mass is hydrogen....
How would copyright affect your ability to "cheat" by looking up snippets of code on stack...
How would copyright affect your ability to "cheat" by looking up snippets of code on stack overflow
   You pick up your older mobile phone to see who just sent you a text,...
   You pick up your older mobile phone to see who just sent you a text, but its dark ! Your sunglasses are blocking all pixels, all colors from your phone. a. Aha, with a reminder from the textbook or the interwebs, you know what polarization the glare has, so you also know what polarization the light from your phone has. Explain. b. You rotate your phone from its normal orientation by 10 degrees, then 30 degrees, then 45, then...
Q1. Using your home country, Jamaica, select an organization with a competitive advantage to examine how...
Q1. Using your home country, Jamaica, select an organization with a competitive advantage to examine how their strategies are implemented. a. Why would you classify the organization as having a competitive advantage in the respective market? b. What types of strategies does this organization employ? c. As a critical thinker, are there any strategies that you think this organization should implement and/or eliminate? Why?
Your patient is hallucinating telling you that they see spiders crawling up the wall, how would...
Your patient is hallucinating telling you that they see spiders crawling up the wall, how would you respond when they ask if you see them too?
Suppose you run some tests to see how the share price behaves when you change some...
Suppose you run some tests to see how the share price behaves when you change some of the inputs and drivers utilized in arriving at that value. You keep all other variables the same and look at the isolated effect that each of the selected inputs has on share price. What are you doing? A.) Performing scenario analysis B.) Performing sensitivity analysis C.) Performing Monte Carlo simulation
You are just getting caught up with your work when you receive the following phone call:...
You are just getting caught up with your work when you receive the following phone call: “Hi, this is Deb in the ED. We’re sending you Linda, a 53 y/o woman with a PMH of CAD, DM, HTN, and Dyslipidemia. Her daughter reports that she’s become increasingly weak over the past couple of weeks and has been unable to do her housework. Apparently she has been C/O swelling in her ankles and feet by late afternoon (“she can’t wear her...
Share a media-related or personal example that investors/creditors/suppliers/customers use financial information when they make their investment/lending/purchasing...
Share a media-related or personal example that investors/creditors/suppliers/customers use financial information when they make their investment/lending/purchasing decisions
How do you see putting Information and Communication Technology to use in order to achieve your...
How do you see putting Information and Communication Technology to use in order to achieve your own economic goals?
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT