In: Computer Science
Instead of just giving you the
answer, I would like to walk you through the logic of what works
and what does not based on my experience.
Let's say that you are faced with this exact problem. How about you
start cracking with the data? Get those who bought product A. Easy
enough. Now, see what other products they got. Easy, again. Now is
the hard part - how do you interpret these results? Let's suppose B
was number one on that list. Great! Correlation found! Or was it?
Was B the most popular product in your store outside of A? Would
you expect any customer to have B up there in the ranks, not just
those who bought A?
In other words, you need to create a baseline "expected" list of
products so you can compare those who bought A to that list and see
if they are more likely to buy B then just a random customer. But
before you do that, I believe I know what answer you are going to
get. The answer is yes, they are more likely to buy B than a random
customer. How do I know that? Experience.
We have to talk about things not directly related to your immediate
question. Let's start with the general understanding of the
database of transactions. What you will have a is a bunch of
transactions by different customers, and some of them will have
many transactions, while the majority will have very few or just
one transaction. Wait, how does that relate to the problem at hand,
am I just wasting your time? No. Let's assume customers are buying
products in a completely randomized fashion, much like running
random generator every time they make a transaction. Now, answer a
simple question, who is likely to have bought both A and B
products, a customer with one transaction or a customer with 10
transactions? Of course, a customer with more transactions is
likely to have bought from more categories as bought many products,
and thus your high transaction customers will over-weigh on every
category you analyze.
Here is the crucial question - what kind of customer composition
are you likely to get when you are running a query "customers who
have bought A"? Are you going to get more low transaction number
customer or more high transaction number customers compared to the
total database? Of course, your sample is going to be biased toward
the high-transaction customers. So, if you were to compare
"customers who bought A" vs "all customers", then you are going to
find out that the first group is more likely to buy... well,
everything. Essentially, take a pick at your category, and these
customers will be more likely to buy it than an average customer.
This is how I know that customers who bought A in all likelihood
are going to buy more B than an average customer.
The bottom line is every analysis that you run that includes
multiple transactions per customer, you have to adjust for the
number of transactions the customer made and/or items they bought.
To accomplish that, you basically need to compare those who bought
A and also had one more transaction (or item bought) to those who
had just one transaction (or item bought), and then two to two, and
so on. To get it done, you want to create a weight system
(essentially, percentages of the total sample), which you will
apply to the rankings you got for those who bought A.