My contemporaneous notes from a particular INFORMS Marketing Science Conference six years ago feature the letters W, T, and F scrawled in the margins a few times. I learned of a deeper problem lurking in the way we were using the crosstab to identify segmentation. In this post, I’ll unpack a heap of jargon and lay the concern bare.

To the twenty or so marketing scientists in the room at the time, I read concern on the faces of about a dozen. It was a atypical because typically that community doesn’t get concerned about too much. One leader remarked that most in industry were not even executing basic segmentation on their users, so it wasn’t a huge industrial concern, but for marketing scientists in academia, this could be a very bad problem. For industry, for market researchers, it was about reporting basic univariate means – that 22% of consumers would buy purple paper towel, or that Obama would win 51% of the popular vote – were no problem and where the action was. Segmenting consumers, that 55% of women with dogs bought paper towel, was a different matter. Another gentleman, one of the four grandees I knew from the era, repeated, as though to wave everybody else off from the problem, that somebody he had supervised in the past was working the underling problem.

If the presentation ever made it to publication, I can’t find it. And I’m having a very tough time locating the specific author, or if there was ever just a simple resolution.

The problem originates with the way we understand ground truth. So let’s start there an unpack it all.

Assume a Universe. This Universe exists entirely within the RAM of my machine. Or yours. And in this Universe, I create a hundred thousand humans. And I assert two things about each of them. Attribute A and Action B. Either somebody has Attribute A, or they don’t. They either have a 1, or a 0. And they will imminently do Action B. They will either do B, a 1, or they won’t, a 0.

If it helps you to think about Attribute A in a concrete term, you can associate it age. Either somebody has accrued 35 years on Earth-RAM, or they have not. They are either 35 years old, a 1, or they are not. And if helps you to think about Action B, you can think of it as they will either Buy brand B, a 1, or they will not, a 0. It doesn’t really matter. A is a 1 or a 0. And B is a 1 or a 0.

Because I have created this Universe, there are no missing values. There are no import errors. There is no number 2. Or -1. There are certainly no special characters.

There is only cold, sterile, dichotomous, data.

This data, of 100,000 perfect people in a perfect bank of RAM in a machine is the absolute ground truth of the Universe. Many of you may feel bad for those people, but that’s because you’re crazy, they have no feelings, and their existence in this Universe is much better. This is what I tell the computer to create this Universe.

n = 100000
dict_a = {a: np.random.randint(0,2) for a in range(n)}
dict_b = {b: np.random.randint(0,2) for b in range(n)}
df = pd.DataFrame({'A':dict_a, 'B':dict_b}, columns=list('AB'))

As a God, I can run a query and discover the true, actual, ground truth of the Universe at any time. This forms the basis of all Judgement to come. And while I’m playing God, I’m going to instantiate twenty special agents – and task them with a few challenges.

First, I’m going to ask these agents to estimate, out of the 100,000 humans, do a majority of humans have Attribute A? How many have Attribute A?

Since I’m also a frugal God, and not much into apotheosis, I’m not just going to give these twenty agents read access to the entire human database. I’ll give them the ability to sample it. They may have a slice to check for the attribute.

sample_size = 100
agents = 20
i = 0
sample_predicted_values = {}
    while i < agents:
        df_sample = df.iloc[random.sample(range(1, n), sample_size)]
        sample_predicted_values[i] = df_sample['A'].mean()
        i += 1
df_samples = pd.DataFrame.from_dict(sample_predicted_values, orient='index')
print df_samples

These agents are able to sample at random – one may be able to interview human 99, 107, 998…and so on. One may coincidentally interview humans 998,000 to 998,999. Their interview list is generated randomly and independently from each other. The humans in RAM can’t refuse to respond. They have to respond. And they have to respond honestly. And they do respond. They get to ask 100 humans what they are. The agents do not make a mistake in recording the responses or tabulating means. They have no feelings about what the answer should be.

What I’ve set up here is a simulation. It’s a controlled environment mostly free from most of the problems caused by the real Universe out there. How would twenty agents answer the question, how many humans have Attribute A. Put a different way, how many humans in the Universe have a 1 under column A?

They’d probably add up all the ‘1” they got from those 100 humans and apply that average to the 100,000. It would stand to reason that if they asked 100 humans, and 50 of them had attribute A, that 50% of all humans would have attribute A.

What do you think these twenty agents would come back with? Would they all come back with the same answer? Would they all literally say 50%. How probable do you that is?

That probably depends on the underlining distribution. What if the randomness of the birth of this Universe produced more than 50% with Attribute A? Or less than 50%? Wouldn’t one expect np.random, or the random function contained in the numpy library, a random function, to produce something that isn’t perfectly 50%?

It would also depend on the random sample that the agent is pulling too. They’re polling 100 humans from a random list they, themselves, produce at random. And what are the odds that each agent, operating independently of each other, is going to draw the exact same 100 humans to poll? It really isn’t likely.

These twenty agents are going to return different lists of 1’s and 0’s, 100 items long, and arrive at a different percentage estimate of the ground truth. And remember, nobody, other than the God, which is me, knows the real ground truth.

Here’s how my agents did on the first run of the Universe. It’s a sorted list of what each Agent reported back, as a decimal. If it says 0.42 in the table, it can be interpreted as 42%. I sorted the list to make it easier to read. Below, you can see that Agent 3 predicted 42%. Agent 7, and Agent 9 both predicted 50%. Agent 4 predicted 60%.

Sorted
3   0.42
12  0.42
1   0.43
0   0.45
17  0.46
10  0.47
5   0.48
18  0.49
9   0.50
7   0.50
19  0.50
11  0.51
14  0.51
16  0.51
2   0.52
15  0.53
6   0.55
13  0.55
8   0.55
4   0.60
Range
0    0.18

That’s a total range of 18 percentage points!

Isn’t that interesting? It isn’t that Agent 4 is a bad person, any more than Agent 3 is a bad person. They just happened to generate very different lists, from the same Universe, and arrived at very different answers. And, you don’t know for sure if Agent 4 is right. Or if Agent 3 is right.

You might be eyeballing that list, and your gut might be telling you something. The agents, in aggregate, are returning a distribution of answers that are clustering around some figure in the middle, aren’t they? And maybe you reach for those classic summary statistics, the mean and the median, to describe the list. Maybe you take the average of the averages and end up at 49.75%. Maybe you take the median at 50%? All of these estimates, in total, are clumping near the middle.

It turns out, in this instance, the ground truth in this Universe was that 50,058 humans had Attribute A, so the Ground Truth was 50.058%. Why would God allow such a Universe to exist?!? God plays dice.

Let’s spool up another Universe, and this time, let’s loosen the strings up a bit on our Agents. Let’s gift them ten times the information. Let’s grant them the RAM to ask 1,000 random humans. What then?

Sorted
2   0.478
16  0.485
17  0.487
8   0.495
1   0.496
11  0.497
0   0.500
3   0.502
4   0.504
7   0.504
14  0.507
18  0.508
19  0.509
15  0.511
6   0.512
9   0.514
5   0.517
12  0.517
10  0.520
13  0.523
Range
0    0.045

Two very interesting things happen. For one, a set of 1000 allows for more precision in their estimates. And, second, the range is a lot tighter, just 4.5% points separate Agent 2 from Agent 13. So based off these twenty estimates, what would you say the ground truth is? Is Agent 0 right at their 50.0% estimate? Is Agent 7 at 50.4%? The mean and the median are very close to each other, around 0.504, or 50.4%. Do you feel more confident that more than half of the humans in this Universe have Attribute A?

The actual ground truth, in this example, was that 49,677 out of 100,000 humans had Attribute A. Agent 11 was more right than others. And again, just out of pure chance.

Did you feel the precision in the figures give you a false sense of confidence? Did you feel that? Insidious, isn’t?

Let’s do one more.

I’m going to give these agents a sample size of 10,000! Such generosity!

#set size of universe
n = 100000
#set size of the sample to be pulled
sample_size = 10000
Sorted
6   0.1931
13  0.1950
5   0.1956
18  0.1958
15  0.1961
12  0.1964
0   0.1967
1   0.1974
11  0.1991
3   0.1993
4   0.2000
8   0.2000
10  0.2005
16  0.2009
9   0.2009
14  0.2021
17  0.2022
7   0.2023
19  0.2026
2   0.2067
Range
0.0136

Each agent is more precise. And the range went from 0.045 to 0.0136, (4.5% down to 1.36%). Which is a bit disappointing isn’t, since I increased the the amount of sample given to each agent by a factor of 10 and only got a four factor increase in accuracy.

And all twenty agents are estimating figures far away from 50%, they’re all down at around 20%. Is Agent 3 right, at 19.93%? Is it weird that Agent 4 and Agent 8 both agree so much at 20.00%? The average of the average is right around Agent 3, 4, and 8 at 0.199, or 19.9%.

Could they all be so wrong?

In this Universe, 19956 humans had Attribute A. The right answer, the actual ground truth of the Universe, was 19.956%

Why?

dict_a = {a: np.random.choice(np.arange(0, 2), p=[0.80, 0.20]) for a in range(n)}

Because God microwaved the dice. I set p=[0.80, 0.20], manipulated the plastic in those bad boys, causing the dice to bounce a certain way – giving the odds a 20% chance of Attribute A for each human. But there was still chance at work.

You may have anchored on 0.50, so when you saw so many outliers, you may have thought there was something wrong with the agents.

Alright, so now you have some intuition for how simulation can help you understand more about the wonderful world of univariate statistics and making predictions about the ground truth of the Universe.

Controlled Universe. Controlled Agents. One variable. One prediction. Why we call it univariate (uni = one) analysis. And there’s still quite a bit of randomness.

As the amount of information made available to each agent increased, the accuracy of their predictions became better. They got better, even if it didn’t get as better as the size of data they got. I increased the amount of data they had access to by two orders of magnitude, from 100 to 10,000, and they only got better from being 9 percentage points off (18 was the total range, the agents were within 9 points of the mean) to almost a one percentage point off. And the rate that predictions get better, themselves, generalize into a predictable formula. Mathematical scientists have in turn discovered a whole bunch of observations about the way this relationship works, assigned Greek letters to some of the concepts, and have been working ever since.

This is the intuition at the root of how nature, of how randomness and chance, even against a ground truth, generalize out in the world of univariate analysis. But because we come to know nature, we can manage that, and ourselves.

Bivariate

Now we’re going to compare two things. I’m spooling up a Universe and setting A and B.

n = 100000
sample_size = 100
dict_a = {a: np.random.randint(0,2) for a in range(n)}
dict_b = {a: np.random.randint(0,2) for a in range(n)}

And now I’m going to ask my twenty agents – how many of those with Attribute A are going to do Action B?

Stated a slightly different way, how many humans have a 1 in both columns, A and B?

I’m going to start off by giving each agent 100 humans to interview. Same rules apply as above.

Here’s what Agent 6 put together from their sample.

Agent
6
Sample Crosstab
A     0   1  All
B               
0    27  37   64
1    19  17   36
All  46  54  100

Here’s how to read that crosstab (sometimes it’s called a contingency table). Agent 6 interviewed 100 people. Of them, 54 had attribute A. And 36 did Action B. Agent 6 found that 17 people that had attribute A, also did Action B.

I’m reporting each agent’s count of those that had 1’s in each column, 1 for Attribute A, and a 1 for Action B.

Sorted Values 1_1
     0
6   17
1   18
11  18
16  19
17  20
15  20
0   21
18  22
2   23
4   23
12  23
7   24
8   24
9   25
3   26
13  26
19  26
14  30
5   33
10  38
Range of Values 1_1
0    21

And as you can see, agents would estimate a value as low as 17 (Agent 6) to as high as twice that, 38 (Agent 1)! It’s trivial here to put those into percentages – a range that varies from 17% to 38%.

The ground truth figure was 25,180, (or 25.180% had both Attribute A and did Action B).

Universal CrossTab
A        0      1     All
B                        
0    24863  24906   49769
1    25051  25180   50231
All  49914  50086  100000

So what happens when I give the agents a sample of 1000?

#set size of the sample to be pulled
sample_size = 1000

And we find that the agents are returning a tighter range, which is what we expected.

Sorted Values 1_1
      0
0   219
16  224
11  236
17  241
14  241
13  241
9   241
10  242
12  248
18  249
8   250
1   250
15  251
7   254
6   255
5   258
3   261
19  261
2   267
4   270
Range of Values 1_1
0    51

And it tightens. Agent Zero estimates 21.9%, Agent Four estimates 27.0%. Which are much tighter in against the ground truth, which was 25.202%.

Universal CrossTab
A        0      1     All
B                        
0    24996  24858   49854
1    24944  25202   50146
All  49940  50060  100000

So what happens when I give the agents a sample of 10000?

#set size of the sample to be pulled
sample_size = 10000

And we find that the estimates are narrowed again.

Sorted Values 1_1
       0
8   2417
12  2426
2   2427
10  2436
15  2454
5   2464
18  2476
19  2486
9   2489
7   2489
0   2525
14  2530
1   2530
3   2538
13  2538
4   2543
6   2546
11  2546
17  2560
16  2570
Range of Values 1_1
0    153

Which is good, against the ground truth of the situation.

Universal CrossTab
A        0      1     All
B                        
0    25197  24885   50082
1    24772  25146   49918
All  49969  50031  100000

So as you can see, more information helps the agents greatly in making a more accurate estimate of the ground truth.

So far, so good.

Next, I’m going to ask my agents a different question – what is the relationship between Attribute A, and Action B?

So I recreate the Universe, and give every agent a sample of 100.

#set size of the sample to be pulled
sample_size = 100

Here’s what Agent 17 discovered when they went out into the Universe.

Agent
17
Sample Crosstab
A     0   1  All
B               
0    26  25   51
1    26  23   49
All  52  48  100

So what can deduce about the Ground Truth of the Universe from this distribution? Well perhaps not much. If Attribute A had nothing to do with Action B, one might expect there to be a number around 25 in each cell. The number of people with Attribute A and doing Action B would be around 25 (And down below, I’ll report out the expected values). And we look above, and we can see that the Agent can’t really make a prediction using information about A.

If Agent 17 was sitting in a room, and I sent a human at random in there, and the person said if they had Attribute A, the Agent might guess that they were going to do Action B in the future (23/48 = ~48%). But given how small the numbers are, Agent 17 couldn’t really say for sure. They wouldn’t be very confident in that prediction.

Agent 5 would have a different relationship with reality based on their experience.

Take a look at what Agent 5 discovered.

Agent
5
Sample Crosstab
A     0   1  All
B               
0    30  18   48
1    21  31   52
All  51  49  100

Agent 5 found that if somebody had Attribute A, 31 times out of 49, that human also did Action B (63% of the time). That’s pretty lopsided. So if you sent a random human into a room with Agent 5, and that human said they had Attribute A, Agent 5 would estimate they’d have a 63% chance that the human would do Action B. They might be pretty confident about that prediction.

That relationship, between a prediction and the confidence in it is quantified in a bunch of equations, rooted in observations, and expressed in Greek letters. I’m going to give them the gift of knowing about a Greek Letter called Chi, and have them report what they calculate.

The way Chi Squared is calculated is by taking the difference between what the agent observes in each cell in the crostab table, and what the agent would expect to see if there was no relationship between A and B, and then square it. Then they divide that number by what was expected. Add all of these together and one gets the Chi Squared value. (The Greek letter Chi looks like a Latin X).

For intuition, the greater the excursion from what’s expected, the greater the Chi Squared value.

The relationship between the Chi Squared value and the probability that it’s independent is moderated by a concept called the degrees of freedom. We have a 2×2 crosstab table here, so the total number of degrees of freedom is just 1. Just one degree of freedom. Check out the black line, the one furthest to the bottom of the chart, below.

As the excursion from what’s expected increases, the Chi Square value returned from the calculation increases. As the Chi Square value increases, the P-value, or the probability that there is no relationship between the two, goes down. High departures from the expected cause high Chi Square values, and high Chi Square values cause low p-values. Low p-values help us understand the probability that A and B are not independent. And this point, about what a p-value means, is quite tortured and debated about. It wasn’t obvious to most real humans until the 1700’s and a human called Pearson didn’t make the connection until 1900. But I’m going to give this knowledge to my Agents for free. (I spoil them!)

chi2, p, dof, expected = stats.chi2_contingency(obs)

I’m going to ask all 20 agents to report on back to me how sure they are that there is no relationship between Attribute A and Action B.

P VALUES
           0
5   0.044430
1   0.069816
16  0.243300
4   0.378896
12  0.399213
2   0.408720
7   0.408720
14  0.460522
19  0.488906
8   0.500817
18  0.545683
0   0.558591
10  0.595316
6   0.669254
11  0.689098
3   0.716851
13  0.776825
9   0.935711
15  0.993611
17  0.993611

Agent 17 reports a 99.3610601185% chance that there is no relationship between Attribute A and Action B. As does Agent 15.

Agent 5 reports a 4.44300910571% chance that there is no relationship between Attribute A and Action B.

And you can see here, the chances that there is no relationship between A and B vary from 4.44% all the way to 99.36%. Now remember, everything in Agent 4’s experience tells them that there are pretty good odds that there is a relationship between A and B.

What effect does giving my agents more sample have? Let’s increase it by a factor of 10.

#set size of the sample to be pulled

sample_size = 1000

And we check out the probabilities that the A isn’t related to B.

P VALUES
           0
6   0.036959
13  0.076433
0   0.096252
7   0.107040
15  0.141339
17  0.161236
1   0.168180
3   0.227854
14  0.229311
4   0.294523
19  0.366309
9   0.368254
10  0.477954
18  0.531879
5   0.546537
16  0.584444
8   0.728194
11  0.874022
2   0.895047
12  0.994750

And we see right away that Agent 6 has something unusual.

Agent
6
Sample Crosstab
A      0    1   All
B                  
0    261  228   489
1    238  273   511
All  499  501  1000

Sample A Mean
0.501
Sample B Mean
0.511
Sample 1-1 Mean
273
CHI STATS FOR THIS AGENT
p
0.0369586132321
chi2
4.35231785869
dof
1
expected
[[244.011 254.989]
 [244.989 256.011]]

Agent 6 observed 273 people with both Attribute A and Action B – so they could predict that somebody doing Action B, given Action, 54.5% of the time, which is better than even odds. They report that there’s just a 3.6958% chance that the relationship between A and B is a fluke. This is their experience. All the other agents report that there’s no relationship between A and B.

What about the ground truth of Universe?

Universal CrossTab
A        0      1     All
B                        
0    25032  24941   49973
1    25202  24825   50027
All  50234  49766  100000
CHI STATS FOR THE UNIVERSE
p
0.36956011438
chi2
0.805143058414
dof
1
expected
[[25103.43682 25130.56318]
 [24869.56318 24896.43682]]

Nope. There isn’t a relationship. There’s no relationship between A and B.

Let’s amp the sample size up to 10,000 and see what happens.

#set size of the sample to be pulled
sample_size = 10000

Agent Zero, even with 10,000 in sample, reported a relationship.

P VALUES
           0
0   0.017687
9   0.061001
10  0.065904
18  0.105137
19  0.122991
3   0.209514
4   0.281474
12  0.326845
13  0.364398
14  0.433251
11  0.519662
1   0.521283
2   0.531240
6   0.560125
15  0.605112
5   0.688567
7   0.890137
17  0.900055
8   0.929782
16  0.955363

Upon closer inspection of Agent Zero, we see why:

Agent
0
Sample Crosstab
A       0     1    All
B                     
0    2524  2505   5029
1    2376  2595   4971
All  4900  5100  10000

Sample A Mean
0.51
Sample B Mean
0.4971
Sample 1-1 Mean
2595
CHI STATS FOR THIS AGENT
p
0.0176866154165
chi2
5.62692654471
dof
1
expected
[[2464.21 2435.79]
 [2564.79 2535.21]]

Even though the Ground Truth, indeed, indicates that A and B are not related.

Universal CrossTab
A        0      1     All
B                        
0    24909  24958   49867
1    25053  25080   50133
All  49962  50038  100000
CHI STATS FOR THE UNIVERSE
p
0.949061729484
chi2
0.00408130392167
dof
1
expected
[[24914.55054 25047.44946]
 [24952.44946 25085.55054]]

Again, Agent Zero isn’t a bad human. It’s just that they got unlucky. Their experience of the Universe leads them to believe that there is a relationship between A and B, even though there isn’t. Even with a sample of 10,000.

In all Four of these Universes, there was no relationship between A and B. Aside from showing you the ground truth and the related statistics, I can prove it by showing you the way I created the Universe.

dict_a = {a: np.random.randint(0,2) for a in range(n)}
dict_b = {a: np.random.randint(0,2) for a in range(n)}

A and B are independent because they were created independently. They were willed into being using two separate random distributions, at two different times (one followed the other) and they have nothing to do with each other, other than existing in the same Simulated Universe together. The fact that it seemed like usually, one agent out of twenty always thought there was a relationship is predictable. It’s just that no agent ever believes that they’d be so unlucky.

At sample = 100:

Sample Mean P VALUE
0    0.419183
Sample Range P VALUE
0    0.933486

At sample = 1000:

Sample Mean P VALUE
0    0.417475
Sample Range P VALUE
0    0.976968

At sample = 10000:

Sample Mean P VALUE
0    0.536264
Sample Range P VALUE
0    0.887396

The P-value range doesn’t become any tighter or more consistent, across the agents, as sample goes up. There is no tightening of certainty at scale.

What if I create a universe where A is related to B?

What if A is correlated to B. What if they’re not only just correlated, but I make A cause B? Naked. Pure. Causality.

dict_a = {a: np.random.choice(np.arange(0, 2), p=[0.50, 0.50]) for a in range(n)}
dict_b = {}

for key, value in dict_a.iteritems():
if value == 0:
dict_b[key] = np.random.choice(np.arange(0, 2), p=[0.80, 0.20])
if value == 1:
dict_b[key] = np.random.choice(np.arange(0, 2), p=[0.20, 0.80])

For each human created from the void, they’ll have a 50:50 chance of getting Attribute A. For those who do not get Attribute A, there is a 20% chance they will do Action B. If they have Attribute A, there is an 80% chance they will do Action B. It is this way because I say it is. The determination of a 0 or a 1 under Action B is determined by, is because of, having Attribute A. It is causal both in terms of time, and in terms of definition. That’s just how I created the Universe. (This is beauty of simulation!)

So I set the agents sample size to 100.

sample_size = 100

And they all come back with very low p-values.

P VALUES
               0
5   7.640166e-13
3   2.970227e-11
10  1.440464e-10
14  5.218926e-10
18  6.594466e-10
8   1.845767e-09
0   1.958641e-09
16  1.958641e-09
13  2.338188e-09
12  2.387777e-09
17  3.596767e-09
11  6.236816e-09
9   1.713752e-08
1   5.138308e-08
4   5.508875e-08
6   5.805862e-08
7   2.820377e-07
2   4.233467e-07
15  4.403962e-06
19  1.220610e-05

And, for intuition, let’s check out why Agent 5 came back with such a low value:

Agent
5
Sample Crosstab
A     0   1  All
B               
0    38   6   44
1     7  49   56
All  45  55  100
Sample A Mean
0.55
Sample B Mean
0.56
Sample 1-1 Mean
49
CHI STATS FOR THIS AGENT
p
7.64016573956e-13
chi2
51.3724911452
dof
1
expected
[[19.8 25.2]
 [24.2 30.8]]

As you can see, one would expect 30.8 people in the sample to have 1′ both A and B, and the observed value was 49. Add up all of these variations, and we get a high chi squared value of 51, which produces a very low p-value. If a person walked into a room with Agent Five and told them they had Attribute A, they’d predict that the person would do Action B 90% of the time.

And that’s within a good range of what Universe says it really is – what the ground truth really is – it’s pretty close.

Universal CrossTab
A        0      1     All
B                        
0    39894   9935   49829
1     9960  40211   50171
All  49854  50146  100000
CHI STATS FOR THE UNIVERSE
p
0.0
chi2
36249.5631917
dof
1
expected
[[24841.74966 25012.25034]
 [24987.25034 25158.74966]]

Alright, let’s boost the sample size to 1,000

sample_size = 1000

And see what happens:

P VALUES
               0
14  2.267577e-89
18  5.482615e-89
6   6.981511e-89
17  7.188154e-89
7   5.035317e-88
5   1.285554e-85
19  7.385426e-85
2   2.342636e-84
15  2.376226e-84
0   2.140627e-81
13  3.776932e-81
10  8.144878e-80
11  4.849325e-79
9   3.542902e-78
16  8.106565e-78
1   1.250593e-75
3   1.762194e-75
12  5.910935e-75
4   6.113513e-75
8   8.353442e-69
Sample Mean P VALUE
0    4.176729e-70
Sample Range P VALUE
0    8.353442e-69

And the Agent’s generally agree with the ground truth of the Universe.

Universal CrossTab
A        0      1     All
B                        
0    40081   9984   50065
1     9951  39984   49935
All  50032  49968  100000
CHI STATS FOR THE UNIVERSE
p
0.0
chi2
36153.7396954
dof
1
expected
[[25048.5208 24983.4792]
 [25016.4792 24951.5208]]

And let’s do it again, with a higher sample for each agent:

P VALUES
      0
0   0.0
17  0.0
16  0.0
15  0.0
14  0.0
13  0.0
12  0.0
11  0.0
10  0.0
9   0.0
8   0.0
7   0.0
6   0.0
5   0.0
4   0.0
3   0.0
2   0.0
1   0.0
18  0.0
19  0.0
Sample Mean P VALUE
0    0.0
Sample Range P VALUE
0    0.0

And these agree with the Universe.

Universal CrossTab
A        0      1     All
B                        
0    39892   9882   49774
1     9978  40248   50226
All  49870  50130  100000
CHI STATS FOR THE UNIVERSE
p
0.0
chi2
36333.9440092
dof
1
expected
[[24822.2938 25047.7062]
 [24951.7062 25178.2938]]

Our Agents are able to make good statements about relationships when God is leaning, 80/20, on the Universe.

What happens to the Agent’s ability to predict if we don’t make it so obvious?

dict_a = {a: np.random.choice(np.arange(0, 2), p=[0.50, 0.50]) for a in range(n)}
dict_b = {}

for key, value in dict_a.iteritems():
if value == 0:
dict_b[key] = np.random.choice(np.arange(0, 2), p=[0.55, 0.45])
if value == 1:
dict_b[key] = np.random.choice(np.arange(0, 2), p=[0.45, 0.55])

In this Universe, there’s a 50/50 chance of having Attribute A, and if given Attribute A, there’s a 55% chance of doing Action A. The dependence of B on A is quite a bit more nuanced, but you see that it’s still there.

P VALUES
           0
16  0.002654
14  0.022877
9   0.028232
1   0.047574
5   0.054960
3   0.101809
10  0.106456
0   0.124586
8   0.133113
11  0.219751
6   0.305201
15  0.355469
2   0.418625
18  0.446463
19  0.525728
12  0.551016
7   0.862065
13  0.902446
4   0.916301
17  0.974134
Sample Mean P VALUE
0    0.354973
Sample Range P VALUE
0    0.97148

Four agents out of twenty, 16, 14, 9, and 1, experienced P-Values that are less than 0.05.

And this Universe is very, very clear – there is dependence.

Universal CrossTab
A        0      1     All
B                        
0    27433  22344   49777
1    22492  27731   50223
All  49925  50075  100000
CHI STATS FOR THE UNIVERSE
p
7.52459087194e-234
chi2
1066.14820878
dof
1
expected
[[24851.16725 25073.83275]
 [24925.83275 25149.16725]]

Let’s see if more data helps my agents make better predictions?

sample_size = 1000

And the P-values narrow:

P VALUES
           0
12  0.000003
5   0.000006
1   0.000012
14  0.000017
11  0.000023
7   0.000148
19  0.000148
8   0.000173
18  0.000192
17  0.000709
15  0.000713
10  0.000942
3   0.002926
9   0.004837
4   0.006401
2   0.008387
6   0.010469
0   0.030132
16  0.057896
13  0.059988
Sample Mean P VALUE
0    0.009206
Sample Range P VALUE
0    0.059985

And in this instance, all agents but two, 16 and 13, experienced P-values less than 0.05. Pretty good.

We boost the sample to 10,000 each.

 sample_size = 10000

And rerun the Universe.

P VALUES
               0
19  2.425200e-33
14  4.308918e-31
15  8.251558e-28
10  8.948502e-28
6   1.468066e-27
13  1.027866e-26
3   1.606491e-26
4   1.693300e-26
12  2.170612e-26
2   3.552389e-26
7   5.560343e-25
1   2.823116e-24
0   1.003131e-23
16  2.671182e-23
18  4.844996e-23
11  1.348288e-21
8   1.514325e-21
17  2.494315e-21
5   5.137547e-21
9   1.729439e-17
Sample Mean P VALUE
0    8.652486e-19
Sample Range P VALUE
0    1.729439e-17

And the P-values narrow even more. All the Agents would agree – B and A are correlated. And they match up with the Universe.

Universal CrossTab
A        0      1     All
B                        
0    27418  22720   50138
1    22104  27758   49862
All  49522  50478  100000
CHI STATS FOR THE UNIVERSE
p
4.27806977209e-235
chi2
1071.87737866
dof
1
expected
[[24829.34036 24692.65964]
 [25308.65964 25169.34036]]

What if I make a Universe where Attribute A is anomalous – just 5% probability that a human would have it, and what if it just gives that 55/45 Action B split? Note that there’s still a causal arrow in the data. It’s just a lot weaker.

dict_a = {a: np.random.choice(np.arange(0, 2), p=[0.95, 0.05]) for a in range(n)}
dict_b = {}

for key, value in dict_a.iteritems():
if value == 0:
dict_b[key] = np.random.choice(np.arange(0, 2), p=[0.55, 0.45])
if value == 1:
dict_b[key] = np.random.choice(np.arange(0, 2), p=[0.45, 0.55])

And I start off the Agents with a sample of just 100.

P VALUES
           0
15  0.187727
2   0.200325
13  0.206507
3   0.289892
1   0.290413
0   0.470615
9   0.472737
7   0.510751
6   0.673700
10  0.727745
16  0.777826
5   0.817648
17  0.817648
11  0.826133
8   0.894095
19  0.905466
18  0.917037
14  0.949945
12  1.000000
4   1.000000
Sample Mean P VALUE
0    0.64681
Sample Range P VALUE
0    0.812273

And we see that none of the agents were able to spot the dependence. Even though in this Universe, it is there.

Universal CrossTab
A        0     1     All
B                       
0    52269  2328   54597
1    42689  2714   45403
All  94958  5042  100000
CHI STATS FOR THE UNIVERSE
p
7.45561589877e-35
chi2
151.676711965
dof
1
expected
[[51844.21926 43113.78074]
 [ 2752.78074  2289.21926]]

So, we’re going to boost the sample to 1000.

sample_size = 1000

And we find that six agents are able to pick it up.

P VALUES
           0
16  0.001622
19  0.007936
3   0.008558
9   0.015707
14  0.018716
13  0.036360
11  0.057232
4   0.089602
12  0.105284
8   0.222948
15  0.244618
1   0.248235
5   0.274044
0   0.327197
17  0.448070
10  0.481531
2   0.513948
18  0.566968
6   0.575433
7   0.602760
Sample Mean P VALUE
0    0.242338
Sample Range P VALUE
0    0.601138

But the Universe says that it’s there,

Universal CrossTab
A        0     1     All
B                       
0    52166  2230   54396
1    42887  2717   45604
All  95053  4947  100000
CHI STATS FOR THE UNIVERSE
p
1.98924420195e-41
chi2
181.771280722
dof
1
expected
[[51705.02988 43347.97012]
 [ 2690.97012  2256.02988]]

In the final run, we give them the full 10,000 sample size

sample_size = 10000

And we find that all the agents pick it up.

P VALUES
               0
14  1.743448e-09
4   1.713640e-08
0   9.902143e-08
7   1.935761e-07
9   7.475067e-07
17  3.601266e-06
15  3.666506e-06
13  5.525651e-06
6   1.075299e-05
12  2.196981e-05
10  2.213179e-05
3   2.672497e-05
1   2.791613e-05
19  4.573572e-05
11  7.288003e-05
16  1.366004e-04
18  1.926298e-04
2   2.279910e-04
8   2.320205e-04
5   2.297855e-03
Sample Mean P VALUE
0    0.000166
Sample Range P VALUE
0    0.002298

Which agrees with the Universe.

Universal CrossTab
A        0     1     All
B                       
0    52402  2256   54658
1    42567  2775   45342
All  94969  5031  100000
CHI STATS FOR THE UNIVERSE
p
1.28720079078e-46
chi2
205.546021449
dof
1
expected
[[51908.15602 43060.84398]
 [ 2749.84398  2281.15602]]

A lot more sample helped the agents understand that a small relationship, within an anomaly (at 5% of the humans having Attribute A) was present in the Universe.

Segmentation

Much of the work that is segmentation is about observing some Attribute A, and running test for correlation against some Action B. The machinery around the word because, is made by the human involved, not the chi square test of the p-value. Chi and p make no statement about causality, they just make a prediction about independence.

The reason for bothering with chi and p-value in the first place is because if one can understand the relationship between an Attribute and an Outcome, they can predict it, and prediction is the key to acting.

Many commercial segmentations one encounters make use of multiple Attributes, like age, gender, income, previous purchase, number of children, marital status, and location, to make a prediction. This increases the number of degrees of freedom, which i turn has a predictable effect on p-values and each agent’s ability to reject the null the hypothesis, or discover exploitable knowledge. In the end, most segmentation boils back to a 2×2 Pearson’s Chi Square test with a single degree of freedom. One either has the attribute of 35+Male+Lower_Income+3children+married+urban, or not. That accumulation of attributes into a single one has the effect of reducing the size of a single cell where the 1 and the 1 line up, that true-positive, sweet spot. One pays the cost of uncertainty by either inflating the degrees of freedom or reducing the size of the segment.

The Problem

The generalized concern, the problem, is the sensitivity of the statistical test for independence, to the size of the segment, is much greater than is generally understood. In the simulations in this post, we demonstrated that even with a sample size of 1000, just six agents were able to pick up (at p < 0.05) on a segment that was clearly there (5% of the population), with an exploitable feature (55:45 for purchase).

In fact, the number of agents that were able to detect the segment were [4, 4, 4, 6, 3] on successive runs.

This took place in a simulation. In a world that I created, with a direct, and I assert, causal link, between A, and then the generation of B. The sample was pulled, at random, from the Universe. And, we knew the Ground Truth of the Universe with absolute certainty (because we created it). There’s none of the usual problems with the real Universe. No census whose credibility we can quibble about. No transcription error. No pesky human with a fat finger. No liars. And nowhere did the God get shifty and changed Confidence Levels randomly. The agents were all super sharp and good professionals.

The segment was there, and, most agents missed it.

If the sample size is increased to 3000, a general size, we get a boost in the predicability – as [12, 13, 12, 10, 15] agents managed to pick it up (p<0.05). Again, that sensitivity, from 1000 to 3000, is the difference between most commercial sampling studies.

Big Problem? Big Data?

So what, one might say, why sample when I have all the data?

We simulated several universes where there was absolute independence between A and B, and we found, consistently, that one agent in twenty was finding evidence of dependence. We saw this often.

At industry conferences, I have encountered practitioners that will argue the correlation must exist if it was discovered, even in the case of correlating an astrological sign with the medication condition of breaking a leg. They insist that it is a problem with the sampling, or that more sample size always solves the problem. However, even with this sterile, simulated, data, sampling 10,000 records out of a ground truth of 100,000, one agent was still tricked by randomness into stating there was an exploitable segment when there very clearly wasn’t one.

You now have more intuition about the structure of the Universe than they do.

Conclusion

Six years ago I wrote WTF repeatedly in the margins during a presentation. What I learned there worried me, and it wasn’t until I started running more simulations on my own that I truly appreciated what was happening deep down there.

The best practical advice that I can offer data scientists in industry is to look for big, bold, segments. Look for big chi squares and very tiny p-values. Be aware that a promising segment might not actually exist even if the p-value is under 0.05, or even under 0.01. You may be an unlucky agent. And you wouldn’t know any different.