p-Hacking and False Discovery in A/B Testing

gwern | 299 points

I would be shocked if it were as low as 57%. As an intern, I found that the analysts in charge of A/B tests often didn't have a background in science or running experiments, and didn't really care. There were a couple of data analytics teams in the company, and I think a lot of the developers didn't like my team because we were seen as "fussier" than the other one. We required people to preregister hypotheses, and run experiments for predetermined amounts of time.

I don't think the tech environment is very conducive to running experiments. Everything moves too fast, by the time you figure out the results someone gives you are bs, they've already got promoted 3 times and work as a director at a different company.

I work in science now, and although people still p-hack like hell, there's at least some sort of shame about it. There's a long term cost too, I've met a couple researchers who have spent years trying to replicate some finding they got early in their career through suspicious means.

boron1006 | 6 years ago

Hi, I am one of the authors. We found that people p-hack with traditional t-tests. Most A/B tests were run this way in the past and some still are. The paper is using Optimizely data (from 2014) before Optimizely introduced new testing in 2015 designed to solve the issues we found in the paper.

If you want to know how Optimizely prevents p-hacking check out the math behind Optimizely’s current testing here: https://www.optimizely.com/resources/stats-engine-whitepaper...

aisscott | 6 years ago

Once at a programming conference, I was talking with a very senior developer at a well known company. He was going on and on about their A/B testing efforts.

I asked how they decided how long they would run an experiment for. The answer was "until we get a significant result."

I was shocked then, but now I am used to getting these kinds of responses from developers ... That and a belief that false positives are not a thing.

nanis | 6 years ago

Hi, founder of VWO here. We revamped our testing engine to Bayesian in 2015 to prevent the ‘peeking problem’ with frequentists approaches. You can read about our approach https://vwo.com/blog/smartstats-testing-for-truth/

paraschopra | 6 years ago

Traditional A/B testing has very poor ergonomics. Experimenters are usually put in awkward conflict-of-interest situations that create multiple strong incentives not to perform rigorous, disciplined, valid experiments.

Null hypothesis significance testing is fundamentally misaligned with business needs and is not a good tool for businesses. This is true in many fields of science as well, but at least they have some mechanisms that try to ensure that experiments are unbiased. Businesses often don't have the same internal and external incentives that lead to those mechanisms, and so NHST is abused even more.

cle | 6 years ago

People are conflating A/B tests are like testing a new revolutionary drug or big discovery. But it isn't

Assuming 'B' is the new option, there are 3 possibilities, A is better than B, A is equivalent to B, A is worse than B

If your p-hacked experiment tells you to change from A to B while the null hypothesis was correct, you didn't get much worse off than you were in the first place. And if your long term metrics were in place then you can get a better measure for your experiment.

Not to mention experimental failures by unaccounted variables

raverbashing | 6 years ago

Hi all. Jimmy, statistician from Optimizely chiming in.

We were excited to collaborate with the authors on this study. Keep in mind the data used in this analysis is from 2014, before we introduced sequential testing and FDR correction as to specifically address this p-hacking issue. I expect these results are in line with any platform using fixed-horizon frequentist methods.

Check out this paper for more details: http://www.kdd.org/kdd2017/papers/view/peeking-at-ab-tests-w...

yichijin | 6 years ago

I created an open source A/B test framework[0], which also uses Bayesian analysis on the dashboard. IANAS(tatistician), but from what I understand it’s still better to plan the check point in advance, rather than stop when reaching significance.

A couple of articles worth reading [1] [2] (can’t exactly vouch for their validity but seem to make some good arguments that appear thought out)

[0] https://github.com/Alephbet/gimel

[1] http://varianceexplained.org/r/bayesian-ab-testing/

[2] http://blog.analytics-toolkit.com/2017/the-bane-of-ab-testin...

gingerlime | 6 years ago

Isn't randomly looking for a pattern and then slapping a hypothesis on it post facto a form of "p-hacking"? Because that's completely commonplace and unremarkable practice in technology

emodendroket | 6 years ago

My very recent meta-analysis of 115 A/B tests reveals that a large proportion are highly suspect for p-hacking: http://blog.analytics-toolkit.com/2018/analysis-of-115-a-b-t...

Going the Bayesian way, as suggested in some comments, is no solution at all, as I am not aware of an accepted Bayesian approach to dealing with the issue:

http://blog.analytics-toolkit.com/2017/bayesian-ab-testing-n...

(feel free to run sims, if you do not trust the logic ;-)) as well as on a more general level:

http://blog.analytics-toolkit.com/2017/5-reasons-bayesian-ab...

geoprofi | 6 years ago

Seconding this, the number of e-commerce companies who got "A/B testing into bankruptcy" on my memory approaches 20.

My take on this. Even in cases where such testing was done by disciplined statisticians (which is not the case in at least 9 times out of 10. Yes, a math or cs PhD major is not a professional statistician by any stretch,) the value of advice made from that data is marginal at best.

As eCommerce is bread and butter of cheap electronics industry, I saw times and times again that "science driven" outfits loose out to others. Not so much because of their quality of decision making was demonstrably inferior, but because their obsession with "statistical tasseography" drained their resources, and shifted their focus away from things of obvious importance.

baybal2 | 6 years ago

After listening to Optimizely reps give a talk about their success with a client (who was present) I suspect the support reps encourage these false positives. They presented a few tests as fantastic wins, when they all had basic flaws (like cold audience vs self selected for interest). Maybe that was just one bad apple (doubtful)...but it was a large client and someone they felt should represent the company as a speaker.

Concerns from the audience were dismissed and referred to follow up after the talk. Never thought the same of Optimizely after that.

dahdum | 6 years ago

How can A/B testing tools be improved to prevent p-value hacking?

Could it be as simple as declaring your test duration before starting the experiment, and having the tool add an asterisk to your results if you stop the experiment early?

babl-yc | 6 years ago

I had the enjoyable experience of sitting at a tech conference and listening to the others in my group tell one of my friends that he had no idea what he was talking about when he said they weren't designing a proper experiment.

I was the only one there who knew he's a particle physicist.

The OP is horrifyingly right.

User23 | 6 years ago

Optimizely being for large enterprises, curious how people do A/B tests at their respective startups. Do most roll their own? How do you make sure your science is sound?

raphaelrk | 6 years ago

ELI5... what’s a p-hack?

nvahalik | 6 years ago