AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |
Back to Blog
Using dynamic programming, we can then determine a sequence of thresholds, one per day, that we can use with the Say we want to run the test daily for a 30 day experiment. Test’s false-positive chance at every step so that the total false-positive rate is below the threshold Since we can compute these probabilities, we can also adjust the Is normally distributed, it is possible to compute the false-positive probabilities at each stage exactly They key idea of “Group Sequential” testing is that under the usual assumption that our test statistic Giving a very high 12% false positive rate! Rate (According to a simple numerical simulation) if you stop the experiment on the first p<0.05 result youįigure 2: Paths terminated when they cross the boundary on the next step. Looking at the results each day for a 10ĭay experiment with say 1000 data points per day will give you about an accumulated 17% false positive It’s obviously not an additional 5% each time as the test statistics are highly correlated,īut it turns out that the false positives accumulate very rapidly. On each subsequent test, you have a non-zero additional probability The intervention had no effect but it looks like it did). On the very first step you have a 5% chance of a false positive (i.e. Say you make a p=0.05 statistical test at each step. If the results look positiveĪnd you may decide to stop the experiment, particularly if it looks like the intervention is giving very bad The key problem is this: Whenever you look at the data (whether you run a formal statistical test or justĮyeball it) you are making a decision about the effectiveness of the intervention. What's wrong with just testing as you go? At the bottom of this post I give tables and references so you can use group sequential They call the setting “Group Sequentialĭesigns”. It turns out that easy to use, practical solutions have been worked outīy clinical statistician decades ago, in papers with many thousands of citations. Iĭiscuss a few of these misguided approaches below. Most discussions of A/B testing do recognize this problem, however the solutions they suggest are simply wrong. Doing so WILL lead to false-positive rates way above 5%, usually on the order of This: If you're going to look at the results of your experiment as it runs, you can not just repeatedly apply a 5% significance level t-test. The current study shows (a) the conclusion of Regehr and Colliver is based on assumptions that can be challenged, and (b) ROC analysis indeed can be used in sequential testing, but only if the procedure is modified according to the results of a corresponding loss function analysis.It’s amazing the amount of confusion on how to run a simple A/B test seen on the internet. Recently, Regehr and Colliver (2003) argued that under certain theoretically derived conditions the use of the loss formula in sequential testing is functionally identical to using classic ROC analysis. Earlier, we doubted the validity of the procedure and proposed to set the cutpoint by minimizing the loss, defined as the weighted sum of the screen negatives and the false positives. Several authors ignored this difference and applied classic ROC analysis to a sequential test. However, in the sequential procedure there are no false negatives, because the result of the complete test is considered the final outcome. In a diagnostic test an optimum cutpoint is obtained by minimizing the weighted sum of false negatives and false positives using Receiver Operator Characteristic (ROC) analysis. The procedure may result in a reduction of testing resources, but at the cost of false positives (candidates who pass the screen but would fail the complete test). Candidates who fail the screen sit the complete test, whereas those who pass the screen are qualified as a pass of the complete test. Initially, all candidates take a screening test consisting of a part of the OSCE. Sequential testing is applied to reduce costs in SP-based tests (OSCEs).
0 Comments
Read More
Leave a Reply. |