Can I use a Mann-Whitney U Test with a very small sample?

I want to compare differences between two independent groups (female and male) when dependent variables are continuous. However, my sample size is very small (N=6). I have done a Mann-Whitney U Test and I am not sure if the results are meaningful given the small sample.

58.6k 8 8 gold badges 134 134 silver badges 200 200 bronze badges asked Dec 24, 2021 at 11:44 218 2 2 silver badges 9 9 bronze badges

3 Answers 3

$\begingroup$

This has been discussed at length on this site. Briefly, the test is valid. But no test is especially helpful because of our inability to interpret large p-values, which do not indicate "no difference". Instead I would replace a test with a confidence interval or Bayesian credible interval. These have interpretations regardless of sample size and regardless of whether a null hypothesis is true.

answered Dec 24, 2021 at 13:06 Frank Harrell Frank Harrell 96.4k 6 6 gold badges 189 189 silver badges 436 436 bronze badges

$\begingroup$ Many thanks, Frank. Which test do you replace using a frequentist approach for a confidence interval"? $\endgroup$

Commented Dec 24, 2021 at 13:17

$\begingroup$ The confidence interval that is consistent with the WIlcoxon test is the Hodges-Lehmann interval whose point estimate is the median of all possible pairwise differences between the two groups. For the mean I would use the Bayesian t-test which allows for non-normality and non-equal variance - see the inference chapter in BBR. $\endgroup$

Commented Dec 24, 2021 at 13:23 $\begingroup$

Frank's advice is useful; I don't wish my answer to suggest any disagreement with that answer.

The Wilcoxon-Mann-Whitney test "works as it should" in small samples. There's a few things to note:

  1. You will have the usual issues with trying to use hypothesis tests in small samples, like low power against anything but large effects (and all the consequences that come along with that). That's not specific to the test you're using -- other tests will also have the same issues because of the same small sample sizes.
  2. A permutation test (such as the Wilcoxon-Mann-Whitney test) will not typically attain exactly the desired type I error rate, though this impact will only be particularly noticeable in quite small samples. For example, if your rejection rule is "reject if the computer tells me the p-value is $\leq$ 0.05", then you won't actually attain the 5% significance level you were aiming for, because of the discreteness of the test statistic. With the Wilcoxon-Mann-Whitney at $n_1=n_2=6$ , and a two-sided alternative, that rejection rule actually leads to a 4.1% type I error rate (if the computer is using exact p-value calculations, at least), so the test conducted that way is somewhat conservative. [There's a way to mitigate this issue at least somewhat without going to randomized tests, though it may be best to avoid distracting from the main question here by detailing it.] Casting the rejection rule in terms of the p-value means you only get the next lower significance level to the one you were seeking. Imagine you wanted a 1% test; then the type I error rate you get goes down from there, even if there was one only just above 1% that you might perhaps have been happy to use had you only realized it was there. For example, if the available significance levels were 0.4% and 1.01%, you'll get the 0.4% level with the approach of comparing the p-value to the significance level you were originally aiming for. This problem becomes more noticeable when adjusting individual significance levels for multiple testing to control overall type I error, and considerably worse at even smaller sample sizes than yours. It's quite possible this way to get into a situation where your actual significance level is not just "lower than you wanted", but exactly $0$ . This is not merely a theoretical possibility that would not occur in practice $-$ I have seen multiple occasions where people have done just this without even realizing it. I think it's better (before seeing the data to avoid any appearance of p-hacking) to consider the available significance levels and choose the testing strategy in full knowledge of the situation one is to be faced with, rather than being surprised by it after seeing the data - or worse, to never even realize that it was not possible to reject the null with the rejection rule that was being used. It's not difficult to identify all the available significance levels for whatever test you're using (e.g. for the specific situation you're asking about, it can be done with a single line of R code), and I can't see any good reason not to do so as a matter of course, whenever you would be faced with small sample sizes and a discrete test statistic.