-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy path62-testing.Rmd
832 lines (595 loc) · 24.8 KB
/
62-testing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
# Hypothesis testing {#sec-testing}
The goal of this chapter is to demonstrate some of the fundamentals of
hypothesis testing as used in bioinformatics. For prerequisites within
the Biomedical sciences masters degree at the UCLouvain, see
*WFARM1247* (Traitement statistique des données).
Parts of this chapter are based on chapter 6 from *Modern Statistics
for Modern Biology* [@MSMB].
## Refresher
```{r toss, echo = FALSE}
set.seed(2)
n <- 100
p <- 0.59
flips <- sample(c("H", "T"), size = n,
replace = TRUE,
prob = c(p, 1 - p))
```
We have flipped a coing 100 times and have obtained the following results.
```{r tossres, echo = FALSE}
table(flips)
```
If the coin was unbiased we expect roughly 50 heads. Can we make any
claims regarding the biaised or unbiaised nature of that coin?
A coin toss can be modeled by a binomial distribution. The histogram
below shows the binomial statistic for 0 to 100 heads; it represents
the binomial density of an unbiased coin. The vertical line shows the
number of head observed.
```{r tosshist, echo = FALSE, fig.cap = "Binomial density of an unbiased coin to get 0 to 100 heads. The full area of the histogram sums to 1."}
num_heads <- sum(flips == "H")
binomial_dens <-
tibble(k = 0:n) %>%
mutate(p = dbinom(k, size = n, prob = 0.5))
ggplot(binomial_dens, aes(x = k, y = p)) +
geom_bar(stat = "identity") +
geom_vline(xintercept = num_heads)
```
Above, we see that the most likely outcome would be 50 heads with a
probability of `r max(binomial_dens$p)`. No head and 100 heads have a
probability of $`r binomial_dens$p[binomial_dens$k == 0]`$.
We set
- $H_0$: the coin is fair
- $H_1$: the coin is baised
If `r num_heads` isn't deemed too *extreme*, then we won't reject
$H_0$ and conclude that the coin in fair. If `r num_heads` is deemed
too extreme, then we reject $H_0$ and accept $H_1$, and conclude that
the coin is baised.
To define *extreme*, we set $\alpha = 0.05$ and rejet $H_0$ if our result is
outside of the 95% most probably values.
```{r tossstat, echo = FALSE}
alpha <- 0.05
binomial_dens <-
arrange(binomial_dens, p) %>%
mutate(reject = (cumsum(p) <= alpha))
```
```{r tosshist2, echo = FALSE, fig.cap = "Binomial density of for an unbiased coin to get 0 to 100 heads. The areas in red sum to 0.05."}
ggplot(binomial_dens) +
geom_bar(aes(x = k, y = p, fill = reject), stat = "identity") +
scale_fill_manual(
values = c(`TRUE` = "red", `FALSE` = "#5a5a5a")) +
geom_vline(xintercept = num_heads, col = "blue") +
theme(legend.position = "none")
```
We can also compute the p-value, the tells us how likely we are to see
such a extreme or more extreme value under $H_0$.
```{r binom_test}
binom.test(x = 62, n = 100, p = 0.5)
```
Whenever we make such a decision, we will be in one of the following situations:
| | $H_0$ is true | $H_1$ is true |
|---------------------|-------------------------|--------------------------|
| Reject $H_0$ | Type I (false positive) | True positive |
| Do not reject $H_0$ | True negative | Type II (false negative) |
See below for a step by step guide to this example.
## A biological example
A typical biological example consists in measuring a gene of interest
in two populations of interest (say control and group), represented by
biological replicates. The figure below represents the distribution of
the gene of interest in the two populations, with the expression
intensities of triplicates in each population.
```{r exdists, echo = FALSE, fig.cap = "Expression of a gene in two populations with randomly chosen triplicates."}
set.seed(123)
s1 <- rnorm(3, 6, 1.2)
s2 <- rnorm(3, 11, 1.5)
tibble(groups = rep(c("control", "group"), each = 100),
expression = c(rnorm(100, 6, 1.2), rnorm(100, 11, 1.5))) %>%
ggplot(aes(x = expression, fill = groups, colour = groups)) +
geom_density(alpha = 0.5) +
geom_vline(xintercept = s1, colour = "#F8766D") +
geom_vline(xintercept = s2, colour = "#00BFC4")
```
We don't have access to the whole population and thus use a sample
thereof (the replicated measurements) to estimate the population
parameters. We have
```{r extab, echo = FALSE}
df <- data.frame(control = s1, group = s2)
df <- rbind(df,
c(mean(s1), mean(s2)),
c(sd(s1), sd(s2)))
rownames(df) <- c(paste("rep", 1:3, sep = "."),
"mean", "sd")
knitr::kable(df)
```
We set our hypothesis as
- $H_0$: the means of the two groups are the same, $\mu_{1} = \mu_{2}$.
- $H_1$: the means of the two groups are different, $\mu_{1} \neq \mu_{2}$.
and calculate a two-sided, two-sample t-test (assuming unequal
variances) with
$$
t = \frac{ \bar X_{1} - \bar X_{2} }
{ \sqrt{ \frac{ s_{1}^{2} }{ N_{1} } + \frac{ s_{2}^{2} }{ N_{2} } } }
$$
where $\bar X_i$, $s_{i}^{2}$ and $N_i$ are the mean, variance and
size of samples 1 and 2.
A t-test has the following assumptions:
- the data are normally distributed;
- the data are independent and identically distributed;
- equal or un-equal (Welch test) variances.
Note that the t-test is robust to deviations[^models].
[^models]: All models are wrong. Some are useful.
In R, we do this with
```{r}
s1
s2
t.test(s1, s2)
```
`r msmbstyle::question_begin()`
Note that this result, this p-value, is specific to our sample, the
measured triplicates, that are assumed to be representative of the
population. Can you imagine other sets of triplicates that are
compatible with the population distributions above, but that could
lead to non-significant results?
`r msmbstyle::question_end()`
In practice, we would apply moderated versions of the tests, such as
the one provided in the `r Biocpkg("limma")` package and also widely
applied to RNA-Seq count data.
## A more realistic biological example
Let's now use the `tdata1` dataset from the `rWSBIM1322` package that
provide gene expression data for 100 genes and 6 samples, three in
group A and 3 in group B.
```{r}
library("rWSBIM1322")
data(tdata1)
head(tdata1)
```
`r msmbstyle::question_begin()`
Visualise the distribution of the `tdata1` data and, if necessary,
log-transform it.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r tex1}
log_tdata1 <- log2(tdata1)
par(mfrow = c(1, 2))
limma::plotDensities(tdata1)
limma::plotDensities(log_tdata1)
```
`r msmbstyle::solution_end()`
We are now going to apply a t-test to feature (row) 73, comparing the
expression intensities in groups A and B. As we have seen, this can be
done with the `t.test` function:
```{r}
x <- log_tdata1[73, ]
t.test(x[1:3], x[4:6])
```
`r msmbstyle::question_begin()`
- Interpret the results of the test above.
- Repeat it with another features.
`r msmbstyle::question_end()`
`r msmbstyle::question_begin()`
We would now like to repeat the same analysis on the 100 genes.
- Write a function that will take a vector as input and perform a
t-test of the first values (our group A) against the 3 last values
(our group B) and returns the p-values.
- Apply the test to all the genes.
- How many significantly differentically expressed genes do you find?
What features are of possible biological interest?
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
my_t_test <- function(x) {
t.test(x[1:3], x[4:6])$p.value
}
pvals <- apply(log_tdata1, 1, my_t_test)
table(pvals < 0.05)
head(sort(pvals))
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
The data above have been generated with the `rnorm` function for all
samples.
- Do you still think any of the features show significant differences?
- Why are there still some features (around 5%) that show a
significant p-value at an alpha of 0.05?
`r msmbstyle::question_end()`
To answer these questions, let's refer to [this xkcd
cartoon](https://xkcd.com/882/) that depicts scientists testing
whether eating jelly beans causes acne.
```{r, results='markup', fig.cap="Do jelly beans cause acne? Scientists investigate. From [xkcd](https://xkcd.com/882/).", echo=FALSE, purl=FALSE, fig.align='center'}
knitr::include_graphics("./figs/jellybeans.png")
```
The data that was used to calculate the p-values was all drawn from the same distribution $N(10, 2)$.
As a result, we should not expect to find a statistically significant
result, unless we repeat the test enought times. Enough here depends
on the $\alpha$ we set to control the type I error. If we set $\alpha$
to 0.05, we accept that rejecting $H_{0}$
in 5% of the extreme cases where we shouldn't reject it. This is an
acceptable threshold that however doesn't hold when we repeat the test
many time.
An important visualistion when performing statistical test repeatedly,
is to visualise the distribution of computed p-values. Below, we see
the histogram of the `tdata1` data and 100 values drawn from a uniform
distribution between 0 and 1. Both are very similar; they are flat.
```{r pvalhist, fig.cap = "Distribution of p-values for the `tdata1` dataset (left) and 100 (p-)values uniformely distributed between 0 and 1 (right)."}
par(mfrow = c(1, 2))
hist(pvals)
hist(runif(100))
```
Below we see the expected trends of p-values for different
scenarios. This example comes from the [Variance
explained](http://varianceexplained.org/statistics/interpreting-pvalue-histogram/).
```{r pvalhists2, results='markup', fig.cap="Expected trends of p-values for different scenarios ([source](http://varianceexplained.org/statistics/interpreting-pvalue-histogram/)).", echo=FALSE, purl=FALSE, fig.align='center', out.width='100%', fig.align='center'}
knitr::include_graphics("./figs/plot_melted-1.png")
```
In an experiment with enough truly differentially expression features,
on expects to observe a substantial increase of small p-values
(anti-conservative). In other words, we expect to see more small
p-values that at random, or when no statistically significant are
present (uniform). All other scenarios warrant further inspection, as
they might point to issue with the data of the tests.
When many tests are preformed, the p-values need to be *adjusted*, to
take into account that many tests have been performed.
## Adjustment for multiple testing
There are two classes of multiple testing adjustment methods:
- **Family-wise error rate** (FWER) that gives the probability of one
or more false positives. The **Bonferroni correction** for *m* tests
multiplies each p-value by *m*. One then checks if any results still
remains below significance threshold.
- **False discovery rate** (FDR) that computes the expected fraction
of false positives among all discoveries. It allows us to choose *n*
results with a given FDR. Widely used examples are
Benjamini-Hochberg or q-values.
The figure below illustrates the principle behind the FDR
adjustment. The procedure estimate the proportion of hypothesis that
are null and then adjust the p-values accordingly.
```{r fdrhist, results='markup', fig.cap="Principle behind the false discovery rate p-value adjustment ([source](http://varianceexplained.org/statistics/interpreting-pvalue-histogram/)).", echo=FALSE, purl=FALSE, fig.align='center', out.width='100%', fig.align='center'}
knitr::include_graphics("./figs/fdr.png")
```
Back to our `tdata1` example, we need to take this into account and
adjust the p-values for multiple testing. Below we apply the
Benjamini-Hochberg FDR procedure using the `p.adjust` function and
confirm that none of the features are differentially expressed.
```{r}
adj_pvals <- p.adjust(pvals, method = "BH")
```
`r msmbstyle::question_begin()`
Are there any adjusted p-values that are still significant?
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
any(adj_pvals < 0.05)
min(adj_pvals)
```
`r msmbstyle::solution_end()`
## Result visualisation
`r msmbstyle::question_begin()`
The `tdata4` dataset is already log2 and can be processed as
it is. Practice what we have see so far and identify differential
expressed genes in group A vs B.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r tdata4ex1}
library("rWSBIM1322")
data(tdata4)
my_t_test <- function(x)
t.test(x[1:3], x[4:6])$p.value
pvals <- apply(tdata4, 1, my_t_test)
hist(pvals)
table(pvals < 0.05)
adjp <- p.adjust(pvals, method = "BH")
head(sort(adjp))
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
Calculate a log2 fold-change between groups A and B. To do so, you
need to subtract mean expression of group B from mean of group
A[^l2fc]. Repeat this for all genes. Visualise and interpret the distribution
of the log2 fold-changes.
[^l2fc]: We need to subtract the means because the data are already log-transformed and $log\frac{a}{b} = log(a) - log(c)$.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r tdata4ex2}
my_lfc <- function(x)
mean(x[1:3]) - mean(x[4:6])
lfc <- apply(tdata4, 1, my_lfc)
hist(lfc)
```
`r msmbstyle::solution_end()`
We generally want to consider both the statistical significance (the
p-value) and the magnitude of the difference of expression (the
fold-change) to provide a biological interpretation of the data. For
this, we use a *volcano plot* as shown below. The most interesting
features are those towards to top corners given that they have a small
p-value (i.e. a large value for *-log10(p-value)*) and large (in
absolute value) fold-changes.
```{r volc, echo = FALSE, fig.cap = "A volcano plot."}
plot(lfc, -log10(adjp))
grid()
abline(h = -log10(0.05))
abline(v = c(-1, 1))
```
`r msmbstyle::question_begin()`
Reproduce the volcano plot above using the `adjp` and `lfc` variables
computed above.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r volcex}
plot(lfc, -log10(adjp))
grid()
abline(h = -log10(0.05))
abline(v = c(-1, 1))
```
`r msmbstyle::solution_end()`
[Chapter 6 on
*testing*](https://www.huber.embl.de/msmb/Chap-Testing.html) of
*Modern Statistics for Modern Biology* [@MSMB], and in particlar
section 6.5, provides additional details about the t-test.
The t-test comes in multiple flavors, all of which can be chosen
through parameters of the `t.test` function. What we did above is
called a two-sided two-sample unpaired test with unequal
variance. **Two-sided** refers to the fact that we were open to reject
the null hypothesis if the expression in the first group was either
larger or smaller than that in the second one.
**Two-sample** indicates that we compared the means of two groups to
each other; another option is to compare the mean of one group against
a given, fixed number.
**Unpaired** means that there was no direct 1:1 mapping between the
measurements in the two groups. If, on the other hand, the data had
been measured on the same patients before and after treatment, then a
paired test would be more appropriate, as it looks at the change of
expression within each patient, rather than their absolute expression.
**Equal variance** refers to the way the statistic is calculated. That
expression is most appropriate if the variances within each group are
about the same. If they are very different, an alternative form,
called the Welch t-test exist.
An important assumption when performing a t-test is the **independence
assumption** among the observations in a sample. An additional exercise
below demonstrate the risk of dependence between samples.
Finally, it is important to highlight that all the methods and
concepts described above will only be relevant if the problem is
properly stated. Hence the importance of clearly laying out the
**experimental design** and stating the **biological question** of
interest before running experiment. Indeed, quoting the mathematician
Richard Hamming:
> It is better to solve the right problem the wrong way than to solve
> the wrong problem the right way.
To this effect, a [type III
error](https://en.wikipedia.org/wiki/Type_III_error) has been defined
as errors that occur when researchers provide the right answer to the wrong
question.
<!-- ## Linear regression -->
<!-- Suggested TODO -->
<!-- ## Empirical approximations -->
<!-- - Bootstrapping to generate null distributions -->
## A word of caution
Summary of statistical inference:
1. Set up a model of reality: null hypothesis $H_0$ (no difference) and *alternative hypothesis*
$H_1$ (there is a difference).
2. Do an experiment to collect data.
3. Perform the statistical inference with the collected data.
4. Make a decision: reject $H_0$ if the computed probability is deemed too small.
What not to do:
### p-hacking {-}
Instead of setting up one statistical hypothesis before data
collection and then testing it, collect data at large, then try lots
of different tests until one give a statistically significant result
[@Head:2015].
### p-harking {-}
HARKing (Hypothesizing After the Results are Known) is the fraudulous
practice of presenting a post hoc hypothesis, i.e., one based on prior
result, as a priori hypotheses [@Kerr:1998].
## Additional exercises
`r msmbstyle::question_begin()`
- Draw three values from two distributions $N(6, 2)$ and $N(8, 2)$, perform a t-test and interpret the results.
- Repeat the above experiment multiple times. Do you get *similar* results? Why?
- Repeat with values from distributions $N(5, 1)$ and $N(8, 1)$.
`r msmbstyle::question_end()`
```{r, include = FALSE}
s1 <- rnorm(3, 6, 2)
s2 <- rnorm(3, 8, 2)
t.test(s1, s2)
s1 <- rnorm(3, 5, 1)
s2 <- rnorm(3, 8, 1)
t.test(s1, s2)
```
`r msmbstyle::question_begin()`
- Generate random data using `rnorm` for 100 genes and 6 samples and
test for differential expression, comparing for each gene the 3
first samples against the 3 last samples. Verify that you identify
about 5 p-values smaller that 0.05. Visualise and interprete the
histogram of these p-values.
- Adjust the 100 p-values for multiple testing and compare the initial
and adjusted p-values.
`r msmbstyle::question_end()`
`r msmbstyle::question_begin()`
We have used two-sided two-sample t-tests above. Also familiarise
yourselves with one-sided, one-sample and paired tests. In particular,
verify how to implement these with the `t.test` function.
`r msmbstyle::question_end()`
`r msmbstyle::question_begin()`
Simulate a dataset of log2 fold-changes measured in triplicate for
1000 genes.
- What function would you use to generate these data?
- What test would you use to test for differential expression? Apply
it to calculate 1000 p-values.
- Visualise and interpret the histogram of p-values.
- FDR-adjust the p-values.
`r msmbstyle::question_end()`
`r msmbstyle::question_begin()`
Load the `tdata2` dataset from the `rWSBIM1322` package. These data
represent the measurement of an inflammation biomarker in the blood of
15 patients, before and after treatment for a acute liver
inflammation. Visualise the data and run a test to verify if the
treatment has had an effect or not.
`r msmbstyle::question_end()`
```{r, include = FALSE}
library("rWSBIM1322")
data(tdata2)
tdata2$patient <- 1:15
library("tidyverse")
pivot_longer(tdata2,
names_to = "time",
values_to = "biomarker",
c(before, after)) %>%
ggplot(aes(x = factor(time, levels = c("before", "after")),
y = biomarker)) +
geom_boxplot() +
geom_point(aes(colour = factor(patient)), size = 3) +
geom_segment(data = tdata2,
aes(x = 1, xend = 2,
y = before, yend = after,
colour = factor(patient)))
t.test(tdata2$before, tdata2$after, paired = TRUE)
```
`r msmbstyle::question_begin()`
Load the `tdata3` data from the `rWSBIM1322` package (version >=
0.1.5). 100 genes have been measures in 2 groups (A and B) in
triplicates.
- Perform a t-test, visualise the p-value on a histogram. Do you
expect to see any genes significantly differentially expressed?
Verify your expectation by adjusting the p-values for multiple
comparisions. Are there any significant results after adjustment?
- Produce a volcano plot for these results and interpret what you see.
- Unless you haven't already done so, visualise the distributions of
the genes for the 6 samples. What do you observe? Re-interpret the
results obtained above in the light of these findings.
- Normalise your data using centring. Visualise the distributions
before and after transformation and make sure you understand the
differences.
- Repeat the first point above on the normalised data. Do you still
find any differentially expressed genes? Explain why.
`r msmbstyle::question_end()`
```{r, include = FALSE}
library("rWSBIM1322")
data(tdata3)
my_t_test <- function(x) t.test(x[1:3], x[4:6])$p.value
pvals1 <- apply(tdata3, 1, my_t_test)
hist(pvals1)
## Anti-conservative distribution, hence expecting significant result(s).
pvals1_adj <- p.adjust(pvals1, method = "BH")
head(sort(pvals1_adj))
my_lfc <- function(x) mean(x[1:3]) - mean(x[4:6])
lfc <- apply(tdata3, 1, my_lfc)
plot(lfc, -log10(pvals1_adj))
grid()
tdata3_cent <- scale(tdata3, scale = FALSE, center = TRUE)
par(mfrow = c(1, 2))
boxplot(tdata3)
boxplot(tdata3_cent)
## Systematic effect in group B. Cancelled out by centring.
pvals2 <- apply(tdata3_cent, 1, my_t_test)
hist(pvals2)
## Uniform distribution, expecting no significant results.
head(sort(p.adjust(pvals2, method = "BH")))
```
`r msmbstyle::question_begin()`
- Load the `cptac_se_prot` data and extract the first feature,
`P00918ups|CAH2_HUMAN_UPS`, a protein in this case. Perform a
t-test between the groups 6A and 6B.
- Duplicate the data of that first feature to obtain a vector of 12
values and repeat the t-test (now 6 against 6) above. Interpret the
new results.
`r msmbstyle::question_end()`
```{r, include = FALSE}
library("SummarizedExperiment")
data(cptac_se_prot)
x1 <- assay(cptac_se_prot)["P00918ups|CAH2_HUMAN_UPS", 1:3]
x2 <- assay(cptac_se_prot)["P00918ups|CAH2_HUMAN_UPS", 4:6]
t.test(x1, x2)
t.test(rep(x1, 2), rep(x2, 2))
## - The power of the t-test depends on the sample size. Even if the
## underlying biological differences are the same, a dataset with
## more observations tends to give more significant results. This can
## also be seen from the way the numbers N1 and N2 appear in the t-test
## formula.
## - The assumption of independence between the measurements is really
## important, and in the case above, the duplicated data are clearly
## dependent.
```
### The coin example, step by step {-}
The following code chunk simulates 100 coin toss from a biased coin.
```{r}
set.seed(2)
n <- 100
p <- 0.59
flips <- sample(c("H", "T"), size = n,
replace = TRUE,
prob = c(p, 1 - p))
```
If the coin were unbiased we expect roughly 50 heads. Let us see how
many heads and tails there are.
```{r}
table(flips)
```
We calculate the binomial statistic for a number of flips between 0
and 100. This is the binomial density for an unbiased coin.
```{r}
library("dplyr")
num_heads <- sum(flips == "H")
binomial_dens <-
tibble(k = 0:n) %>%
mutate(p = dbinom(k, size = n, prob = 0.5))
```
`r msmbstyle::question_begin()`
- Write some code to check that the probabilties from the binomial
statistic sum to $1$.
- Change the `prob` argument and show that the probabilities still sum
to $1$.
`r msmbstyle::question_end()`
The following code chunk plots the binomial statistic and the number
of heads observed is marked in blue.
```{r,}
library("ggplot2")
ggplot(binomial_dens, aes(x = k, y = p)) +
geom_bar(stat = "identity") +
geom_vline(xintercept = num_heads)
```
`r msmbstyle::question_begin()`
- Change the prob argument above and re-plot the binomial statistic,
what do you notice about how the distribution is centered?
`r msmbstyle::question_end()`
Now, we set the size of the reject threshold, this is a choice and
corresponds to how many false discoveries we are happy to allow.
```{r}
alpha <- 0.05
```
`r msmbstyle::question_begin()`
- Without looking below use the arrange function from `dplyr` to order
the probabilities, smallest first.
- Looking at the output, what is the most unlikely number of heads to
observe?
- Looking at the output, what is the most likely number of heads to
observe?
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
binomial_dens %>%
filter(p == max(p))
binomial_dens %>%
filter(p == min(p))
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
- What does the following code chunk do?
```{r}
binomial_dens <-
arrange(binomial_dens, p) %>%
mutate(reject = (cumsum(p) <= alpha))
```
`r msmbstyle::question_end()`
Let us plot the reject region in red
```{r}
ggplot(binomial_dens) +
geom_bar(aes(x = k, y = p, fill = reject), stat = "identity") +
scale_fill_manual(
values = c(`TRUE` = "red", `FALSE` = "#5a5a5a")) +
geom_vline(xintercept = num_heads, col = "blue") +
theme(legend.position = "none")
```
`r msmbstyle::question_begin()`
- Is there evidence that our coin is biased?
- Change the size of the reject region to a smaller value, what is our
conclusion now?
`r msmbstyle::question_end()`
The above test already has an easy to use function in R:
```{r}
binom.test(x = num_heads, n = n, p = 0.5)
```