Consider a placebo-controlled preventive HIV vaccine efficacy trial. An HIV amino acid sequence is measured from each volunteer who acquires HIV, and these sequences are aligned together with the reference HIV sequence represented in the vaccine. We develop genome scanning methods to identify HIV positions at which the amino acids in sequences from infected vaccine recipients tend to be more divergent from the corresponding reference amino acid than the amino acids in sequences from infected placebo recipients. We consider five two-sample test statistics, based on Euclidean, Mahalanobis, and Kullback-Leibler divergence measures. Weights are incorporated to reflect biological information contained in diverent amino acid positions and substitutions. Position-wise p-values are obtained by approximating the null distribution of the statistics either by a permutation procedure or by nonparametric estimation. Modified Bonferroni and false discovery rate procedures that exploit the discrete nature of the genetic data are used to infer statistically significant signature positions. The methods are examined in simulations and are applied to data from a vaccine trial. More broadly, these methods address the general problem of comparing discrete frequency distributions between groups in a high-dimensional data setting.


Bioinformatics | Biostatistics | Computational Biology