

Re-implementation of correction for population structure
FINDING SEQUENCES FREE
It is worth noting that other reference free methods for association mapping mentioned above are based on presence and absence of k-mers, and hence are not suitable for finding sequences in sex chromosomes present in both sexes e.g.

We find that the k-mers determined by HAWK cover the entire sequenced regions in X and Y chromosomes. We apply our method to sequencing data from two populations in the 1000 genomes dataset, labeling males and females as cases and controls.

Our implementation is faster and more flexible to run compared to the original implementation while producing results similar to the original one.įinally, we show that our method can be used to find sequences in the sex chromosomes. We have also analyzed the execution times of the two implementations. We have tested our implementation with a dataset on E.coli ampicillin resistance and have compared its output with the output of the original implementation. We have also extended support for Jellyfish 2 and implemented Benjamini–Hochberg procedure, which can be used to correct for multiple tests when the study is underpowered for Bonferroni correction. We have re-implemented the step for finding associated k-mers after population structure correction using C++, which was previously implemented in R. Here we re-implement Hawk with the goal to reduce its execution time and make it more convenient for users. Finally, the k-mers found associated may be assembled to get a sequence for each associated loci. After that, associations to k-mers after correcting for population structure are determined. Next, population structure is determined from k-mer counts using Eigenstrat. Second, using likelihood ratio test, they find k-mers with significantly different counts in case and control samples. First, they count k-mers in reads from each individual using Jellyfish. , frequencies of k-mers are analyzed to find k-mers associated with a phenotype and then they are assembled to form the associated sequences. In the association mapping tool named Hawk developed by Rahman et al. contiguous sequences of length k in sequenced reads and identifying k-mers associated with the phenotype. The methods are primarily based on finding k-mers i.e.

presented methods for mapping associations in large genomes, to categorical phenotypes and to both categorical and quantitative phenotypes, respectively. However, these methods do not scale to organisms with large genomes, and as many of them have incomplete reference genomes, there were challenges in association mapping in these organisms. The high plasticity in bacterial genomes means structural variants and even large genomic segments in various strains are missing in the reference genomes which makes application of reference based methods difficult. A number of methods have been developed to perform association studies in bacterial genomes that do not require aligning reads to reference genomes. To address this issue, alignment free approaches for association mapping have been explored. However, both these approaches require a reference genome of the organism which makes them inappropriate for association mapping in non-model organisms with incomplete reference genomes or none at all. In genome wide association studies (GWAS), individuals are typically genotyped using microarrays or by aligning sequencing reads from individuals to a reference genome. įunding: The authors received no specific funding for this work.Ĭompeting interests: The authors have declared that no competing interests exist.Īssociation mapping is the process of associating phenotypes with genotypes.
FINDING SEQUENCES MANUAL
The scripts, source code, and user manual for HAWK is available at, the code is available for and tested in Linux (Ubuntu14.04 and Ubuntu18.04), and the 1000 genomes dataset is available at.
FINDING SEQUENCES ARCHIVE
Coli ampicillin resistance dataset can be downloaded from the Sequence Read Archive (SRA Accession No. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.ĭata Availability: The E. Received: JAccepted: DecemPublished: January 7, 2021Ĭopyright: © 2021 Mehrab et al. University of Texas School of Public Health, UNITED STATES Citation: Mehrab Z, Mobin J, Tahmid IA, Rahman A (2021) Efficient association mapping from k-mers-An application in finding sex-specific sequences.
