Support

Support Options

Submit a Support Ticket

The MACH 1.0 Markov Chain based Haplotyper

The MaCH Tool is used to infer missing genotypes in a two step process, where Step 1 estimates the model parameters, and Step 2 uses the parameters estimated in Step 1 -- calibrated to your specific dataset and genotyping platform -- to impute all SNPs.

Launch Tool

You must login before you can run this tool.

Version 14.2 - published on 15 Sep 2010

This tool is closed source.

View All Supporting Documents

file option other options logs
Category Tools
Abstract image MaCH is a tool for haplotyping and genotype imputation. MaCH was developed at the University of Michigan by Yun Li ylwtx@umich.edu and Gonçalo Abecasis goncalo@umich.edu.


The main web site of the MaCH research group provides a detailed Tutorial describing the input parameters, input file formats, and output results for running MaCH to haplotype a sample of unrelated individuals or to infer genotypes at untyped markers. The wiki page for MaCH is at http://www.sph.umich.edu/csg/abecasis/MaCH/.


We thank Yun Li and Gonçalo Abecasis for their permission to offer MaCH as a tool at cceHUB.



The MaCH Tool at cceHUB

The MaCH Tool available at cceHUB is used for genotype imputation. Users can input MaCH format data files from the cceHUB Data Repository or from their own cceHUB home folder.


Files from the Data Repository are only accessible for input by authorized users who belong to the Statistical Modeling at Purdue group. Please send mail to acc@purdue.edu to request membership in that group. Note that some input "ped" files in the Data Repository have associated "ped" files that begin with "sub". The "sub" files have a smaller sample population and require much shorter execution times in Step 1 to estimate model parameters. The associated, larger files should then be used in Step 2. As an example, for chromosome group number 6, the ped file subgawchr6.ped (26GB) can be used in Step 1, while the associated ped file gawpedchr6.ped (270GB) should be used in Step 2. Using the larger ped files may require up to a week of execution runtime.


Files from your own collection can be loaded into the cceHUB MaCH Toolusers. You will need to follow the instructions in the Tutorial for formatting your files. You will then need to upload your files to cceHUB using WebDAV or SFTP. See How to Use WebDAV for information on setting up WebDAV to "drag and drop" your datasets into your home folder at cceHUB. You can also use Core FTP Lite, a free file transfer package, for moving files from your desktop to your cceHUB home folder. (Use login.ccehub.org as the destination machine. Send email to acc@purdue.edu with any questions.) Files in your home folder can be input into the MaCH Tool by selecting the option to choose your dataset from your own collection.


The cceHUB MaCH Tool allows any registered cceHUB user to
  • choose and set parameters
  • select and load data files
  • specify a location for output results
  • execute the MaCH code
which is all done from the graphical interface running in your cceHUB web browser.


The sections below describing how to use MaCH to infer genotypes at untyped markers are copied from the tutorial at the MaCH web site. Additions/changes to the original text are underlined.



Using MaCH for Genotype Imputation

The process makes it relatively straightforward to combine results of genome-wide association scans based on different genotyping platforms and to increase power of association analyses for studies based on a single platform.


To infer missing genotypes, you'll typically provide genotypes for your own samples as input together with haplotypes for a reference sample, such as the HapMap. An alternative is to create a large pooled dataset that includes genotypes both for your own samples and for the reference individuals in a single pedigree file. Since this alternatively is not commonly used, we will focus here on describing the first strategy.



Preliminary Checks

Before genotype imputation, you should carry out basic data quality checks on available genotypes. Typically, we exclude from analysis markers that have low genotyping success rates (perhaps with <95% of genotypes called successfully), unexpected evidence for deviations from Hardy-Weinberg equilibrium (perhaps with an HWE p-value < 0.000001 or so), large numbers of discrepancies among duplicate samples or with several mendelian inconsistencies in available parent-offspring trios, or that are rare (with MAF < 1% or so). All these checks are platform and study specific, and you'll have to figure out what is appropriate for your data. They are mentioned here as a reminder.


When MACH loads your pedigree and the reference haplotypes, it checks that allele labels in the two samples are compatible and that allele frequencies are broadly comparable. If your sample includes no A/T or G/C SNPs (e.g. because it was genotyped on an Illumina Infinium platform), you can use the autoFlip option (you can set this parameter in the cceHUB MaCH tool as a Phased data parameter option) to ensure that alleles in the pedigree file and those in the reference haplotypes refer to the the same strand. If your sample does include A/T and G/C SNPs, you'll have to ensure they are aligned to the same strand manually and inspect allele frequency discrepancies identified by MACH to help pinpoint problems. Although it is typical that a small number of SNPs will drift in frequency between populations, we recommend that you read through the warnings generated by MACH. If you see large frequency discrepancies or anything else suspicious ... investigate!


Newer versions of MACH will automatically ignore any SNPs that are present in your pedigree file but not in the reference panel. SNPs that are present only in the reference panel but not in your pedigree will be imputed!



Step 1: Estimating Model Parameters

Once you are happy with your input dataset, the most (computationally) efficient way to carry out imputation in large GWAS datasets is to use greedy option (you can set this parameter in the cceHUB MaCH tool as a Phased data parameter option) and to carry out a two step process. The first step is to build a model that relates your samples to the haplotypes in the reference panel. This model includes both an estimate of the "error" rate for each marker (an omnibus parameter which captures both genotyping error, discrepancies between your platform and the reference panel, and recurrent mutation) and of "crossover" rates for each interval (a parameter that describes breakpoints in haplotype stretches shared between your samples and the reference panel).


The key choices for this first step are the number of iterations expended in estimating model parameters (specified with the rounds parameter, which you can set in the cceHUB MaCH tool as a Markov sample parameter option) and the number of individuals in your sample to used for model building. In small samples, it is often okay to include your entire sample in this model parameter estimation step, in larger samples it is usually sufficient to include a random subset of 200-500 individuals in this step.


Once all iterations are completed, MACH will store model parameters in two files, ".rec" and ".erate" . You will specify which location in your cceHUB home folder to store these intermediate files as one of the runtime parameters set in the cceHUB MaCH tool. We will use these files as input for the next step, where model parameters will be fixed.


Useful Tip: When analyzing very large samples, the compact option which you can set in the tool as a Haplotyper parameter option can help you save memory.



Step 2: Carrying Out Genotype Imputation

This step is relatively quick and uses the parameters estimated in the previous round and calibrated to your specific dataset and genotyping platform to impute all SNPs in the reference panel in your sampled individuals.



references Yun Li, Cristen Willer, Serena Sanna, Gonçalo Abecasis Annual Review of Genomics and Human Genetics, September 2009, Vol. 10, Pages 387-406 (doi: 10.1146/annurev.genom.9.081307.164242)

Li Y and Abecasis GR (2006) Mach 1.0: Rapid Haplotype Reconstruction and Missing Genotype Inference. Am J Hum Genet S79 2290

Li Y, Willer CJ, Ding J, Scheet P and Abecasis GR (2006) Rapid Markov Chain Haplotyping and Genotype Inference. submitted

Cite this work

Researchers should cite this work as follows:

  • Yanzhu Lin; Guoheng Chen; Ann Christine Catlin (2010), "The MACH 1.0 Markov Chain based Haplotyper," http://ccehub.org/resources/mach.

    BibTex | EndNote

Tags
  1. genetics
  2. halpotype