Genetic Mixture Analysis (GMA)

Software for estimating the stock proportions within mixed stock fisheries

 

 

IMPORTANT: This program has been replaced by my new program ONCOR.

This webpage has been retained in the off chance that someone wants an old version of the program.

 

 

Introduction

GMA is windows program designed to use "baseline" genetic data to estimate the composition of a sample containing individuals from those populations. The most common application of GMA is analysis of samples from mixed stock fisheries. GMA can either estimate the mixture proportions for a sample or perform assignment tests.

GMA has two main functions: "Estimation" and "Simulation." Two types of estimation are available: mixture proportions and assignment tests. Several types of simulation are available. The simplest produces a sample from a simulated mixture. Others evaluate how well baseline data can estimate mixture proportions.

 

History

I wrote GMA to help fisheries managers improve the design of genetic stock identification efforts. In particular, I was exploring the relationship between polymorphism at loci used for GSI and the quality of mixture estimates. I wrote GMA from scratch because I believed the computer program SPAM (Statistics Package for the Analysis of Mixtures) did not adequately deal with “sampling zeros” (alleles found in the mixture but not in one or more baseline population). In my opinion, this statistical artifact of SPAM has had a substantial influence on how fisheries geneticists design mixed stock fisheries studies. Since I have begun this project, ADFG has released a revised version of SPAM (July 2002) that has fixed this problem, and differences between the two programs are now relatively minor.

 

Comparison to SPAM

GMA is quite similar to the Statistics Package for the Analysis of Mixtures (SPAM), distributed by the Alaska Department of Fish And Game. However, GMA has a few functions that SPAM does not. For example, during simulations, GMA will let users specify how many genes should be sampled from actual baseline while creating a simulated baseline to analyze simulated data. This is useful for examining how baseline sample size affects mixture estimates. GMA also has a full set of assignment test simulations. Lastly, GMA is probably easier to use than SPAM.

 

Comparison to WHICHRUN

The assignment test functions available in GMA are similar to the assignment test program WHICHRUN. See here for the WHICHRUN website. The most important difference is that GMA uses estimates of the mixture proportions as a prior while performing assignment tests. WHICHRUN effectively assumes that an individual drawn from a mixture could have come from each population with equal probability. A second difference between WHICHRUN and GMA is that WHICHRUN output likelihood ratios instead of probabilities. A third difference between WHICHRUN and GMA is that WHICHRUN uses 1/(2N+1) as an estimate of alleles found in the mixture but not in baseline populations. GMA uses a slightly more sophisticated approach (see below), but simulations that I have run suggest that results from each method are effectively indistinguishable. WHICHRUN and GMA also differ by what functions they include. WHICHRUN can uses a jackknife procedure to evaluate confidence in assignment. GMA doesn't do this, but GMA does have simulation capabilities that WHICHRUN doesn't.

 

Download GMA

Click here to download the code:

GMA.exe      The main executable file.
kalinowski_library.dll    A collection of functions and procedures that GME.exe requires.

 

And here for sample data files:

steelhead.bse A sample "baseline" file in SPAM format.
nexus_example.txt A sample "baseline" file in nexus GDA format.
up_col.con  A sample "control" file for running simulations.
up_col.mix A sample "mixture" file.

 

Installation / Running GMA / Uninstalling

GMA requires the .NET (pronounced "dot net") framework to run. Instructions are provided on my software page.

Once you have the .NET framework on your computer, click on GMA.exe to run GMA. Note:  kalinowski_library.dll must be in the same folder as GMA.exe.

To remove GMA from your computer, delete GMA.exe and kalinowski_library.dll.

 

Recommendation for new users

I recommend getting used to GMA by opening steelhead.bse and mid_col.con and running a few simulations. GMA is fairly slow, so don't do too many trials while exploring GMA's functions (10 is good for a start).

 

Quick Start Instructions (for simulations)

1. Open a baseline file.

2. Make a control file. See the “Utilities” menu for this option.

3. Open the control file.

4. Run the appropriate simulation.

 

Quick Start Instructions (for estimation)

1. Open a baseline file.

2. Open a mixture file.

3. Estimate mixture proportions or perform assignment tests.

 

Simulations

Four types of simulations can be run. All are quite similar. To run simulations, a baseline and a control file must be open (see below).

The simplest simulation produces a single simulated mixture, and outputs the genotypes of that mixture to a mixture file. To do this, go to the Simulations menu and select 'Mixture Simulation" --> One Sample. This simulated mixture can then be opened and analyzed (estimate mixture proportions, or perform assignment tests). See here for sample output.

Two simulations test how well baseline data estimate mixture proportions. They differ only in their output. In each case, mixtures and baselines are simulated from the actual baseline according to the control file. GMA then estimates the mixture proportions of the simulated mixture using the simulated sample (using the loci and populations specified in the control file). Two types of output are available: "All estimates" or "Averages and standard deviations". If "All estimates" is selected, the estimates for the proportion of every baseline population for every simulated mixture will be output to a text file. This can easily produce a big text file.  I recommend importing it into a database for analysis. See here for sample output.

A forth simulation performs assignment tests on each simulated mixture. The output should be self-explanatory. See here for an example. In the output file, "PMixture", "PSample", and "PEstimate" indicates the parametric proportion of a population in a mixture, the proportion of the same population in a sample taken from the mixture, and the estimate of the proportion produced by GMA.

Output files from simulations end with .out, except for the last type with ends with .txt (in order to facilitate importing the file to a spreadsheet or database).

 

Estimation

To estimate the mixture proportions in a sample from a mixture, or to assign the individuals in the sample to baseline populations: 1. open a baseline file, 2. open a mixture file, 3. select the appropriate analysis under the menu "Estimation". Assignment tests indicate the most likely populations for individuals to come from. These results are sorted from most likely to least likely. Only the most probable populations are output (output ends when the cumulative probability reaches 0.99). See an example output file from assignment tests here.

Output files from estimation end with .est.

 

File Formats

GMA uses three types of files: baseline files, control files, and mixture files. To run simulations, you need to open a baseline file and then a control file (in that order). To estimate the composition of a mixture, you need to open a baseline file and then a mixture file (in that order).

In general, GMA uses spaces, tabs, and line feeds to separate characters. Because, GMA uses spaces to separate different pieces of data (e.g. population name and sample size), locus names, individual id's etc. usually can not have spaces in them.

Baseline, mixture, and control files must use line feeds according toe the models provided. In other words, you can't put multiple pieces of data on the same line if GMA expects it to be on separate lines.

Baseline files

Baseline files are essentially databases containing the genotype or allele frequencies of the baseline data available to analyze mixtures.

GMA can read three types of baseline files: SPAM, GENEPOP, and Nexus. SPAM baseline files must list data as absolute counts (as opposed to frequencies). See here for an example of a SPAM file. GENEPOP files can list genotypes with two or three characters. In addition, the name of the sample, can be listed after the "POP" separator.

Control Files

Control files (*.con) tell GMA how to run simulations. Control files have five sections TITLE, LOCI, SAMPLESIZES, MIXTURE, and REPORTINGUNITS. The most important section is the MIXTURE section. All the others are optional. The MIXTURE section specifies the mixture proportions within the mixture being sampled. This can be done as either a set of counts or a set of probabilities. If counts are used, each simulated sample will contain the specified number of individuals from each populations. If probabilities are used, sampling will be simulated according to those probabilities. The TITLE is a one line description of the control file. The LOCI section lists the loci to be used. If this section is omitted, all loci will be used in simulations. The POPULATIONS section lists the baseline populations to use while analyzing mixtures. If this section is omitted, all baseline populations will be used. If this section is included, and a population is omitted, GMA will conclude that the mixture does not contain any individuals from that population. This section lets users explore the consequences of having individuals in the mixture that came from populations not included in the baseline used to analyze the mixture.

The SAMPLESIZES section lists sample size (in numbers of genes) to use while simulating baselines to analyze simulated mixtures. (Note: GMA always simulates a new baseline to analyze simulated mixtures.) If this section is omitted, GMA will use the same sample sizes as the baseline. The REPORTINGUNITS section lists sets of samples of interest. GMA sums the mixture proportions or assignment probabilities for each population in a reporting unit to produce an estimate for the reporting unit as whole. See the Methods section below. The list of reporting unit does not have to be comprehensive (i.e. contain all baseline populations). In addition, populations can belong to more than one reporting unit. For example, the following reporting units might be used: "Upper_Columbia", "Lower_Columbia", "Wild", "Hatchery", "Endangered" etc. In this example, a single population might be listed in three reporting units (e.g. Lower_Columbia, Wild, Endangered).

See here for an example of a control file.

The easiest way to make a control file is to open a baseline file, and use GMA's control file making "wizard" listed in the "Utilities" menu.

Mixture Files

Mixture files (.mix) contain: 1) the genotypes of individuals from a mixture to be analyzed, and 2) instructions for how to do the analysis. GMA does not read SPAM, NEXUS, or GENEPOP files directly for these file formats do not support the necessary features, but can read genotypes recorded in ach of these formats.

Mixture files contain the following sections: TITLE, LOCI, BASELINE, REPORTINGUNITS, and MIXTURE. The MIXTURE sections contains the genotypes of the individuals in the mixture - one individual per line. See above for an explanation of the other sections. See here for an example of a mixture file containing genotypes in the GENEPOP format and here for an example of a mixture file containing genotypes in the SPAM format.

 

Statistical Methods

Genotype probabilities

GMA uses the method of Rannala and Mountain (1997, equation 9) for estimating the probability of observing a genotype in a population. This method has the advantage of being non-zero and quick to calculate.

GMA estimates the probability of an individual coming from ith baseline population with Bayes' rule using that population's estimated contribution to the mixture as a prior. Individuals are "assigned" to the population for which they have the highest probability of coming from.

Simulations

During simulations, the method of Rannala and Mountain is used to simulate the drawing of genotypes. When mixtures are being simulated, sampling of individuals s assumed to be independent. When simulated mixtures are analyzed, the baseline is always resampled.

 

Types of loci covered

GMA can only handle codominant nuclear loci such as allozymes, microsatellites, or MHC alleles. GMA can not deal with isoloci or mtDNA haplotypes.

 

Citation

Kalinowski, ST. 2003. Genetic Mixture Analysis 1.0. Department of Ecology, Montana State University, Bozeman MT 59717. Available for download from http://www.montana.edu/kalinowski

 

Questions? Need help? Have a suggestion?

I realize this documentation is brief. Please don't hesitate to email or call if you have any questions or problems.

skalinowski@montana.edu

406 994-3232

 

Literature Cited

Ranalla B, JL Mountain. 1997. Detecting immigration by using multilocus genotypes. Proc. Natl. Acad. Sci. USA 94: 9197-9201.

Pella JJ, M Masuda, S Nelson. 1996. Search algorithms for computing stock composition of a mixture from traits of individuals by maximum likelihood. US Department of Commerce, NOAA/NMFS Technical Memo. NMFS-AFSC-61.

Smouse PE, RS Waples, JA Tworek. 1990. A genetic mixture analysis for use with incomplete source population data. CJFAS 47:620-634.

 

 

Kalinowski home