Setup

library(tidyverse)
library(ape)
library(phangorn)
library(pegas)
library(adegenet)
#library(strataG)
library(knitr)
library(emoji)
library(coda)
library(ggmcmc)
library(perm)
library(Rmpfr)
library(spatstat.explore)
library(colorspace)
library(latex2exp)
library(ggridges)
library(raster)
library(RColorBrewer)
library(reshape2)
library(snpR)
library(graph4lg)
library(gghighlight)
library(hierfstat)
library(pophelper)
library(conStruct)
source("../IBD_Kernels/IBD_functions.R")

# Function library
harvest.model.likelihoods <- function(workingDir = workingDir,
                                      outfileName = "outfile.txt",
                                      multilocus = T){
    # this function harvests model marginal likelihoods for models calculated by
    # the program migrate-n (Beerli & Felsenstein 2001).
    # It takes as input a directory full of directories, 
    # each of which contains output from a migrate model, and is named
    # after that model. 
  
    #initialize a data frame to take the values
    modelMarglikes <- data.frame(model=character(),
                             thermodynamic=numeric(),
                             bezier.corrected=numeric(), 
                             harmonic=numeric()) 
    # loop through directories in the working directory, each of which is name
    # after a different model
  for(i in list.dirs(workingDir, full.names = F)[-1]){ #i<-"stepping.stone"
      modelDir<-file.path(workingDir,i)
      print(modelDir)
    #scan in the outfile, separating at each newline
      outfile<-scan(file=file.path(modelDir,outfileName),what="character",sep="\n") 
    #find the line with the likelihoods on it and split on runs of spaces
      marglikeline <- grep("Scaling factor",outfile,value=F)-1
      marglikeline <- strsplit(outfile[marglikeline],
                               "\\s+", perl = T)[[1]][3:5]
    #  if(length(marglikeline)==0){next}
      marglikes <- c(i,marglikeline)
     
      modelMarglikes <- rbind(modelMarglikes,marglikes, deparse.level = 2)
  }
  names(modelMarglikes) <- c("model","thermodynamic","bezier.corrected","harmonic")
  modelMarglikes[2:4] <- sapply(modelMarglikes[2:4], as.numeric)
  return(modelMarglikes)
}

bfcalcs<-function(df,ml="bezier.corrected"){
  # This calculates log bayes factors on data frames output by
  # harvest.model.likelihoods(), following Johnson and Omland (2004)
  # You may choose the likelihood flavor with
  # ml = "bezier.corrected", "thermodynamic" or "harmonic"
  #df$thermodynamic <- as.numeric(df$thermodynamic)
  #df$bezier.corrected <- as.numeric(df$bezier.corrected)
  #df$harmonic <- as.numeric(df$harmonic)
  mlcol <- df[[ml]] 
    bmvalue <- mlcol[which.max(mlcol)]
    lbf <- 2*(mlcol-bmvalue)
    choice <- rank(-mlcol)
    modelprob <- exp(lbf/2)/sum(exp(lbf/2))
    dfall <- cbind(df,lbf,choice,modelprob)
    return(dfall)
}   

migrants.per.gen<-function(x){
  #a function for creating Nm vectors out of m and Theta vectors.
  #x<-x[[1]]
  m<-names(x)[which(grepl("M_",names(x)))] #names of m columns
  #theta<-names(x)[which(grepl("Theta_",names(x)))] #names of theta columns
  for(n in m){
    t<-paste("Theta",strsplit(n,split="_")[[1]][3],sep="_")
    x[,paste("Nm",strsplit(n,split="_")[[1]][2],strsplit(n,split="_")[[1]][3],sep="_")]<- x[,which(names(x)==n)]*x[,which(names(x)==t)] 
    #this hairy little statement makes a new column named "Nm_X_Y" and then fills it by multiplying the M_X_Y column by the Theta_Y column  
  }
  return(x)
}

Introduction

Rob Toonen asked me to take a look at this dataset sequenced and analyzed by ’Ale’alani Dudoit.

From Rob:

It is a zoanthid - Palythoa tuberculosa - that is most common in anthropogenically disturbed habitats. It seems like it could be a cool story, and she has plenty of SNPs or contigs to do whatever analysis you think would be most interesting to work up.

’Ale’a:

Mahalo for checking out our data Eric, would be great to try out Migrate-n and see what results come of it. A couple other folks in the lab had similar interesting fst/structure results with their data (corals and fish) where O’ahu is more genetically similar to Kure/Midway atoll than to neighboring islands. Would be interesting to see if we get any cool results as Rob mentioned with military transport b/w O’ahu and Midway during WWII.

Transfer files

’Ale’a placed all the files into a shared folder on OneDrive. To copy them to orbicella I used rclone, following these instructions to authenticate.

rclone copy -v  EDC_OneDrive:Documents/Research/Ptuberculosa/raw_reads ./

SNPs and Haplotypes from Stacks

Process RadTags and Rename Files

’Ale’a said:

It was an ezRAD protocol developed by the Tobo lab. The restriction enzymes used were both MboI and Sau3AI. The sequences have indexes removed but still need to be trimmed/ends cleaned.

So I will process the radtags to remove adapters and the sau3AI cut site (GATC, which is the same as MboI). This command cleans data removing reads with uncalled bases, low phred scores, rescue barcodes and rad-tags.


process_radtags -p ./raw/ -o ./cleaned/ --rescue --clean --quality --paired --filter_illumina --renz-1 sau3AI \
--adapter-1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
--adapter-2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

724964572 total sequences 0 failed Illumina filtered reads (0.0%) 227835784 reads contained adapter sequence (31.4%) 0 barcode not found drops (0.0%) 28941251 low quality read drops (4.0%) 63374437 RAD cutsite not found drops (8.7%) 404813100 retained reads (55.8%)

Hmm… gotta rename the cleaned files too

cd cleaned
for f in *001.1.fq.gz
do
mv $f ${f/_R[12]_001.1.fq.gz/.1.fq.gz}
done

for f in *001.2.fq.gz
do
mv $f ${f/_R[12]_001.2.fq.gz/.2.fq.gz}
done

for f in *L005.1.fq.gz
do
mv $f ${f/_S*_L005.1.fq.gz/.1.fq.gz}
done

for f in *L005.2.fq.gz
do
mv $f ${f/_S*_L005.2.fq.gz/.2.fq.gz}
done

for f in *rem.1.fq.gz
do
mv $f ${f/_S*_L005_R1_001.rem.1.fq.gz/.rem.1.fq.gz}
done

for f in *rem.2.fq.gz
do
mv $f ${f/_S*_L005_R2_001.rem.2.fq.gz/.rem.2.fq.gz}
done

R80 Dataset fromt the Stacks Pipeline

I’m going to try to find the R80 parameter set (Paris, Stevens, and Catchen 2017), which is the set of stacks parameters (m, M and n) that yield the highest number of variant loci. I will run across 9 different parameter combinations as below.

denovo_map.pl --samples ./cleaned/ --popmap popmap --out-path ./stacks_output_332 --paired \
-m 3 -M 3 -n 2 --min-samples-per-pop 80 -T 12 -X "populations: --fasta-samples" -X "populations: --fasta-loci"\ 
-X "populations: --write-random-snp" -X "populations: --vcf"

denovo_map.pl --samples ./cleaned/ --popmap popmap --out-path ./stacks_output_333 --paired \
-m 3 -M 3 -n 3 --min-samples-per-pop 80 -T 12 -X "populations: --fasta-samples" -X "populations: --fasta-loci"\ 
-X "populations: --write-random-snp" -X "populations: --vcf"

denovo_map.pl --samples ./cleaned/ --popmap popmap --out-path ./stacks_output_334 --paired \
-m 3 -M 3 -n 4 --min-samples-per-pop 80 -T 12 -X "populations: --fasta-samples" -X "populations: --fasta-loci"\ 
-X "populations: --write-random-snp" -X "populations: --vcf"

denovo_map.pl --samples ./cleaned/ --popmap popmap --out-path ./stacks_output_365 --paired \
-m 3 -M 6 -n 5 --min-samples-per-pop 80 -T 12 -X "populations: --fasta-samples" -X "populations: --fasta-loci"\ 
-X "populations: --write-random-snp" -X "populations: --vcf"

denovo_map.pl --samples ./cleaned/ --popmap popmap --out-path ./stacks_output_366 --paired \
-m 3 -M 6 -n 6 --min-samples-per-pop 80 -T 12 -X "populations: --fasta-samples" -X "populations: --fasta-loci"\ 
-X "populations: --write-random-snp" -X "populations: --vcf"

denovo_map.pl --samples ./cleaned/ --popmap popmap --out-path ./stacks_output_367 --paired \
-m 3 -M 6 -n 7 --min-samples-per-pop 80 -T 12 -X "populations: --fasta-samples" -X "populations: --fasta-loci"\ 
-X "populations: --write-random-snp" -X "populations: --vcf"

denovo_map.pl --samples ./cleaned/ --popmap popmap --out-path ./stacks_output_398 --paired \
-m 3 -M 9 -n 8 --min-samples-per-pop 80 -T 12 -X "populations: --fasta-samples" -X "populations: --fasta-loci"\ 
-X "populations: --write-random-snp" -X "populations: --vcf"

denovo_map.pl --samples ./cleaned/ --popmap popmap --out-path ./stacks_output_399 --paired \
-m 3 -M 9 -n 9 --min-samples-per-pop 80 -T 12 -X "populations: --fasta-samples" -X "populations: --fasta-loci"\ 
-X "populations: --write-random-snp" -X "populations: --vcf"

denovo_map.pl --samples ./cleaned/ --popmap popmap --out-path ./stacks_output_3910 --paired \
-m 3 -M 9 -n 10 --min-samples-per-pop 80 -T 12 -X "populations: --fasta-samples" -X "populations: --fasta-loci"\ 
-X "populations: --write-random-snp" -X "populations: --vcf"

Determine R80 via this line from Rivera-Colon

cat populations.sumstats.tsv | grep -v '^#' | cut -f 1 | sort -n -u | wc -l

R80 <- read_csv("tables/R80.csv") %>% mutate(Parameter_Set = factor(Parameter_Set, levels = Parameter_Set, ordered=T))

## Rows: 9 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Parameter_Set
## dbl (4): m, M, n, Variant_Loci
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

r80plot <- R80 %>% ggplot(aes(x = Parameter_Set, y = Variant_Loci, group = 1)) + geom_point() + geom_path()

r80plot

#ggsave("figures/final_figures/R80.pdf",r80plot)

Find Symbiont Loci

Raul Gonzalez-Pech kindly provided a FASTA file of all Symbiodinium genomes onto our server. I am going to blast the R80 populations.loci.fa file against it, following along here.

First I need to create a database from Raul’s FASTA file

makeblastdb -in All_Symbiodiniacea_genomes.fasta -dbtype nucl

Building a new DB, current time: 06/05/2024 21:23:37 
New DB name:   /data/home/common_data/Symbiodiniaceae_genomes/Symbiodiniacea
New DB title:  All_Symbiodiniacea_genomes.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 1136861 sequences in 129.668 seconds.

blastn -query ../populations.loci.fa -db  /data/home/common_data/Symbiodiniaceae_genomes/All_Symbiodiniacea_genomes.fasta -outfmt 6 -out SNPquery.tsv

Filter the list to evalues < 10E-15, and write them out as a blacklist to be filtered out by populations

blast <- read_tsv("stacks/blast/Symbiont_blast.tsv") %>% filter(Evalue < 10e-15)

blacklist <- unique(blast$StacksLocus) %>% str_remove("CLocus_") %>% as.numeric()

length(blacklist)

write.csv(blacklist, "stacks/blast/Symbio_blacklist.txt", header = F, quote = F)

SNPs

Filter to create final SNP dataset

Filter in Populations

So 325 loci match to Symbiodinium.

I’ll use populations to create a VCF and Genepop output with populations that removes the symbiont loci, and otherwise filters such that loci are found in 50% of individuals, minimum allele count of 3, max observed heterozygosity is 0.5

populations -P ./ -O ./SNPs_mac3_ind50 -M ../popmap -R 50 --max-obs-het 0.5 -B ./blast/Symbio_blacklist.txt --min-mac 3 --write-random-snp --hwe --vcf --genepop --treemix --phylip 

Removed 2146561 loci that did not pass sample/population constraints from 2152778 loci.
Kept 6217 loci, composed of 2370236 sites; 62823 of those sites were filtered, 2896 variant sites remained.
Number of loci with PE contig: 6217.00 (100.0%);
  Mean length of loci: 371.25bp (stderr 1.46);
Number of loci with SE/PE overlap: 6216.00 (100.0%);
  Mean length of overlapping loci: 381.24bp (stderr 1.46); mean overlap: 31.00bp (stderr 0.00);
Mean genotyped sites per locus: 381.24bp (stderr 1.46).

Filter with snpR

I’m going to use Will Hemstrom’s new snpR package for filtering and basic stats. It attaches sample metadata as “facets” to the datasets

snps<-read_vcf("stacks/populations.snps.vcf")

labels <- read_tsv("stacks/labels.txt")

labels <- labels %>% mutate(region = case_when(island %in% c("Hawaii","Maui","Molokai","Oahu","Kauai") ~ "High",
                                      .default = "Low"      ))

sample.meta(snps)$pop <- str_replace_all(labels$island," ","_")
sample.meta(snps)$region <- labels$region

snps <- filter_snps(snps, hwe = 1e-6, min_loci = 0.5)

format_snps(snps, output = "structure", facets = "pop", outfile = "Ptuberculosa_filtered_SNPs.stru")
format_snps(snps, output = "genepop", facets = "pop", outfile = "Ptuberculosa_filtered_SNPs.gen")
format_snps(snps, output = "vcf", facets = "pop", outfile = "Ptuberculosa_filtered_SNPs.vcf", chr = "CHROM")
format_snps(snps, output = "ac", facets = "pop", outfile = "Ptuberculosa_filtered_SNPs_allele_counts.txt")
write_tsv(sample.meta(snps, "pop"), "popmap72.tsv")

Filtering loci. Starting loci: 2896 
Filtering non-biallelic loci...
    0 bad loci
Filtering non_polymorphic loci...
    0 bad loci
Filtering loci out of hwe...
     61  bad loci
    Ending loci: 2835 
Filtering individuals. Starting individuals: 90 
Filtering out individuals sequenced in few kept loci...
Re-calculating and adding facets.
    Ending individuals: 72 
Re-filtering loci...
Filtering non_polymorphic loci...
    2 bad loci
    Final loci count: 2833

63 loci were removed by the HWE filter, for a final total of 2833. 18 individuals were removed for having genotypes for less than 50% of loci for a final total of 72 individuals in the SNP dataset.

Haplotypes

Now I need to re-run populations to get haplotypes. More stringent requirements that the locus appear in 75% of all individuals across all populations. Write the fasta-samples-raw for fasta2genotype to operate on.

populations -P ../ -O ./ -M ../popmap -t 12 -p 9 -r 75 -H -B ./blast/Symbio_blacklist.txt  --hwe --fstats --fasta-loci --fasta-samples --vcf --genepop --structure --treemix --fasta-samples-raw

Fasta2Genotype

Paul Maier created this script to convert Stacks haplotypes to migrate format. I had to remove all the [individualName] tokens from the populations.samples.fa to get it to work.

python2 /Applications/migrate/migrate-4.4.4/fasta2genotype/fasta2genotype.py populations.samples2.fa NA ../popmap.tsv  populations.snps.vcf ptuberculosa.mig

###################################################################
###                                                             ###
###       Fasta2Genotype | Data Conversion | Version 1.10       ###
###                                                             ###
###                        Cite as follows:                     ###
###                                                             ###
###   Maier P.A., Vandergast A.G., Ostoja S.M., Aguilar A.,     ###
###   Bohonak A.J. (2019). Pleistocene glacial cycles drove     ###
###   lineage diversification and fusion in the Yosemite toad   ###
###   (Anaxyrus canorus). Evolution, in press.                  ###
###   https://www.doi.org/10.1111/evo.13868                     ###
###                                                             ###
###################################################################

Output type? [1] Migrate [2] Arlequin [3] DIYABC [4] LFMM [5] Phylip [6] G-Phocs [7] Treemix [8] Haplotype: 1
Remove restriction enzyme or adapter sequences? These may bias data. [1] Yes [2] No: 2
Coverage Cutoff (number reads for locus)? Use '0' to ignore coverage: 0
Remove monomorphic loci? [1] Yes [2] No: 2
Remove loci with excess heterozygosity? This can remove paralogs. [1] Yes [2] No: 1
Maximum heterozygosity cutoff for removing loci out of Hardy-Weinberg? 0.5
Filter for allele frequency? False alleles might bias data. [1] Yes [2] No: 2
Filter for missing genotypes? These might bias data. [1] Yes [2] No: 2
 
**************************************************************************************************************
***                                       ... BEGINNING CONVERSION ...                                     ***
**************************************************************************************************************
 
Cataloging loci...
Counting locus lengths...
Cataloging populations...
Counting gene copies...
Counting alleles for each locus...
Identifying loci with excess heterozygosity...
     Calculating observed heterozygosity and homozygosity...
     Calculating expected heterozygosity and homozygosity...
     Flagging loci with excess heterozygosity for removal...
     Removing loci...
Outputting migrate-n file...
*** DONE! ***

Then I manually re-ordered the populations to run from north to south in the output file, and moved it into the migrate folder.

Isolation by Distance

SNPs

Calculate Distance Matrices

Calculate Fst

Calculate Fst ($\theta_{WC}$) of Weir and Cockerham.

snps <- calc_pairwise_fst(snps,"pop",fst_over_one_minus_fst = F,verbose=T, 
                          boot = 1000, boot_par = 4)
fst <- get.snpR.stats(snps,facets = "pop",stats = "fst")$fst.matrix$pop$fst
fstps <- get.snpR.stats(snps,facets = "pop",stats = "fst")$fst.matrix$pop$p
#FDR correction to q-values
fstqs <- fstps[,2:9] %>% as.matrix %>% as.vector %>% p.adjust("BY") %>%
            matrix(ncol = 8) %>% bind_cols(fstps[,1]) %>% relocate(p1)
names(fstqs) <- names(fstps)

#add extra row and column to the data frame, then make it a square matrix
fst <- tibble(fst) %>% add_column(French_Frigate_Shoals = rep(NA,8),.before = 2) %>% add_row(.after = 8)
fst <- as.matrix(fst[,2:10])
fstqs <- tibble(fstqs) %>% add_column(French_Frigate_Shoals = rep(NA,8),.before = 2) %>% add_row(.after = 8)
fstqs <- as.matrix(fstqs[,2:10])
#set diagonal equal to zero
diag(fst) <- 0
diag(fstqs) <- 0
#make the matrix symmetrical
fst[lower.tri(fst)] <- t(fst)[lower.tri(fst)]
dimnames(fst)[[1]] <- dimnames(fst)[[2]]

fstqs[lower.tri(fstqs)] <- t(fstqs)[lower.tri(fstqs)]
dimnames(fstqs)[[1]] <- dimnames(fstqs)[[2]]
#reorder
#fst <- fst[c(1,6,7,8,3,2,5,9,4),c(1,6,7,8,3,2,5,9,4)] # this is south to north
fst <- fst[c(4,9,5,1,3,8,7,6,2),c(4,9,5,1,3,8,7,6,2)] # this is north to south
fst_dist <- as.dist(t(fst), diag = T)

fstqs <- fstqs[c(4,9,5,1,3,8,7,6,2),c(4,9,5,1,3,8,7,6,2)]
fstqs_dist <- as.dist(t(fstqs), diag = T)

fst_qs <- fst
fst_qs[upper.tri(fst_qs)] <- fstqs[upper.tri(fstqs)]

#fst_pairwise <- get.snpR.stats(snps,facets = "pop",stats = "fst")$weighted.means  %>% 
#                  separate(subfacet,into=c("from","to"), sep = "~")
#fst2 <- fst_pairwise %>% spread(to,weighted_mean_fst, fill = NA)

#write.csv(fst_qs, "IBD/Ptuberculosa_Fst_SNPs_qvaluesInUpper.csv",row.names=F,quote=F)
#write.csv(fst, "IBD/Ptuberculosa_Fst_SNPs.csv",row.names=F,quote=F)


#do this for 109 migrate loci!

Calculate Geographic Distance

localities <- read_csv("./figures/hypotheses_as_graphs/hawaii_vertices.csv") %>% 
                filter(sampled == "black") %>% map_df(rev)
gcdists <- as.dist(pointDistance(localities[,c(8,7)], lonlat=T)/1000)
attr(gcdists, "Labels") <- dimnames(fst)[[1]]

#write.csv(as.matrix(gcdists), "IBD/Ptuberculosa_pairwise_great_circle_distances_km.csv", row.names = F, quote=F)

Read Matrices Back In

#read in the two distance matrices that were created above
gen.dist.mat <- read.csv(file = "IBD/Ptuberculosa_Fst_SNPs.csv") 
#linearize
gen.dist.mat.og <- gen.dist.mat/(1-gen.dist.mat)                           
geo.dist.mat.og <- read.csv(file = "IBD/Ptuberculosa_pairwise_great_circle_distances_km.csv")

# convert to distance matrix
geo.dist <- as.dist(geo.dist.mat.og)
gen.dist <- as.dist(gen.dist.mat.og)

gen.df <- reshape2::melt(as.matrix(gen.dist), varnames = c("row", "col")) %>% 
                    filter(value %in% as.matrix(gen.dist)[lower.tri(gen.dist)]) %>% 
                    distinct(value, .keep_all=T)  %>% rename(fst = value)
geo.df <- reshape2::melt(as.matrix(geo.dist), varnames = c("row", "col")) %>% rename(distance = value)

dist.df <- gen.df %>% left_join(geo.df, by=c("row","col"))

Calculate linear model

First to get the slope $m$ we need to make a simple linear model, calculating significance with a Mantel test.

mantelt_withOahu <- mantel.randtest(gen.dist,geo.dist, nrepet = 10000)
lmodel_withOahu <- lm(fst ~ distance , dist.df)
slope <- round(lmodel_withOahu$coefficients[2],7)
mantelr <- round(mantelt_withOahu$obs, 2)
pvalue <- round(mantelt_withOahu$pvalue, 5)

ibd_plot <- ggplot(dist.df,aes(x=distance,y=fst)) +
                geom_point(shape = "triangle", size = 3) + 
                geom_smooth(method=lm) + xlab("Great Circle Distance (km)") + 
                ylab(expression(F["ST"]/1-F["ST"])) + 
                ylim(-0.025, 0.11) + xlim(0,3000) +
                geom_text(label = paste("m =", slope, 
                                        "; Mantel r =", mantelr,
                                        ", p =", pvalue ), 
                          mapping = aes(x = 800, y = 0.11), size = 4) +
                 theme(axis.text = element_text(size = rel(1.5)), 
                                axis.title = element_text(size = rel(1.5)))

ibd_plot

## Warning in geom_text(label = paste("m =", slope, "; Mantel r =", mantelr, : All aesthetics have length 1, but the data has 36 rows.
## ℹ Did you mean to use `annotate()`?

## `geom_smooth()` using formula = 'y ~ x'

#ggsave("figures/final_figures/final_ibd_plot_allpops.pdf",ibd_plot,  width = 7, height = 5, units = "in")

Remove Oahu

Because of initial findings with Structure, I’m going to take a closer look at those $F_{st}$ involving Oahu with gghighlight().

#remove Oahu

gen.dist.mat <- gen.dist.mat.og[-6,-6]
geo.dist.mat <- geo.dist.mat.og[-6,-6]
geo.dist <- as.dist(geo.dist.mat)
gen.dist <- as.dist(gen.dist.mat)
dist.df_withoutOahu <- dist.df %>% filter(row != "Oahu" & col != "Oahu")
lmodel_withoutOahu <- lm(fst ~ distance , dist.df_withoutOahu)
mantelt_withoutOahu <- mantel.randtest(gen.dist,geo.dist, nrepet = 10000)
mantelt_withoutOahu

## Monte-Carlo test
## Call: mantel.randtest(m1 = gen.dist, m2 = geo.dist, nrepet = 10000)
## 
## Observation: 0.8708939 
## 
## Based on 10000 replicates
## Simulated p-value: 0.00059994 
## Alternative hypothesis: greater 
## 
##      Std.Obs  Expectation     Variance 
## 4.5797324643 0.0003885757 0.0361295617

slope <- round(lmodel_withoutOahu$coefficients[2],7)
mantelr <- round(mantelt_withoutOahu$obs, 2)
pvalue <- round(mantelt_withoutOahu$pvalue, 5)

ibd_plot_oahu <- ggplot(dist.df,aes(x=distance,y=fst)) +
                geom_point(fill = "grey40", color = "red",shape = "triangle", size = 4) + 
                geom_smooth(method=lm, color = "red") + xlab("Great Circle Distance (km)") +
                geom_smooth(data = dist.df_withoutOahu, method = lm, color = "blue") +
                ylab(expression(F["ST"]/1-F["ST"])) + 
                gghighlight(row == "Oahu" | col == "Oahu", keep_scales = T,
                        unhighlighted_params = (aes(color = NULL, shape = "triangle", size = 3))) +
                ylim(-0.025, 0.11) + xlim(0,3000) +
                geom_text(label = paste("m =", slope, 
                                        "; Mantel r =", mantelr,
                                        ", p =", pvalue ), 
                         mapping = aes(x = 800, y = 0.11), size = 4) +
                 theme(axis.text = element_text(size = rel(1.5)), 
                                axis.title = element_text(size = rel(1.5)))




dist.df_O <- dist.df %>% filter(row == "Oahu" | col == "Oahu")
lmodel_Oahu <- lm(fst ~ distance , dist.df_O)
mantelt_Oahu <- mantel.randtest(dist(dist.df_O$fst), dist(dist.df_O$distance), nrepet=10000)
slope <- round(lmodel_Oahu$coefficients[2],7)
mantelr <- round(mantelt_Oahu$obs, 2)
pvalue <- round(mantelt_Oahu$pvalue, 5)

ibd_plot_oahu <- ibd_plot_oahu + geom_text(label = paste("m =", slope, 
                                        "; Mantel r =", mantelr,
                                        ", p =", pvalue ), 
                         mapping = aes(x = 800, y = 0.105), size = 4,
                         color = "red")

ibd_plot_oahu

## Warning in geom_text(label = paste("m =", slope, "; Mantel r =", mantelr, : All aesthetics have length 1, but the data has 8 rows.
## ℹ Did you mean to use `annotate()`?

## Warning in geom_text(label = paste("m =", slope, "; Mantel r =", mantelr, : All aesthetics have length 1, but the data has 8 rows.
## ℹ Did you mean to use `annotate()`?

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

#ggsave("figures/final_figures/final_ibd_plot_noOahu.pdf",ibd_plot_Kure,  width = 7, height = 5, units = "in")

Wow! the negative slope of comparisons involving Oahu is almost exactly the same as the positive slope for the other islands.

Remove Kure

gen.dist.mat <- gen.dist.mat.og[-1,-1]
geo.dist.mat <- geo.dist.mat.og[-1,-1]
geo.dist <- as.dist(geo.dist.mat)
gen.dist <- as.dist(gen.dist.mat)
dist.df_withoutKure <- dist.df %>% filter(row != "Kure" & col != "Kure")
lmodel_withoutKure <- lm(fst ~ distance , dist.df_withoutKure)
mantelt_withoutKure <- mantel.randtest(gen.dist,geo.dist, nrepet = 10000)
mantelt_withoutKure

## Monte-Carlo test
## Call: mantel.randtest(m1 = gen.dist, m2 = geo.dist, nrepet = 10000)
## 
## Observation: 0.1357342 
## 
## Based on 10000 replicates
## Simulated p-value: 0.2407759 
## Alternative hypothesis: greater 
## 
##       Std.Obs   Expectation      Variance 
##  0.6512521753 -0.0001933121  0.0435628766

slope <- round(lmodel_withoutKure$coefficients[2],7)
mantelr <- round(mantelt_withoutKure$obs, 2)
pvalue <- round(mantelt_withoutKure$pvalue, 5)

ibd_plot_Kure <- ggplot(dist.df,aes(x=distance,y=fst)) +
                geom_point(fill = "grey40", color = "red",shape = "triangle", size = 4) + 
                geom_smooth(method=lm, color = "red") + xlab("Great Circle Distance (km)") +
                geom_smooth(data = dist.df_withoutKure, method = lm, color = "blue") +
                ylab(expression(F["ST"]/1-F["ST"])) + 
                gghighlight(row == "Kure" | col == "Kure", keep_scales = T,
                        unhighlighted_params = (aes(color = NULL, shape = "triangle", size = 3))) +
                ylim(-0.025, 0.11) + xlim(0,3000) +
                geom_text(label = paste("m =", slope, 
                                        "; Mantel r =", mantelr,
                                        ", p =", pvalue ), 
                         mapping = aes(x = 800, y = 0.11), size = 4) +
                 theme(axis.text = element_text(size = rel(1.5)), 
                                axis.title = element_text(size = rel(1.5)))




dist.df_O <- dist.df %>% filter(row == "Kure" | col == "Kure")
lmodel_Kure <- lm(fst ~ distance , dist.df_O)
mantelt_Kure <- mantel.randtest(dist(dist.df_O$fst), dist(dist.df_O$distance), nrepet=10000)
slope <- round(lmodel_Kure$coefficients[2],7)
mantelr <- round(mantelt_Kure$obs, 2)
pvalue <- round(mantelt_Kure$pvalue, 5)

ibd_plot_Kure <- ibd_plot_Kure + geom_text(label = paste("m =", slope, 
                                        "; Mantel r =", mantelr,
                                        ", p =", pvalue ), 
                         mapping = aes(x = 800, y = 0.105), size = 4,
                         color = "red")

ibd_plot_Kure

## Warning in geom_text(label = paste("m =", slope, "; Mantel r =", mantelr, : All aesthetics have length 1, but the data has 8 rows.
## ℹ Did you mean to use `annotate()`?

## Warning in geom_text(label = paste("m =", slope, "; Mantel r =", mantelr, : All aesthetics have length 1, but the data has 8 rows.
## ℹ Did you mean to use `annotate()`?

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

#ggsave("figures/final_figures/final_ibd_plot_noKure.pdf",ibd_plot_Kure,  width = 7, height = 5, units = "in")

Structure

I wanted to take a closer look at how Structure (Pritchard, Stephens, and Donnelly 2000) sees these data. ’Ale’a had already tried with a location prior and admixture, but given low values of m below, I wanted to try without admixture and with an extra long MCMC chain (1 million steps)

Parameter settings

MAINPARAMS

#define BURNIN 250000
#define NUMREPS 1000000
#define INFILE structure_infile
#define OUTFILE structure_outfile
#define NUMINDS 72
#define NUMLOCI 2833
#define PLOIDY 2
#define MISSING -9
#define ONEROWPERIND 0
#define LABEL 1
#define POPDATA 1
#define POPFLAG 0
#define LOCDATA 0
#define PHENOTYPE 0
#define EXTRACOLS 0
#define MARKERNAMES 0
#define RECESSIVEALLELES 0
#define MAPDISTANCES 0
#define PHASED 0
#define PHASEINFO 0
#define MARKOVPHASE 0
#define NOTAMBIGUOUS -999

EXTRAPARAMS (just changed NOADMIX to 1 for the no admixture runs)

#define NOADMIX 0
#define LINKAGE 0
#define USEPOPINFO 0
#define LOCPRIOR 1
#define FREQSCORR 1
#define ONEFST 0
#define INFERALPHA 1
#define POPALPHAS 0
#define ALPHA 5
#define INFERLAMBDA 0
#define POPSPECIFICLAMBDA 0
#define LAMBDA 1
#define FPRIORMEAN 0.012
#define FPRIORSD 0.01
#define UNIFPRIORALPHA 1
#define ALPHAMAX 10
#define ALPHAPRIORA 1
#define ALPHAPRIORB 2
#define LOG10RMIN -4.0
#define LOG10RMAX 1.0
#define LOG10RPROPSD 0.1
#define LOG10RSTART -2.0
#define GENSBACK 2
#define MIGRPRIOR 0.01
#define PFROMPOPFLAGONLY 0
#define LOCISPOP 1
#define LOCPRIORINIT 1
#define MAXLOCPRIOR 20
#define PRINTNET     1
#define PRINTLAMBDA  1
#define PRINTQSUM    1
#define SITEBYSITE   0
#define PRINTQHAT    0
#define UPDATEFREQ   0
#define PRINTLIKES   0
#define INTERMEDSAVE 0
#define ECHODATA     0
#define ANCESTDIST   0
#define NUMBOXES   1000
#define ANCESTPINT 0.90
#define COMPUTEPROB 1
#define ADMBURNIN  0
#define ALPHAPROPSD 0.025
#define STARTATPOPINFO FALSE
#define RANDOMIZE 0
#define SEED 45196
#define METROFREQ 10
#define REPORTHITRATE 0

Structure Threader

I used Structure Threader to run many values of K in parallel for 5 replicates each. Command line for structure_threader

structure_threader run -K 10 -R 5 -t 45  -i ../structure_infile -o ./output -st /opt/conda/bin/structure --params ./mainparams 

#for usepop
structure_threader run -Klist 9 -R 5 -t 5  -i ../structure_infile -o ./output -st /opt/conda/bin/structure --params ./mainparams

Summarize results

I used pophelper to calculate optimal K and summarize the results.

Admixture, location prior

popmap <- read_tsv("filtered_data/popmap_SNPs_n72.tsv")

files <- list.files("structure/admix_loc/output/",pattern = "str",full.names = T)
admix_loc <- readQ(files) #read in the files
admix_loc_tab <- tabulateQ(admix_loc) #tabulate log-likelihoods
admix_loc_summ <- summariseQ(admix_loc_tab) #take the mean across replicates
evannoMethodStructure(admix_loc_summ,writetable = T,
                      returnplot = T,exportplot = T, exportpath ="structure/admix_loc",
                      xaxisbreaks = 1:10)

admix_loc_align <- alignK(admix_loc, type = "auto") #align k clusters so they are the same across replicates (like clumpak)

# summarize across replicates. This is not working because the sortQ function has 
# been broken by recent updates to R. My Ks are already sorted so I used the code from within the function thusly:
#admix_loc_merge <- mergeQ(admix_loc_align) 
        mergy<-function(x){
            return(list(Reduce(`+`, x)/length(x)))
        }
 
      labels <- summariseQ(tabulateQ(admix_loc_align, sorttable = FALSE))$k
      admix_loc_merged <- unlist(lapply(splitQ(admix_loc_align), mergy), recursive = FALSE)
      admix_loc_merged <- admix_loc_merged[-1]

#spit out a plot of all replicates to look at convergence
admix_loc_all_plots <- plotQ(admix_loc_align, imgoutput = "join", returnplot = T,
              exportplot = T, exportpath = "structure/admix_loc", imgtype = "pdf", 
              grplab = popmap["Name"], selgrp = "Name", barbordersize = 1, grplabsize = 0.9,
              showindlab = F, grplabangle = 45, grplabheight = 2, grplabpos = 0.6)
#alignK is not matching layers correctly, so I am doing this sort
admix_loc_merged2 <- alignK(as.qlist(admix_loc_merged), "across")
names(admix_loc_merged2[[1]]) <- names(admix_loc_merged2[[1]][c(2,1)])
names(admix_loc_merged2[[2]]) <- names(admix_loc_merged2[[2]][c(2,1,3)])
names(admix_loc_merged2[[3]]) <- names(admix_loc_merged2[[3]][c(2,1,3,4)])
names(admix_loc_merged2[[5]]) <- names(admix_loc_merged2[[5]][c(6,1, 3,2,5,4)])
names(admix_loc_merged2[[6]]) <- names(admix_loc_merged2[[6]][c(6,1,3,2,5,4,7)])
names(admix_loc_merged2[[7]]) <- names(admix_loc_merged2[[7]][c(8,1,4,2,5,6,7,3)])
#names(admix_loc_merged2[[7]]) <- names(admix_loc_merged2[[7]][c(6,1,3,2,5,4,7,8)])
names(admix_loc_merged2[[8]]) <- names(admix_loc_merged2[[8]][c(8,1,7,2,5,6,4,3,9)])
#names(admix_loc_merged2[[9]]) <- names(admix_loc_merged2[[9]][c(6,1,7,2,5,3,4,8,9,10)])
#final plot
admix_loc_p1 <- plotQ(admix_loc_merged2[-9], imgoutput = "sep", returnplot = T,
              exportplot = T, exportpath = "structure/admix_loc", imgtype = "pdf",
              grplab = popmap["Name"], selgrp = "Name", barbordersize = 1, grplabsize = 0.9,
              showindlab = F, grplabangle = 45, grplabheight = 2, grplabpos = 0.6,
              showyaxis = F, showticks = F)

Evanno Plot with admixture

Q Plot with admixture

No Admixture, Location Prior

files <- list.files("structure/noadmix_loc/output/",pattern = "str",full.names = T)
noadmix_loc <- readQ(files)
noadmix_loc_tab <- tabulateQ(noadmix_loc)
noadmix_loc_summ <- summariseQ(noadmix_loc_tab)
evannoMethodStructure(noadmix_loc_summ,writetable = T,returnplot = T,exportplot = T, exportpath ="structure/noadmix_loc",
                      xaxisbreaks = 1:10)

noadmix_loc_align <- alignK(noadmix_loc, type = "across")

#spit out a plot of all replicates to look at convergence
noadmix_loc_all_plots <- plotQ(noadmix_loc_align, imgoutput = "join", returnplot = T,
              exportplot = T, exportpath = "structure/noadmix_loc", imgtype = "pdf",
              grplab = popmap["Name"], selgrp = "Name", barbordersize = 1, grplabsize = 0.9,
              showindlab = F, grplabangle = 45, grplabheight = 2, grplabpos = 0.6)


#Merging the results here yields garbage because the indiviudals are 
# assigned to different clusters each time, so I 
# am just showing results for the first replicate of each
 
  noadmix_loc_merged <- noadmix_loc_align[c(6,11,16,21,26,31,36,41)]
#and the final plot
noadmix_loc_p1 <- plotQ(noadmix_loc_merged, imgoutput = "join", returnplot = T,
              exportplot = T, exportpath = "structure/noadmix_loc", imgtype = "pdf",
              grplab = popmap["Name"], selgrp = "Name", barbordersize = 1, grplabsize = 0.9,
              showindlab = F, grplabangle = 45, grplabheight = 2, grplabpos = 0.6,
              showyaxis = F, showticks = F)

Evanno Plot with no admixture

Q Plot with no admixture

ConStruct

ConStruct (Bradburd, Coop, and Ralph 2018) attempts to categorize genetically homogeneous groups, like Structure, but it considers continuous variation (Isolation by Distance) as it does so. I’m going to follow Gideon’s vignettes to perform the analysis.

Read and Convert Data

Empirical Data

From Structure to ConStruct format. Also create a geographic distance matrix, and a matrix of lat/longs for plotting.

conStruct.infile <- structure2conStruct(infile = "structure/structure_infile", 
                                        onerowperind= F,
                                        start.loci = 3,
                                        start.samples = 1,
                                        missing.datum = -9,
                                        outfile = "structure/construct_infile")

popmap <- read_tsv("popmap.tsv")

popmap <- popmap %>% right_join(tibble(IndID = dimnames(conStruct.infile)[[1]]))


localities <- read_csv("./figures/hypotheses_as_graphs/hawaii_vertices.csv") %>% 
                filter(sampled == "black") %>% map_df(rev)
localities$island[3] <- "Kamo"
localities$island[6] <- "O'ahu"

popmap <- popmap %>% left_join(localities, by = join_by(Name == island ))

latlongs <- matrix(c(popmap$long, popmap$lat), ncol = 2)

#calculate pairwise geographic distances between individuals
ind_geog_dists <- pointDistance(latlongs,latlongs,lonlat = T, allpairs = T)/1000


write.csv(latlongs, file = "conStruct/empirical/ind_coordinates.csv", quote = F, row.names = F)


write.csv(ind_geog_dists, file = "conStruct/empirical/ind_geog_matrix.csv", quote = F, row.names = F)

Run a spatial model

I wrote the following as an r-script to run on Orbicella in parallel at multiple values of K. Ran it for two chains, 40,000 iterations each.

# usage: Rscript run_ConStruct_K.R K T n.chains n.iter conStruct.data popmap popdists
# where K is the number of layers (i.e. Structure clusters) to be included in the analysis
# and T (or F) indicates whether to run the spatial model
# n.chains is the number of MCMC chains
# conStruct.data is an .RData object with conStruct formatted data
# latlongs.name is a csv file of lat longs of each individual
# ind.geog.dists.name is a csv file of individual pairwise distances

require(tidyverse)
require(conStruct)

args = commandArgs(trailingOnly=TRUE)


if (length(args) < 4) {
  stop("Must supply K, spatial(T/F), n.chains, n.iter in this order", call.=FALSE)
}

K <- args[1] %>% as.numeric()
spatial <- args[2] %>% as.logical()
n.chains <- args[3] %>% as.numeric()
n.iter <- args[4] %>% as.numeric()
conStruct.data.name <- args[5] %>% as.character()
latlongs.name <- args[6] %>% as.character()
ind.geog.dists.name <- args[7] %>% as.character()


if(spatial){
  prefix <- "sp"
}else{
  prefix <- "nonsp"
}

## Read and Convert Data
print("Reading In The Data and Converting it for use by ConStruct")
load(conStruct.data.name) 
latlongs<- read_csv(latlongs.name) %>% as.matrix()
ind_geog_dists <- read.csv(file = ind.geog.dists.name,header =T) %>% as.matrix()


## Run a spatial model
print(c("Running a conStruct model for K = ",K, "spatial = ", spatial))
spatial9 <- conStruct(spatial = T, K = K,
                      freqs = freqs,
                      geoDist = ind_geog_dists,
                      coords = latlongs,
                      prefix = paste(prefix,"K",K, sep = "_"),
                      n.iter = n.iter,
                      n.chains = n.chains)

Start ConStruct runs for K = 1 - 10 in different tmux sessions


for j in {1..10}
do
echo "now starting job $j"
tmux new -A -s cs$j -d
tmux send-keys -t cs$j "conda activate R" Enter
tmux send-keys -t cs$j "Rscript run_ConStruct_K.R.txt $j T 2 40000 construct_infile.RData ind_coordinates.csv ind_geog_matrix.csv | tee -a ConStructLog$j.txt" Enter
done

Calculate Layer Contributions

This is code from Gideon’s “How to compare conStruct model runs” vignette.

# Loop through output files generated by conStruct 
#   runs with K=1 through 10 and calculate the 
#   layer contributions for each layer in each run  

layer.contributions <- matrix(NA,nrow=10,ncol=10)
construct.admix.props <- list()
# load the conStruct.results.Robj and data.block.Robj
#   files saved at the end of a conStruct run
load("conStruct/sp_K_1_conStruct.results.Robj")
load("conStruct/sp_K_1_data.block.Robj")


# calculate layer contributions
layer.contributions[,1] <- c(calculate.layer.contribution(conStruct.results[[1]],data.block),rep(0,4))
layer.contributions[6,1] <- 0
tmp <- conStruct.results[[1]]$MAP$admix.proportions

admix.df <- data.frame(tmp)
names(admix.df) <- paste0("Cluster",1:dim(admix.df)[2])
construct.admix.props[[paste0("conStruct_K",1)]] <- admix.df

for(i in 2:10){
    # load the conStruct.results.Robj and data.block.Robj
    #   files saved at the end of a conStruct run
    load(sprintf("conStruct/sp_K_%s_conStruct.results.Robj",i))
    load(sprintf("conStruct/sp_K_%s_data.block.Robj",i))
    
    # match layers up across runs to keep plotting colors consistent
    #   for the same layers in different runs
    tmp.order <- match.layers.x.runs(tmp,conStruct.results[[1]]$MAP$admix.proportions)  

    # calculate layer contributions
    layer.contributions[,i] <- c(calculate.layer.contribution(conStruct.results=conStruct.results[[1]],
                                                             data.block=data.block,
                                                             layer.order=tmp.order),
                                    rep(0,10-i))
    tmp <- conStruct.results[[1]]$MAP$admix.proportions[,tmp.order]
    admix.df <- data.frame(tmp)
    names(admix.df) <- paste0("Cluster",1:dim(admix.df)[2])
   construct.admix.props[[paste0("conStruct_K",i)]] <- admix.df[,tmp.order]
}


structure_K_colors <- c("#2121D9","#9999FF","#DF0101", "#04B404","#FFFB23","#FF9326","#A945FF", "#0089B2", "#B26314")


layer_contributions_plot <- barplot(layer.contributions,
        col=structure_K_colors,
        las = 1,
        xlab="",
        ylab="layer contributions",
        names.arg=paste0("K = ",1:10))

layer_contributions_plot

##  [1]  0.7  1.9  3.1  4.3  5.5  6.7  7.9  9.1 10.3 11.5

#write.csv(layer.contributions, "tables/layer_contributions.csv", quote = F, row.names = T, col.names = 1:10)

Visualize Results

ConStruct automagically creates structure Q-plots and even pie charts for geography. But I’d like to get the plots into the same format as the Structure output.

construct.admix.props.q <-as.qlist(construct.admix.props)



conStruct_p1 <- plotQ(construct.admix.props.q[2:9], imgoutput = "join", returnplot = T,
              exportplot = T, exportpath = "conStruct/", imgtype = "pdf",
              grplab = popmap["Name"], selgrp = "Name", barbordersize = 1, grplabsize = 0.9,
              showindlab = F, grplabangle = 45, grplabheight = 2, grplabpos = 0.6,
              showyaxis = F, showticks = F)

conStruct_p1

ConStruct Q Plot

Effect of Missing Data

The SNP dataset unfortunately has a lot of missing data. Does this have any effect on ConStruct inference?

Use BCFtools to look at the distribution of missing data across samples

 bcftools query --list-samples Ptuberculosa_filtered_SNPs.vcf > samples.txt
 bcftools stats -S samples.txt Ptuberculosa_filtered_SNPs.vcf > bcftoolsoutput.txt
 plot-vcfstats bcftoolsoutput.txt

I then manually merged this output with admixture proportions for each layer for K = 4 from ConStruct.

bc <- read_csv("filtered_data/bcfstats/bcftools_conStruct.csv")

## Rows: 72 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): sampleID, population
## dbl (13): inRefHom, nonRefHom, InHets, InTransitions, InTransversions, Indel...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

l1<-lm(Construct_L1 ~ Missing_Percent, data = bc)
l2<-lm(Construct_L2 ~ Missing_Percent, data = bc)
l3<-lm(Construct_L3 ~ Missing_Percent, data = bc)
l4<-lm(Construct_L4 ~ Missing_Percent, data = bc)

bc %>% dplyr::select(Missing_Percent, Construct_L1, Construct_L2, Construct_L3, Construct_L4) %>% 
  gather(-Missing_Percent, key = "var", value = "value") %>% 
  ggplot(aes(x = Missing_Percent, y = value)) +
    facet_wrap(~ var, scales = "free") +
    geom_point() +
    geom_smooth(method = "lm") + ylab("Layer Admixture Proportion") + xlab("% Missing Data")

## `geom_smooth()` using formula = 'y ~ x'

None of the correlations are significant.

Migrate

Install Migrate

Install the parallel version of Migrate on Nautilus server. I hade to remove -fvectorize from line 100 of the makefile to get this to work.

curl https://peterbeerli.com/migrate-html5/download_version4/migrate-newest-src.tar.gz > migrate.tar.gz
tar -xvf migrate.tar.gz
cd migrate-4.4.4/
cd src
./configure
make mpis (for parallel running)
sudo cp migrate-n-mpi /usr/local/bin/migrate-n-mpi

Locus Statistics & Mutation Model

These RAD loci are much less variable than COI that I’m used to using. I need to figure out an overall mutation model to use with RAD loci. I can’t find any discussion of this issue on the Migrate Google Group, so I created one. I guess I’ll use modelTest in the phangorn package to see where that gets me.

I renamed the FASTA headers in populations.samples.fa with BBEdit:

Find: >CLocus_\d+_Sample_\d+_Locus_(\d+)_Allele_([01]) \[(.+)\]
Replace: >\3_L\1_A\2

Lets load in the data and calculate some statistics for each locus. Previously Migrate-n only implemented the F84 (=HKY) model, with two rates (Transistions and Transversions) and gamma distributed rate variability. The new v4 has a datamodel parameter that suggests it might take other models of molecular evolution, but the possible models are not listed in the documentation! So I will just fit an HKY model.

note the interface lists these possible models: 1 Jukes-Cantor model 2 Kimura 2-parameter model 3 Felsenstein 1981 (F81) model 4 Felsenstein 1984 (F84) model 5 Hasegawa-Kishino-Yano model 6 Tamura-Nei model

fastadata <- read.FASTA("/Volumes/GoogleDrive-103325533945714033178/My Drive/Ptuberculosa/stacks_output/allpops75_hwe/populations.samples_rename.fasta")

# have to read in the locus lengths, because this is the only way to make sure the stats
# and the migrate infile are ordered the same
migrate_lengths <- read_lines("/Volumes/GoogleDrive-103325533945714033178/My Drive/Ptuberculosa/stacks_output/ptuberculosa.mig", skip = 1, n_max=1) %>% 
  str_split("\\t", simplify = T) %>%  t(.) %>% tibble(length=.)

# Get locus names. fasta2genotype orders them alphabetically, so do this here too
locus_names <- str_extract(names(fastadata),"L\\d+") %>% unique() %>% sort()

stats <- tibble(locus = character(), 
                length = numeric(),
                segSites = numeric(),
                nHaps = numeric(),
                nucDiv = numeric(),
                ttRatio = numeric(),
                model = character(),
                gammaShape = numeric(),
                rate1 = numeric(), 
                rate2 = numeric(), 
                rate3 = numeric(),
                rate4 = numeric(),
                Q1= numeric(),
                Q2 = numeric(),
                Q3 = numeric(),
                )
                
for(l in locus_names){
  print(paste("Now Doing Locus", l, match(l, locus_names), "out of", length(locus_names)))
  locus_dnabin <- fastadata[str_which(names(fastadata),pattern = l)]
  # convert to package formats
  locus_dnabin <- as.matrix(locus_dnabin)
  locus_gtypes <- sequence2gtypes(locus_dnabin)
  locus_phy <- phyDat(locus_dnabin)
  #create a haplotype network .. to be continued
  haps <- haplotype(locus_dnabin)
  nhaps <- length(dimnames(haps)[[1]])
  #tcs <- haploNet(haps)
  #find parameters of HKY (F84) model
  modeltest <- modelTest(locus_phy, model = c("JC","K80", "F81", "HKY","TrN"), 
                         G = T, I = F, k = 4, mc.cores = 4)
  # pick out the best model. If multiple models are tied for best, pick the simplest one
  bestmodel <- modeltest$Model[which(modeltest$BIC == min(modeltest$BIC))][1]
  #open the object environment
  env <- attr(modeltest,"env")
  bestparams <- get(bestmodel, env)
  bestparams <- eval(bestparams, env=env)
  # use this code for v3, which only has F84 (HKY)
  #HKY <- modelTest(locus_phy, model = c("HKY"), 
  #                       G = T, I = F, k = 4)
  #env <- attr(HKY, "env")
  #HKYG <- get("HKY+G", env)
  #model <- eval(HKYG, env=env)
  # calculate TiTv Ratio
  ttratio <- TiTvRatio(locus_gtypes)
  
  stats <- bind_rows(stats, tibble(locus=l, 
                          length = length(locus_dnabin[1,]),
                          segSites = length(seg.sites(locus_dnabin)),
                          nHaps = length(dimnames(haps)[[1]]),
                          nucDiv = nuc.div(locus_dnabin),
                          ttRatio =  ttratio[3],
                          model = bestmodel,
                          gammaShape = bestparams$shape,
                          rate1 = bestparams$g[1],
                          rate2 = bestparams$g[2],
                          rate3 = bestparams$g[3],
                          rate4 = bestparams$g[4],
                          Q1 = sort(unique(bestparams$Q))[1],
                          Q2 = sort(unique(bestparams$Q))[2],
                          Q3 = sort(unique(bestparams$Q))[3]
                          ))
                         
                        
}

stats <- stats[which(stats$length %in% migrate_lengths$length),]
#setdiff(stats$length, as.numeric(migrate_lengths$length))
# write_csv(stats, "./migrate/run4_locus_statistics.csv")

Write a model block

Write a space delimited textfile that can be added to the migrate data file in the format that Peter suggested.

stats <- read_csv("./tables/migrate_locus_statistics.csv")

## Rows: 109 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): locus, model
## dbl (9): length, segSites, nHaps, nucDiv, ttRatio, gammaShape, rate1, Q1, Q2
## lgl (4): rate2, rate3, rate4, Q3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# write a space delimited textfile that can be added to the migrate data file
# following Peter's suggestion here:
#https://groups.google.com/g/migrate-support/c/XjV_4jZW4RI/m/HbRWoGY6AwAJ
migrate_mutation_models <- tibble(prefix = "#$",
                                 locus=1:length(stats$locus),
                                 type = "s",
                                 model = stats$model,
                                 q1 = stats$Q2,
                                 q2 = stats$Q3)

#write_delim(migrate_mutation_models,"./migrate/run4_modelblock.txt", na="", delim = " ")
stats

## # A tibble: 109 × 15
##    locus    length segSites nHaps    nucDiv ttRatio model gammaShape rate1 rate2
##    <chr>     <dbl>    <dbl> <dbl>     <dbl>   <dbl> <chr>      <dbl> <dbl> <lgl>
##  1 L10061      411        1     2 0.0000280   Inf   F81            1     1 NA   
##  2 L1096075    388        7     8 0.000234      6   K80            1     1 NA   
##  3 L11008      472        2     6 0             1   JC             1     1 NA   
##  4 L11074      468        2     3 0.0000491     0   JC             1     1 NA   
##  5 L11315      499        4     5 0.000162      1   F81            1     1 NA   
##  6 L11796      516        1     2 0.0000239   Inf   F81            1     1 NA   
##  7 L1180       496        5     6 0.000158      1.5 F81            1     1 NA   
##  8 L12220      337        2     4 0.0000346   Inf   F81            1     1 NA   
##  9 L1257       512        3     9 0             0.5 JC             1     1 NA   
## 10 L12728      618        1     2 0.0000180     0   JC             1     1 NA   
## # ℹ 99 more rows
## # ℹ 5 more variables: rate3 <lgl>, rate4 <lgl>, Q1 <dbl>, Q2 <dbl>, Q3 <lgl>

So we have a 16 invariant loci, and the mean overall transition:transversion ratio is 0.8073394. Mean gamma shape parameter is 1, which argues for only one rate.

Parmfile

I went through and compared my 3.x parmfile line by line to the default one generated by Migrate 4.4.4, and came up with this

################################################################################
# Parmfile for Migrate 4.4.4(git:v4-series-26-ge85c6ff)-June-1-2019 [do not remove these first TWO lines]
# generated automatically on
# Fri Dec 12 2022
menu=YES
nmlength=10
datatype=SequenceData
datamodel=HKY
ttratio=1.000000

freqs-from-data=YES 

seqerror-rate={0.0001,0.0001,0.0001,0.0001}
categories=1 rates=1: 1.000000 
prob-rates=1: 1.000000 
autocorrelation=NO
weights=NO
recover=NO
fast-likelihood=NO
inheritance-scalars={1.00000000000000000000}
haplotyping=YES:no-report
population-relabel={1 2 3 4 5 6 7 8 9 10}
infile=../../ptuberculosa.mig
random-seed=AUTO title= Palythoa tuberculosa - Hawaii
progress=YES
logfile=NO
print-data=NO
outfile=outfile.txt
pdf-outfile=outfile.pdf
pdf-terse=NO
use-M=YES
print-tree=NONE
mig-histogram=NO
skyline=NO theta=PRIOR:50
migration=PRIOR:10
rate=PRIOR:50
split=PRIOR:10
splitstd=PRIOR:10
mutation=CONSTANT
analyze-loci=A
divergence-distrib=S
custom-migration={
**00000000
***0000000
0***000000
00***00000
000***0000
0000***000
00000***00
000000***0
0000000***
00000000**
}
geo=NO

updatefreq=0.200000 0.200000 0.200000 0.200000 bayes-posteriorbins= 500 500
bayes-posteriormaxtype=TOTAL
bayes-file=YES:bayesfile
bayes-allfile=YES:bayesallfile
bayes-all-posteriors=YES:bayesallposterior
bayes-proposals= THETA METROPOLIS-HASTINGS Sampler
bayes-proposals= MIG SLICE Sampler
bayes-proposals= DIVERGENCE METROPOLIS-HASTINGS Sampler
bayes-proposals= DIVERGENCESTD METROPOLIS-HASTINGS Sampler
bayes-priors= THETA WEXPPRIOR: 0.0 0.01 0.1000000 0.01000 
bayes-priors= MIG WEXPPRIOR: 0.000100 100000.000000 1000000
bayes-priors= RATE * * UNIFORMPRIOR: 0.000000 10000000000.000000 1000000000.000000 
bayes-hyperpriors=NO
long-chains=1
long-inc=100
long-sample=10000
burn-in=2000  
auto-tune=YES:0.440000
assign=NO
heating=YES:1:{1.000000,1.500000,3.000000,1000000.000000}
heated-swap=YES
moving-steps=NO
gelman-convergence=No
replicate=NO
end

Test Run 1

Let’s see what happens! Started this run on Jan 22, at 10pm.

screen -S migrate_testrun

mpirun -np 16 ~/migrate-4.4.4/src/migrate-n-mpi parmfile

And it was finished the next morning! Prior on m was too wide, giving results like this:

Results

Wide Priors on m -pg 1 Wide Priors on m-pg3 No bueno.

Test Run 2

Revise m prior

Revised the m prior like so (I had no window value in the original!) Order of values is min, mean, maximum, proposal window.

bayes-priors= MIG WEXPPRIOR: 0.000100 1000.000000 10000 100

Panmixia

A few modifications to make a panmictic model

population-relabel={1 1 1 1 1 1 1 1 1 1}
#and
custom-migration={
*
}

Island Model

Change all parameter values to m for island model.

custom-migration={
mmmmmmmmmm
mmmmmmmmmm
mmmmmmmmmm
mmmmmmmmmm
mmmmmmmmmm
mmmmmmmmmm
mmmmmmmmmm
mmmmmmmmmm
mmmmmmmmmm
mmmmmmmmmm
}

Minor github issue. Got a little hung up at this point because I accidentally committed some bayesallfiles that were hundreds of Mb, and of course github wouldn’t let me upload those. Thought I fixed it with these instructions. Then ended up trying git-filter-repo (installed via homebrew), then ended up restoring the whole thing from the github repository. Ouch. Then it wouldn’t push until I ran git prune

Results

Model Selection

Very nice! 😃 Going to need to modify my code to read multilocus output from migrate-n.

workingDir <- "./migrate/run2"

modelMarglikes <- harvest.model.likelihoods(workingDir = workingDir)

## [1] "./migrate/run2/island"
## [1] "./migrate/run2/panmixia"
## [1] "./migrate/run2/stepping.stone"

kable(modelMarglikes, format.args = list(digits = 5))

model	thermodynamic	bezier.corrected	harmonic
island	-71972	-71120	-71813
panmixia	-70400	-69492	-69655
stepping.stone	-65995	-65136	-65790

results <- bfcalcs(modelMarglikes)

kable(results)

model	thermodynamic	bezier.corrected	harmonic	lbf	choice	modelprob
island	-71971.81	-71120.19	-71812.53	-11967.82	3	0
panmixia	-70400.30	-69492.18	-69654.60	-8711.80	2	0
stepping.stone	-65995.43	-65136.28	-65789.54	0.00	1	1

Parameters

workingDir2 <- "/Volumes/GoogleDrive-103325533945714033178/My Drive/Ptuberculosa/stacks_output/migrate/testrun/run2"

winningModelDir <- file.path(workingDir2, "stepping.stone")

#this may take a minute or two
data <- read.table(file.path(winningModelDir,"bayesallfile"), header=T) 

#calculate migrants per generation
data.nm <- migrants.per.gen(data)

data %>% group_by(locus) %>% summarize()
# Split the whole list into the individual replicates, so that you will
# have a list of data frames
#data.list.1 <- split(data.nm,data$Replicate)

#split the data on locus
data.list.1 <- split(data.nm,data$Locus)
data.list.mcmc <- mcmc.list(lapply(data.list.1,mcmc))   
# Remove burnin if it wasn't removed in the run, where burnin is a proportion of the total.
# e.g. First 40% First 20,000 samples out of 50,000 or 10 million steps out of 25 million
#burnin <- 0.2
#data.list.1<-lapply(data.list.1,subset,subset=Steps>(burnin*max(data.list.1[[1]]$Steps)))

data.list.mcmc <- mcmc.list(data.list.1)
    
#convert each dataframe to an mcmc object and then convert the whole thing to an mcmc list

#collapse all replicates to one for purposes of highest posterior density
data.list.allinone <- mcmc(data=do.call("rbind",data.list.2)) 
    

summ <- summary(data.list.mcmc)
ess <- effectiveSize(data.list.mcmc)
gelman <- gelman.diag(data.list.mcmc,multivariate=F)
HPD <- HPDinterval(data.list.allinone)
    
#concatenate the stats
allstats<-cbind(summ$statistics[,1:2],HPD,ess,gelman$psrf)
#write.csv(allstats,file="codastats.csv")

Models

’Ale’a and I settled on the following initial models


#Panmixia (panmixia; 1 panmictic population)                        
population-relabel={1 1 1 1 1 1 1 1 1 1}                                
custom-migration=   *                           
                                
                                
                                
                                
                                
#n-Island (island; all populations same size, all exchange migrants)                                
population-relabel={1 2 3 4 5 6 7 8 9}                              
custom-migration=   
  Pop_Kure                  m   m   m   m   m   m   m 
    Pop_P&H                   m m   m   m   m   m   m
    Pop_MaroReef                m   m   m   m   m   m   m
    Pop_FFS                   m m   m   m   m   m   m
    Pop_Kauai                   m   m   m   m   m   m   m
    Pop_Oahu                    m   m   m   m   m   m   m
    Pop_Molokai               m m   m   m   m   m   m
    Pop_Maui                    m   m   m   m   m   m   m
    Pop_BigIsland               m   m   m   m   m   m   m
                                
                                
                                
                                
                                
                                
#Stepping-Stone (stepping.stone)                
population-relabel={1 2 3 4 5 6 7 8 9}                              
    Pop_Kure                    *   *   0   0   0   0   0   0   0
    Pop_P&H                   * *   0   0   0   0   0   0   0
    Pop_MaroReef                0   *   *   *   0   0   0   0   0
    Pop_FFS                   0 0   *   *   *   0   0   0   0
    Pop_Kauai                   0   0   0   *   *   *   0   0   0
    Pop_Oahu                    0   0   0   0   *   *   *   0   0
    Pop_Molokai               0 0   0   0   0   *   *   *   0
    Pop_Maui                    0   0   0   0   0   0   *   *   *
    Pop_BigIsland               0   0   0   0   0   0   0   *   *
                                
                                
                                
                                
                                
#Stepping-Stone with a Kure-Oahu connection (stepping.stone.KO)                             
population-relabel={1 2 3 4 5 6 7 8 9}                              
custom-migration=   
    Pop_Kure                    *   *   0   0   0   0   *   0   0
    Pop_P&H                   * *   0   0   0   0   0   0   0
    Pop_MaroReef                0   *   *   *   0   0   0   0   0
    Pop_FFS                   0 0   *   *   *   0   0   0   0
    Pop_Kauai                   0   0   0   *   *   *   0   0   0
    Pop_Oahu                    *   0   0   0   *   *   *   0   0
    Pop_Molokai               0 0   0   0   0   *   *   *   0
    Pop_Maui                    0   0   0   0   0   0   *   *   *
    Pop_BigIsland               0   0   0   0   0   0   0   *   *
                                
                                
                                
                                
#NWHI vs MHI    (NWHI_MHI; two panmictic regions)                   
population-relabel={1 1 1 1 2 2 2 2 2}                              
custom-migration=   
  Pop_NWHI                  *   *                   
    Pop_MHI                   * *                   
                                
                                
                                
                                
# Stepping Stone Breaks (stepping.stones.breaks; 
#  regional groups of Kure_PH vs MR_FFS vs MHI as lumped populations 
#  with stepping stone gene flow between them. )                                
population-relabel={1 1 2 2 3 4 5 6 7}                              
custom-migration=   
  Pop_Kure_PH                   *   *   0   0   0   0   0
    Pop_MaroReef_FFS                *   *   *   0   0   0   0
    Pop_Kauai                       0   *   *   *   0   0   0
    Pop_Oahu                        0   0   *   *   *   0   0
    Pop_Molokai                   0 0   0   *   *   *   0
    Pop_Maui                        0   0   0   0   *   *   *
    Pop_BigIsland                   0   0   0   0   0   *   *
                                
                                
                                
# Stepping Stone Breaks (stepping.stones.breaks; 
#  regional groups of Kure_PH vs MR_FFS vs MHI as lumped populations 
#  with stepping stone gene flow between them, and a Kure-Oahu connection.)                                 
population-relabel={1 1 2 2 3 4 5 6 7}                              
custom-migration=   
  Pop_Kure_PH                   *   *   0   *   0   0   0
    Pop_MaroReef_FFS                *   *   *   0   0   0   0
    Pop_Kauai                       0   *   *   *   0   0   0
    Pop_Oahu                        *   0   *   *   *   0   0
    Pop_Molokai                   0 0   0   *   *   *   0
    Pop_Maui                        0   0   0   0   *   *   *
    Pop_BigIsland                   0   0   0   0   0   *   *

I also turned menu = NO, and replicate=YES:3.

Gene Flow between Oahu and Kure

There is an intriguing result of nearly one-way gene flow from Oahu to Kure. Also, the stepping.stone.breaks.KO is much better than just stepping.stone.breaks. This suggests that a divergence parameter for these two populations is in order.

Gene Flow from Oahu to Kure

Gene Flow from Kure to Oahu

What about one way gene flow?

Stepping.stone.oneway

# (stepping.stone.oneway)                               
population-relabel={1 2 3 4 5 6 7 8 9}                              
custom-migration={
*   *   0   0   0   0   0   0   0
*   *   0   0   0   0   0   0   0
0   *   *   0   0   0   0   0   0
0   0   *   *   0   0   0   0   0
0   0   0   *   *   0   0   0   0
0   0   0   0   *   *   0   0   0
0   0   0   0   0   *   *   0   0
0   0   0   0   0   0   *   *   0
0   0   0   0   0   0   0   *   *
}

Adding divergence models

Peter says you can mix * with d parameters, and I tested it and it worked! Based on these results I’m going to add a model of one way ongoing migration from Oahu to Kure, and another model of divergence from Oahu to Kure about 80 years ago with no subsequent gene flow, and finally, divergence plus migration between Oahu to Kure.

stepping.stone.KO.D

Equilibrium Gene Flow, but divergence with gene flow from Oahu to Kure


#Stepping-Stone with a *one way* Kure-Oahu divergence with gene flow 
# (stepping.stone.KO.divmig)                                
population-relabel={1 2 3 4 5 6 7 8 9}                              
custom-migration=   
    Pop_Kure                    *   0   0   0   0   0   D   0   0
    Pop_P&H                   0 *   0   0   0   0   0   0   0
    Pop_MaroReef                0   *   *   *   0   0   0   0   0
    Pop_FFS                   0 0   *   *   *   0   0   0   0
    Pop_Kauai                   0   0   0   *   *   *   0   0   0
    Pop_Oahu                    0   0   0   0   *   *   *   0   0
    Pop_Molokai               0 0   0   0   0   *   *   *   0
    Pop_Maui                    0   0   0   0   0   0   *   *   *
    Pop_BigIsland               0   0   0   0   0   0   0   *   *

stepping.stone.KOd

Equilibrium Gene Flow, but divergence without gene flow from Oahu to Kure

#Stepping-Stone with a *one way* Kure-Oahu divergence without gene flow 
# (stepping.stone.KO.div)   
population-relabel={1 2 3 4 5 6 7 8 9}                              
custom-migration=   
    Pop_Kure                    *   0   0   0   0   0   d   0   0
    Pop_P&H                   0 *   0   0   0   0   0   0   0
    Pop_MaroReef                0   *   *   *   0   0   0   0   0
    Pop_FFS                   0 0   *   *   *   0   0   0   0
    Pop_Kauai                   0   0   0   *   *   *   0   0   0
    Pop_Oahu                    0   0   0   0   *   *   *   0   0
    Pop_Molokai               0 0   0   0   0   *   *   *   0
    Pop_Maui                    0   0   0   0   0   0   *   *   *
    Pop_BigIsland               0   0   0   0   0   0   0   *   *

stepping.stone.splitsD

North to South Divergence with migration

*   0   0   0   0   0   0   0   0
D   *   0   0   0   0   0   0   0
0   D   *   0   0   0   0   0   0
0   0   D   *   0   0   0   0   0
0   0   0   D   *   0   0   0   0
0   0   0   0   D   *   0   0   0
0   0   0   0   0   D   *   0   0
0   0   0   0   0   0   D   *   0
0   0   0   0   0   0   0   D   *

stepping.stone.splits.d

North to South Divergence without migration

*   0   0   0   0   0   0   0   0
d   *   0   0   0   0   0   0   0
0   d   *   0   0   0   0   0   0
0   0   d   *   0   0   0   0   0
0   0   0   d   *   0   0   0   0
0   0   0   0   d   *   0   0   0
0   0   0   0   0   d   *   0   0
0   0   0   0   0   0   d   *   0
0   0   0   0   0   0   0   d   *

stepping.stone.splitsD.KOsplitD

Here, population 1 isn’t born until it splits from population 6. I’m having trouble setting an individual prior for a very young divergence from 6 to 1- apparently I found a bug- Peter is fixing it.

*   0   0   0   0   d   0   0   0
0   *   0   0   0   0   0   0   0
0   D   *   0   0   0   0   0   0
0   0   D   *   0   0   0   0   0
0   0   0   D   *   0   0   0   0
0   0   0   0   D   *   0   0   0
0   0   0   0   0   D   *   0   0
0   0   0   0   0   0   D   *   0
0   0   0   0   0   0   0   D   *

Priors

And we need to add reasonable priors on divergence time to match the foundation of Naval Air Station Midway in 1941, with a standard deviation of say 20 years because it was placed under Navy control in 1903. Splitting times $\tau$ in Migrate are in units of $\mu$ x generations. I will assume a nuclear mutation rate of 1e-9 and a generation time of 1 year. Peter confirmed

#take the mean of the modes from one replicate of the stepping.stone.KO model
#meantheta <- mean(0.00570+0.00450+0.00450+0.00490+0.00630+0.00510+0.00570+0.00510+0.00590)
mu <- 1e-9
generation_time <- 1

mean_split <- 80*mu/generation_time
std_split <- 20*mu/generation_time
max_split <- 163 * mu/generation_time

80 years ago is 8^{-8}, and a max of 163 years (Midway discovered in 1859) which is 1.63^{-7}.

It turns out that setting individual priors on divergence isn’t working right now, so I’m just going to have to set a wide prior if there are other d or D parameters and only can set the specific prior if the Kure Oahu split is the only split in the model.

This results in adding the following two lines to the parmfiles with only the Kure Oahu split.

bayes-priors= SPLIT * * WEXPPRIOR: 0.0 8e-8 1.63e-7 1.63e-8 
bayes-priors= SPLITSTD * * WEXPPRIOR: 0.0 8e-8 1.63e-7 1.63e-8

Unfortunately the prior above was resulting in error messages.

[1]    63427 illegal hardware instruction  /Applications/migrate/migrate-4.4.4/migrate-n

From the output, it looks like it can’t consider priors smaller than 1e-6. But even using 1e-5 or 1-e4 results in errors.

So all will use this wider prior. Also, I am using exponentially distributed priors, so the SPLITSTD is not going to be used.

bayes-priors= SPLIT * *  WEXPPRIOR: 0.0 0.001 0.1000000 0.01000
bayes-priors= SPLITSTD * *  WEXPPRIOR: 0.0 0.001 0.1000000 0.01000

Final Parmfile

Here’s the parmfile that I used for Run 4, with of course the various models-specific settings changed for each model.

################################################################################
# Parmfile for Migrate 4.4.4(git:v4-series-26-ge85c6ff)-June-1-2019 [do not remove these first TWO lines]
# generated automatically on
# Fri Dec 12 2022
menu=NO
nmlength=10
datatype=SequenceData
datamodel=HKY
ttratio=1.000000

freqs-from-data=YES 

seqerror-rate={0.0001,0.0001,0.0001,0.0001}
categories=1 #no categories file specified
rates=1: 1.000000 
prob-rates=1: 1.000000 
autocorrelation=NO
weights=NO
recover=NO
fast-likelihood=NO
inheritance-scalars={1.00000000000000000000}
haplotyping=YES:no-report
population-relabel={1 2 3 4 5 6 7 8 9}
infile=../../../ptuberculosa.mig
random-seed=AUTO #OWN:410568459
title= Palythoa tuberculosa - Hawaii
progress=YES
logfile=YES:logfile.txt
print-data=NO
outfile=outfile.txt
pdf-outfile=outfile.pdf
pdf-terse=YES
use-M=YES
print-tree=NONE
mig-histogram=MIGRATIONEVENTSONLY
skyline=NO #needs mig-histogram=ALL:...
theta=PRIOR:50
migration=PRIOR:10
rate=PRIOR:50
split=PRIOR:10
splitstd=PRIOR:10
mutation=CONSTANT
analyze-loci=A
divergence-distrib=E
custom-migration={
*   0   0   0   0   d   0   0   0
0   *   *   0   0   0   0   0   0
0   *   *   *   0   0   0   0   0
0   0   *   *   *   0   0   0   0
0   0   0   *   *   *   0   0   0
0   0   0   0   *   *   *   0   0
0   0   0   0   0   *   *   *   0
0   0   0   0   0   0   *   *   *
0   0   0   0   0   0   0   *   *
}
geo=NO

updatefreq=0.200000 0.200000 0.200000 0.200000 #tree, parameter haplotype, timeparam updates
bayes-posteriorbins= 2000 2000
bayes-posteriormaxtype=TOTAL
bayes-file=YES:bayesfile
bayes-allfile=YES:bayesallfile
bayes-all-posteriors=YES:bayesallposterior
bayes-proposals= THETA METROPOLIS-HASTINGS Sampler
bayes-proposals= MIG SLICE Sampler
bayes-proposals= DIVERGENCE METROPOLIS-HASTINGS Sampler
bayes-proposals= DIVERGENCESTD METROPOLIS-HASTINGS Sampler
bayes-priors= THETA WEXPPRIOR: 0.0 0.001 0.1000000 0.01000 
bayes-priors= MIG WEXPPRIOR: 0.000100 1000.000000 10000 100
bayes-priors= SPLIT * *  WEXPPRIOR: 0.0 0.001 0.1000000 0.01000
bayes-priors= SPLITSTD * *  WEXPPRIOR: 0.0 0.001 0.1000000 0.01000
bayes-hyperpriors=NO
long-chains=1
long-inc=100
long-sample=30000
burn-in=10000
auto-tune=YES:0.440000
assign=NO
heating=YES:1:{1.000000,1.500000,3.000000,1000000.000000}
heated-swap=YES
moving-steps=NO
gelman-convergence=No
replicate=NO
end

Copy It Up

Copy it up, and then make a total of 10 replicate copies of the folder.

scp -r ./run4.1 ecrandall@nautilus.psu.edu:./Ptuberculosa/run4.1    
mkdir rep1
mv step* rep1
cp ../run4/ptuberculosa.mig ./ptuberculosa.mig
cp ../run4/mpi_run_migrate-n.sh ./mpi_run_migrate-n.sh
cd rep1
#once on server, copy it 9 times
for a in $(seq 2 10); do cp -r rep1 rep$a; done

And Back down


rsync -av -e ssh --exclude='bayes*'  ecrandall@nautilus.psu.edu:~/Ptuberculosa/run4.5 ./

rsync -av -e ssh --exclude='bayes*' --exclude="*.pdf"  eric@moneta.tobolab.org:~/Palythoa/run4.6 ./run4.6

# then copied them into the run4 folder with:
# the exclude statement excludes stepping.stone.KOd and stepping.stone.KO.D which didn't complete.
rsync -av --progress run4.5/ run4/ --exclude '*KO[\.d]*'  

rsync -av --progress run4.6/ run4/

Bash Script

So we will do 10 replicates of 3 replicates. This will start at r1, and run all models for that before moving on. Pretty sure this will finish one whole model before moving on to the next one (since all threads are being used for different loci)

#!
for r in */
    do
        cd $r
        echo $r
        date
        date > date.txt
            for m in */
              do
                cd $m
                  date > date.txt
                  echo $m
                  date
                  mpirun -np 32 ~/migrate-4.4.4/src/migrate-n-mpi parmfile
                  sleep 1
                cd ..
              done
        cd ..
    done

Results

Model Marginal Likelihoods

runDir <- "/Users/eric/Library/CloudStorage/OneDrive-ThePennsylvaniaStateUniversity/Documents/Research/Ptuberculosa/migrate/run4"
#runDir <-"/Volumes/GoogleDrive-103325533945714033178/My Drive/ptuberculosa/migrate/run4"

likelist <- list()
for(r in 1:10){
  rep = paste0("rep",r)
  print(rep)
  likelist[[rep]] <- harvest.model.likelihoods(workingDir=file.path(runDir,rep))
}

# Model selection for each replicate...
likelist %>% map(bfcalcs)

The final model marginal likelihood estimates based on the mean of 10 replicates:

like.df <-  likelist %>% bind_rows() %>% group_by(model)

means <- like.df %>% summarize(bezier.corrected = mean(bezier.corrected))
  

final_model <- bfcalcs(means)

final_model

##                              model bezier.corrected       lbf choice modelprob
## 1                         NWHI_MHI        -72281.30 -35630.64      8         0
## 2                           island        -75311.81 -41691.66      9         0
## 3                         panmixia        -72172.88 -35413.80      7         0
## 4                   stepping.stone        -71256.23 -33580.50      6         0
## 5                stepping.stone.KO        -69254.51 -29577.06      3         0
## 6              stepping.stone.KO.D        -55474.16  -2016.36      2         0
## 7               stepping.stone.KOd        -54465.98      0.00      1         1
## 8            stepping.stone.breaks        -71110.23 -33288.51      5         0
## 9         stepping.stone.breaks.KO        -70307.17 -31682.39      4         0
## 10         stepping.stone.splits.d        -77896.08 -46860.21     12         0
## 11          stepping.stone.splitsD        -77091.94 -45251.93     11         0
## 12 stepping.stone.splitsD.KOsplitd        -76884.10 -44836.24     10         0

Figures

And some figures summarizing all this. The two best models are equilibrium models across the archipelago, with recent divergence from Oahu to Kure. The best one is divergence without gene flow, followed by divergence with gene flow. The three models with divergences are much worse than those without. Difference between stepping.stone and stepping.stone.KO is very significant.

models <- c("N-Island", "NWHI vs. MHI"," Panmixia", "SS: Islands",
            "SS: Regional Breaks", "SS: Regional Breaks, KO 2way",
            "SS: Islands, OK 2way", 
            "SS: Islands, O2K Colonization with GF",
            "SS: Islands, O2K Colonization without GF",
            "Seq Div: No GF","Seq Div: GF",
            "Seq Div: GF, O2K Colonization without GF")

model_letters <- c("B", "C"," A", "D","G", "H",
                    "F", "L", "M", "I","J","K")
model_letters <- rep(model_letters,10)

likesPlot <- likelist %>% bind_rows() %>% 
              group_by(model) %>% 
              ggplot(mapping = aes(x=model, y = bezier.corrected)) +
              geom_violin(draw_quantiles = 0.5) +
              theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
              scale_x_discrete(labels = models) +
              labs(x = "Metapopulation Model", 
                   y = "Bezier Corrected Marginal Likelihood")
likesPlot

likesPlot_letters <- likelist %>% bind_rows() %>% 
                      bind_cols(model_letters=model_letters) %>% 
                      group_by(model_letters) %>% 
                      ggplot(mapping = aes(x=model_letters, y = bezier.corrected)) +
                      geom_violin(draw_quantiles = 0.5) +
                      labs(x = "Metapopulation Model", 
                      y = "Bezier Corrected Marginal Likelihood")

#ggsave("figures/final_figures/MarginalLikelihoodsPlot.pdf", likesPlot_letters)

Rerun the three top models

Sigh. I figured out that I had a ridiculously high Ti:Tv rate ($\kappa$) for 3 loci that had K80 models. This might have affected the summed marginal likelihood of each model just a bit, and definitely could affect parameter estimation, so I am going to rerun the three top models (stepping.stone.KOd, stepping.stone.KO.D, stepping.stone.KO) using a new version of migrate that doesn’t have bayesfile visualization issues, and with the ti:tv ratio fixed to 6 for those three loci.

Final Model Selection Results

Model Marginal Likelihoods

runDir <-"/Users/eric/Library/CloudStorage/OneDrive-ThePennsylvaniaStateUniversity/Documents/Research/Ptuberculosa/migrate/run4.7"

likelist2 <- list()
for(r in 1:10){
  rep = paste0("rep",r)
  print(rep)
  likelist2[[rep]] <- harvest.model.likelihoods(workingDir=file.path(runDir,rep))
}

# Model selection for each replicate...
likelist2 %>% map(bfcalcs)

The final model marginal likelihood estimates based on the mean of 10 replicates:

like2.df <-  likelist2 %>% bind_rows() %>% group_by(model)

means2 <- like2.df %>% summarize(bezier.corrected = mean(bezier.corrected))
  

final_model2 <- bfcalcs(means2)

final_model2

##                 model bezier.corrected        lbf choice    modelprob
## 1   stepping.stone.KO        -65211.06 -34149.632      3 0.000000e+00
## 2 stepping.stone.KO.D        -48214.14   -155.794      2 1.478301e-34
## 3  stepping.stone.KOd        -48136.25      0.000      1 1.000000e+00

T-Test

The two best models are equilibrium models across the archipelago, with recent divergence from Oahu to Kure. The best one is divergence without gene flow, followed by divergence with gene flow. The difference between the top two models is significant. 😃

top.choice <- final_model2$model[which(final_model2$choice ==1)]
second.choice <- final_model2$model[which(final_model2$choice ==2)]

TS <- permTS(like2.df$bezier.corrected[which(like2.df$model == top.choice)],
       like2.df$bezier.corrected[which(like2.df$model == second.choice)],
       alternative = "greater")

TS

Figures

models <- c("SS: Islands, O2K Colonization with GF", 
            "SS: Islands, O2K Colonization without GF")

model_letters2 <- c("L","M")

likesPlot2 <- likelist2 %>% bind_rows() %>% filter(model != "stepping.stone.KO") %>% 
              group_by(model) %>% 
              ggplot(mapping = aes(x=model, y = bezier.corrected)) +
              geom_violin(draw_quantiles = 0.5) +
              theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
              scale_x_discrete(labels = models) +
              labs(x = "Metapopulation Model", 
                   y = "Bezier Corrected Marginal Likelihood") +
              ylim(-50000,-45000)
likesPlot2

likesPlot2_letters <- likelist2 %>% bind_rows() %>% filter(model != "stepping.stone.KO") %>%
              group_by(model) %>% 
              ggplot(mapping = aes(x=model, y = bezier.corrected)) +
              geom_violin(draw_quantiles = 0.5) +
              scale_x_discrete(labels = model_letters2) +
              labs(x = "Metapopulation Model", 
                   y = "Bezier Corrected Marginal Likelihood") +
              ylim(-50000,-45000)

Model Parameter Estimates

Because I set such wide priors on gene flow for marine species, the standard PDF output does a poor job of binning the data, such that even though an equilibrium model of gene flow is the best model, Migrate is telling me that all migration parameters have a mode of 2.5 and 0 at the 25th percentile. It would be nice if I could look at the posterior calculated across all loci myself, but it is unclear how Migrate does this exactly. I inquired about this to Peter, and he got back to me with this method for summing the posterior over all loci and some Python code as a demo. See email thread “correctly combining over loci using the bayesallfile” from November 2022.

create a histogram for each locus

log the histogram (there will be issues with zero value cells) *. migrate smoothes each locus histogram

subtract the log(prior)

sum all log histograms for each locus

add the log prior back

convert to non-log

adjust so that the sum of the new histogram sums to 1.0

Read in the posteriors and make a labeling key

#
Dir <- "/Users/eric/Library/CloudStorage/OneDrive-ThePennsylvaniaStateUniversity/Documents/Research/Ptuberculosa/migrate/run4.7"
      
winningModel <- "stepping.stone.KO.D"

posteriors <- list()

#read in the posterior for one replicate this may take 5 minutes
for(r in 1:2){
  print(paste("Loading Rep",r))
  posteriors[[r]] <- read.table(file.path(Dir,paste0("rep",r),winningModel,
                                      "bayesallfile"), header=T) 
}

## [1] "Loading Rep 1"
## [1] "Loading Rep 2"

#combine into one, and then keep only the first 2 reps (based on diagnostics below)
posteriors12 <- posteriors %>% bind_rows(.id = "rep") %>% 
                                filter(rep <= 2) %>% migrants.per.gen()

posteriors12_long <- posteriors12 %>% pivot_longer(starts_with(c("Theta_", "M_", 
                                                           "Nm_","D_","d_")),
                                                    names_to = "parameter",
                                                    values_to = "value") 


# read in a population key to translate numbers into names in figures
popkey <- read_csv("filtered_data/popkey.csv")

## Rows: 9 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Pop, Hawaiian_Pop
## dbl (1): Index
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

names(popkey$Pop) <- popkey$Index
#make a new labeling key
posterior_parameters <- names(posteriors12) %>% str_subset("Theta|M_|Nm_|D_|d_")

parameter_key <- posterior_parameters %>% 
                                  str_replace("(\\d)_(\\d)", "\\1 to \\2") %>%
                                  str_replace("M_", "m/mu " ) %>% 
                                  str_replace("Theta_", "theta ") %>% 
                                  str_replace_all(popkey$Pop) %>% 
                                  str_replace("Nm_", "4Nem ") 
                                  

names(parameter_key) <- posterior_parameters

Check Convergence

# posteriors.grouped <- posteriors %>% 
#                       select(all_of(c("Locus",
#                                         "Steps",
#                                         posterior_parameters))) %>% 
#                       group_by(rep)

# posterior.reps <- group_split(posterior.grouped)
# posterior.key <- group_keys(posterior.grouped)

#thin by taking every other step
posteriors2 <- lapply(posteriors,filter,Steps%%500==0)
# select parameters only
posteriors2 <- lapply(posteriors2, select, starts_with(c("Theta_", "M_", 
                                                          "Nm_","D_","d_")))

                        
posteriors.mcmc <- mcmc.list(lapply(posteriors2,mcmc,thin=500))
posterior.ggs <- ggs(posteriors.mcmc, burnin = F)
effectiveSize(posteriors2[[1]])
#ggmcmc(posterior.ggs, param_page = 4)

Effective Sizes

Effective sizes are good except for M_3_4.

effectiveSize(posteriors2[[1]]) (first replicate)

     Theta_1      Theta_2      Theta_3      Theta_4      Theta_5      Theta_6      Theta_7      Theta_8 
127773.10928  60095.43078  58552.43846  71943.00597  77691.51114  61662.55359  45230.02725  64583.65543 
     Theta_9        M_6_1        M_3_2        M_2_3        M_4_3        M_3_4        M_5_4        M_4_5 
 79323.26808 147192.07123  67813.55644    448.34109   1473.22013     68.48918   3983.63608  39822.63084 
       M_6_5        M_5_6        M_7_6        M_6_7        M_8_7        M_7_8        M_9_8        M_8_9 
140352.48682 316433.45752 333193.44720  63909.27617  63612.89938 653891.00000  38723.33119 441357.70803 
       D_6_1 
202634.81598

effectiveSize(posteriors.mcmc[c(1,2)])
  Theta_1   Theta_2   Theta_3   Theta_4   Theta_5   Theta_6   Theta_7   Theta_8   Theta_9     M_6_1 
268101.80 175117.21  86454.19  84954.50  91454.47 134416.13 127731.50 110109.44 174394.67 313435.43 
    M_3_2     M_2_3     M_4_3     M_3_4     M_5_4     M_4_5     M_6_5     M_5_6     M_7_6     M_6_7 
674097.79 236556.17 421717.30 218933.72 390887.59 272926.62 585871.37 589532.54 613794.87 495933.81 
    M_8_7     M_7_8     M_9_8     M_8_9     D_6_1 
421019.97 667326.66  39515.35 531464.72 415678.34

Gelman-Rubin Diagnostic

Gelman-Rubin diagnostic is a bit high for M_ parameters, but if I log transform them, all Gelman diagnostics are below 1.2.


gelman.diag(posteriors.mcmc)

Potential scale reduction factors:

        Point est. Upper C.I.
Theta_1       1.00       1.00
Theta_2       1.01       1.02
Theta_3       1.00       1.01
Theta_4       1.00       1.01
Theta_5       1.00       1.00
Theta_6       1.00       1.01
Theta_7       1.01       1.02
Theta_8       1.00       1.00
Theta_9       1.01       1.01
M_6_1         1.00       1.00
M_3_2         1.20       1.25
M_2_3         1.29       1.34
M_4_3         1.28       1.34
M_3_4         1.29       1.82
M_5_4         1.12       1.15
M_4_5         1.06       1.06
M_6_5         1.04       1.04
M_5_6         1.05       1.05
M_7_6         1.01       1.01
M_6_7         1.20       1.20
M_8_7         1.01       1.01
M_7_8         1.17       1.17
M_9_8         1.24       1.40
M_8_9         1.28       1.30
D_6_1         1.00       1.00

Multivariate psrf

1.03

gelman.diag(posteriors.mcmc, transform = T)
Potential scale reduction factors:

        Point est. Upper C.I.
Theta_1       1.00       1.00
Theta_2       1.00       1.01
Theta_3       1.00       1.00
Theta_4       1.00       1.01
Theta_5       1.00       1.00
Theta_6       1.00       1.01
Theta_7       1.00       1.01
Theta_8       1.00       1.00
Theta_9       1.00       1.01
M_6_1         1.00       1.00
M_3_2         1.01       1.03
M_2_3         1.01       1.03
M_4_3         1.01       1.02
M_3_4         1.02       1.04
M_5_4         1.01       1.02
M_4_5         1.00       1.01
M_6_5         1.00       1.01
M_5_6         1.01       1.02
M_7_6         1.00       1.01
M_6_7         1.00       1.01
M_8_7         1.00       1.00
M_7_8         1.01       1.01
M_9_8         1.01       1.01
M_8_9         1.00       1.01
D_6_1         1.00       1.00

Multivariate psrf

1.02

Theta plots

Here are plots of each parameter posterior, with each colored line being the posterior from one of 106 loci. Next thing I have to do is code up the summation described by Peter above.

$\theta=4N_e\mu$

Theta_plot <- posteriors12_long %>%  
              filter(str_detect(parameter, "Theta_")) %>%  
              ggplot() + 
              geom_density(mapping = aes(x = value, 
                    group=Locus, color =Locus), weight = 0.5) + 
                    #scale_x_log10(limits=c(1e-4,0.1)) +
                    facet_wrap(~parameter, 
                    scales = "free_y",
                    labeller = labeller(parameter= parameter_key), 
                           dir = "h",nrow=2) +
                    theme(axis.text.x = element_text(size = rel(0.6)))

Theta_plot

Migration plots

Proportion of migrants relative to mutation rate $\frac{m}{\mu}$

M_plot <- posteriors12_long %>%  
          filter(str_detect(parameter, "M_")) %>%  
          ggplot() + 
          geom_density(mapping = aes(x = value, 
                       group=Locus, color =Locus), weight = 0.5) + 
          scale_x_log10(limits=c(1e-4,1000)) + 
          facet_wrap(~parameter, scales = "free_y", 
                    labeller = labeller(parameter= parameter_key), 
                    dir = "v",nrow=2) +
          theme(axis.text.x = element_text(size = rel(0.6)))
M_plot

Gene Flow plots

Migrants per generation $4N_em$

Nm_plot <- posteriors12_long %>%  
          filter(str_detect(parameter, "Nm_")) %>%  
          ggplot() + 
          geom_density(mapping = aes(x = value, group=Locus, color
                                     =Locus), weight = 0.5) + 
                    scale_x_log10(limits=c(1e-6,10)) + 
                    facet_wrap(~parameter, scales = "free_y",
                    labeller = labeller(parameter= parameter_key), 
                    dir = "v",nrow=2) +
           theme(axis.text.x = element_text(size = rel(0.6)))

Nm_plot

Sum over loci

Priors

Here are the priors I’m using, as a reminder.

bayes-priors= THETA WEXPPRIOR: 0.0 0.001 0.1000000 0.01000 
bayes-priors= MIG WEXPPRIOR: 0.000100 1000.000000 10000 100
bayes-priors= SPLIT * *  WEXPPRIOR: 0.0 0.001 0.1000000 0.01000
bayes-priors= SPLITSTD * *  WEXPPRIOR: 0.0 0.001 0.1000000 0.01000

Two Loci

Here are the basics for doing it over two loci:

# make density objects for two loci (n must be a power of 2, using 2^14)
testdens1 <- density(posterior$Theta_1[which(posterior$Locus==1)], n=16384, 
                     from = 0, to = 0.1, bw = "SJ")
testdens2 <- density(posterior$Theta_1[which(posterior$Locus==2)], n=16384, 
                     from = 0, to = 0.1, bw = "SJ")

#create the prior
thetaprior <- dexp(testdens1$x, rate = 0.001, log = F)

#create an empty tibble
densum <- tibble(.rows=16384)
densum$x <- testdens1$x
#remove the prior and standardize
densum$l1 <- (testdens1$y/ thetaprior) / sum(testdens1$y/ thetaprior)
densum$l2 <- (testdens2$y/ thetaprior) / sum(testdens2$y/ thetaprior)
#sum over loci and standardize again
densum$sum <- exp(log(densum$l1+1e-20) + log(densum$l2 + 1e-20) + 
                    log(thetaprior)) / sum( exp(log(densum$l1+1e-20) + 
                                                  log(densum$l2 + 1e-20) + 
                                                  log(thetaprior)))
#pivot to long format
densum_long <- pivot_longer(densum,cols=2:4,names_to = "yplot", values_to = "y")
#plot
ggplot(densum_long, aes(x = x, y = y, group = yplot, color = yplot)) + 
  geom_line() + scale_x_log10()

#bayes-priors= THETA WEXPPRIOR: 0.0 0.001 0.1000000 0.01000 
#bayes-priors= MIG WEXPPRIOR: 0.000100 1000.000000 10000 100

Many Loci

Functions for Summarizing from Bayesfile

These functions are for reconstructing distributions from the bayesfile which consists of distributions for each locus calculated as histograms

bayesfilenames <- c("Locus","pnumber","HPC50","HPC95","value","count",
                    "freq","cumFreq","priorFreq")
DirB <- "/Users/eric/Library/CloudStorage/OneDrive-ThePennsylvaniaStateUniversity/Documents/Research/Ptuberculosa/migrate/run4.6"
bf <- read_table(file = file.path(DirB,paste0("rep",1),winningModel,
                                      "bayesfile"), 
                             col_types = "ifiiddddd", col_names = bayesfilenames,
                             comment = "#", skip_empty_rows = T)

parameter_number_table <- read_table(file = file.path(DirB,paste0("rep",1),winningModel,
                                      "bayesfile"), 
                             col_types = "cic", 
                             col_names = c("drop","pnumber","parameter"),
                             skip=9, n_max = 24) %>% 
                      select(-drop) %>%
                      mutate(parameter = str_replace_all(parameter, ",","_")) %>% 
                      mutate(parameter = str_remove_all(parameter,"\\(")) %>% 
                      mutate(parameter = str_remove_all(parameter,"\\)")) %>% 
                      left_join(tibble(parameter_name =
                                        parameter_key,parameter=names(parameter_key)),
                                by = "parameter")

parameter_number_key <- parameter_number_table$parameter_name 
names(parameter_number_key) <- parameter_number_table$pnumber

#bf %>% ggplot(aes(x=value, y = freq, color = Locus)) + 
#  geom_line() + scale_color_continuous() + 
#  facet_wrap(vars(Parameter), scales = "free")

# bf %>% filter(pnumber %in% 1:9) %>% filter(Locus %in% 1:109) %>% 
#       ggplot(aes(x=value, y = freq, color = Locus)) + geom_line() +
#       facet_wrap(vars(pnumber), scales = "free")

bf %>% filter(pnumber %in% 1:9) %>% filter(Locus == 110) %>% 
      ggplot(aes(x = value, y = freq, color = pnumber)) + geom_line() +
      scale_color_brewer(palette = "YlGnBu", labels = parameter_number_key) +
      ylim(0,0.5) + xlim(0,0.01) + geom_line(aes(x = value, y = priorFreq),
                                             color = "grey", linetype="dotted")

Functions for Summarizing from Bayesallfile

These functions are for reconstructing distributions from the “raw” posterior output, which Peter calls a bayesallfile.

Note I revised these functions following the “Sanity Check” section below, when I re-ran Migrate due to a bug in v4.4.4. I found an issue in the remove_prior function where it wasn’t removing the prior correctly. Code for parameter estimates from before the “Sanity Check” will no longer work in these functions.

remove_prior <- function(densityd,prior,floor = 1e-10){
  # this function subtracts the prior from the posterior density for the y values of a single density column
  # and changes values less than zero to a small floor value
  # and then standardizes the result to sum to 1
  minusprior <- densityd - prior
  minusprior[minusprior <= 0] <- floor
  standardized <- minusprior / sum(minusprior)
  return(standardized)
}

sum_over_loci <- function(df,parameter){
      #this function takes a data frame of probability densities for many loci
      # that have had the prior removed, and also has a logged prior column called logPrior
      # as well as the name of a parameter (e.g. "Theta_1")
      # and sums the densities over loci.
       # Rmpfr package allows quadruple precision for calcs on very small numbers.
    require(Rmpfr)

      #log all values
       df2 <- df %>%  mutate(across(starts_with(c(parameter)),
                  .fns=log)) %>% 
      # convert the df to rowwise so that rows can be summed
      # and then sum across the row, including the prior
          rowwise() %>% 
          mutate(sum_param_prior = 
          sum(c_across(starts_with(c(parameter,"logPrior"))))) %>% 
    #convert back to a regular df
          ungroup()
    
    #need to convert to quadruple precision because 
    #these will exponentiate to very small numbers.
    sum_param_prior_exp <- exp(mpfr(df2$sum_param_prior, precBits = 128))
    # standardize by dividing by the sum of the column
    sum_param_prior_standardized <-
             sum_param_prior_exp/sum(sum_param_prior_exp)
    #drop the intermediate columns (no longer needed), change the standardized
    # output back to double precision so that it can be incorporated into the df
    # rename the summed column after the parameter
      df3 <- df2 %>% dplyr::select(-c(sum_param_prior)) %>%
          mutate(summed_over_loci =
          as.numeric(sum_param_prior_standardized)) %>% 
          rename_with(.fn = ~ paste(parameter), 
                      .cols = summed_over_loci)
    return(df3)
}



summarize_posterior <- function(posterior, 
                  parameter_type = c("Theta","M","Nm","D"),
                  exponential_mean = 1,
                  lower.bound, upper.bound,
                  n = 2^10, floor = 1e-10){
  # this function takes a Migrate-n posterior "bayesallfile" as a dataframe
  # as well as one of the parameter types, and the mean of an exponential prior
  # Currently only exponential priors supported
  # it will create densities for each parameter of the given type,
  # remove the prior from each, sum across loci, and re-add the prior (once)
  parameters <- names(posterior) %>%
                        str_subset(parameter_type)
  # create a tibble with the x values for all density plots 
  #  for a particular parameter type
  getx <- posterior %>% filter(Locus == 1) %>% 
        dplyr::select(parameters[1])

  # create a tibble with x values for a density plot
  #  of the chosen number of points
  dens <- tibble(.rows = n, x = density(getx[[1]],  n=n, 
                           from =lower.bound, 
                           to = upper.bound,
                           bw = "nrd0")$x)
  print("calculating densities")
  # calculate densities for each parameter of a given type at each locus
  dens <- posterior %>% 
          dplyr::select(starts_with(c("Steps","Locus","rep",
                  paste0(parameter_type,"_")))) %>% 
          pivot_wider(names_from = "Locus", values_from = 
                  starts_with(paste0(parameter_type,"_")),
                          names_sep = "-") %>% 
          dplyr::select(starts_with(paste0(parameter_type,"_"))) %>% 
          map_dfc(function(x) density(x, n = n, from = lower.bound,
                                    to = upper.bound, 
                                    bw = "nrd0")$y) %>%
          bind_cols(dens)
  
  # create, log and remove prior
  dens$prior <- dexp(dens$x, rate = 1/exponential_mean, 
                     log = F)
  
  dens$logPrior <- log(dens$prior)
  
  print("removing prior")
  dens2 <- dens %>% 
        #remove the prior, standardize
        mutate(across(starts_with(parameter_type), 
                  ~ remove_prior(densityd = .x,
                                  prior = dens$prior,
                                 floor = floor) ))

  dens3 <- dens2
    
  for(p in parameters){
    print(p)  
    dens3 <- sum_over_loci(df = dens3, parameter = p)
  }
  # trying to do the above loop with purrr:map_dfc
  #dens4 <- parameters %>% 
  #        map_dfc(.f = ~ sum_over_loci(df = dens2, parameter = .x))
  return(dens3)
}

posterior_stats <- function(df,parameter){
  require(spatstat.explore)
  p <- df %>% dplyr::select(x,parameter) %>% as.list(bw = 6.33e-05)
  names(p) <- c("x", "y")
  p$bw <- 6.33e-5
  attr(p, "class") <- "density"
  qu <- quantile.density(p, c(0.025, 0.25, 0.5, 0.75, 0.975))
  wmo <- p$x[which(p$y==max(p$y))]
  wme <- weighted.mean(p$x, p$y)
  wsd <- sqrt(weighted.var(p$x, p$y))
  stats <- c(qu,mode = wmo, mean = wme, sd = wsd)
  return(stats)
}

Parameter Estimates

These are the estimates, summed over loci.

Stepping-Stone_KOD Model

For the model with ongoing gene flow between Kure and Oahu

Theta Estimates

And here I run the code for the $\theta$ estimates

parameter_type <- "Theta"

parameters <- names(posterior) %>%
                        str_subset(parameter_type)

parameter_key2 <- parameters
names(parameter_key2) <- parameter_key[names(parameter_key) %in% parameter_key2]

#summarize the posterior for theta
thetas <- summarize_posterior(posterior, parameter_type = parameter_type, 
                              prior_params = c(0,0.001,0.1))

#convert to long format
thetas_long <- thetas %>% dplyr::select(all_of(c("x","prior",parameters))) %>% 
                pivot_longer(starts_with(c(paste0(parameter_type,"_"))), 
                            names_to = "Parameter",
                            values_to = "Density")

theta.colors <- rev(brewer.pal(9,"Spectral"))
theta.colors.light <- lighten(theta.colors)
theta.colors.dark <- darken(theta.colors)

#plot
thetas_density <-  ggplot(thetas_long, aes(x = x, y = Density, 
                    color = Parameter)) +
        geom_line(size = 1) + 
        #scale_x_log10(limits = c(5e-4,1e-2)) + 
        scale_color_discrete(type = theta.colors, 
                             labels = parameter_key) +
        geom_line(aes(x = x, y=prior), color = "grey",
                  linetype="dotted") +
        ylim(0,0.175) + xlim(0,0.01)

thetas_ridges <- ggplot(thetas_long, aes(x = x, y = Parameter, height = Density, 
                    fill = Parameter)) +
        geom_density_ridges(stat = "identity") + 
        scale_fill_discrete(type = theta.colors, labels = parameter_key) +
        scale_y_discrete(limits = rev, labels = NULL, breaks = NULL) +
        labs(x = TeX("$\\Theta$")) + xlim(0,0.01)

thetas_violins <- ggplot(thetas_long, aes(x = Parameter, y = x, 
                                          violinwidth = Density*10, 
                                          fill = Parameter)) +
        geom_violin(stat = "identity") + 
        scale_fill_discrete(type = theta.colors, labels = parameter_key) +
        scale_x_discrete(limits = rev, labels = NULL, breaks = NULL) +
        labs(y = TeX("$\\Theta$")) + ylim(0,0.01)


# get stats for each summed-over-loci posterior

theta_stats <- parameter_key2 %>% map_dfr(.f = ~ posterior_stats(df = thetas, 
                                                             parameter = .x),
                                      .id = "parameter")

# write_csv(theta_stats, "tables/stepping.stone.KO.D/theta_stats.csv")
# 
# ggsave("figures/stepping.stone.KO.D/thetas_density.jpg",thetas_density)
# ggsave("figures/stepping.stone.KO.D/thetas_ridges.jpg",thetas_ridges)
# ggsave("figures/stepping.stone.KO.D/thetas_violins.jpg",thetas_violins)
# 
# ggsave("figures/stepping.stone.KO.D/thetas_density.pdf",thetas_density)
# ggsave("figures/stepping.stone.KO.D/thetas_ridges.pdf",thetas_ridges)
# ggsave("figures/stepping.stone.KO.D/thetas_violins.pdf",thetas_violins)
#https://adiradaniel.netlify.app/post/ggmultipane/#using-ggpubrggarrange
#to make multipanel figures

Theta estimates:

thetastats <- read_csv("./tables/stepping.stone.KO.D/theta_stats.csv")

## Rows: 9 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): parameter
## dbl (8): 2.5%, 25%, 50%, 75%, 97.5%, mode, mean, sd
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

thetastats

## # A tibble: 9 × 9
##   parameter      `2.5%`   `25%`   `50%`   `75%` `97.5%`    mode    mean       sd
##   <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl>
## 1 theta Kure    0.00543 0.00559 0.00568 0.00575 0.00596 0.00569 0.00567  1.61e-4
## 2 theta P&H     0.00112 0.00123 0.00129 0.00136 0.00148 0.00128 0.00130  9.25e-5
## 3 theta Maro    0.00674 0.00698 0.00699 0.00700 0.00701 0.00699 0.00698  5.73e-5
## 4 theta FFS     0.00621 0.00678 0.00680 0.00682 0.00687 0.00680 0.00678  1.37e-4
## 5 theta Kauai   0.00797 0.00801 0.00803 0.00804 0.00807 0.00803 0.00802  2.73e-5
## 6 theta Oahu    0.00762 0.00765 0.00767 0.00768 0.00772 0.00767 0.00767  2.62e-5
## 7 theta Molokai 0.00678 0.00684 0.00686 0.00689 0.00775 0.00685 0.00693  2.44e-4
## 8 theta Maui    0.00771 0.00779 0.00783 0.00785 0.00789 0.00784 0.00781  4.82e-5
## 9 theta Big Is. 0.00632 0.00638 0.00640 0.00643 0.00649 0.00640 0.00639  1.10e-4

Theta Density Theta Ridges

Theta Violins

m/mu estimates

parameter_type <- "M"

parameters <- names(posteriors12) %>%
                        str_subset(parameter_type)

parameter_key2 <- parameters
names(parameter_key2) <- parameter_key[names(parameter_key) %in% parameter_key2]

#summarize the posterior for theta
Ms <- summarize_posterior(posteriors12, parameter_type = parameter_type, 
                              prior_params = c(0,1,1000), n = 2^14)

#convert to long format
Ms_long <- Ms %>% select(all_of(c("x","prior",parameters))) %>% 
                pivot_longer(starts_with(c(paste0(parameter_type,"_"))), 
                            names_to = "Parameter",
                            values_to = "Density") %>% 
#                filter(Parameter != "M_6_1")

migration.colors <- c(rbind(theta.colors, theta.colors.dark))[4:18]            
#plot
Ms_density <-  ggplot(Ms_long, aes(x = x, y = Density, 
                    color = Parameter)) +
        geom_line(size = 1) + 
        scale_x_log10(limits = c(1e-2,100)) + 
        scale_color_discrete(type = migration.colors, labels = parameter_key) +
        geom_line(aes(x = x, y=prior), color = "grey",
                  linetype="dotted")
  
           
#plot
M_ridges <- ggplot(Ms_long, aes(x = x, y = Parameter, height = Density, 
                    fill = Parameter)) +
        geom_density_ridges(stat = "identity") + 
        scale_x_log10(limits = c(1e-2,100)) +
        scale_fill_discrete(type = migration.colors, labels = parameter_key) +
        scale_y_discrete(limits = rev, labels = NULL, breaks = NULL) +
        labs(x = TeX("$\\frac{m}{\\mu}$"))

  
M_violins <- ggplot(Ms_long, aes(x = Parameter, y = x, violinwidth = Density, 
                    fill = Parameter)) +
        geom_violin(stat = "identity") +
        scale_y_log10(limits = c(1e-2,100)) +
        scale_fill_discrete(type = migration.colors, labels = parameter_key) 
# get stats for each summed-over-loci posterior

Ms_stats <- parameter_key2 %>% map_dfr(.f = ~ posterior_stats(df = Ms, 
                                                             parameter = .x),
                                      .id = "parameter")

# write_csv(Ms_stats, "tables/stepping.stone.KO.D/Mmu_stats.csv")
# 
ggsave("figures/stepping.stone.KO.D/Mmu_density.jpg",Ms_density)
ggsave("figures/stepping.stone.KO.D/Mmu_ridges.jpg",M_ridges)
ggsave("figures/stepping.stone.KO.D/Mmu_violins.jpg",M_violins)

ggsave("figures/stepping.stone.KO.D/Mmu_density.pdf",Ms_density)
ggsave("figures/stepping.stone.KO.D/Mmu_ridges.pdf",M_ridges)
ggsave("figures/stepping.stone.KO.D/Mmu_violins.pdf",M_violins)
M_ridges2<-M_ridges + theme(legend.position = "none")
ggsave(filename = "M_ridges_nolegend.pdf",plot=M_ridges2, path = "./figures/stepping.stone.KO.D/", width = 10.7, height = 6.42)

M/mu estimates:

mmustats <- read_csv("./tables/stepping.stone.KO.D/Mmu_stats.csv")

## Rows: 15 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): parameter
## dbl (8): 2.5%, 25%, 50%, 75%, 97.5%, mode, mean, sd
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

mmustats

## # A tibble: 15 × 9
##    parameter            `2.5%`  `25%`  `50%`  `75%` `97.5%`   mode   mean     sd
##    <chr>                 <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
##  1 m/mu Oahu to Kure    66.3   66.7   67.3   68.3    71.6   66.3   67.7   1.46  
##  2 m/mu Maro to P&H      1.16   1.16   1.16   1.28    1.40   1.16   1.21  0.143 
##  3 m/mu P&H to Maro      1.10   1.16   1.22   1.22    1.28   1.16   1.20  0.0680
##  4 m/mu FFS to Maro      0.244  0.244  0.244  0.244   0.305  0.244  0.249 0.0435
##  5 m/mu Maro to FFS      1.10   1.22   1.28   1.34    1.46   1.28   1.28  0.117 
##  6 m/mu Kauai to FFS     0.549  0.671  0.732  0.794   0.916  0.671  0.724 0.104 
##  7 m/mu FFS to Kauai     0.122  0.183  0.183  0.183   0.244  0.183  0.174 0.0461
##  8 m/mu Oahu to Kauai    0.183  0.183  0.244  0.488   0.610  0.183  0.308 0.183 
##  9 m/mu Kauai to Oahu    0.305  0.427  0.488  0.488   0.610  0.488  0.465 0.0897
## 10 m/mu Molokai to Oahu  0.183  0.244  0.244  0.244   0.305  0.244  0.236 0.0499
## 11 m/mu Oahu to Molokai  0.366  0.427  0.427  0.488   0.549  0.427  0.454 0.0632
## 12 m/mu Maui to Molokai  1.04   1.22   1.28   1.40    1.65   1.28   1.30  0.149 
## 13 m/mu Molokai to Maui  0.183  0.183  0.244  0.244   0.244  0.244  0.217 0.0439
## 14 m/mu Big Is. to Maui  0.732  0.916  1.10   1.16    1.22   1.16   1.04  0.173 
## 15 m/mu Maui to Big Is.  1.95   1.95   1.95   2.01    2.01   1.95   1.98  0.0454

$Mmu Violins$

Mmu Violins

4Nem estimates

parameter_type <- "Nm"

parameters <- names(posteriors12) %>%
                        str_subset(parameter_type)

parameter_key2 <- parameters
names(parameter_key2) <- parameter_key[names(parameter_key) %in% parameter_key2]

#summarize the posterior for theta
nms <- summarize_posterior(posteriors12, parameter_type = parameter_type, 
                              prior_params = c(0,1,1), n = 2^17)

#convert to long format
nms_long <- nms %>% select(all_of(c("x","prior",parameters))) %>% 
                pivot_longer(starts_with(c(paste0(parameter_type,"_"))), 
                            names_to = "Parameter",
                            values_to = "Density") %>% 
                filter(Parameter != "Nm_6_1")

migration.colors <- c(rbind(theta.colors, theta.colors.dark))[4:18]            
#plot
Nms_density <-  ggplot(nms_long, aes(x = x, y = Density, 
                    color = Parameter)) +
        geom_line(size = 1) + 
        #scale_x_log10(limits = c(5e-4,1e-2)) + 
        scale_color_discrete(type = migration.colors, labels = parameter_key) +
        geom_line(aes(x = x, y=prior), color = "grey",
                  linetype="dotted") +
        ylim(0,0.125) + xlim(0,0.003)
  
           
#plot
nm_ridges <- ggplot(nms_long, aes(x = x, y = Parameter, height = Density, 
                    fill = Parameter)) +
        geom_density_ridges(stat = "identity") + 
        scale_fill_discrete(type = migration.colors, labels = parameter_key) +
        scale_y_discrete(limits = rev, labels = NULL, breaks = NULL) +
        labs(x = TeX("$4N_em$")) +
        xlim(0,0.003)
  
nm_violins <- ggplot(nms_long, aes(x = Parameter, y = x, violinwidth = Density*20, 
                    fill = Parameter)) +
        geom_violin(stat = "identity") + 
        scale_fill_discrete(type = migration.colors, labels = parameter_key) +
        ylim(0,0.003)
# get stats for each summed-over-loci posterior

nms_stats <- parameter_key2 %>% map_dfr(.f = ~ posterior_stats(df = nms, 
                                                             parameter = .x),
                                      .id = "parameter")

# write_csv(nms_stats, "tables/stepping.stone.KO.D/4Nm_stats.csv")
# 
# ggsave("figures/stepping.stone.KO.D/4Nm_density.jpg",Nms_density)
# ggsave("figures/stepping.stone.KO.D/4Nm_ridges.jpg",nm_ridges)
# ggsave("figures/stepping.stone.KO.D/4Nm_violins.jpg",nm_violins)
# 
# ggsave("figures/stepping.stone.KO.D/4Nm_density.pdf",Nms_density)
# ggsave("figures/stepping.stone.KO.D/4Nm_ridges.pdf",nm_ridges)
# ggsave("figures/stepping.stone.KO.D/4Nm_violins.pdf",nm_violins)
# Nm_ridges2 <- Nm_ridges + theme(legend.position = "none")
# ggsave("figures/stepping.stone.KO.D/4Nm_ridges.pdf",nm_ridges)

$4N_em$ estimates:

nmstats <- read_csv("./tables/stepping.stone.KO.D/4Nm_stats.csv")

## Rows: 15 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): parameter
## dbl (8): 2.5%, 25%, 50%, 75%, 97.5%, mode, mean, sd
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nmstats

## # A tibble: 15 × 9
##    parameter      `2.5%`   `25%`   `50%`   `75%` `97.5%`    mode    mean      sd
##    <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 4Nem Oahu to… 9.09e-1 9.66e-1 9.83e-1 9.93e-1 9.99e-1 1   e+0 9.75e-1 2.46e-2
##  2 4Nem Maro to… 3.66e-4 4.27e-4 4.58e-4 4.96e-4 5.72e-4 4.50e-4 4.64e-4 5.47e-5
##  3 4Nem P&H to … 6.03e-4 6.71e-4 7.17e-4 7.63e-4 8.54e-4 7.10e-4 7.19e-4 6.61e-5
##  4 4Nem FFS to … 1.98e-4 2.21e-4 2.37e-4 2.59e-4 2.98e-4 2.37e-4 2.41e-4 2.66e-5
##  5 4Nem Maro to… 1.35e-3 1.52e-3 1.62e-3 1.72e-3 1.94e-3 1.61e-3 1.63e-3 1.51e-4
##  6 4Nem Kauai t… 6.49e-4 7.40e-4 7.93e-4 8.47e-4 9.69e-4 7.86e-4 7.97e-4 8.23e-5
##  7 4Nem FFS to … 8.70e-4 1.04e-3 1.14e-3 1.27e-3 1.54e-3 1.12e-3 1.16e-3 1.73e-4
##  8 4Nem Oahu to… 1.49e-3 1.74e-3 1.87e-3 2.01e-3 2.27e-3 1.86e-3 1.87e-3 2.01e-4
##  9 4Nem Kauai t… 1.08e-3 1.23e-3 1.31e-3 1.40e-3 1.57e-3 1.30e-3 1.31e-3 1.26e-4
## 10 4Nem Molokai… 9.99e-4 1.17e-3 1.26e-3 1.36e-3 1.56e-3 1.24e-3 1.26e-3 1.43e-4
## 11 4Nem Oahu to… 7.10e-4 8.24e-4 8.93e-4 9.61e-4 1.11e-3 8.85e-4 8.97e-4 1.03e-4
## 12 4Nem Maui to… 1.46e-3 1.65e-3 1.75e-3 1.86e-3 2.08e-3 1.75e-3 1.76e-3 1.61e-4
## 13 4Nem Molokai… 4.35e-4 5.04e-4 5.42e-4 5.80e-4 6.56e-4 5.34e-4 5.40e-4 5.67e-5
## 14 4Nem Big Is.… 1.26e-3 1.40e-3 1.47e-3 1.56e-3 1.75e-3 1.46e-3 1.48e-3 1.26e-4
## 15 4Nem Maui to… 1.17e-3 1.39e-3 1.52e-3 1.66e-3 2.05e-3 1.52e-3 1.54e-3 2.22e-4

4Nem Density $4Nem Ridges$

##### 4NeM Estimate for Oahu - Kure

This parameter is orders of magnitude larger than the others, and effectively can’t be summarized under the same prior limits. So this code is just rejigged to focus on the Oahu-Kure parameter

parameter_type <- "Nm"

parameters <- names(posteriors12) %>%
                        str_subset(parameter_type)

parameter_key2 <- parameters
names(parameter_key2) <- parameter_key[names(parameter_key) %in% parameter_key2]

#summarize the posterior for theta
nms61 <- summarize_posterior(posteriors12, parameter_type = parameter_type, 
                              prior_params = c(1,1,1000), n = 2^14)

#convert to long format
nms_61long <- nms61 %>% select(all_of(c("x","prior",parameters))) %>% 
                pivot_longer(starts_with(c(paste0(parameter_type,"_"))), 
                            names_to = "Parameter",
                            values_to = "Density") %>% 
                filter(Parameter == "Nm_6_1")

migration.colors <- c(rbind(theta.colors, theta.colors.dark))            
#plot
Nms_61density <-  ggplot(nms_61long, aes(x = x, y = Density, 
                    color = Parameter)) +
        geom_line(size = 1) + 
        #scale_x_log10(limits = c(5e-4,1e-2)) + 
        scale_color_discrete(type = migration.colors, labels = parameter_key) +
        geom_line(aes(x = x, y=prior), color = "grey",
                  linetype="dotted") +
        ylim(0,0.125) + xlim(0,100)
  
           
#plot
nm_61ridges <- ggplot(nms_61long, aes(x = x, y = Parameter, height = Density, 
                    fill = Parameter)) +
        geom_density_ridges(stat = "identity") + 
        scale_fill_discrete(type = migration.colors, labels = parameter_key) +
        scale_y_discrete(limits = rev, labels = NULL, breaks = NULL) +
        labs(x = TeX("$4N_em$")) +
        xlim(0,50)
  
nm_61violins <- ggplot(nms_61long, aes(x = Parameter, y = x, violinwidth = Density*20, 
                    fill = Parameter)) +
        geom_violin(stat = "identity") + 
        scale_fill_discrete(type = migration.colors, labels = parameter_key) +
        ylim(0,50)
# get stats for each summed-over-loci posterior

nms_61stats <-posterior_stats(df = nms61, parameter = "Nm_6_1")

nm61stats <- read.csv("./tables/stepping.stone.KO.D/4Nm_Oahu_Kure.csv")
nm61stats

##       X          x
## 1  2.5% 34.6597693
## 2   25% 34.7207471
## 3   50% 34.7207471
## 4   75% 34.7207471
## 5 97.5% 38.1355063
## 6  mode 34.7207471
## 7  mean 34.7966844
## 8    sd  0.8936556

Divergence Estimate Oahu-Kure

This is the divergence

parameter_type <- "D"
getx <- posteriors12 %>% filter(Locus == 1) %>% 
     dplyr::select("D_6_1")

dens <- tibble(.rows = 2^15, x = density(getx[,1],  n=2^15, 
                                      from = 0, 
                                      to = 0.0001,
                                      bw = "nrd0")$x)

dens2 <- posteriors12 %>%
          select(starts_with(c("Steps","Locus","rep",
                paste0(parameter_type,"_")))) %>%
          pivot_wider(names_from = "Locus", values_from =
                      starts_with(paste0(parameter_type,"_")),
                      names_sep = "-") %>%
          select(4:112) %>%
          map_dfc(function(x) density(x, n = 2^15, from = 0, to = 0.001, bw = "nrd0")$y)

dens3<-bind_cols(dens,dens2)

dens3$prior <- dexp(dens3$x, rate = 1/0.001,log = F)
dens3$prior <- dens3$prior/sum(dens3$prior)
dens3$logPrior <- log(dens3$prior)


dens4 <- dens3 %>%
        #remove the prior, standardize
        mutate(across(starts_with(parameter_type),~ remove_prior(densityd = .x,
                                           prior =dens$prior,threshold = 1e-10) ))
names(dens4)[2:110] <- paste0("D_6_1-",names(dens4)[1:109])

dens5 <- sum_over_loci(df = dens4, parameter = "D_6_1")


#convert to long format
D_61long <- dens5 %>% select(all_of(c("x","prior","D_6_1"))) %>% 
                pivot_longer(starts_with(c(paste0(parameter_type,"_"))), 
                            names_to = "Parameter",
                            values_to = "Density") %>% 
                filter(Parameter == "D_6_1")

migration.colors <- c(rbind(theta.colors, theta.colors.dark))            
#plot
D_61density <-  ggplot(D_61long, aes(x = x, y = Density, 
                    color = Parameter)) +
        geom_line(size = 1) + 
        #scale_x_log10(limits = c(5e-5,1e-2)) + 
        scale_color_discrete(type = migration.colors, labels = parameter_key) +
        geom_line(aes(x = x, y=prior), color = "grey",
                  linetype="dotted") +
        ylim(0,0.0004) + xlim(0,0.0001) + theme(legend.position = "none")

  
           

# get stats for each summed-over-loci posterior

D_61stats <-posterior_stats(df = dens5, parameter = "D_6_1")

# write_csv(data.frame(D_61stats), "tables/stepping.stone.KO.D/D_6_1_stats.csv")
# 
# ggsave("figures/stepping.stone.KO.D/D_6_1.jpg",D_61density)

Divergence Density

D61stats <- read.csv("./tables/stepping.stone.KO.D/D_6_1_stats.csv")
D61stats

##      D_61stats
## 1 5.261391e-05
## 2 5.735954e-05
## 3 6.005432e-05
## 4 6.291086e-05
## 5 6.875210e-05
## 6 5.970336e-05
## 7 6.022191e-05
## 8 4.123291e-06

Divergence time in coalescent units following along on page 25 here. Assuming 1 year generation time Beerli et al. (2019)

D <- 5.970336e-05
tau <- D / (0.00767+0.00569)
tau

## [1] 0.004468814

t <- D / 1.2e-8
t

## [1] 4975.28

Divergence time is estimated at 4975.28 years before present

Stepping-Stone KO.d model

This is the same model, without ongoing gene flow between Kure and Oahu

Read in the posteriors and make a labeling key

#
Dir <- "/Users/eric/Library/CloudStorage/OneDrive-ThePennsylvaniaStateUniversity/Documents/Research/Ptuberculosa/migrate/run4.7"
      
winningModel <- "stepping.stone.KOd"

posteriors <- list()

#read in the posterior for two replicates this may take 5 minutes
for(r in 1:2){
  print(paste("Loading Rep",r))
  posteriors[[r]] <- read.table(file.path(Dir,paste0("rep",r),winningModel,
                                      "bayesallfile"), header=T) 
}

posteriors <- lapply(posteriors, select, starts_with(c("Locus","Steps","Theta_", "M_", 
                                                          "Nm_","D_","d_")))



#combine into one, and then keep only the first 2 reps (based on diagnostics below)
posteriors12 <- posteriors %>% bind_rows(.id = "rep") %>% 
                                 migrants.per.gen()  %>% mutate(rep = as.integer(rep))




#posteriors12_long <- posteriors12 %>% pivot_longer(starts_with(c("Theta_", "M_", 
#                                                           "Nm_","D_","d_")),
#                                                    names_to = "parameter",
#                                                    values_to = "value") 


# read in a population key to translate numbers into names in figures
popkey <- read_csv("popkey.csv")
names(popkey$Pop) <- popkey$Index
#make a new labeling key
posterior_parameters <- names(posteriors12) %>% str_subset("Theta|M_|Nm_|D_|d_")

parameter_key <- posterior_parameters %>% 
                                  str_replace("(\\d)_(\\d)", "\\1 to \\2") %>%
                                  str_replace("M_", "m/mu " ) %>% 
                                  str_replace("Theta_", "theta ") %>% 
                                  str_replace_all(popkey$Pop) %>% 
                                  str_replace("Nm_", "4Nem ") 
                                  

names(parameter_key) <- posterior_parameters

Check Convergence

# posteriors.grouped <- posteriors %>% 
#                       select(all_of(c("Locus",
#                                         "Steps",
#                                         posterior_parameters))) %>% 
#                       group_by(rep)

# posterior.reps <- group_split(posterior.grouped)
# posterior.key <- group_keys(posterior.grouped)

#thin by taking every other step
posteriors2 <- lapply(posteriors,filter,Steps%%500==0)


# select parameters only
posteriors2 <- lapply(posteriors2, select, starts_with(c("Theta_", "M_", 
                                                          "Nm_","D_","d_")))

                        
posteriors.mcmc <- mcmc.list(lapply(posteriors2,mcmc,thin=500))
posterior.ggs <- ggs(posteriors.mcmc, burnin = F)

ggmcmc(posterior.ggs, param_page = 4)







#if I wanted to pull out parameters by locus, but I don't - Peter calculates Effective Sizes on the whole posterior

posteriors3 <- posteriors15 %>% select(starts_with(c("Replicate","Steps","Locus","Theta_", "M_", 
                                                          "Nm_","D_","d_"))) %>% mutate(Locus_name = Locus)

posteriors3_Nm <- posteriors3 %>% select(starts_with(c("Replicate","Steps","Locus", 
                                                          "Nm_"))

posteriors3_Nm <- posteriors3 %>% pivot_wider(id_cols = c("Steps", "Replicate","Locus"), 
                                                            names_from = "Locus_name",
                                               values_from =starts_with(c("Theta_", "M_", 
                                                          "Nm_","D_","d_")), names_sep = "-L")

posteriors3_Nm.mcmc <- mcmc(posteriors3_Nm)

effectiveSize(posteriors3_Nm.mcmc)

Effective Sizes

Effective sizes are good

effectiveSize(posteriors2[[1]]) (first replicate)

 Theta_1   Theta_2   Theta_3   Theta_4   Theta_5   Theta_6   Theta_7   Theta_8   Theta_9     M_3_2     M_2_3     M_4_3     M_3_4 
140194.20  99698.77 114330.54  65325.45  59135.13  73193.97 109083.37  70626.00 107475.10  50591.32 172911.83  90997.00 269569.09 
    M_5_4     M_4_5     M_6_5     M_5_6     M_7_6     M_6_7     M_8_7     M_7_8     M_9_8     M_8_9     D_6_1 
327963.24 297808.14 273528.90 337943.73 270026.70 354645.35 350985.75 653891.00 653891.00 653891.00 170758.40

Gelman-Rubin Diagnostic

Gelman-Rubin diagnostic across 5 replicates is a bit high for some M_ parameters, but all below 1.2.


gelman.diag(posteriors.mcmc, autoburnin = F)

Potential scale reduction factors:

        Point est. Upper C.I.
Theta_1       1.00       1.00
Theta_2       1.00       1.01
Theta_3       1.01       1.02
Theta_4       1.01       1.01
Theta_5       1.01       1.03
Theta_6       1.00       1.00
Theta_7       1.00       1.00
Theta_8       1.00       1.01
Theta_9       1.01       1.01
M_3_2         1.14       1.17
M_2_3         1.13       1.13
M_4_3         1.12       1.12
M_3_4         1.07       1.07
M_5_4         1.15       1.15
M_4_5         1.00       1.00
M_6_5         1.01       1.01
M_5_6         1.12       1.12
M_7_6         1.06       1.06
M_6_7         1.10       1.10
M_8_7         1.23       1.24
M_7_8         1.10       1.10
M_9_8         1.12       1.12
M_8_9         1.21       1.25
D_6_1         1.00       1.00

Multivariate psrf

1.01

gelman.diag(posteriors.mcmc, transform = T,autoburnin=F)
Potential scale reduction factors:

Potential scale reduction factors:

        Point est. Upper C.I.
Theta_1       1.00       1.00
Theta_2       1.00       1.00
Theta_3       1.01       1.02
Theta_4       1.00       1.01
Theta_5       1.01       1.02
Theta_6       1.00       1.00
Theta_7       1.00       1.00
Theta_8       1.00       1.00
Theta_9       1.00       1.00
M_3_2         1.01       1.01
M_2_3         1.00       1.01
M_4_3         1.01       1.02
M_3_4         1.00       1.01
M_5_4         1.00       1.01
M_4_5         1.00       1.01
M_6_5         1.00       1.01
M_5_6         1.00       1.01
M_7_6         1.00       1.00
M_6_7         1.01       1.01
M_8_7         1.01       1.02
M_7_8         1.01       1.01
M_9_8         1.01       1.02
M_8_9         1.01       1.02
D_6_1         1.00       1.00

Multivariate psrf

1.02

Theta Estimates

And here I run the code for the $\theta$ estimates

parameter_type <- "Theta"

parameters <- names(posteriors12) %>%
                        str_subset(parameter_type)

parameter_key2 <- parameters
names(parameter_key2) <- parameter_key[names(parameter_key) %in% parameter_key2]

#summarize the posterior for theta
thetas <- summarize_posterior(posteriors12, parameter_type = parameter_type, 
                              prior_params = c(0,0.001,0.1))

#convert to long format
thetas_long <- thetas %>% select(all_of(c("x","prior",parameters))) %>% 
                pivot_longer(starts_with(c(paste0(parameter_type,"_"))), 
                            names_to = "Parameter",
                            values_to = "Density")

theta.colors <- rev(brewer.pal(9,"Spectral"))
theta.colors.light <- lighten(theta.colors)
theta.colors.dark <- darken(theta.colors)

#plot
thetas_density <-  ggplot(thetas_long, aes(x = x, y = Density, 
                    color = Parameter)) +
        geom_line(linewidth= 1) + 
        #scale_x_log10(limits = c(5e-4,1e-2)) + 
        scale_color_discrete(type = theta.colors, 
                             labels = parameter_key) +
        geom_line(aes(x = x, y=prior), color = "grey",
                  linetype="dotted") +
        ylim(0,0.175) + xlim(0,0.01)

thetas_ridges <- ggplot(thetas_long, aes(x = x, y = Parameter, height = Density, 
                    fill = Parameter)) +
        geom_density_ridges(stat = "identity", scale = 5) + 
        scale_fill_discrete(type = theta.colors, labels = parameter_key) +
        scale_y_discrete(limits = rev, labels = NULL, breaks = NULL) +
        labs(x = TeX("$\\Theta$")) + xlim(0,0.01)

thetas_violins <- ggplot(thetas_long, aes(x = Parameter, y = x, 
                                          violinwidth = Density*10, 
                                          fill = Parameter)) +
        geom_violin(stat = "identity") + 
        scale_fill_discrete(type = theta.colors, labels = parameter_key) +
        scale_x_discrete(limits = rev, labels = NULL, breaks = NULL) +
        labs(y = TeX("$\\Theta$")) + ylim(0,0.01)


# get stats for each summed-over-loci posterior

theta_stats <- parameter_key2 %>% map_dfr(.f = ~ posterior_stats(df = thetas, 
                                                             parameter = .x),
                                      .id = "parameter")

#write_csv(theta_stats,"./tables/stepping.stone.KOd/theta_stats.csv")

# 
# ggsave(filename = "thetas_density.jpg",plot=thetas_density, 
#         path = "./figures/stepping.stone.KOd")
# ggsave(filename = "thetas_ridges.jpg",plot=thetas_ridges, 
#        path = "./figures/stepping.stone.KOd")
# ggsave(filename = "thetas_violins.jpg",plot=thetas_violins, 
#        path = "./figures/stepping.stone.KOd")
# 
# ggsave(filename = "thetas_density.pdf",plot=thetas_density, 
#         path = "./figures/stepping.stone.KOd")
# ggsave(filename = "thetas_ridges.pdf",plot=thetas_ridges, 
#        path = "./figures/stepping.stone.KOd")
# ggsave(filename = "thetas_violins.pdf",plot=thetas_violins, 
#        path = "./figures/stepping.stone.KOd")

#

Theta estimates:

thetastats <- read_csv("./tables/stepping.stone.KOd/theta_stats.csv")

## Rows: 9 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): parameter
## dbl (8): 2.5%, 25%, 50%, 75%, 97.5%, mode, mean, sd
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

thetastats

## # A tibble: 9 × 9
##   parameter      `2.5%`   `25%`   `50%`   `75%` `97.5%`    mode    mean       sd
##   <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl>
## 1 theta Kure    0.00393 0.00418 0.00425 0.00435 0.00520 0.00423 0.00429  2.42e-4
## 2 theta P&H     0.00132 0.00146 0.00152 0.00159 0.00173 0.00150 0.00153  1.04e-4
## 3 theta Maro    0.00667 0.00670 0.00671 0.00672 0.00675 0.00671 0.00671  2.10e-5
## 4 theta FFS     0.00778 0.00780 0.00782 0.00783 0.00786 0.00782 0.00781  6.22e-5
## 5 theta Kauai   0.00773 0.00778 0.00780 0.00783 0.00788 0.00779 0.00780  3.97e-5
## 6 theta Oahu    0.00740 0.00745 0.00747 0.00749 0.00753 0.00748 0.00747  3.33e-5
## 7 theta Molokai 0.00691 0.00697 0.00701 0.00707 0.00716 0.00699 0.00702  7.19e-5
## 8 theta Maui    0.00726 0.00731 0.00734 0.00736 0.00740 0.00734 0.00734  4.51e-5
## 9 theta Big Is. 0.00498 0.00513 0.00536 0.00545 0.00623 0.00541 0.00540  3.62e-4

$Theta Ridges$

Theta Ridges

m/mu estimates

parameter_type <- "M"

parameters <- names(posteriors12) %>%
                        str_subset(parameter_type)

parameter_key2 <- parameters
names(parameter_key2) <- parameter_key[names(parameter_key) %in% parameter_key2]

#summarize the posterior for theta
Ms <- summarize_posterior(posteriors12, parameter_type = parameter_type, 
                              prior_params = c(0,1,1000), n = 2^14)

#convert to long format
Ms_long <- Ms %>% select(all_of(c("x","prior",parameters))) %>% 
                pivot_longer(starts_with(c(paste0(parameter_type,"_"))), 
                            names_to = "Parameter",
                            values_to = "Density")


migration.colors <- c(rbind(theta.colors, theta.colors.dark))[4:18]  

#plot
M_density <-  ggplot(Ms_long, aes(x = x, y = Density, 
                    color = Parameter)) +
        geom_line(size = 1) + 
        scale_x_log10(limits = c(1e-2,100)) + 
        scale_color_discrete(type = migration.colors, labels = parameter_key) +
        geom_line(aes(x = x, y=prior), color = "grey",
                  linetype="dotted")
  
           
#plot
M_ridges <- ggplot(Ms_long, aes(x = x, y = Parameter, height = Density, 
                    fill = Parameter)) +
        geom_density_ridges(stat = "identity") + 
        scale_x_log10(limits = c(1e-2,100)) +
        scale_fill_discrete(type = migration.colors, labels = parameter_key) +
        scale_y_discrete(limits = rev, labels = NULL, breaks = NULL) +
        labs(x = TeX("$\\frac{m}{\\mu}$"))

  
M_violins <- ggplot(Ms_long, aes(x = Parameter, y = x, violinwidth = Density, 
                    fill = Parameter)) +
        geom_violin(stat = "identity") +
        scale_y_log10(limits = c(1e-2,100)) +
        scale_fill_discrete(type = migration.colors, labels = parameter_key) 
# get stats for each summed-over-loci posterior



Ms_stats <- parameter_key2 %>% map_dfr(.f = ~ posterior_stats(df = Ms, 
                                                             parameter = .x),
                                      .id = "parameter")

ggsave(filename = "M_density.jpg",plot=M_density, path = "./figures/stepping.stone.KOd")
ggsave(filename = "M_ridges.jpg",plot=M_ridges, path = "./figures/stepping.stone.KOd")
ggsave(filename = "M_violins.jpg",plot=M_violins, path = "./figures/stepping.stone.KOd")
write_csv(Ms_stats,"./tables/stepping.stone.KOd/Mmu_stats.csv")

ggsave(filename = "M_density.pdf",plot=M_density, path = "./figures/stepping.stone.KOd")
ggsave(filename = "M_ridges.pdf",plot=M_ridges, path = "./figures/stepping.stone.KOd")
ggsave(filename = "M_violins.pdf",plot=M_violins, path = "./figures/stepping.stone.KOd")
M_ridges2 <- M_ridges + theme(legend.position="none")
ggsave(filename = "M_ridges_nolegend.pdf",plot=M_ridges2, path = "./figures/stepping.stone.KOd", width = 10.7, height = 6.42)

M/mu estimates:

mmustats <- read_csv("./tables/stepping.stone.KOd/Mmu_stats.csv")

## Rows: 14 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): parameter
## dbl (8): 2.5%, 25%, 50%, 75%, 97.5%, mode, mean, sd
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

mmustats

## # A tibble: 14 × 9
##    parameter            `2.5%` `25%` `50%` `75%` `97.5%`  mode  mean     sd
##    <chr>                 <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>  <dbl>
##  1 m/mu Maro to P&H      1.28  1.34  1.40  1.40    1.46  1.40  1.37  0.0663
##  2 m/mu P&H to Maro      0.916 1.04  1.04  1.10    1.28  1.04  1.07  0.0989
##  3 m/mu FFS to Maro      0.244 0.244 0.244 0.244   0.305 0.244 0.249 0.0434
##  4 m/mu Maro to FFS      1.53  2.01  2.08  2.08    2.14  2.08  1.96  0.251 
##  5 m/mu Kauai to FFS     0.183 0.244 0.244 0.244   0.305 0.244 0.253 0.0596
##  6 m/mu FFS to Kauai     0.183 0.244 0.244 0.244   0.305 0.244 0.244 0.0532
##  7 m/mu Oahu to Kauai    0.122 0.122 0.183 0.244   0.305 0.183 0.182 0.0634
##  8 m/mu Kauai to Oahu    0.183 0.183 0.183 0.244   0.305 0.183 0.212 0.0582
##  9 m/mu Molokai to Oahu  0.122 0.183 0.183 0.183   0.244 0.183 0.186 0.0449
## 10 m/mu Oahu to Molokai  0.305 0.366 0.366 0.366   0.427 0.366 0.364 0.0540
## 11 m/mu Maui to Molokai  1.04  1.16  1.22  1.28    1.34  1.16  1.20  0.110 
## 12 m/mu Molokai to Maui  0.183 0.183 0.244 0.244   0.244 0.244 0.222 0.0435
## 13 m/mu Big Is. to Maui  1.04  1.10  1.16  1.16    1.16  1.16  1.13  0.0548
## 14 m/mu Maui to Big Is.  0.916 0.977 0.977 1.04    1.04  0.977 0.992 0.0461

Mmu Ridges

4Nem estimates

parameter_type <- "Nm"

parameters <- names(posteriors12) %>%
                        str_subset(parameter_type)

parameter_key2 <- parameters
names(parameter_key2) <- parameter_key[names(parameter_key) %in% parameter_key2]

#summarize the posterior for theta
nms <- summarize_posterior(posteriors12, parameter_type = parameter_type, 
                              prior_params = c(0,1,100), n = 2^17)

#convert to long format
nms_long <- nms %>% select(all_of(c("x","prior",parameters))) %>% 
                pivot_longer(starts_with(c(paste0(parameter_type,"_"))), 
                            names_to = "Parameter",
                            values_to = "Density") %>% 
                filter(Parameter != "Nm_6_1")

migration.colors <- c(rbind(theta.colors, theta.colors.dark))[4:18]            
#plot
Nm_density <-  ggplot(nms_long, aes(x = x, y = Density, 
                    color = Parameter)) +
        geom_line(size = 1) + 
        #scale_x_log10(limits = c(5e-4,1e-2)) + 
        scale_color_discrete(type = migration.colors, labels = parameter_key) +
        geom_line(aes(x = x, y=prior), color = "grey",
                  linetype="dotted") +
        ylim(0,0.125) + xlim(0,0.003)
  
           
#plot
Nm_ridges <- ggplot(nms_long, aes(x = x, y = Parameter, height = Density, 
                    fill = Parameter)) +
        geom_density_ridges(stat = "identity") + 
        scale_fill_discrete(type = migration.colors, labels = parameter_key) +
        scale_y_discrete(limits = rev, labels = NULL, breaks = NULL) +
        labs(x = TeX("$4N_em$")) +
        xlim(0,0.003)
  
Nm_violins <- ggplot(nms_long, aes(x = Parameter, y = x, violinwidth = Density*20, 
                    fill = Parameter)) +
        geom_violin(stat = "identity") + 
        scale_fill_discrete(type = migration.colors, labels = parameter_key) +
        ylim(0,0.003)
# get stats for each summed-over-loci posterior

Nm_stats <- parameter_key2 %>% map_dfr(.f = ~ posterior_stats(df = nms, 
                                                             parameter = .x),
                                      .id = "parameter")

# ggsave(filename = "Nm_density.jpg",plot=Nm_density, path = "./figures/stepping.stone.KOd")
# ggsave(filename = "Nm_ridges.jpg",plot=Nm_ridges, path = "./figures/stepping.stone.KOd")
# ggsave(filename = "Nm_violins.jpg",plot=Nm_violins, path = "./figures/stepping.stone.KOd")
# 
# ggsave(filename = "Nm_density.pdf",plot=Nm_density, path = "./figures/stepping.stone.KOd")
# ggsave(filename = "Nm_ridges.pdf",plot=Nm_ridges, path = "./figures/stepping.stone.KOd")
# ggsave(filename = "Nm_violins.pdf",plot=Nm_violins, path = "./figures/stepping.stone.KOd")
# 
# write_csv(Nm_stats,"./tables/stepping.stone.KOd/Nm_stats.csv")
# 
# Nm_ridges2 <- Nm_ridges + theme(legend.position="none")
# ggsave(filename = "Nm_ridges_nolegend.pdf",plot=Nm_ridges2, path = "./figures/stepping.stone.KOd", width = 10.7, height = 6.42)
# ggsave(filename = "Nm_ridges.pdf",plot=Nm_ridges, path = "./figures/stepping.stone.KOd", width = 10.7, height = 6.42)

$4N_em$ estimates:

nmstats <- read_csv("./tables/stepping.stone.KOd/Nm_stats.csv")

## Rows: 14 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): parameter
## dbl (8): 2.5%, 25%, 50%, 75%, 97.5%, mode, mean, sd
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nmstats

## # A tibble: 14 × 9
##    parameter      `2.5%`   `25%`   `50%`   `75%` `97.5%`    mode    mean      sd
##    <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 4Nem Maro to… 4.73e-4 5.57e-4 6.10e-4 6.64e-4 7.86e-4 6.03e-4 6.14e-4 8.12e-5
##  2 4Nem P&H to … 5.42e-4 6.18e-4 6.64e-4 7.02e-4 7.93e-4 6.56e-4 6.62e-4 6.50e-5
##  3 4Nem FFS to … 2.06e-4 2.37e-4 2.52e-4 2.67e-4 3.05e-4 2.44e-4 2.51e-4 2.71e-5
##  4 4Nem Maro to… 1.17e-3 1.32e-3 1.40e-3 1.50e-3 1.71e-3 1.39e-3 1.41e-3 1.39e-4
##  5 4Nem Kauai t… 5.19e-4 5.95e-4 6.33e-4 6.79e-4 7.71e-4 6.26e-4 6.36e-4 6.46e-5
##  6 4Nem FFS to … 1.21e-3 1.41e-3 1.53e-3 1.66e-3 1.95e-3 1.48e-3 1.54e-3 1.87e-4
##  7 4Nem Oahu to… 1.10e-3 1.29e-3 1.40e-3 1.50e-3 1.77e-3 1.41e-3 1.40e-3 1.71e-4
##  8 4Nem Kauai t… 9.23e-4 1.06e-3 1.14e-3 1.24e-3 1.43e-3 1.13e-3 1.15e-3 1.28e-4
##  9 4Nem Molokai… 8.54e-4 9.77e-4 1.05e-3 1.13e-3 1.30e-3 1.04e-3 1.06e-3 1.17e-4
## 10 4Nem Oahu to… 6.94e-4 7.86e-4 8.39e-4 9.00e-4 1.03e-3 8.32e-4 8.47e-4 8.83e-5
## 11 4Nem Maui to… 1.10e-3 1.27e-3 1.36e-3 1.46e-3 1.66e-3 1.36e-3 1.36e-3 1.46e-4
## 12 4Nem Molokai… 4.27e-4 4.81e-4 5.11e-4 5.42e-4 6.10e-4 5.04e-4 5.11e-4 4.87e-5
## 13 4Nem Big Is.… 1.11e-3 1.24e-3 1.31e-3 1.39e-3 1.56e-3 1.30e-3 1.32e-3 1.15e-4
## 14 4Nem Maui to… 1.25e-3 1.76e-3 2.11e-3 2.37e-3 2.80e-3 2.23e-3 2.07e-3 4.23e-4

$4Nem Ridges$

4Nem Ridges

#### Divergence Estimate Oahu-Kure

This is the divergence

parameter_type <- "D"
getx <- posteriors12 %>% filter(Locus == 1) %>% 
     dplyr::select("D_6_1")

dens <- tibble(.rows = 2^15, x = density(getx[,1],  n=2^15, 
                                      from = 0, 
                                      to = 0.0001,
                                      bw = "nrd0")$x)

dens2 <- posteriors12 %>%
          select(starts_with(c("Steps","Locus","rep",
                paste0(parameter_type,"_")))) %>%
          pivot_wider(names_from = "Locus", values_from =
                      starts_with(paste0(parameter_type,"_")),
                      names_sep = "-") %>%
          select(3:111) %>%
          map_dfc(function(x) density(x, n = 2^15, from = 0, to = 0.001, bw = "nrd0")$y)

dens2<-bind_cols(dens,dens2)

dens2$prior <- dexp(dens2$x, rate = 1/0.001,log = F)
dens2$prior <- dens2$prior/sum(dens2$prior)
dens2$logPrior <- log(dens2$prior)


dens4 <- dens2 %>%
        #remove the prior, standardize
        mutate(across(starts_with(parameter_type),
                      ~ remove_prior(densityd = .x,
                          prior = dens2$prior,threshold = 1e-10) ))
names(dens4)[1:109] <- paste0("D_6_1-",names(dens4)[2:110])

dens5 <- sum_over_loci(df = dens4, parameter = "D_6_1") 
dens5 <- bind_cols(dens5, x=dens$x)

#convert to long format
D_61long <- dens5 %>% 
                pivot_longer(starts_with(c(paste0("D","_"))), 
                            names_to = "Parameter",
                            values_to = "Density") %>% 
                filter(Parameter == "D_6_1")


#D_61long <- bind_cols(D_61long,x=dens$x)

#migration.colors <- c(rbind(theta.colors, theta.colors.dark))            
#plot
D_61density <-  ggplot(D_61long, aes(x = x, y = Density, 
                    color = Parameter)) +
        geom_line(size = 1) + 
        #scale_x_log10(limits = c(5e-5,1e-2)) + 
        scale_color_discrete(type = migration.colors, labels = parameter_key) +
        #geom_line(aes(x = x, y=prior), color = "grey",
        #          linetype="dotted") +
        ylim(0,0.0003) + xlim(0,0.0001) + theme(legend.position = "none")
  
           

# get stats for each summed-over-loci posterior

D_61stats <-posterior_stats(df = dens5, parameter = "D_6_1")

#ggsave(filename = "D_6_1.jpg",plot=D_61density, path = "./figures/stepping.stone.KOd")
#write.csv(D_61stats ,"./tables/stepping.stone.KOd/D_6_1.csv",quote=F)

Divergence Density

D61stats <- read.csv("./tables/stepping.stone.KOd/D_6_1.csv")
D61stats

##       X            x
## 1  2.5% 5.244911e-05
## 2   25% 5.739006e-05
## 3   50% 6.014893e-05
## 4   75% 6.302683e-05
## 5 97.5% 6.880093e-05
## 6  mode 5.988647e-05
## 7  mean 6.027578e-05
## 8    sd 4.178363e-06

Sanity Check

Simulations with very low gene flow

So these results are intriguing in that they suggest that we could be getting quite low pairwise $F_{ST}$ values even when there is negligible gene flow between islands. However, it doesn’t make sense that an equilbrium model could yield such low $F_{ST}$ without a non-equilibrium event preceeding it, such as a founder event (i.e. Hawaiian population being founded from elsewhere in the Indo-Pacific) I want to double check that the results are indeed consistent with low $F_{ST}$ so I am going to simulate the model (minus the anthropogenic gene flow between Oahu and Kure) with a founder event to see if our results are consistent.

IBDsim (Leblois, Estoup, and Rousset 2009) did well for me in my previous Hawaii paper, so I am turning to it again, as it is able to simulate all the aspects that I need. I’m going to start with just doing 90 diploid samples of DNA under the Juke Cantor model to simulate the haplotype dataset used in migrate.

Here is the parmfile for a coalescentsimulation with a 1.2e-8 mutation rate per generation (* 500bp = 6e-6) (Zhang et al. 2022), and an equivalent migration rate (as Migrate found this ratio close to 1). From the present, there are 15 demes of 50K effective individuals each, with an emigration rate of 6e-6, which results in immigration rates of close to this value (see file EmpImmmigRate), exchanging individuals only with neighboring populations (stepping-stone model). Absorbing boundaries. At 999 generations, there is a founder event of 100 individuals in one deme, which, at 1,000 generations comes from a larger population of 100K effective individuals. Sampling is exactly as in the empirical observations in terms of 9 islands being sampled with the same sample sizes as we have.

I have varied the simulations to test the effect of the age and size of the founder event as such:

[A]ge of Founder Event (generations)	Size of [F]ounding Population
Infinite (true equilibrium)	NA
1000	100
10000	100
100000	100
1000	1000
10000	1000
100000	1000

%%%%%%%%%% SIMULATION PARAMETERS %%%%%%%
Data_File_Name = stepstone_radseq
GenePop_File_Extension = .gen
Run_Number = 1
Pause = Never
Genepop = T
Migrate = F
Migraine_Settings = F
Nexus_File_Format = F

%%%%%%%%%% MARKERS PARAMETERS %%%%%%%%%
Locus_Number = 100
Mutation_Rate = 0.000006
Mutation_Model = JC69
MRCA_Sequence = TACTCTATATATTATGTTTGGTGTGTGATCTGGTTTAGTCGGGACTGCTTTGAGGCTCTTGATTCGAGCTGAACTTGGACAGCCAGGAGCTCTTTTAGGTGATGATCAACTTTATAATGTGATCGTCACTGCGCATGCATTTGTGATAATTTTTTTCTTGGTGATGCCTATGATGATTGGGGGATTCGGTAATTGGTTGGTTCCTTTAATGTTGGGGGCTCCTGATATGGCGTTCCCGCGGTTGAATAATATAAGTTTTTGGTTGCTTCCGCCTTCGTTGACTTTGTTGCTTGCTTCTTCTGCTGTTGAGAGTGGTGTAGGGACAGGTTGAACGGTTTATCCTCCTTTGTCTGGGAACTTAGCTCATGCTGGGGGTTCTGTGGATCTAGCTATCTTCTCGTTACACTTAGCTGGTGTATCTTCTATTTTAGGTGCTGTAAATTTTATTACTACGATCATTAATATGCGGTGACAGGGGATGCAATTTGAGCGGTTGCCTC
Sequence_Size = 500
Transition_Transversion_ratio = 2
Equilibrium_Frequencies = 0.25 0.25 0.25 0.25
Polymorphic_Loci_Only= False
Ploidy = Diploid

%%%%%%%%%% VARIOUS COMPUTATION OPTIONS %%%
DiagnosticTables = Hexp, Fis, Seq_stats, Effective_Dispersal, Iterative_Statistics


%%%%%%%%%% DEMOGRAPHIC OPTIONS %%%%%%%%
NewDemographicPhaseAt=0
Lattice_Boundaries = Absorbing
Total_Range_Dispersal = True
Dispersal_Distribution = SteppingStone
Immigration_Control = Simple1Dproduct
Total_Emigration_Rate = 0.000006
Dist_max = 1
Lattice_SizeX = 15
Lattice_SizeY = 1
Ind_Per_Pop = 50000
Continuous_Deme_Size_Variation = Exponential
Dens_Logistic_Growth_Rate = 0.3
Continuous_Lattice_Size_Variation = Linear
Lattice_Logistic_Growth_Rate = 0.3

NewDemographicPhaseAt=999
Lattice_SizeX = 1
Lattice_SizeY = 1
Ind_Per_Pop = 100

NewDemographicPhaseAt=1000
Lattice_SizeX = 1
Lattice_SizeY = 1
Ind_Per_Pop = 100000

%%%%%%%%%% SAMPLE PARAMETERS %%%%%%%%
%Sample_SizeX = 9
%Sample_SizeY = 9
Sample_Coordinates_X = 1 3 6 8 11 12 13 14 15
Sample_Coordinates_Y = 1 1 1 1 1  1  1  1  1
Ind_Per_Pop_Sampled = 10 4 5 9 15 10 10 10 17

Now read and plot the results

#read in the empirical haplotype data as numbered haps
haps <- read.genepop("filtered_data/Ptuberculosa_filtered_haps.gen", quiet = T)
#fst_haps <- mat_pw_fst(haps)
fst_haps <- genet.dist(genind2hierfstat(haps), method = "Nei87")
fst_haps <- fst_haps / (1-fst_haps)
stats_haps <- basic.stats(haps)

sims <- list()
for(sim in list.files("IBDsim", pattern = "^F", include.dirs=T, full.names = T)){
  simname <- sim %>% str_match("F\\d+_A.+") %>% as.character()
  sims[simname] <- read.genepop(file.path(sim,"stepstone_radseq.gen"),
                                ncode = 3,  quiet = T)
}

# use map() to loop over all the genepop objects and calculate Fst 
locus_fsts <- sims %>% map(\(x) basic.stats(genind2hierfstat(x)))
locus_fsts$Observed <- stats_haps

fst_hists <- locus_fsts %>% map(\(x) ggplot() + geom_histogram(x$perloc, mapping=aes(x=Fst),binwidth = 0.01) + xlim(0,1))

cowplot::plot_grid(plotlist = fst_hists, labels = names(fst_hists))

So basically, there is no realistic founder event scenario that leads to $F_{ST}$ values as low as what we observe.

Check IBD

#create a "geographic" distance matrix                  
sim_geo_points <- matrix(c(1,1,3,1,6,1,8,1,11,1,12,1,13,1,14,1,15,1),ncol = 2, byrow = T)
sim_geo_dists <- pointDistance(sim_geo_points,lonlat=F) %>% as.dist()

# re-load the geographic distances and convert to distance matrix
geo.dist.mat <- read.csv(file = "IBD/Ptuberculosa_pairwise_great_circle_distances_km.csv")
geo.dist <- as.dist(geo.dist.mat)

# (WC Theta is not working for reasons I can't figure out, so using Nei 87)
fsts <- sims %>% map(\(x) genet.dist(genind2hierfstat(x), method = "Nei87"))

ibd_plots <- fsts %>% map(\(x) ggplot(tibble(Lattice_Dist = as.vector(sim_geo_dists), Fst = as.vector(x))) + 
                             geom_point(aes(x = Lattice_Dist, y = Fst)) + ylim(0,1))

ibd_plots$Observed <- tibble(Geog_Dist = geo.dist, Fst = fst_haps) %>%  ggplot() + 
                              geom_point(aes(x = Geog_Dist, y = Fst)) + ylim(0,1)

cowplot::plot_grid(plotlist = ibd_plots, labels = names(ibd_plots))

## Don't know how to automatically pick scale for object of type <dist>.
## Defaulting to continuous.

#sim_dists.df <-  bind_cols(fst = as.dist(sim_haps_fst), distance = as.dist(sim_geo_dists))

#mantelt_sims <- mantel.randtest(as.dist(sim_haps_fst),as.dist(sim_geo_dists), nrepet = 10000)
#lmodel_sims <- lm(fst ~ distance , sim_dists.df)

Re-run Stepping.Stone.KOd with larger priors

I have re-run the Stepping.Stone.KOd model with expanded M prior, expanded sampling window, and ensuring that larger values are explored by starting in the middle of the prior. Also tried to get it to output a histogram of migration events but that doesn’t seem to have worked.

Changes to parmfile

bayes-priors= MIG WEXPPRIOR: 0.000100 1000.000000 1000000 10000
theta=PRIOR:50

mig-histogram=MIGRATIONEVENTSONLY:0.00001:mighistogram.txt

Read in the posterior and make a labeling key

#
Dir <- "/Users/eric/Library/CloudStorage/OneDrive-ThePennsylvaniaStateUniversity/Documents/Research/Ptuberculosa/migrate/run5/rep1/stepping.stone.KOd/bayesallfile"
      

#read in the posterior for two replicates this may take 5 minutes
posterior <- read.table(Dir, header=T) 

posterior <- posterior %>% migrants.per.gen()
                                 
# read in a population key to translate numbers into names in figures
popkey <- read_csv("popkey.csv")
names(popkey$Pop) <- popkey$Index
#make a new labeling key
posterior_parameters <- names(posterior) %>% str_subset("Theta|M_|Nm_|D_|d_")

parameter_key <- posterior_parameters %>% 
                                  str_replace("(\\d)_(\\d)", "\\1 to \\2") %>%
                                  str_replace("M_", "m/mu " ) %>% 
                                  str_replace("Theta_", "theta ") %>% 
                                  str_replace_all(popkey$Pop) %>% 
                                  str_replace("Nm_", "4Nem ") 
                                  

names(parameter_key) <- posterior_parameters

Check Convergence

#thin
posterior$Steps <- as.numeric(posterior$Steps)
posterior2 <- posterior %>% as_tibble() %>% filter(Steps%%500==0) %>% 
                            dplyr::select(starts_with(c("Theta_", "M_", 
                                        "Nm_","D_","d_")))

posterior.mcmc <- mcmc(posterior2,end = 3000000,thin=500)
posterior.ggs <- ggs(posteriors.mcmc, burnin = F)

ggmcmc(posterior.ggs, param_page = 8)

Effective Sizes

Effective sizes are good

effectiveSize(posterior.mcmc)

   Theta_1    Theta_2    Theta_3    Theta_4    Theta_5    Theta_6    Theta_7    Theta_8    Theta_9      M_3_2      M_2_3      M_4_3      M_3_4 
 49337.815  10646.909   5125.929   4722.193   9552.176   4626.342   7675.959   7246.239  10490.718  38715.991   5248.512 606155.173 114465.448 
     M_5_4      M_4_5      M_6_5      M_5_6      M_7_6      M_6_7      M_8_7      M_7_8      M_9_8      M_8_9     Nm_3_2     Nm_2_3     Nm_4_3 
 93326.272 158502.135 231502.904 135276.560 130166.133  77014.260  84461.711   6471.971  20463.577  10983.852  52598.200   4685.068 582050.268 
    Nm_3_4     Nm_5_4     Nm_4_5     Nm_6_5     Nm_5_6     Nm_7_6     Nm_6_7     Nm_8_7     Nm_7_8     Nm_9_8     Nm_8_9      D_6_1 
 82934.776 125485.354 136584.397 205575.637 121067.103 168445.994  84396.049 121708.508  10313.493  20041.834  13056.593  53007.255

Here is what M parameters look like. They are all in the realm of what I found - right around 1:1 with mutation rate.

M_parameters

Comparison to Outfile

And yet, this is what the outfile for this run shows

Locus Parameter        2.5%      25.0%    mode     75.0%   97.5%     median   mean
-----------------------------------------------------------------------------------
  All  M_3->2         19500.00 19500.00 20250.00 20500.00 21000.00 20750.00 20403.54
  All  M_2->3         19500.00 19500.00 20250.00 20500.00 21000.00 20750.00 20401.98
  All  M_4->3         19500.00 19500.00 20250.00 20500.00 21000.00 20750.00 20401.99
  All  M_3->4         19500.00 19500.00 20250.00 20500.00 21000.00 20750.00 20401.99
  All  M_5->4         19500.00 19500.00 20250.00 20500.00 21000.00 20750.00 20402.00
  All  M_4->5         19500.00 19500.00 20250.00 20500.00 21000.00 20750.00 20402.00
  All  M_6->5         19500.00 19500.00 20250.00 20500.00 21000.00 20750.00 20402.03
  All  M_5->6         19500.00 19500.00 20250.00 20500.00 21000.00 20750.00 20402.03
  All  M_7->6         19500.00 19500.00 20250.00 20500.00 21000.00 20750.00 20402.08
  All  M_6->7         19500.00 19500.00 20250.00 20500.00 21000.00 20750.00 20401.99
  All  M_8->7         19500.00 19500.00 20250.00 20500.00 21000.00 20750.00 20402.02
  All  M_7->8         19500.00 19500.00 20250.00 20500.00 21000.00 20750.00 20405.28
  All  M_9->8         19500.00 19500.00 20250.00 20500.00 21000.00 20750.00 20402.92
  All  M_8->9         19500.00 19500.00 20250.00 20500.00 21000.00 20750.00 20408.30

And the histograms generated across all loci are blank

Migrate Outfile Histograms

So, what is going on here?

The outfile M estimates of M = 20,500 seem more reasonable as they translate to ~ 164 effective migrants per generation.

Simulations with Nm = 150

Let’s see what that looks like in simulations (with Nm = 150, m = 50,000/150 = 0.003). All other parameters the same as above.

sims <- list()
for(sim in list.files("IBDsim", pattern = "^Nm", include.dirs=T, full.names = T)){
  simname <- sim %>% str_match("F\\d+_A.+") %>% as.character()
  sims[simname] <- read.genepop(file.path(sim,"stepstone_radseq.gen"),ncode = 3, quiet = T)
  
}

# use map() to loop over all the genepop objects and calculate Fst 
locus_fsts <- sims %>% map(\(x) basic.stats(genind2hierfstat(x)))
locus_fsts$Observed <- stats_haps

fst_hists <- locus_fsts %>% map(\(x) ggplot() + geom_histogram(x$perloc, mapping=aes(x=Fst),binwidth = 0.01) + xlim(0,1))

cowplot::plot_grid(plotlist = fst_hists, labels = names(fst_hists))

Check IBD

fsts <- sims %>% map(\(x) genet.dist(genind2hierfstat(x), method = "Nei87"))


ibd_plots <- fsts %>% map(\(x) ggplot(tibble(Lattice_Dist = as.vector(sim_geo_dists), Fst = as.vector(x))) + 
                             geom_point(aes(x = Lattice_Dist, y = Fst)) + ylim(0,0.4))

ibd_plots$Observed <- tibble(Geog_Dist = geo.dist, Fst = fst_haps) %>%  ggplot() + 
                              geom_point(aes(x = Geog_Dist, y = Fst)) + ylim(0,0.4)

cowplot::plot_grid(plotlist = ibd_plots, labels = names(ibd_plots))

## Don't know how to automatically pick scale for object of type <dist>.
## Defaulting to continuous.

That’s more like it!

So simulations are consistent with Nm = 150 (M > 20K) but I still don’t understand the inconsistency between the raw output values for M ( M < 10) into bayesallfile and the summary found outfile.txt (M > 20K).

Re-Running

So I went back and forth with Peter about this, and it seems to have to do with an issue with models with a “d” or “D” divergence parameter in them. Somehow they are maybe causing an “underflow” that affects all m/mu parameters. What’s more this seems to only happen on AMD chips, according to Peter. Anyway, it is a bug, and now I need to re-run all of the models.

I also have changed the sampler for migration parameters to Metropolis-Hastings, widened the migration prior, and had it start from the middle of the migration prior. All changes to the parmfiles are noted here:

Migrate parmfile revisions

migration=PRIOR:50
bayes-proposals= MIG METROPOLIS-HASTINGS Sampler
bayes-priors= MIG WEXPPRIOR: 0.000100 1000.000000 10000000 10000
bayes-posteriorbins= 2000 2000

I ran all the models for 5 replicates during the month of January. The runs would have been completed much sooner, but Kinilau went down due to power failures many times during this period. I also had to rerun a number of models which still had histogram bins set to 500 (which changes the marginal likelihood of the model - upward bias).

A strong prior on Oahu-Kure divergence?

For models with Divergence between Oahu and Kure, I considered using the very stong prior below, which corresponds to maximum split of 1.63e-6 (assumes fast mutation rate of 1e-8 * 163 years since founding of Midway naval base)

#for models with Divergence between Oahu and Kure
bayes-priors= SPLIT * * WEXPPRIOR: 0.0 8e-7 1.63e-6 1.0e-7 
bayes-priors= SPLITSTD * * WEXPPRIOR: 0.0 2e-7 2e-6 1.0e-7
# this corresponds to maximum split of 1.63e-6 (assumes fast mutation rate of 1e-8 * 163 years since founding of Midway naval base)

My reasoning is that Midway - was discovered by westerners in 1859. Naval activity increased in 1941. So the earliest date that westerners could have transported P. tuberculosa to Midway is 1859, but more likely to have happened in 1941ish.

mulo = 1e-9
generationtime = 1
muhi = 1e-8

max_age = 2022-1859

max_split = max_age * (muhi /generationtime)
mean_split = 80 * muhi
sd_split = 20 * muhi

So, a maximum splitting time of 1.63^{-6}, with a mean of 8^{-7} and pick an sd of 2^{-7}. In the end, I didn’t use this prior on any of the models.

In the end, I didn’t apply this prior to any of the models for the following reasons:

Migrate doesn’t handle such a small prior. It reset the maximum to 0.00002.
I also couldn’t seem to add the prior to models that had other divergence times. Whenever I tried to individual priors on all divergence times, with the small prior on Oahu->Kure, Migrate would throw a segmentation fault.

But note that for equilibrium models with just the divergence from Oahu to Kure, the strong prior did improve the marginal likelihood.

Copy it Down

rsync -av -e ssh --exclude='bayes*' --exclude="*.pdf"  eric@kinilau.tobolab.org:~/run6/ ./final

Model Selection Results

Model Marginal Likelihoods

runDir <-"~/github/Palythoa_tuberculosa/migrate/run6/final/"

likelist3 <- list()
for(r in 1:5){
  rep = paste0("rep",r)
  print(rep)
  likelist3[[rep]] <- harvest.model.likelihoods(workingDir=file.path(runDir,rep))
}

# Model selection for each replicate...
likelist3 %>% map(bfcalcs)

The final model marginal likelihood estimates based on the mean of 5 replicates:

model_letters <- c("C","B","A","G","H","J","I","D","E","F","K","L","M","O","P","Q","N")

like3.df <-  likelist3 %>% bind_rows() %>% group_by(model)

means3 <- like3.df %>% summarize(bezier.corrected = mean(bezier.corrected))
  

final_model3 <- bfcalcs(means3) %>% bind_cols(model_letters=model_letters) %>% arrange(model_letters)

final_model3

##                                  model bezier.corrected       lbf choice
## 1                             panmixia        -70726.61 -1657.472      8
## 2                               island        -73364.51 -6933.268     17
## 3                             NWHI_MHI        -70735.71 -1675.676      9
## 4                stepping.stone.breaks        -71043.93 -2292.108     12
## 5             stepping.stone.breaks.KO        -70653.59 -1511.420      5
## 6      stepping.stone.breaks.splits.KO        -71015.46 -2235.172     11
## 7                       stepping.stone        -71215.42 -2635.088     14
## 8                    stepping.stone.KO        -70715.82 -1635.888      7
## 9                   stepping.stone.KOd        -71149.81 -2503.876     13
## 10                 stepping.stone.KO.D        -71007.04 -2218.336     10
## 11               stepping.stone.oneway        -72217.46 -4639.172     16
## 12             stepping.stone.splits.d        -71906.80 -4017.840     15
## 13              stepping.stone.splitsD        -70666.44 -1537.132      6
## 14     stepping.stone.splitsD.KOsplitd        -70650.82 -1505.880      4
## 15     stepping.stone.splitsD.2way.KOd        -70086.97  -378.184      3
## 16 stepping.stone.splitsD.2way.NS.KO.D        -69897.88     0.000      1
## 17 stepping.stone.splitsD.2way.SN.KO.D        -69956.86  -117.968      2
##       modelprob model_letters
## 1  0.000000e+00             A
## 2  0.000000e+00             B
## 3  0.000000e+00             C
## 4  0.000000e+00             D
## 5  0.000000e+00             E
## 6  0.000000e+00             F
## 7  0.000000e+00             G
## 8  0.000000e+00             H
## 9  0.000000e+00             I
## 10 0.000000e+00             J
## 11 0.000000e+00             K
## 12 0.000000e+00             L
## 13 0.000000e+00             M
## 14 0.000000e+00             N
## 15 7.557668e-83             O
## 16 1.000000e+00             P
## 17 2.418657e-26             Q

#write_csv(final_model3, "tables/final_model_table.csv")

T-Test

The two best models are stepping.stone.splitsD.KOsplitd (sequential colonization with gene flow north to south, then colonization of Kure from Oahu wihout gene flow 94% relative probability) and stepping.stone.breaks.KO (equilibrium gene flow with two lumped metapopulations in the NWHI and gene flow between Oahu and Kure). With 5 reps, there is no significant difference in marginal likelihood between these two.

top.choice <- final_model3$model[which(final_model3$choice ==1)]
second.choice <- final_model3$model[which(final_model3$choice ==2)]

TS <- permTS(like3.df$bezier.corrected[which(like3.df$model == top.choice)],
       like3.df$bezier.corrected[which(like3.df$model == second.choice)],
       alternative = "greater")

TS

## 
##  Exact Permutation Test (network algorithm)
## 
## data:  GROUP 1 and GROUP 2
## p-value = 0.003968
## alternative hypothesis: true mean GROUP 1 - mean GROUP 2 is greater than 0
## sample estimates:
## mean GROUP 1 - mean GROUP 2 
##                      58.984

Figures

And some figures summarizing all this.

model_letters2 <- rep(model_letters,5)

likesPlot_letters <- likelist3 %>% bind_rows() %>% bind_cols(model_letters2) %>% 
              group_by(model) %>% 
              ggplot(mapping = aes(x=model_letters2, y = bezier.corrected)) +
              geom_boxplot() +
              labs(x = "Metapopulation Model", 
                   y = "Bezier Corrected Marginal Likelihood") + ylim(-73500, -69000)

## New names:
## • `` -> `...5`

likesPlot_letters

likesPlot_letters

ggsave("figures/final_figures/MarginalLikelihoodsPlot.pdf",likesPlot_letters,width = 10.7, height = 6.42)

Parameter Estimates

stepping.stone.splitsD.2way.NS.KO.D

For the model with divergence-with-gene-flow from north to south, followed by divergence-with-gene-flow from Oahu to Kure (Model O)

Read in the posteriors and make a labeling key

#
Dir <- "/Users/eric/Library/CloudStorage/OneDrive-ThePennsylvaniaStateUniversity/Documents/Research/Ptuberculosa/migrate/run6"
      
winningModel <- "stepping.stone.splitsD.2way.NS.KO.D_final"

#posteriors <- list()

#read in the posterior for two replicates this may take 5 minutes
#for(r in 1:2){
#  print(paste("Loading Rep",r))
#  posteriors[[r]] <- read.table(file.path(Dir,paste0("rep",r),winningModel,
#                                      "bayesallfile"), header=T) 
#}

posterior <- read_table(file.path(Dir,"rep1",winningModel,"bayesallfile"), comment = "#") %>% 
                  migrants.per.gen() %>% dplyr::select(starts_with(c("Locus","Steps","Replicate","Theta_", "M_", 
                                                    "Nm_","D_","d_")))

#create a long version for locus-by-locus plots
posterior_long <- posterior %>% pivot_longer(starts_with(c("Theta_", "M_", 
                                                           "Nm_","D_","d_")),
                                                    names_to = "parameter",
                                                    values_to = "value") 


#posteriors12_long <- posteriors12 %>% pivot_longer(starts_with(c("Theta_", "M_", 
#                                                           "Nm_","D_","d_")),
#                                                    names_to = "parameter",
#                                                    values_to = "value") 


# read in a population key to translate numbers into names in figures
popkey <- read_csv("popkey.csv")
names(popkey$Hawaiian_Pop) <- popkey$Index
#make a new labeling key
posterior_parameters <- names(posterior) %>% str_subset("Theta|M_|Nm_|D_|d_")

parameter_key <- posterior_parameters %>% 
                                  str_replace("(\\d)_(\\d)", "\\1 to \\2") %>%
                                  str_replace("M_", "m/mu " ) %>% 
                                  str_replace("Theta_", "theta ") %>% 
                                  str_replace_all(popkey$Hawaiian_Pop) %>% 
                                  str_replace("Nm_", "4Nem ") 

parameter_key <- parameter_key[c(1,2,3,4,5,6,7,8,9,
                                 18,12,11,14,13,16,15,10,19,17,21,20,23,22,25,24,
                                 34,28,27,30,29,32,31,26,35,33,37,36,39,38,41,40,
                                 43,44,45,46,47,48,49,42)]
                                  

names(parameter_key) <- posterior_parameters[c(1,2,3,4,5,6,7,8,9,
                                 18,12,11,14,13,16,15,10,19,17,21,20,23,22,25,24,
                                 34,28,27,30,29,32,31,26,35,33,37,36,39,38,41,40,
                                 43,44,45,46,47,48,49,42)]

Check Convergence

posterior.grouped <- posterior %>% 
                       dplyr::select(all_of(c("Locus","Replicate",
                                         "Steps",
                                         posterior_parameters))) %>% 
                       group_by(Replicate)

posterior.reps <- group_split(posterior.grouped)
posterior.key <- group_keys(posterior.grouped)

#thin by taking every other step
#posteriors2 <- lapply(posteriors,filter,Steps%%500==0)


# select parameters only
#posterior2 <-  posterior %>% dplyr::select(starts_with(c("Theta_", "M_", 
#                                                          "Nm_","D_","d_")))

posteriors.mcmc <- posterior.reps %>% map(~ dplyr::select(., starts_with(c("Theta_", "M_", 
                                                          "Nm_","D_","d_"))))  %>% map(mcmc)      
posteriors.mcmc <- mcmc.list(posteriors.mcmc)
posterior.ggs <- ggs(posteriors.mcmc, burnin = F)

ggmcmc(posterior.ggs, param_page = 4)







#if I wanted to pull out parameters by locus, but I don't - Peter calculates Effective Sizes on the whole posterior

posteriors3 <- posterior %>% dplyr::select(starts_with(c("Replicate","Steps","Locus","Theta_", "M_", 
                                                          "Nm_","D_","d_"))) %>% mutate(Locus_name = Locus)

posteriors3_Nm <- posteriors3 %>% pivot_wider(id_cols = c("Steps", "Replicate","Locus"), 
                                                            names_from = "Locus_name",
                                               values_from =starts_with(c("Theta_", "M_", 
                                                          "Nm_","D_","d_")), names_sep = "-L")

posteriors3_Nm.mcmc <- mcmc(posteriors3_Nm)

effectiveSize(posteriors3_Nm.mcmc)

Effective Sizes

Effective sizes are good

effectiveSize(posteriors.mcmc[[1]]) (first replicate)

Theta_1   Theta_2   Theta_3   Theta_4   Theta_5   Theta_6   Theta_7   Theta_8   Theta_9     M_6_1     M_3_2     M_2_3     M_4_3     M_3_4     M_5_4     M_4_5 
 755139.9  826428.2  796069.0  891979.6  379318.7  679559.4  696500.7  951356.4  691646.8  956594.2  953136.2  968733.7 1094410.4 1089891.0  849360.1 1089891.0 
    M_6_5     M_1_6     M_5_6     M_7_6     M_6_7     M_8_7     M_7_8     M_9_8     M_8_9    Nm_6_1    Nm_3_2    Nm_2_3    Nm_4_3    Nm_3_4    Nm_5_4    Nm_4_5 
1043252.9  963042.5 1089891.0  959816.7  882511.5  924205.2 1089891.0  928377.0  952468.5  879150.2  941641.8  930176.8 1095970.2  928354.5  783268.4  875664.3 
   Nm_6_5    Nm_1_6    Nm_5_6    Nm_7_6    Nm_6_7    Nm_8_7    Nm_7_8    Nm_9_8    Nm_8_9     D_6_1     D_2_3     D_3_4     D_4_5     D_5_6     D_6_7     D_7_8 
1122918.5 1089891.0  901501.8  916741.7  910306.5  881627.3  964591.7  953676.9  855007.4 1089891.0 1110109.8 1084510.7  800135.8  989120.0  945399.5  890976.6 
    D_8_9 
 781656.3

Gelman-Rubin Diagnostic

Gelman-Rubin diagnostic across 3 replicates looks great!


gelman.diag(posteriors.mcmc, autoburnin = F)

Potential scale reduction factors:

        Point est. Upper C.I.
Theta_1          1          1
Theta_2          1          1
Theta_3          1          1
Theta_4          1          1
Theta_5          1          1
Theta_6          1          1
Theta_7          1          1
Theta_8          1          1
Theta_9          1          1
M_6_1            1          1
M_3_2            1          1
M_2_3            1          1
M_4_3            1          1
M_3_4            1          1
M_5_4            1          1
M_4_5            1          1
M_6_5            1          1
M_1_6            1          1
M_5_6            1          1
M_7_6            1          1
M_6_7            1          1
M_8_7            1          1
M_7_8            1          1
M_9_8            1          1
M_8_9            1          1
Nm_6_1           1          1
Nm_3_2           1          1
Nm_2_3           1          1
Nm_4_3           1          1
Nm_3_4           1          1
Nm_5_4           1          1
Nm_4_5           1          1
Nm_6_5           1          1
Nm_1_6           1          1
Nm_5_6           1          1
Nm_7_6           1          1
Nm_6_7           1          1
Nm_8_7           1          1
Nm_7_8           1          1
Nm_9_8           1          1
Nm_8_9           1          1
D_6_1            1          1
D_2_3            1          1
D_3_4            1          1
D_4_5            1          1
D_5_6            1          1
D_6_7            1          1
D_7_8            1          1
D_8_9            1          1

Theta Estimates

Locus by Locus Plots

theta_plot <- posterior_long %>%  
    filter(str_detect(parameter, "Theta_")) %>%  
    ggplot() + 
    geom_density(mapping = aes(x = value, group=Locus, color
                               =Locus), weight = 0.5) + 
    scale_x_log10(limits=c(0.0001,0.01)) + 
    facet_wrap(~parameter, scales = "free_y",
               labeller = labeller(parameter= parameter_key), 
               dir = "v",nrow=2) +
    theme(axis.text.x = element_text(size = rel(0.6)))

#ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/Theta_loci.pdf",theta_plot,width = 10.7, height = 6.42)

Theta - Locus by Locus

Summed Over Loci Plots

parameter_type <- "Theta"

parameters <- names(posterior) %>%
                        str_subset(parameter_type)

parameter_key2 <- parameters
names(parameter_key2) <- parameter_key[names(parameter_key) %in% parameter_key2]

#summarize the posterior for theta
thetas <- summarize_posterior(posterior, parameter_type = parameter_type, 
                              exponential_mean=0.001,lower.bound=0, upper.bound=0.1,n=2^14)

#convert to long format
thetas_long <- thetas %>% dplyr::select(all_of(c("x","prior",parameters))) %>% 
                pivot_longer(starts_with(c(paste0(parameter_type,"_"))), 
                            names_to = "Parameter",
                            values_to = "Density")

theta.colors <- rev(brewer.pal(9,"Spectral"))
theta.colors.light <- lighten(theta.colors)
theta.colors.dark <- darken(theta.colors)

#plot
thetas_density <-  ggplot(thetas_long, aes(x = x, y = Density, 
                    color = Parameter)) +
        geom_line(linewidth = 1) + 
        #scale_x_log10(limits = c(5e-4,1e-2)) + 
        scale_color_discrete(type = theta.colors, 
                             labels = str_replace(parameter_key, "theta ",""),
                             name= TeX("$\\Theta$ Parameter")) +
        geom_line(aes(x = x, y=prior), color = "grey",
                  linetype="dotted") + labs(x = TeX("$\\Theta$")) +
        ylim(0,0.175) + xlim(0,0.005) + 
        theme(axis.text.x = element_text(size = rel(2)))

thetas_ridges <- ggplot(thetas_long, aes(x = x, y = Parameter, height = Density, 
                    fill = Parameter)) +
        geom_density_ridges(stat = "identity") + 
        scale_fill_discrete(type = theta.colors, 
                             labels = str_replace(parameter_key, "theta ",""),
                             name= TeX("$\\Theta$ Parameter")) +
        scale_y_discrete(limits = rev, labels = NULL, breaks = NULL) +
        labs(x = TeX("$\\Theta = 4N_{e}\\mu$")) + xlim(0,0.005) + 
        theme(axis.text.x = element_text(size = rel(2)))

thetas_violins <- ggplot(thetas_long, aes(x = Parameter, y = x, 
                                          violinwidth = Density*10, 
                                          fill = Parameter)) +
        geom_violin(stat = "identity") + 
        scale_fill_discrete(type = theta.colors, 
                             labels = str_replace(parameter_key, "theta ",""),
                             name= TeX("$\\Theta$ Parameter")) +
        scale_x_discrete(limits = rev, labels = NULL, breaks = NULL) +
        labs(y = TeX("$\\Theta$")) + ylim(0,0.005) + 
        theme(axis.text.x = element_text(size = rel(2)))


# get stats for each summed-over-loci posterior

theta_stats <- parameter_key2 %>% map_dfr(.f = ~ posterior_stats(df = thetas, 
                                                             parameter = .x),
                                      .id = "parameter")

# write_csv(theta_stats, "tables/stepping.stone.splitsD.2way.NS.KO.D/theta_stats.csv")
# 
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/thetas_density.jpg",thetas_density, width = 10.7, height = 6.42)
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/thetas_ridges.jpg",thetas_ridges, width = 10.7, height = 6.42)
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/thetas_violins.jpg",thetas_violins,width = 10.7, height = 6.42)
# 
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/thetas_density.pdf",thetas_density,width = 10.7, height = 6.42)
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/thetas_ridges.pdf",thetas_ridges,width = 10.7, height = 6.42)
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/thetas_violins.pdf",thetas_violins, width = 10.7, height = 6.42)
#https://adiradaniel.netlify.app/post/ggmultipane/#using-ggpubrggarrange
#to make multipanel figures

Theta estimates:

theta_stats <- read_csv("/Users/eric/github/Palythoa_tuberculosa/tables/stepping.stone.splitsD.2way.NS.KO.D/theta_stats.csv")

## Rows: 9 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): parameter
## dbl (8): 2.5%, 25%, 50%, 75%, 97.5%, mode, mean, sd
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

theta_stats

## # A tibble: 9 × 9
##   parameter       `2.5%`   `25%`   `50%`   `75%` `97.5%`    mode    mean      sd
##   <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 theta Holaniku 0.00204 0.00212 0.00216 0.00221 0.00230 0.00215 0.00216 6.79e-5
## 2 theta Manawai  0.00132 0.00132 0.00132 0.00134 0.00135 0.00132 0.00133 3.69e-5
## 3 theta Kamokuo… 0.00129 0.00133 0.00135 0.00138 0.00143 0.00135 0.00135 3.66e-5
## 4 theta Lalo     0.00195 0.00201 0.00204 0.00208 0.00216 0.00203 0.00204 5.33e-5
## 5 theta Kaua'i   0.00274 0.00280 0.00283 0.00286 0.00292 0.00283 0.00283 4.70e-5
## 6 theta O'ahu    0.00236 0.00237 0.00238 0.00239 0.00243 0.00237 0.00239 1.86e-5
## 7 theta Moloka'i 0.00206 0.00211 0.00214 0.00217 0.00225 0.00214 0.00214 4.89e-5
## 8 theta Maui     0.00215 0.00219 0.00222 0.00225 0.00233 0.00222 0.00222 4.71e-5
## 9 theta Hawai'i  0.00291 0.00299 0.00303 0.00308 0.00317 0.00303 0.00303 6.83e-5

Theta Ridges

m/mu estimates

Locus by Locus Plots

M_plot <- posterior_long %>%  
    filter(str_detect(parameter, "M_")) %>%  
    ggplot() + 
    geom_density(mapping = aes(x = value, group=Locus, color
                               =Locus), weight = 0.5) + 
    scale_x_log10(limits=c(1,10000)) + 
    facet_wrap(~parameter, scales = "free_y",
               labeller = labeller(parameter= parameter_key), 
               dir = "v",nrow=2) +
    theme(axis.text.x = element_text(size = rel(0.6)))

#ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/M_loci.pdf",M_plot,width = 10.7, height = 6.42)

M/mu Locus By Locus ##### Summed Over Loci Plots

parameter_type <- "M"

parameters <- names(posterior) %>%
                        str_subset(parameter_type)

parameter_key2 <- subset(parameter_key, grepl(pattern = parameter_type, x = names(parameter_key)))

#summarize the posterior for theta
Ms <- summarize_posterior(posterior, parameter_type = parameter_type, 
                              exponential_mean = 1000, n = 2^15, 
                          lower.bound = 0, upper.bound = 10000)

#convert to long format
Ms_long <- Ms %>% dplyr::select(all_of(c("x","prior",parameters))) %>% 
                pivot_longer(starts_with(c(paste0(parameter_type,"_"))), 
                            names_to = "Parameter",
                            values_to = "Density") 

migration.colors <- c(rbind(theta.colors, theta.colors.dark))
migration.colors <- migration.colors[c(1,3,4,5,6,7,8,2,9,10,11,12,13,14,15,16)]
#plot
Ms_density <-  ggplot(Ms_long, aes(x = x, y = Density, 
                    color = Parameter)) +
        geom_line(size = 1) + 
        scale_x_continuous(limits = c(0,1000)) + 
       scale_color_discrete(type = migration.colors, 
                             labels = str_replace(parameter_key2, "m/mu ",""),
                             name= TeX("$\\frac{m}{\\mu}$ Parameter")) +
        labs(x = TeX("$\\frac{m}{\\mu}$")) +
        geom_line(aes(x = x, y=prior), color = "grey",
                  linetype="dotted") + 
        theme(axis.text.x = element_text(size = rel(2)))

 
           
#plot
M_ridges <- ggplot(Ms_long, aes(x = x, y = Parameter, height = Density, 
                    fill = Parameter)) +
        geom_density_ridges(stat = "identity") + 
        scale_x_continuous(limits = c(0,1000)) +
        scale_fill_discrete(type = migration.colors, 
                             labels = str_replace(parameter_key2, "m/mu ",""),
                             name= TeX("$\\frac{m}{\\mu}$ Parameter")) +
        scale_y_discrete(limits = rev, labels = NULL, breaks = NULL) +
        labs(x = TeX("$\\frac{m}{\\mu}$")) + 
        theme(axis.text.x = element_text(size = rel(2)))

#get stats for each summed-over-loci posterior
parameter_key3 <- setNames(names(parameter_key2), parameter_key2)

Ms_stats <- parameter_key3 %>% map_dfr(.f = ~ posterior_stats(df = Ms, 
                                                               parameter = .x),
                                        .id = "parameter")
  
# write_csv(Ms_stats, "tables/stepping.stone.splitsD.2way.NS.KO.D/Mmu_stats.csv")
# 
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/Mmu_density.jpg",
#          Ms_density, width = 10.7, height = 6.42)
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/Mmu_ridges.jpg",
#          M_ridges, width = 10.7, height = 6.42)
# 
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/Mmu_density.pdf",Ms_density, width = 10.7, height = 6.42)
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/Mmu_ridges.pdf",
#         M_ridges, width = 10.7, height = 6.42)
# 
# M_ridges2<-M_ridges + theme(legend.position = "none")
# ggsave(filename = "M_ridges_nolegend.pdf",plot=M_ridges2, path = "./figures/stepping.stone.splitsD.2way.NS.KO.D/",
#          width = 10.7, height = 6.42)

M/mu estimates:

mmu_stats <- read_csv("tables/stepping.stone.splitsD.2way.NS.KO.D/Mmu_stats.csv")

## Rows: 16 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): parameter
## dbl (8): 2.5%, 25%, 50%, 75%, 97.5%, mode, mean, sd
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

mmu_stats

## # A tibble: 16 × 9
##    parameter                  `2.5%` `25%` `50%` `75%` `97.5%`  mode  mean    sd
##    <chr>                       <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>
##  1 m/mu Holaniku to O'ahu       233.  253.  265.  276.    299.  264.  265. 17.0 
##  2 m/mu Manawai to Kamokuoka…   461.  478.  485.  490.    496.  489.  483.  9.39
##  3 m/mu Kamokuokamohoali'i t…   470.  502.  520.  538.    573.  519.  520. 26.7 
##  4 m/mu Kamokuokamohoali'i t…   517.  548.  565.  581.    612.  565.  565. 24.1 
##  5 m/mu Lalo to Kamokuokamoh…   368.  394.  408.  424.    456.  407.  409. 22.5 
##  6 m/mu Lalo to Kaua'i          440.  468.  484.  500.    533.  482.  484. 23.7 
##  7 m/mu Kaua'i to Lalo          308.  330.  342.  354.    379.  341.  342. 18.2 
##  8 m/mu O'ahu to Holaniku       490.  515.  532.  550.    592.  527.  534. 26.1 
##  9 m/mu Kaua'i to O'ahu         466.  500.  517.  535.    567.  518.  517. 25.9 
## 10 m/mu O'ahu to Kaua'i         288.  314.  328.  342.    367.  328.  328. 20.3 
## 11 m/mu O'ahu to Moloka'i       495.  526.  543.  561.    596.  542.  544. 25.8 
## 12 m/mu Moloka'i to O'ahu       287.  310.  323.  335.    361.  322.  323. 18.9 
## 13 m/mu Moloka'i to Maui        464.  498.  517.  537.    578.  515.  518. 29.0 
## 14 m/mu Maui to Moloka'i        260.  282.  294.  306.    331.  292.  294. 18.0 
## 15 m/mu Maui to Hawai'i         479.  500.  513.  528.    558.  511.  515. 20.7 
## 16 m/mu Hawai'i to Maui         214.  236.  249.  262.    287.  248.  249. 18.8

M/mu Ridges

4Nem estimates

Locus by Locus Plots

Nm_plot <- posterior_long %>%  
    filter(str_detect(parameter, "Nm_")) %>%  
    ggplot() + 
    geom_density(mapping = aes(x = value, group=Locus, color
                               =Locus), weight = 0.5) + 
    scale_x_log10(limits=c(0.01,10)) + 
    facet_wrap(~parameter, scales = "free_y",
               labeller = labeller(parameter= parameter_key), 
               dir = "v",nrow=2) +


#ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/4Nm_loci.pdf",Nm_plot,width = 10.7, height = 6.42)

4Nm Locus By Locus

Summed Over Loci Plots

parameter_type <- "Nm"

parameters <- names(posterior) %>%
                        str_subset(parameter_type)

parameter_key2 <- subset(parameter_key, grepl(pattern = parameter_type, x = names(parameter_key)))


#summarize the posterior for nm parameters
nms <- summarize_posterior(posterior, parameter_type = parameter_type, 
                           exponential_mean = 1, n = 2^14, lower.bound=0, 
                           upper.bound=100, floor = 1e-10)

#convert to long format
nms_long <- nms %>% dplyr::select(all_of(c("x","prior",parameters))) %>% 
                pivot_longer(starts_with(c(paste0(parameter_type,"_"))), 
                            names_to = "Parameter",
                            values_to = "Density")

          
#plot
Nms_density <-  ggplot(nms_long, aes(x = x, y = Density, 
                    color = Parameter)) +
        geom_line(size = 1) + 
        #scale_x_log10(limits = c(5e-4,1e-2)) + 
       scale_color_discrete(type = migration.colors, 
                             labels = str_replace(parameter_key2, "4Nem ",""),
                             name= TeX("$4N_{e}m$ Parameter")) +
        geom_line(aes(x = x, y=prior), color = "grey",
                  linetype="dotted") + xlim(0,3) + 
        theme(axis.text.x = element_text(size = rel(2)))
  
           
#plot
nm_ridges <- ggplot(nms_long, aes(x = x, y = Parameter, height = Density*5, 
                    fill = Parameter)) +
        geom_ridgeline(stat = "identity") + 
        #scale_x_log10(limits = c(0.01,100)) +
        scale_fill_discrete(type = migration.colors, 
                             labels = str_replace(parameter_key2, "4Nem ",""),
                             name= TeX("$4N_{e}m$ Parameter")) +
        scale_y_discrete(limits = rev, labels = NULL, breaks = NULL) +
        labs(x = TeX("$4N_em$")) +
        xlim(0,3) + 
        theme(axis.text.x = element_text(size = rel(2)))
  
nm_violins <- ggplot(nms_long, aes(x = Parameter, y = x, violinwidth = Density*10, 
                    fill = Parameter)) +
        geom_violin(stat = "identity") + 
        scale_fill_discrete(type = migration.colors, 
                             labels = str_replace(parameter_key2, "4Nem ",""),
                             name= TeX("$4N_{e}m$ Parameter")) + 
        labs(y = TeX("$4N_em$")) +
        ylim(0,3) + 
        theme(axis.text.x = element_text(size = rel(2)))
# get stats for each summed-over-loci posterior
parameter_key3 <- setNames(names(parameter_key2), parameter_key2)
nms_stats <- parameter_key3 %>% map_dfr(.f = ~ posterior_stats(df = nms, 
                                                             parameter = .x),
                                      .id = "parameter")

# write_csv(nms_stats, "tables/stepping.stone.splitsD.2way.NS.KO.D/4Nm_stats.csv")
#  
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/4Nm_density.jpg",Nms_density,width = 10.7, height = 6.42)
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/4Nm_ridges.jpg",nm_ridges,width = 10.7, height = 6.42)
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/4Nm_violins.jpg",nm_violins,width = 10.7, height = 6.42)
# 
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/4Nm_density.pdf",Nms_density,width = 10.7, height = 6.42)
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/4Nm_ridges.pdf",nm_ridges,width = 10.7, height = 6.42)
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/4Nm_violins.pdf",nm_violins,width = 10.7, height = 6.42)
# nm_ridges2 <- nm_ridges + theme(legend.position = "none")
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/4Nm_ridges_nolegend.pdf",nm_ridges2,width = 10.7, height = 6.42)

$4N_em$ estimates:

nm_stats <- read_csv("tables/stepping.stone.splitsD.2way.NS.KO.D/4Nm_stats.csv")

## Rows: 16 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): parameter
## dbl (8): 2.5%, 25%, 50%, 75%, 97.5%, mode, mean, sd
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nm_stats

## # A tibble: 16 × 9
##    parameter                `2.5%` `25%` `50%` `75%` `97.5%`  mode  mean      sd
##    <chr>                     <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>   <dbl>
##  1 4Nem Holaniku to O'ahu    0.610 0.616 0.623 0.623   0.623 0.623 0.620 0.00529
##  2 4Nem Manawai to Kamokuo…  1.85  1.85  1.85  1.86    1.86  1.85  1.85  0.0440 
##  3 4Nem Kamokuokamohoali'i…  0.476 0.488 0.494 0.501   0.513 0.494 0.494 0.0109 
##  4 4Nem Kamokuokamohoali'i…  2.09  2.11  2.12  2.15    2.21  2.11  2.13  0.0333 
##  5 4Nem Lalo to Kamokuokam…  0.470 0.482 0.488 0.494   0.501 0.488 0.488 0.00976
##  6 4Nem Lalo to Kaua'i       1.92  2.01  2.08  2.14    2.26  2.06  2.08  0.0892 
##  7 4Nem Kaua'i to Lalo       1.61  1.61  1.62  1.63    1.65  1.61  1.62  0.0132 
##  8 4Nem O'ahu to Holaniku    1.80  1.90  1.97  2.03    2.15  1.97  1.97  0.0932 
##  9 4Nem Kaua'i to O'ahu      1.77  1.79  1.82  1.84    1.91  1.80  1.82  0.0373 
## 10 4Nem O'ahu to Kaua'i      1.79  1.82  1.84  1.87    1.97  1.83  1.85  0.0449 
## 11 4Nem O'ahu to Moloka'i    1.84  1.87  1.90  1.93    2.01  1.89  1.91  0.0452 
## 12 4Nem Moloka'i to O'ahu    0.934 0.964 0.983 0.995   1.20  0.983 1.00  0.0757 
## 13 4Nem Moloka'i to Maui     1.75  1.83  1.87  1.94    2.09  1.85  1.89  0.0891 
## 14 4Nem Maui to Moloka'i     1.67  1.67  1.67  1.68    1.68  1.67  1.67  0.00680
## 15 4Nem Maui to Hawai'i      2.14  2.26  2.33  2.39    2.54  2.31  2.33  0.103  
## 16 4Nem Hawai'i to Maui      0.769 0.775 0.781 0.781   0.781 0.781 0.779 0.00483

4Nm Ridges

Divergence estimates

Locus by Locus Plots

D_plot <- posterior_long %>%  
    filter(str_detect(parameter, "D_")) %>%  
    ggplot() + 
    geom_density(mapping = aes(x = value, group=Locus, color
                               =Locus), weight = 0.5) + 
    scale_x_log10(limits=c(1e-7,0.01)) + 
    facet_wrap(~parameter, scales = "free_y",
               labeller = labeller(parameter= parameter_key), 
               dir = "v",nrow=2) +
    theme(axis.text.x = element_text(size = rel(0.6)))

#ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/D_loci.pdf",D_plot,width = 10.7, height = 6.42)

Summed Over Loci Plots

parameter_type <- "D"

parameters <- names(posterior) %>%
                        str_subset(parameter_type)

parameter_key2 <- subset(parameter_key, grepl(pattern = parameter_type, x = names(parameter_key)))


#summarize the posterior for D parameters
ds <- summarize_posterior(posterior, parameter_type = parameter_type, 
                              exponential_mean=0.001, n = 2^15, lower.bound = 0, upper.bound = 0.005)

#convert to long format
ds_long <- ds %>% dplyr::select(all_of(c("x","prior",parameters))) %>% 
                pivot_longer(starts_with(c(paste0(parameter_type,"_"))), 
                            names_to = "Parameter",
                            values_to = "Density") 

          
#plot
ds_density <-  ggplot(ds_long, aes(x = x, y = Density, 
                    color = Parameter)) +
        geom_line(size = 1) + 
        #scale_x_log10(limits = c(5e-4,1e-2)) + 
        scale_color_discrete(type = theta.colors[c(3,4,5,6,7,8,9,1)], 
                             labels = str_replace(parameter_key2, "D_",""),
                             name= TeX("d Parameter")) +
        geom_line(aes(x = x, y=prior), color = "grey",
                  linetype="dotted") + xlim(0,0.00125) + 
        theme(axis.text.x = element_text(size = rel(2)))
  
           
#plot
ds_ridges <- ggplot(ds_long, aes(x = x, y = Parameter, height = Density, 
                    fill = Parameter)) +
        geom_density_ridges(stat = "identity") + 
        scale_fill_discrete(type = theta.colors[c(3,4,5,6,7,8,9,1)], 
                             labels = str_replace(parameter_key2, "D_",""),
                             name= TeX("d Parameter")) +
        scale_y_discrete(limits = rev, labels = NULL, breaks = NULL) +
        labs(x = TeX("Divergence Time (generations x $\\mu)$")) +
        xlim(0,0.00125) + 
        theme(axis.text.x = element_text(size = rel(2)))
  
ds_violins <- ggplot(ds_long, aes(x = Parameter, y = x, violinwidth = Density*20, 
                    fill = Parameter)) +
        geom_violin(stat = "identity") + 
       scale_fill_discrete(type = theta.colors[c(3,4,5,6,7,8,9,1)], 
                             labels = str_replace(parameter_key2, "D_",""),
                             name= TeX("d Parameter")) +
        ylim(0,0.00125) + 
        theme(axis.text.x = element_text(size = rel(2)))
# get stats for each summed-over-loci posterior
parameter_key3 <- setNames(names(parameter_key2), parameter_key2)
ds_stats <- parameter_key3 %>% map_dfr(.f = ~ posterior_stats(df = ds, 
                                                             parameter = .x),
                                      .id = "parameter")

# write_csv(ds_stats, "tables/stepping.stone.splitsD.2way.NS.KO.D/D_stats.csv")
# 
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/D_density.jpg",ds_density, width = 10.7, height = 6.42)
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/D_ridges.jpg",ds_ridges, width = 10.7, height = 6.42)
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/D_violins.jpg",ds_violins, width = 10.7, height = 6.42)
# 
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/D_density.pdf",ds_density, width = 10.7, height = 6.42)
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/D_ridges.pdf",ds_ridges, width = 10.7, height = 6.42)
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/D_violins.pdf",ds_violins, width = 10.7, height = 6.42)
# ds_ridges2 <- ds_ridges + theme(legend.position = "none")
# ggsave("figures/stepping.stone.splitsD.2way.NS.KO.D/ds_ridges_nolegend.pdf",ds_ridges2,width = 10.7, height = 6.42)

D estimates:

ds_stats <- read_csv("tables/stepping.stone.splitsD.2way.NS.KO.D/D_stats.csv")

## Rows: 8 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): parameter
## dbl (8): 2.5%, 25%, 50%, 75%, 97.5%, mode, mean, sd
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ds_stats

## # A tibble: 8 × 9
##   parameter       `2.5%`   `25%`   `50%`   `75%` `97.5%`    mode    mean      sd
##   <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 D_Manawai to … 9.35e-4 9.39e-4 9.44e-4 9.50e-4 9.67e-4 9.40e-4 9.46e-4 9.91e-6
## 2 D_Kamokuokamo… 9.48e-4 9.53e-4 9.58e-4 9.65e-4 9.84e-4 9.53e-4 9.60e-4 9.66e-6
## 3 D_Lalo to Kau… 8.42e-4 8.73e-4 8.90e-4 9.08e-4 9.47e-4 8.88e-4 8.91e-4 2.69e-5
## 4 D_Kaua'i to O… 8.19e-4 8.48e-4 8.65e-4 8.82e-4 9.17e-4 8.63e-4 8.65e-4 2.50e-5
## 5 D_O'ahu to Mo… 7.45e-4 7.78e-4 7.97e-4 8.16e-4 8.55e-4 7.95e-4 7.98e-4 2.79e-5
## 6 D_Moloka'i to… 7.50e-4 7.85e-4 8.05e-4 8.26e-4 8.66e-4 8.04e-4 8.06e-4 2.96e-5
## 7 D_Maui to Haw… 8.20e-4 8.57e-4 8.78e-4 9.00e-4 9.42e-4 8.77e-4 8.79e-4 3.11e-5
## 8 D_O'ahu to Ho… 7.57e-4 7.90e-4 8.08e-4 8.27e-4 8.66e-4 8.06e-4 8.09e-4 2.77e-5

D Ridges Divergence time is in generations multiplied by expected mutations (gen * $\mu$), following along on page 25 here. Beerli et al. (2019)

Assuming a 1 year generation time, and 1.2e-8 mutations/generation rate (Zhang et al. 2022)

d <-mean(ds_stats$mode) #units of gen*mu
gen <- 1
t <- d / (1.2e-8 * gen)

t

## [1] 72144.48

Divergence time is estimated at 7.2144485^{4} years before present

Second Sanity Check

Simulations with Migrate-derived parameters

OK. Just double checking that the new results make sense. Here is the parmfile for a coalescent simulation with a 1.2e-8 mutation rate (* 500bp = 6e-6) (Zhang et al. 2022), and a migration rate that is 429.3191321 times greater (0.0025759). From the present, there are 15 demes of 4.5002046^{4} effective individuals each, with an emigration rate of 6e-6 (which, when multiplied by the per-locus mutation rate (rather than per-site used in the paper), works out to 0.00262), which results in immigration rates of close to this value (see file EmpImmmigRate), exchanging individuals only with neighboring populations (stepping-stone model). Absorbing boundaries. At 7.2144485^{4} generations, there is a founder event of F individuals in one deme, which, at 7.2145485^{4} generations comes from a larger population of 100K effective individuals. Sampling is exactly as in the empirical observations in terms of 9 islands being sampled with the same sample sizes as we have.

I have varied the simulations to test the effect of [F]ounder event as such:

Size of [F]ounding Population
10
100
10000

Haplotypes

%%%%%%%%%% SIMULATION PARAMETERS %%%%%%%
Data_File_Name = stepstone_radseq
GenePop_File_Extension = .gen
Run_Number = 1
Pause = Never
Genepop = T
Migrate = F
Migraine_Settings = F
Nexus_File_Format = F

%%%%%%%%%% MARKERS PARAMETERS %%%%%%%%%
Locus_Number = 109
Mutation_Rate = 0.000006
Mutation_Model = JC69
MRCA_Sequence = TACTCTATATATTATGTTTGGTGTGTGATCTGGTTTAGTCGGGACTGCTTTGAGGCTCTTGATTCGAGCTGAACTTGGACAGCCAGGAGCTCTTTTAGGTGATGATCAACTTTATAATGTGATCGTCACTGCGCATGCATTTGTGATAATTTTTTTCTTGGTGATGCCTATGATGATTGGGGGATTCGGTAATTGGTTGGTTCCTTTAATGTTGGGGGCTCCTGATATGGCGTTCCCGCGGTTGAATAATATAAGTTTTTGGTTGCTTCCGCCTTCGTTGACTTTGTTGCTTGCTTCTTCTGCTGTTGAGAGTGGTGTAGGGACAGGTTGAACGGTTTATCCTCCTTTGTCTGGGAACTTAGCTCATGCTGGGGGTTCTGTGGATCTAGCTATCTTCTCGTTACACTTAGCTGGTGTATCTTCTATTTTAGGTGCTGTAAATTTTATTACTACGATCATTAATATGCGGTGACAGGGGATGCAATTTGAGCGGTTGCCTC
Sequence_Size = 500
Transition_Transversion_ratio = 2
Equilibrium_Frequencies = 0.25 0.25 0.25 0.25
Polymorphic_Loci_Only= False
Ploidy = Diploid

%%%%%%%%%% VARIOUS COMPUTATION OPTIONS %%%
DiagnosticTables = Hexp, Fis, Seq_stats, Effective_Dispersal, Iterative_Statistics


%%%%%%%%%% DEMOGRAPHIC OPTIONS %%%%%%%%
NewDemographicPhaseAt=0
Lattice_Boundaries = Absorbing
Total_Range_Dispersal = True
Dispersal_Distribution = SteppingStone
Immigration_Control = Simple1Dproduct
Total_Emigration_Rate = 0.002628559
Dist_max = 1
Lattice_SizeX = 15
Lattice_SizeY = 1
Ind_Per_Pop = 44324
Continuous_Deme_Size_Variation = None
Continuous_Lattice_Size_Variation = None

NewDemographicPhaseAt=70962
Lattice_SizeX = 1
Lattice_SizeY = 1
Ind_Per_Pop = 100

NewDemographicPhaseAt=70963
Lattice_SizeX = 1
Lattice_SizeY = 1
Ind_Per_Pop = 100000

%%%%%%%%%% SAMPLE PARAMETERS %%%%%%%%
%Sample_SizeX = 9
%Sample_SizeY = 9
Sample_Coordinates_X = 1 3 6 8 11 12 13 14 15
Sample_Coordinates_Y = 1 1 1 1 1  1  1  1  1
Ind_Per_Pop_Sampled = 10 4 5 9 15 10 10 10 17

Per-Locus FST

Now read and plot the results

theta.colors <- rev(brewer.pal(9,"Spectral"))
#read in the empirical haplotype data as numbered haps
haps <- read.genepop("filtered_data/Ptuberculosa_filtered_haps.gen", quiet = T)
#fst_haps <- mat_pw_fst(haps)
fst_haps <- genet.dist(genind2hierfstat(haps), method = "Nei87")
fst_haps <- fst_haps / (1-fst_haps)
stats_haps <- basic.stats(haps)

#read in the simulated data
sims <- list()
for(sim in list.files("IBDsim", pattern = "^second", include.dirs=T, full.names = T)){
  simname <- sim %>% str_match("F\\d+") %>% as.character()
  sims[simname] <- read.genepop(file.path(sim,"stepstone_radseq.gen"),
                                ncode = 3,  quiet = T)
}

# use map() to loop over all the genepop objects and calculate Fst 
locus_fsts <- sims %>% map(\(x) basic.stats(genind2hierfstat(x)))
locus_fsts$Observed <- stats_haps

fst_hists <- locus_fsts %>% map(\(x) ggplot() + geom_histogram(x$perloc, mapping=aes(x=Fst),binwidth = 0.005) +
                                xlim(-0.05,0.2) + ylim(0,25) + labs(y = "locus count", x = TeX("$F_{ST}$")))

fst_hists_all<-cowplot::plot_grid(plotlist = fst_hists, labels = c("A","B","C","D"), nrow=2)

#ggsave("figures/final_figures/fst_hists_all_letters.pdf", fst_hists_all, width = 8, height = 5, units = "in" )

fst_loci_distribution <- ggplot(locus_fsts$Observed$perloc, mapping=aes(x = Fst)) + geom_histogram() +
                          geom_histogram(locus_fsts$F100$perloc, mapping=aes(x = Fst), fill = theta.colors[9], alpha = 0.5) +
                          labs(y = "locus count", x = TeX("$F_{ST}$")) + 
                          theme(axis.text = element_text(size = rel(1.5)), 
                                axis.title = element_text(size = rel(2)))

#ggsave("figures/final_figures/fst_hists.pdf", fst_loci_distribution, width = 8, height = 5, units = "in" )

These results are quite reasonable. The observed values have more variance than the simulated values, for reasons I’m not quite sure of, but are probably biological in nature.

Check IBD

Once again, getting reasonable results, with the simulated results looking more orderly than the empirical values.

#create a "geographic" distance matrix                  
sim_geo_points <- matrix(c(1,1,3,1,6,1,8,1,11,1,12,1,13,1,14,1,15,1),ncol = 2, byrow = T)
sim_geo_dists <- pointDistance(sim_geo_points,lonlat=F) %>% as.dist()

# re-load the geographic distances and convert to distance matrix
geo.dist.mat <- read.csv(file = "IBD/Ptuberculosa_pairwise_great_circle_distances_km.csv")
geo.dist <- as.dist(geo.dist.mat)

# (using Nei 87 for multi-allelic data)
fsts <- sims %>% map(\(x) genet.dist(genind2hierfstat(x), method = "Nei87"))

ibd_plots <- fsts %>% map(\(x) ggplot(tibble(Lattice_Dist = as.vector(sim_geo_dists), Fst = as.vector(x)), 
                                      aes(x = Lattice_Dist, y = Fst)) + 
                             geom_point() + ylim(-0.02,0.075) +
                              labs(x = "Lattice Distance", y = TeX("$F_{ST}$")) + 
                            geom_smooth(method=lm, color = "blue") ) 
                
fst_haps.mat <- as.matrix(fst_haps)

fst.df <- melt(fst_haps.mat)[melt(lower.tri(fst_haps))$value,]
names(fst.df) <- c("row","col","fst")
fst.df <- fst.df %>% bind_cols(geo=geo.dist)

oahu.fst <- fst.df %>% filter(row == "93OAPtu" | col == "93OAPtu")

fst.df <- fst.df %>% filter(row != "93OAPtu" & col != "93OAPtu")




ibd_plots$Observed <- fst.df %>%  ggplot(aes(x = geo, y = fst)) + 
                              geom_point() + ylim(-0.02,0.075) +
                              labs(x = "Kilometers", y = TeX("$F_{ST}$")) + 
                              geom_smooth(method=lm, color = "blue") + 
                              geom_point(oahu.fst, mapping=aes(x = geo, y = fst), col = "red")

                              

ibd_compare <- cowplot::plot_grid(plotlist = ibd_plots, labels = names(ibd_plots), ncol=2)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## Don't know how to automatically pick scale for object of type <dist>.
## Defaulting to continuous.
## `geom_smooth()` using formula = 'y ~ x'

#ggsave("figures/final_figures/ibd_compare.pdf", ibd_compare, width = 8, height = 5, units = "in" )
#sim_dists.df <-  bind_cols(fst = as.dist(sim_haps_fst), distance = as.dist(sim_geo_dists))

#mantelt_sims <- mantel.randtest(as.dist(sim_haps_fst),as.dist(sim_geo_dists), nrepet = 10000)
#lmodel_sims <- lm(fst ~ distance , sim_dists.df)

Literature Cited

Beerli, Peter, Somayeh Mashayekhi, Marjan Sadeghi, Marzieh Khodaei, and Kyle Shaw. 2019. “Population Genetic Inference With MIGRATE.” Current Protocols in Bioinformatics 68 (1): e87. https://doi.org/10.1002/cpbi.87.

Bradburd, Gideon S., Graham M. Coop, and Peter L. Ralph. 2018. “Inferring Continuous and Discrete Population Genetic Structure Across Space.” Genetics 210 (1): 33–52. https://doi.org/10.1534/genetics.118.301333.

Leblois, Raphaël, Arnaud Estoup, and Francois Rousset. 2009. “IBDSim: A Computer Program to Simulate Genotypic Data Under Isolation by Distance.” Molecular Ecology Resources 9 (1): 107–9. https://doi.org/10.1111/j.1755-0998.2008.02417.x.

Paris, Josephine R., Jamie R. Stevens, and Julian M. Catchen. 2017. “Lost in Parameter Space: A Road Map for Stacks.” Edited by Susan Johnston. Methods in Ecology and Evolution 8 (10): 1360–73. https://doi.org/10.1111/2041-210x.12775.

Pritchard, J. K., M. Stephens, and P. Donnelly. 2000. “Inference of Population Structure Using Multilocus Genotype Data.” Genetics 155 (2): 945–59.

Zhang, Jia, Zoe T Richards, Arne A S Adam, Cheong Xin Chan, Chuya Shinzato, James Gilmour, Luke Thomas, Jan M Strugnell, David J Miller, and Ira Cooke. 2022. “Evolutionary Responses of a Reef-building Coral to Climate Change at the End of the Last Glacial Maximum.” Molecular Biology and Evolution 39 (10): msac201. https://doi.org/10.1093/molbev/msac201.

Gene Flow Models for Palythoa tuberculosa

Setup

Introduction

Transfer files

SNPs and Haplotypes from Stacks

Process RadTags and Rename Files

R80 Dataset fromt the Stacks Pipeline

Find Symbiont Loci

SNPs

Filter to create final SNP dataset

Filter in Populations

Filter with snpR

Haplotypes

Fasta2Genotype

Isolation by Distance

SNPs

Calculate Distance Matrices

Calculate Fst

Calculate Geographic Distance

Read Matrices Back In

Calculate linear model

Remove Oahu

Remove Kure

Structure

Parameter settings

MAINPARAMS

EXTRAPARAMS (just changed NOADMIX to 1 for the no admixture runs)

Structure Threader

Summarize results

Admixture, location prior

No Admixture, Location Prior

ConStruct

Read and Convert Data

Empirical Data

Run a spatial model

Calculate Layer Contributions

Visualize Results

Effect of Missing Data

Use BCFtools to look at the distribution of missing data across samples

Migrate

Install Migrate

Locus Statistics & Mutation Model

Write a model block

Parmfile

Test Run 1

Results

Test Run 2

Revise m prior

Panmixia

Island Model

Results

Model Selection

Parameters

Models

Gene Flow between Oahu and Kure

What about one way gene flow?

Adding divergence models

stepping.stone.KO.D

stepping.stone.KOd

stepping.stone.splitsD

stepping.stone.splits.d

stepping.stone.splitsD.KOsplitD

Priors

Final Parmfile

Copy It Up

And Back down

Bash Script

Results

Model Marginal Likelihoods

Figures

Rerun the three top models

Final Model Selection Results

Model Marginal Likelihoods

T-Test

Figures

Model Parameter Estimates

Read in the posteriors and make a labeling key

Check Convergence

Effective Sizes

Gelman-Rubin Diagnostic