This notebooks visualises the results of kMeans clustering of the genomics variants from chromosome 22 of the 1000 Genomes project dataset (phase3).
We have reduced all the variants to 50 cluster centers, so that now each of the ~2500 individuals can be representation by a vector of size 50.
The results are available in: data/cluster-centers_chr22.csv.gz
.
Now we will compute the average representation for each population averaging the vectors of the inviduals from this population and then use hierarchical clustering to see, which populations are similiar.
# import pandas and set display options
import pandas as pd
pd.set_option('display.max_rows', 5)
pd.set_option('display.max_columns', 8)
# read and display the representations
representationPD = pd.read_csv('data/cluster-centers_chr22.csv.gz', index_col=0)
representationPD
# read the pedigree file that associates individuals with their populations
pedPD = pd.read_csv('data/integrated_call_samples_v2.20130502.ALL.ped.bz2', sep='\t', index_col=1)
pedPD
# compute the average representation per population
populationRepresentationPD = representationPD.join(pedPD).groupby('Population').mean()[representationPD.columns]
populationRepresentationPD
# load population descriptions
populationDescPD = pd.read_csv('data/1000_gen_populations.txt', sep='\t', index_col=0)
populationDescPD
# create labels for the dendrogram `SuperPopulationCode` + `Population Description`
populationLabels = populationDescPD \
.loc[populationRepresentationPD.index][['Super Population Code', 'Population Description']].apply(" ".join, axis=1)
populationLabels
# compute pair-wise distances between population representations
# and run hierarchical clustering
from scipy.spatial.distance import pdist
import scipy.cluster
pairWisePopulationDistances = pdist(populationRepresentationPD)
populationLinkage = scipy.cluster.hierarchy.linkage(pairWisePopulationDistances, method='complete')
# display the dendrogram
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (5,6)
plt.close()
scipy.cluster.hierarchy.dendrogram(populationLinkage, orientation = 'left',
color_threshold = 0.4, labels = populationLabels,
leaf_font_size = 10)
plt.show()
display()
We can clearly see from the chart above that the 50 cluster centers are indeed a reasonable representation of the entire chromosome.
You can now play around modifying pieces of the code.
When you are done and you are running off the local machine remember to close the notebook with File/Close and Halt
.