r - Phylogenetic tree -
I am working to keep a phosphatic tree based on the data-data of genes. Below is my subset of data (test txt). The tree is not built on any DNA sequences, but it is considered as just words.
id gene 1 gene 2 1 ADRA1D ADK 2 ADRA1B ADK 3 ADRA1A ADK 4 ADRB1 ASIC 1 5 ADRB 1 ADK 6 ADRB 2 ASIC 1 ADRB 2 ADK 8 AGTR 1 ACH 9 AGTRRA 1 ADK 10 ALLOX 5 ADRB 1 11 ALOX 5 ADRB 2 12 ALPPL 2 ADRB 1 13 ALPPL 2 ADRB 2 14 AMI 2 AGR 1 15 AR ADRAA 1 AR ADRAID 17 AR ADARA 1B 18 AR ADARA 1A AR ADRAAAA A20 AR ADA 2B
Below is my code r
Library (Ape) tab = reed CSV ("test.tx
P> However, I do not get only genes and genes 2 columns from genes 1 The data given below is actually what I want, but genes should also have genes from 2 columnsCode>
My data is attached here < P> I have a question about how the cluster is. Since the pair
17 AR ADRA1B 18 AR ADRA1A
and
should be compressed closely as they have a common gene. 17 and 2 should be together, and 18 and 3.
Should I use any other method, if I am wrong to use this method (Euclidean distance)?
Should I convert my data to the matrix of rows and columns, where the gene is 1 x-axis, and the gene is 2 y-axis, each cell is filled with 1 or 0? (Basically if they are added then it will be 1, and if not, then 0)
Updated code:
table = table (tab $ gene1, ( D, method = "ward") plot (as.phylo (fit))
My answer is only valid if in fact only two genes present in each person are present and one person in each line If, however, each line means that
gene1
happens withgene2
, certainly no useful clustering can be done, In my opinion, in that case, I expect an extra column that their common decline Stating the possibility of receiving and is a key component could be like some kind of analysis (PCA), but I am very far from being an expert on clustering (hierarchical).Before you can use the
dist
function, you have brought your data in a suitable format:# Change in format genes.mats & lt; - cbind (tab [, "id"], matrix (0l, gda, ("id", gene.names) lapply (seq_len (nrow (tab)), function (x) nno = nrow (tab), ncol = Length (gene.names)) colnames (gene.matrix) & lt; -c jean matrix [x, match (tab [x, c ("gene1", "gene2")], colnames (gene.matrix) ]
Received App ADK ADORA1 ADRA1A ADRA1B ADRA1D ADRA2 A ... [1,] 1 0 0 0 0 0 0 0 [0] [0 ] 2,] 2 0 1 0 0 1 0 0 [3,] 3 0 1 0 1 0 0 0 [4,] 4 0 0 0 0 0 0 ...
So each line represents an overview (= personal), where the person is identified in the first column and in each subsequent column Ains
1
if the genes are present If this is missing then thedist
function on this matrix can be appropriately applied (ID column removed):< Code> d & lt; - dist (gene.matrix [, - 1], method = "eclidian") fit & lt; - hclust (d, method = "ward") plot (as.phylo (fit))Perhaps, it is a good idea to read the differences, for example,
Between the distance between eclidan distance between individuals with id = 1
andid = 2
is:euclidean
,between Manhattan
etc.euclidean_dist = sqrt ((0-0) ^ 2 + (1-1) ^ 2 + (0-0) ^ 2 + (0-0) ^ 2 + (0-1) ^ 2 + ...)
While Manhattan distance
Manhattan_dist = asp (0-0) + abs (1 -1) + ABS (0-0) + ABS (0-0) + ABS (0-1) + ...
Comments
Post a Comment