Methodology described in Drake et al. 2014:
Sequences from over 100 biomineral proteome studies were grouped by hierarchical clustering using the CD-HIT suite web server (Li and Godzik, 2006; Huang et al., 2010; http://weizhong-lab.ucsd.edu/cd-hit/) and assigned gene ontology (GO) terms using Blast2Go software (Conesa et al., 2005). Although 1531 proteins reduced to 1051 clusters at 30% similarity, only 64 clusters showed sequence similarity across phyla. Studies published from the 1990s through June 2013, using N-terminal and mass spectrometry COM sequencing, RT-PCR, or GO and KEGG annotation of genomic and transcriptomic data sets are included. Mass spectrometry sequences were excluded if the experimental data were compared with gene models from a different species.
This dataset includes information presented in Supplemental Table S1 from Drake et al. 2014:
Cross-phyla clustering of non-redundant carbonate organic matrix proteins from N-terminal and mass spectrometry COM sequencing, RT-PCR, or GO and KEGG annotation of genomic and transcriptomic data sets (from over 100 studies) grouped by hierarchical clustering using the CD-HIT suite web server (http://weizhong-lab.ucsd.edu/cd-hit/). The gene accession numbers are included. 1531 proteins reduced to 1051 clusters at 30% or greater similarity, although only 64 clusters showed sequence similarity across phyla. Studies published from the 1990s through June 2013 are included. Clusters with the same name have been combined. Note: Mass spectrometry sequences were excluded if the experimental data were compared with gene models from a different species.
* indicates that non-homologous proteins with similar function were also observed. See Table 2 in Drake et al. 2014.
** indicates a likely cellular contaminant due to location based on cellular component GO term.