As previously touched on, the genome is the entirety of genetic material carried by an individual or species and varies accordingly. The database of genomes of different species is growing and includes humans (the Human Genome Project). For example, the human genome, by chromosome, is viewable here: https://www.ncbi.nlm.nih.gov/genome/?term=homo+sapiens
Simple genomes such as those of viruses can enable a relatively straightforward effort of assigning proteins to each gene in the genome, and thus creating a database of them. This is known as a proteome.
The information gleaned from a virus proteome, for example, can inform vaccination targets by selecting appropriate antigens such as elements of the viral capsid.
Other exciting synthetic biology applications can be explored such as glowing beer, synthesising specific compounds useful in medicine or manufacturing using organisms to whom that product isn’t native in an attempt to boost production or create new products.
Analysing and storing information about more complex genomes is hindered by non-coding DNA and regulatory genes. Non-coding DNA and regulatory genes take up the vast majority of this type of genome. This means that the actual protein products that genes code for are in the minority.
The proteomes corresponding to complex genomes, human included, are therefore difficult to build. Sequencing methods themselves have witnessed, and continue to witness a rapid evolution towards faster, more efficient, automated techniques that can yield tremendous amounts of data.
For example, Sanger sequencing has been the main method of sequencing DNA and yielded many variations of itself. The basic concept follows these steps:
1. Mix copies of your target DNA to be sequenced with radioactive nucleotides (with A, T, G or C bases)
2. These nucleotides also prevent further DNA lengthening, resulting in a mixture of different sequence DNA strands complementary to the template DNA
3. e.g. AATGGC creates TTACCG, TACCG, ACCG, CCG, CG and G
4. Run the DNA mixture on a gel to separate the different strands by size
5. Infer their sequence based on the results: the radioactive reading of the different bases (A, T, C or G) alongside the size sequence of the strands (smaller strands run further down the gel while larger strands stay towards the top, where they were loaded)
At present, protein sequencing is not efficient enough to compete with simply sequencing the corresponding DNA and inferring the protein from the DNA. The speed and cost of genetic sequencing has been regarded to follow a trend similar to that of transistor speed and cost known as Moore’s Law which predicted that speed would double as price halved. So far it has held true for DNA sequencing and is known as The Carlson Curve.