Ultimate Genome Data Size

Updated on 2022-05-24

Background

Genome data is large. A human genome takes about 3.3 GB in FASTA format. Thousands of genomes of various organisms are already determined. The GenomeSync database currently stores 478,512 genomes for 94,139 species. Combined, these genomes contain 7.43 Tbp of sequence data and would occupy 7.62 TB in FASTA format. In the compressed NAF format, they actually consume 1.62 TB on disk.

These genomes represent only a small fraction of known organisms. The NCBI taxonomy database currently includes 1,977,269 species. Therefore, only 4.76% of known species have their genomes assembled.

Questions

Given that over a million species live on our planet, how much space will be needed to store all their genomes? How much space would be required for a smaller set, such as one genome per genus, and per family? Also, how much space would be needed when storing a genome for each taxonomic leaf node (which can be a species, subspecies, strain, etc.)?

Knowing the answers to these questions will help planning storage systems for future genome databases. It will help assessing suitability of data analysis tools for working with the large scale sequence data. Ultimately it will help us prepare for storing and effectively using the massive genome data of the future.

Method and Results

We first took the sizes of all 478,512 genomes available in GenomeSync, and propagated these sizes towards the root of the taxonomic tree, taking average at each node. In this way we essentially produced a simplified reconstruction of ancestoral genome sizes, using taxonomy as a phylogenetic tree and ignoring branch lengths.

Next we propagated these reconstructed genome sizes to all branches of the taxonomy still missing genomes. After this step, each taxonomic node had an average genome size associated with it, either real or hypothetical.

At this point we can sum the sizes for any selection of taxonomic nodes. We summed the sizes for all nodes having the rank of "species", to produce the estimate of total size of a dataset containing one genome per species. We did the same for the nodes with the ranks of "genus" and "family". We also summed the sizes for all leaf nodes (of various ranks).

We used the current overall compression ratio of 4.597 bp/byte (7,431 Gbp / 1,617 GB) to estimate the ultimate NAF-compressed size of all genomes.

Coverage:
One genome per ...
Number of genomes Ultimate genome
length (Tbp)
Ultimate NAF size (TB)
Family 9,804 7.73 1.68
Genus 103,138 92.59 20.14
Species 1,977,269 1,031.94 224.50
Leaf 2,205,746 1,060.77 230.77

In order to check the stability of the produced estimates, we reconstructed how the estimates changed over time during the last few years. To do that we used the assembly dates of currently available genomes. We started with the set of genomes already assembled on 2005-01-01, and computed the ultimate genome size estimates based on just those genomes. We then started to move forward, re-computing the estimates on each day when any new genomes were assembled.

The following chart shows how many genomes were assembled by each date (among current genome assemblies in GenomeSync), and used for the estimates above, for the same time scale.

As we can see, despite the over 100 times increase in the number of genomes, the ultimate genome data size estimates don't change much, and the amplitude of fluctuations becomes smaller with time. This means that the estimates are not likely to change with additional genomes that will be sequenced in the future.

The estimate chart above is based on current day NCBI taxonomy information. However, NCBI taxonomy database is not fixed, but is continuously maintained, fixed, and expanded. The rate of its growth can be seen on this page: Taxonomy history charts at GenomeSync. There you can see that the number of species-rank nodes more than doubled over the last several years.

Clearly, if year 2015 taxonomy was used as a basis for calculation, the ultimate genome data size estimate would be much smaller than the one obtained using current taxonomy. Therefore, we also computed history of estimates based on both genomes and taxonomy information available at each point of time. This chart is based on all publicly available monthly dumps of NCBI taxonomy going back to 2014-08-01. The estimate is based on subsets of current day genomes (those assembled prior to each date on the chart), and not all current genome names exist in historical taxonomy dumps, therefore each estimate is based only on those genomes for taxa that are present in the next nearest taxonomic dump.

The next chart shows the number of genomes used for this calculation at each point of time, As well as the total number of genomes assembled by each date.

This chart shows, that only a small fraction of current genomes is missing in historical taxonomy dumps.

Discussion

The entire genomes of all known species will containt about 1,031.94 Tbp of nucleotide sequence, and will take about 224.50 TB when compressed in NAF format (based on currently available genomes and current taxonomy, as of 2022-05-24). Therefore, this data will fit on about 11.2 20 TB hard drives.

This estimate seems relatively stable with regard to additions of new genomes. However, taxonomy database still keeps growing, therefore, the estimate will continue to increase too. Normally we should expect taxonomy growth to slow down at some point, but it's not clear if we are close to that.