protein-get-taxid-and-title

About

This script takes a list of protein accessions, and gets their taxonomy ids and sequence titles from the NCBI server. This can be useful, for example, after running a homology search with the nr database.

Getting this information is more complicated than it seems, because: 1. Some protein are not the Protein database, but instead found in the IPG database. 2. The IPG database may return summaries with different accession from the one being queried. 3. Some proteins may be 'suppressed' and not found by the esearch tool. Such proteins can be still found with the efetch tool. 4. Some proteins will not be found in both Protein and IPG databases, but still can be found in the Sequences database. 5. Sometimes the we have to repeat the request several times. 6. Querying accessions one by one takes too long time, so we have to process them in batches. 7. Each tool and database has its own format of input arguments and returned data.

This script employs several strategies, in series:

  1. Using esearch + esummary tools with the "Protein" database. Failed accessions are searched using esearch + esummary with the "IPG" (Identical Protein Groups) database.
  2. Any accessions that were not found in step 1 are queried in the "Protein" database with the efetch tool.
  3. Any remaining unknown accessions are queried in the "Sequences" database with the efetch tool.

Each step is repeated up to 3 times before moving on to the next step, to account for possible connection failures.

This procedure seems to work most of the time for our purposes. We used it to query tens of millions accessions in a day or two.

This tool is shared with the hope that it can be useful, but without any guarantees. Future changes to the NCBI tools or database may break this script. Please let us know if you notice any breakage or have other suggesions or requests.

Download

Usage

protein-get-taxid-and-title.pl [Options]

Options:

--in FILE - Read input from FILE (can be 'stdin' for piping) (Required).

--out FILE - Write output to FILE (can be 'stdout' for piping) (Required).

--failed FILE - Save failed accessions into FILE.

--include-failed - Include failed accessions in the main output (Default).

--exclude-failed - Exclude failed accessions in the main output.

Input format

Accessions, one per line.

Output format

Tab-separated format, with 3 fields in each line: Accession, Taxid, Title. Output data is in the same order with the input. Taxid and Title will be empty for failed requests, if --include-failed option is used.