UProC - tools for ultra-fast protein sequence classification

Contents

About UProC

With rapidly increasing volumes of biological sequence data the functional analysis of new sequences in terms of similarities to known protein families challenges classical bioinformatics. The ultrafast protein classification (UProC) toolbox implements a novel algorithm ("Mosaic Matching") for large-scale sequence analysis and is now available in terms of an open source C library. UProC is up to three orders of magnitude faster than profile-based methods and achieved up to 80% higher sensitivity on unassembled short reads (100 bp) from simulated metagenomes. UProC does not depend on a multiple alignment of family-specific sequences. Therefore, in addition to the protein domain classfication according to the Pfam database, UProC can, in principle, also provide the detection of KEGG Orthologs. We provide a precompiled database for KEGG Ortholog classification which we applied to the prediction of functional repertoires from short reads (see below).

In the Downloads section below you find the links for the corresponding database files that we have precompiled for import into UProC.

References

  • Meinicke, Peter

    UProC: tools for ultra-fast protein domain classification

    Bioinformatics, 2014

Using UProC

  • Asshauer, K. P. and Wemheuer, B. and Daniel, R. and Meinicke, P.

    Tax4Fun: predicting functional profiles from metagenomic 16S rRNA data

    Bioinformatics, 2015

  • Landesfeind, Manuel and Meinicke, Peter

    Predicting the functional repertoire of an organism from unassembled RNA–seq data

    BMC Genomics, 2014

Downloads

All downloads are also available from ftp://projects.gobics.de/uproc

UProC command-line tools and C library

Changelog

HTML README of the latest release

Version 1.2.0

Released 2015-03-09.

PGP Signatures

You can verify the authenticity of downloaded software packages by checking the corresponding .sig PGP signature for the following key(s):

For example, with GnuPG, you can obtain the key with

gpg --keyserver pgpkeys.mit.edu --recv-key 8E6F6473

To verify the authenticity of FILE, download FILE.sig to the same directory, run

gpg --verify FILE.sig

and make sure that the output says Good signature

gpg: Signature made Tue 08 Apr 2014 03:27:24 PM CEST using DSA key ID 8E6F6473
gpg: Good signature from "Robin Martinjak <robin@rmartinjak.de>"
# ...

Databases

These databases need to be imported using uproc-import. Even though they are compressed with gzip, you don't have to decompress them manually, as uproc-import will take care of this. After importing you can delete the downloaded file if you wish.

If you have problems importing a database, verify that zlib (de-)compression is available by running uproc-import -V (capital v). If it says zlib: no, either decompress the database manually or install the zlib library and header files and recompile UProC.

Note

To avoid severe performance problems, make sure you have enough main memory (RAM) to load the whole database. This is usually a bit more than twice the size of the downloaded file.

KEGG Orthologs

Built from the KEGG Orthologs database (release from March 2014, preprocessed with the SEG low complexity filter).

Pfam

Built from Pfam (preprocessed with SEG).

Models

Just extract the contained directory to an appropriate place.

Questions & Answers

How do I install UProC?
You can find the installation instructions in the README.rst file contained in the software packages or rendered as HTML here.
Does UProC require additional software or programs for installation?

It very much depends on the operating system and the particular installation, whether you might have to install additional software or libraries. For the pre-compiled Windows binaries (see below) you don't need to install any additional software. Compiling UProC on a Linux PC from scratch, it depends on the particular environment. Within a typical developer environment the configuration scripts should run without problems. In other cases additional developer tools would have to be installed which should be easy for most of the common Linux distributions. We have seen the following examples for an Ubuntu 12.04 LTS distribution - in brackets you find the command for installation:

  • gcc, make etc. (sudo apt-get install build-essential)
  • zlib header files (sudo apt-get install zlib1g-dev)
Does UProC run on a Windows PC or Laptop?

Yes, we successfully tested the following options and versions:

  1. Compilation within the Cygwin environment requires to install Cygwin and possibly several additional developer components within the environment. As a shortcoming of that variant, you would also have to run the compiled UProC programs within the cygwin environment. Therefore we recommend to try the following second option, namely to use the precompiled binaries for Windows (most probably the 64-bit version), which you obtain from the UProC homepage.

  2. Using the precompiled binaries (see the Downloads section). We successfully tested the 64 bit binaries on several machines:

    • Windows 7 PC with 8 GB and Intel quadcore processor
    • Windows 7 notebook with 8GB RAM and Intel quadcore processor, running with imported Pfam24 database (~90% memory usage)
    • Windows 8.1 notebook with 8GB RAM and Intel quadcore processor, running with imported Pfam24 database (~90% memory usage)
Does UProC run on Mac OS X?

Yes, but we faced problems with slow file access on hard drives that severely degrade the UProC performance. Currently a working solution that we found to provide a sufficient speed on Mac OS X requires a solid state disk (SSD) for storing the database. We expect that also a Ramdisk might provide a possible solution. You may test UProC with a conventional hard disk on OS X but from our experience, speed can be incredibly slow. In the following we sketch how we installed UProC on a Macbook with OS X Mavericks, 8 GB RAM and SSD.

Install Homebrew and GCC

ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"
brew install gcc49

Download and extract the source code, then change to the directory containing the source, compile and install using the newly installed compiler

curl http://uproc.gobics.de/downloads/uproc/uproc-latest.tar.gz | tar xz
cd $(echo uproc-* | awk '{print $NF}')
export CC=gcc-4.9
./configure; make; sudo make install
We have also compiled two binary packages compiled without OpenMP for Mac OS X that you might try in order to facilitate installation. Just click on a package file for installation (the binaries are then installed to /usr/local/bin). If you have an SSD use the version with mmap, otherwise install the version without mmap. Afterwards you are able to use the UProC programs, for example uproc-import and uproc-dna (see above section on Windows binaries).
When do I need the SEG program?
The SEG low complexity filtering progam (part of the BLAST suite) is not needed if you import one of the pre-compiled databases that are available on the UProC homepage. However, the SEG program is highly recommended if you want to compile your own protein database. In that case, you have to provide a multi fasta file of labeled protein sequences where the protein family label, in general a numerical identifier, has to be placed in the fasta comment line, preceding the corresponding amino acid sequence. This multi fasta file should be processed with SEG to create a smaller and better database for UProC. You can run SEG with default parameters and you should use the masking ('X') option, e.g. seg my_db_protein_sequences.fasta -x > my_xmasked_db_protein_sequences.fasta
The latest preprocessed database file provided at the UProC website is based on Pfam 28. Does UProC support more recent Pfam versions?
The original implementation of UProC (version 1.2.0) as avaliable on the website has some limitation regarding the maximum size of the database. The Pfam database has grown over time and unfortunately, more recent versions exceed the original UProC limit. We provide an unofficial UProC update, which possibly may only run on some recent processor architectures. We had success with Pfam 36 on an Intel i9-13900 computer with 64 GB RAM. However, the same code did not run on some older systems that we tested. We are currently investigating this issue.

The unofficial UProC version together with a preprocessed Pfam 36 database file can be found here: UProC for large databases, Pfam36-UProC-DB (28 GB)

Runtimes

We tested UProC on a large file from the Human Microbiome Project (SRS017007) containing about 13 Gigabases of 100 bp short reads. We used UProC in short read mode according to uproc-dna -s with multithreading enabled. If the number of physical cores differed from the number of logical cores (in brackets) we chose the higher number in the -t option. Runtime was measured in terms of total wall clock time including all I/O processing.

Because Pfam27 needs much more RAM than Pfam24 we could only use it on a subset of available computers. We also tested the UProC binaries with Pfam24 on a 8GB notebook running Windows 8.1 which was successfull for smaller fasta files but failed for the large HMP file above due to limited memory.

Computer OS Processor Cores RAM Storage Pfam24 Pfam27
Desktop PC Windows 7 AMD Phenom II X6 1090T @ 3.2 GHz 6 8 GB HDD 61m02s N/A
MacBook Pro OS X Mavericks Intel Core i5 @ 2.6 GHz 2 (4) 16 GB SSD 76m24s 78m01s
DELL T1650 Precision Ubuntu 12.04 Intel Xeon E3-1240 V2 @ 3.4 GHz 4 (8) 32 GB HDD 52m47s 53m50s
DELL T1650 Precision Ubuntu 13.10 Intel Core i5-3470 @ 3.2 GHz 4 32 GB HDD 56m26s 58m10s
TOSHIBA Satellite notebook Windows 7 Intel Core i7-3610QM @ 2.3 GHz 4 8 GB HDD 68m00s N/A