Contents
With rapidly increasing volumes of biological sequence data the functional analysis of new sequences in terms of similarities to known protein families challenges classical bioinformatics. The ultrafast protein classification (UProC) toolbox implements a novel algorithm ("Mosaic Matching") for large-scale sequence analysis and is now available in terms of an open source C library. UProC is up to three orders of magnitude faster than profile-based methods and achieved up to 80% higher sensitivity on unassembled short reads (100 bp) from simulated metagenomes. UProC does not depend on a multiple alignment of family-specific sequences. Therefore, in addition to the protein domain classfication according to the Pfam database, UProC can, in principle, also provide the detection of KEGG Orthologs. We provide a precompiled database for KEGG Ortholog classification which we applied to the prediction of functional repertoires from short reads (see below).
In the Downloads section below you find the links for the corresponding database files that we have precompiled for import into UProC.
Meinicke, Peter
UProC: tools for ultra-fast protein domain classification
Bioinformatics, 2014
Asshauer, K. P. and Wemheuer, B. and Daniel, R. and Meinicke, P.
Tax4Fun: predicting functional profiles from metagenomic 16S rRNA data
Bioinformatics, 2015
Landesfeind, Manuel and Meinicke, Peter
Predicting the functional repertoire of an organism from unassembled RNA–seq data
BMC Genomics, 2014
Asshauer, Kathrin P. and Klingenberg, Heiner and Lingner, Thomas and Meinicke, Peter
Exploring Neighborhoods in the Metagenome Universe
Lingner, Thomas and Meinicke, Peter
Characterizing metagenomic novelty with unexplained protein domain hits
All downloads are also available from ftp://projects.gobics.de/uproc
HTML README of the latest release
Released 2015-03-09.
You can verify the authenticity of downloaded software packages by checking the corresponding .sig PGP signature for the following key(s):
For example, with GnuPG, you can obtain the key with
gpg --keyserver pgpkeys.mit.edu --recv-key 8E6F6473
To verify the authenticity of FILE, download FILE.sig to the same directory, run
gpg --verify FILE.sig
and make sure that the output says Good signature
gpg: Signature made Tue 08 Apr 2014 03:27:24 PM CEST using DSA key ID 8E6F6473 gpg: Good signature from "Robin Martinjak <robin@rmartinjak.de>" # ...
These databases need to be imported using uproc-import. Even though they are compressed with gzip, you don't have to decompress them manually, as uproc-import will take care of this. After importing you can delete the downloaded file if you wish.
If you have problems importing a database, verify that zlib (de-)compression is available by running uproc-import -V (capital v). If it says zlib: no, either decompress the database manually or install the zlib library and header files and recompile UProC.
Note
To avoid severe performance problems, make sure you have enough main memory (RAM) to load the whole database. This is usually a bit more than twice the size of the downloaded file.
Built from the KEGG Orthologs database (release from March 2014, preprocessed with the SEG low complexity filter).
Built from Pfam (preprocessed with SEG).
It very much depends on the operating system and the particular installation, whether you might have to install additional software or libraries. For the pre-compiled Windows binaries (see below) you don't need to install any additional software. Compiling UProC on a Linux PC from scratch, it depends on the particular environment. Within a typical developer environment the configuration scripts should run without problems. In other cases additional developer tools would have to be installed which should be easy for most of the common Linux distributions. We have seen the following examples for an Ubuntu 12.04 LTS distribution - in brackets you find the command for installation:
- gcc, make etc. (sudo apt-get install build-essential)
- zlib header files (sudo apt-get install zlib1g-dev)
Yes, we successfully tested the following options and versions:
Compilation within the Cygwin environment requires to install Cygwin and possibly several additional developer components within the environment. As a shortcoming of that variant, you would also have to run the compiled UProC programs within the cygwin environment. Therefore we recommend to try the following second option, namely to use the precompiled binaries for Windows (most probably the 64-bit version), which you obtain from the UProC homepage.
Using the precompiled binaries (see the Downloads section). We successfully tested the 64 bit binaries on several machines:
- Windows 7 PC with 8 GB and Intel quadcore processor
- Windows 7 notebook with 8GB RAM and Intel quadcore processor, running with imported Pfam24 database (~90% memory usage)
- Windows 8.1 notebook with 8GB RAM and Intel quadcore processor, running with imported Pfam24 database (~90% memory usage)
Yes, but we faced problems with slow file access on hard drives that severely degrade the UProC performance. Currently a working solution that we found to provide a sufficient speed on Mac OS X requires a solid state disk (SSD) for storing the database. We expect that also a Ramdisk might provide a possible solution. You may test UProC with a conventional hard disk on OS X but from our experience, speed can be incredibly slow. In the following we sketch how we installed UProC on a Macbook with OS X Mavericks, 8 GB RAM and SSD.
Install Homebrew and GCC
ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)" brew install gcc49
Download and extract the source code, then change to the directory containing the source, compile and install using the newly installed compiler
curl http://uproc.gobics.de/downloads/uproc/uproc-latest.tar.gz | tar xz cd $(echo uproc-* | awk '{print $NF}') export CC=gcc-4.9 ./configure; make; sudo make install
We have also compiled two binary packages compiled without OpenMP for Mac OS X that you might try in order to facilitate installation. Just click on a package file for installation (the binaries are then installed to /usr/local/bin). If you have an SSD use the version with mmap, otherwise install the version without mmap. Afterwards you are able to use the UProC programs, for example uproc-import and uproc-dna (see above section on Windows binaries).
The unofficial UProC version together with a preprocessed Pfam 36 database file can be found here: UProC for large databases, Pfam36-UProC-DB (28 GB)
We tested UProC on a large file from the Human Microbiome Project (SRS017007) containing about 13 Gigabases of 100 bp short reads. We used UProC in short read mode according to uproc-dna -s with multithreading enabled. If the number of physical cores differed from the number of logical cores (in brackets) we chose the higher number in the -t option. Runtime was measured in terms of total wall clock time including all I/O processing.
Because Pfam27 needs much more RAM than Pfam24 we could only use it on a subset of available computers. We also tested the UProC binaries with Pfam24 on a 8GB notebook running Windows 8.1 which was successfull for smaller fasta files but failed for the large HMP file above due to limited memory.
Computer | OS | Processor | Cores | RAM | Storage | Pfam24 | Pfam27 |
---|---|---|---|---|---|---|---|
Desktop PC | Windows 7 | AMD Phenom II X6 1090T @ 3.2 GHz | 6 | 8 GB | HDD | 61m02s | N/A |
MacBook Pro | OS X Mavericks | Intel Core i5 @ 2.6 GHz | 2 (4) | 16 GB | SSD | 76m24s | 78m01s |
DELL T1650 Precision | Ubuntu 12.04 | Intel Xeon E3-1240 V2 @ 3.4 GHz | 4 (8) | 32 GB | HDD | 52m47s | 53m50s |
DELL T1650 Precision | Ubuntu 13.10 | Intel Core i5-3470 @ 3.2 GHz | 4 | 32 GB | HDD | 56m26s | 58m10s |
TOSHIBA Satellite notebook | Windows 7 | Intel Core i7-3610QM @ 2.3 GHz | 4 | 8 GB | HDD | 68m00s | N/A |