Pygustus
A python wrapper for the gene prediction program AUGUSTUS.
Requirements
To use Pygustus, an installed or built AUGUSTUS with minimum program version 3.3.2 is required. Using Ubuntu, AUGUSTUS can be installed as follows.
sudo apt install augustus augustus-data augustus-doc
More information can be found on the AUGUSTUS GitHub page.
To run Pygustus properly it is necessary that the AUGUSTUS environment variable AUGUSTUS_CONFIG_PATH
is set correctly and points to the configuration directory.
If AUGUSTUS was built from source and no installation was done (so the command augustus
is not executable), then the path to the executable can be set as described in the configuration section.
For Pygustus Python version 3.6 or higher is required.
The following examples assume that Python 3 is the default on the executing system. To ensure the usage of Python 3, the use of a virtual environment is recommended. A virtual environment can be created with venv.
Installation
Pygustus is in alpha development status. Currently it is not recommended for productive use. Pygustus can be installed from PyPi as follows.
pip install pygustus
Building Pygustus from source
:warning: Pygustus currently only works with Augustus from git, not with Augustus from package managers or third parties. The crucial commit was performed on Feb 28 2023.
As an alternative to installing Pygustus from PyPi, Pygustus can also be built from source as follows. After cloning the repository from GitHub,
git clone git@github.com:Gaius-Augustus/pygustus.git
required dependencies need to be installed.
pip install numba property-manager pysais
# sometimes required: pip install --force-reinstall -v "numpy==1.23"
# sometimes required: pip install --force-reinstall -v "numexpr==2.6.2"
pip install -r requirements.txt
# can be skipped for just installing the package: pip install -r requirements-dev.txt
After that Pygustus can be built and installed as follows.
python setup.py sdist bdist_wheel
pip install dist/pygustus-<VERSION>.tar.gz
For the execution of the tests pytest
is used. Example usage:
pytest -m ghactions tests/
The test cases marked with ghactions
are those that are not too expensive in terms of runtime.
Usage
Pygustuts supports the training and prediction of AUGUSTUS. The prediction can be executed either in a single thread or in parallel. In multithreaded execution, the input file is split into smaller pieces and AUGUSTUS is executed in parallel on partial inputs. Finally, the partial results are joined together.
As values of the parameters for all Pygustus programs only the Python types are permissible.
Training
To train AUGUSTUS, the etraining program was adopted in Pygustus. More information about the program can be found here. The usage in Pygustus is as follows.
from pygustus import etraining
etraining.train('path/to/trainfilename.gb', species='SPECIES')
The species to be trained must be present in the config folder of AUGUSTUS (see also AUGUSTUS_CONFIG_PATH). To create a new species, the Perl script new_species.pl
from the script folder of AUGUSTUS can be used.
If the path to the etraining executable is to be specified temporarily, the Pygustus parameter path_to_binary=path/to/etraining
can be used.
Prediction
To run a prediction AUGUSTUS can be executed on the input file as usual or the input file can be split and AUGUSTUS is run on input parts in parallel. For the second variant the Pygustus parameter jobs=n
must be set with n > 1
.
Default (Single Thread)
If the prediction is executed with jobs=1
(default, may be ommitted), AUGUSTUS is executed on the input file exactly as if one would start AUGUSTUS from the console. Usage example:
from pygustus import augustus
augustus.predict('path/to/input/file', species='human',
UTR=True, softmasking=False)
To redirect the output to a file the AUGUSTUS parameters outfile
and errfile
can be used. Application example for the output of the prediction and the possible errors that occurred
augustus.predict('path/to/input/file', species='human',
UTR=True, softmasking=False
outfile='out.gff', errfile='out.err')
If the path to the AUGUSTUS executable is to be specified temporarily, the Pygustus parameter path_to_binary=path/to/augustus
can be used.
Multithreaded
If the Pygustus parameter jobs=n
is set with n > 1
, then the input file is split into several small files and Augustus is run in parallel for each file with the given parameters. After AUGUSTUS has been executed on all parts, the partial results are combined to the final result. If the parameter outfile
is set, the result will be saved in the file given there. Otherwise, the result will be saved in the file augustus.gff
(default). A usage example is shown below.
from pygustus import augustus
augustus.predict('path/to/input/file', [augustus_parameters],
[pygustus_parameters], jobs=n)
All parameters permitted for AUGUSTUS can be used as augustus_parameters. The following pygustus_parameters are additionally available.
Parameter | Default Value | Description |
---|---|---|
jobs (int) | 1 | If jobs=n with n > 1 is set, AUGUSTUS is executed in parallel on sequence segments or split input files using n jobs. After the execution of all jobs, the output files are merged. |
chunksize (int) | 2500000 | If chunksize=n with n > 0 is set and jobs > 1 , each AUGUSTUS instance is executed on sequence segments of the maximum size n . |
overlap (int) | 500000 | If overlap=n with n > 1 is set and jobs > 1 , each AUGUSTUS instance is executed on sequence segments of size chunksize and the segments overlap by n . |
partitionHints (bool) | False | If this option is set to True, a hints file is given and jobs > 1 , then the hints file is split into appropriate pieces for the respective AUGUSTUS jobs. |
minSplitSize (int) | 1000000 | The input fasta file is spilt to at least minSplitSize=n base pairs. Set n=0 to split the input in single sequence files. |
partitionLargeSequences (bool) | False | Parallelize large sequences by automatically setting the AUGUSTUS parameters predictionStart and predictionEnd based on the given values for chunksize and overlap . |
maxSeqSize (int) | 3500000 | The maximum length of a sequence from which the sequence is started to be partitioned. To turn on the paritioning partitionLargeSequences=True must be set |
debugOutputDir (string) | None | If the directory is specified, all generated files, i.e. the split of the input file and intermediate results, as well as the generated AUGUSTUS command lines are stored there. This option works only for the parallelization, i. e. jobs > 1 is set. |
path_to_bin (string) | None | Sets the path to the desired executable version of AUGUSTUS when augustus.predict() is called or etraining when etraining.train() is called. The path is not saved for further executions. |
To redirect the output to a file the AUGUSTUS parameters outfile
and errfile
can be used as for the default case.
Configuration
The paths to the augustus
and etraining
binaries be configured. This path is only used if the Pygustus parameter path_to_bin
is not specified. This configuration is saved until the next change. The configuration is identical for pygustus.etraining
and pygustus.augustus
, so that the following example is restricted to pygustus.augustus
.
Read the configured path
To get the the currently configured path to the executable of AUGUSTUS you can proceed as follows.
from pygustus import augustus
augustus.config_get_bin()
Update the path to the binary
To update the currently configured path to the executable of AUGUSTUS you can proceed as follows.
augustus.config_set_bin(path/to/augustus)
Set the default binary
To set the default binary you can proceed as follows.
augustus.config_set_default_bin()
This method sets the configured path to the AUGUSTUS executable to augustus
. This should exist if AUGUSTUS is properly installed on the system.
As mentioned earlier, the configured path can be overridden by specifying the Pygustus parameter path_to_bin
for the current prediction with augustus or the current training with etraining.
Help
To have easy access to the AUGUSTUS and Pygustus help system, the following methods are available.
Method | Description |
---|---|
help() | Shows usage information about the Pygustus wrapper and its parameters. |
show_aug_help() | Shows the help output of AUGUSTUS, equivalent to the AUGUSTUS call with the parameter --help . |
show_aug_paramlist() | Shows all possible parameter names of AUGUSTUS, equivalent to the AUGUSTUS call with the parameter --paramlist . |
show_species_info() | Shows species information of AUGUSTUS, equivalent to the AUGUSTUS call with the parameter --species=help . |
Usage example
from pygustus import augustus
augustus.help()
Examples
The use of Pygustus is also demonstrated with an executable Python script and a Jupyter notebook. The script assumes that Pygustus and AUGUSTUS are installed as described above.
Executable example
The following command lines install required dependencies, download the script and execute it. The script creates a folder structure in the working directory and downloads required data. After that different AUGUSTUS prediction examples are executed with Pygtustus.
pip install wget
wget https://raw.githubusercontent.com/Gaius-Augustus/pygustus/main/examples/aug_run_examples.py
chmod +x aug_run_examples.py
./aug_run_examples.py
After execution, the debug folder contains the generated AUGUSTUS command lines as well as the split input files for parallel execution.
Jupyter notebook
How the output of the examples should look like can also be taken from a Jupyter notebook.