Difference between revisions of "Alphafold"

From HPC Guide
Jump to navigation Jump to search
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=== Alphafold ===
+
== Alphafold ==
AlphaFold is an artificial intelligence (AI) program developed by Alphabets's/Google's DeepMind which performs predictions of protein structure.
+
AlphaFold is an artificial intelligence (AI) program developed by DeepMind (part of Alphabet/Google) that predicts protein structures.
  
 +
=== Databases ===
 +
The necessary databases are mounted on nodes with GPUs and are located at `/alphafold_storage/alphafold_db`.
  
=== How to use===
+
=== Usage ===
 +
To run AlphaFold, use the `run_alphafold.sh` script located at `/powerapps/share/centos7/alphafold/alphafold-2.3.1/run_alphafold.sh`.
  
use <b>run_alphafold.sh</b> script located at /home/alphafold_folder/alphafold_multimer_non_docker (in compute-0-300)
+
===== '''Required Parameters''': =====
 +
* `-d <data_dir>`: Path to the directory of supporting data.
 +
* `-o <output_dir>`: Path to a directory that will store the results.
 +
* `-f <fasta_paths>`: Path to FASTA files containing sequences. For multiple sequences in a file, it will fold as a multimer. To fold more sequences one after another, separate the files with a comma.
  
script reference:
+
* `-t <max_template_date>`: Maximum template release date to consider (ISO-8601 format, i.e., YYYY-MM-DD). This parameter helps in folding historical test sets.
<pre>
 
Usage: run_alphafold.sh <OPTIONS>
 
Required Parameters:
 
-d <data_dir>    Path to directory with supporting data: AlphaFold parameters and genetic and template databases. Set to the target of download_all_databases.sh.
 
-o <output_dir>  Path to a directory that will store the results.
 
-f <fasta_path>  Path to a FASTA file containing a single sequence.
 
-t <max_template_date> Maximum template release date to consider (ISO-8601 format: YYYY-MM-DD). Important if folding historical test sets.
 
Optional Parameters:
 
-n <openmm_threads>  OpenMM threads (default: all available cores)
 
-b <benchmark>    Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins (default: false)
 
-g <use_gpu>      Enable NVIDIA runtime to run with GPUs (default: true)
 
-a <gpu_devices>  Comma separated list of devices to pass to 'CUDA_VISIBLE_DEVICES' (default: 0)
 
-m <model_preset>  Choose preset model configuration - the monomer model (monomer), the monomer model with extra ensembling (monomer_casp14), monomer model with pTM head (monomer_ptm), or multimer model (multimer) (default: monomer)
 
-p <db_preset>      Choose preset MSA database configuration - smaller genetic database config (reduced_dbs) or full genetic database config (full_dbs) (default: full_dbs)
 
-u <use_precomputed_msas>      Whether to read MSAs that have been written to disk. WARNING: This will not check if the sequence, database or configuration have changed. (default: false)
 
-r <remove_msas_after_use>      Whether, after structure prediction(s), to delete MSAs that have been written to disk to significantly free up storage space. (default: false)
 
-i <is_prokaryote>  Optional for multimer system, not used by the single chain system. This should contain a boolean specifying true where the target complex is from a prokaryote, and false where it is not, or where the origin is unknown. These values determine the pairing method for the MSA (default: false)
 
</pre>
 
  
==== Databases ====
+
===== '''Optional Parameters''': =====
We downloaded the databases to /home/alphafold_folder/alphafold_data on compute-0-300
+
* `-g <use_gpu>`: Enable NVIDIA runtime to run with GPUs (default: true).
you may use it, or copy it to your own storage and point to it with -d flag of the run script.
+
* `-r <run_relax>`: Whether to run the final relaxation step on the predicted models (default: true).
also, you may download the databases to your own storage via the script <b>download_all_data.sh</b> located at /home/alphafold_folder/alphafold_multimer_non_docker/scripts/
+
* `-e <enable_gpu_relax>`: Run relax on GPU if GPU is enabled (default: true).
 +
* `-n <openmm_threads>`: OpenMM threads (default: all available cores).
 +
* `-a <gpu_devices>`: Comma-separated list of devices to pass to 'CUDA_VISIBLE_DEVICES' (default: 0).
 +
* `-m <model_preset>`: Choose preset model configuration: 'monomer', 'monomer_casp14', 'monomer_ptm', or 'multimer' (default: 'monomer').
 +
* `-c <db_preset>`: Choose preset MSA database configuration ('reduced_dbs' or 'full_dbs', default: 'full_dbs').
 +
* `-p <use_precomputed_msas>`: Whether to read MSAs written to disk (default: 'false').
 +
* `-l <num_multimer_predictions_per_model>`: Number of predictions per model when using `model_preset=multimer` (default: 5).
 +
* `-b <benchmark>`: Run multiple JAX model evaluations to obtain a timing that excludes compilation time (default: 'false').
  
 +
==== Example Slurm Script ====
 +
This script demonstrates how to submit an AlphaFold job using SLURM:
  
 
+
<syntaxhighlight lang="bash">
==== Sample qsub script ====
 
'''Note:'''
 
 
 
create folder for output in your home dir
 
mkdir ~/alphafold_output
 
then run the script
 
* you may download dummy_test folder from this github as well for the output
 
https://github.com/kalininalab/alphafold_non_docker
 
* /home/alphafold_folder/alphafold_multimer_non_docker/example/query.fasta = this is sample data, please point to the data you need to query.
 
* The lines '''export CUDA_VISIBLE_DEVICES=$(python3 /powerapps/scripts/check_avail_gpu.py)''' and the flag <b>-a $CUDA_VISIBLE_DEVICES</b> make it so you can use the next free GPU on the server, please leave it as is.
 
 
 
<pre>
 
 
#!/bin/bash
 
#!/bin/bash
#PBS -l select=1:ncpus=4:ngpus=1
+
#SBATCH --job-name=AlphaFold-Multimer    # Job name
#PBS -q gpu
+
#SBATCH --partition=gpu2                  # Specify GPU partition
 
+
#SBATCH --nodes=1                         # Number of nodes
# Description: AlphaFold-Multimer (Non-Docker) with auto-gpu selection
+
#SBATCH --ntasks=1                        # Number of tasks (processes)
# Original Author: Lev Arie Krapivner
+
#SBATCH --cpus-per-task=4                 # Number of CPU cores per task
 +
#SBATCH --mem=32G                        # request RAM
 +
#SBATCH --gres=gpu:1                      # Request 1 GPU
 +
#SBATCH --output=alphafold_%j.out        # Standard output (with job ID)
 +
#SBATCH --error=alphafold_%j.err          # Standard error (with job ID)
  
# load miniconda
+
# Description: AlphaFold-Multimer (Non-Docker) with auto-GPU selection
module load miniconda/miniconda3-4.7.12-environmentally
 
# activate relevant venv
 
conda activate /powerapps/share/centos7/miniconda/miniconda3-4.7.12-environmentally/envs/alphafold_non_docker
 
# run alphafold
 
cd /home/alphafold_folder/alphafold_multimer_non_docker/
 
# call to check_available_gpu python script
 
# returns the param for CUDA_VISIBLE_DEVICE which the run alphafold script uses
 
  
export CUDA_VISIBLE_DEVICES=$(python3 /powerapps/scripts/check_avail_gpu.py)
+
# Load the required module/environment
# echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
+
module load alphafold/alphafold_non_docker_2.3.1
bash run_alphafold.sh -d /home/alphafold_folder/alphafold_data -o ~/output_dir -f /home/alphafold_folder/alphafold_multimer_non_docker/example/query.fasta -t 2020-05-14 -a $CUDA_VISIBLE_DEVICES
 
  
</pre>
+
# Run the AlphaFold script
 +
bash $ALPHAFOLD_SCRIPT_PATH/run_alphafold.sh -d $ALPHAFOLD_DB_PATH -o ~/output_dir -f $ALPHAFOLD_SCRIPT_PATH/examples/query.fasta -t $(date +%Y-%m-%d)
 +
</syntaxhighlight>
  
 +
==== Important Notes ====
 +
* '''Output Directory''': You can specify the output directory using the `-o` parameter to store the results. This directory can be anywhere you choose.
 +
* The `-t` (max_template_date) parameter defines the maximum release date of templates to consider in the format `YYYY-MM-DD`. This is crucial when working with historical test sets, as it restricts the search for templates to those released on or before the specified date. You can use different dates depending on your requirements, such as the current date with `$(date +%Y-%m-%d)` or a specific historical date, like `-t 2021-12-31`.
 +
* '''Memory Requirements''': For monomer jobs, at least '''32GB of RAM''' is recommended. For multimer jobs, allocate at least '''64GB of RAM'''; however, for more complex or large structures, consider using '''128GB or more''' to ensure stability.
  
[https://github.com/amorehead/alphafold_non_docker Alphafold - non_docker source]
+
==== Additional Resources ====
 +
* You can download the `dummy_test` folder for sample output from this [https://github.com/kalininalab/alphafold_non_docker The Github Repository].
 +
* For sample data, you can use `/home/alphafold_folder/alphafold_multimer_non_docker/example/query.fasta` or provide your own data for queries.

Latest revision as of 14:01, 28 October 2024

Alphafold

AlphaFold is an artificial intelligence (AI) program developed by DeepMind (part of Alphabet/Google) that predicts protein structures.

Databases

The necessary databases are mounted on nodes with GPUs and are located at `/alphafold_storage/alphafold_db`.

Usage

To run AlphaFold, use the `run_alphafold.sh` script located at `/powerapps/share/centos7/alphafold/alphafold-2.3.1/run_alphafold.sh`.

Required Parameters:
  • `-d <data_dir>`: Path to the directory of supporting data.
  • `-o <output_dir>`: Path to a directory that will store the results.
  • `-f <fasta_paths>`: Path to FASTA files containing sequences. For multiple sequences in a file, it will fold as a multimer. To fold more sequences one after another, separate the files with a comma.
  • `-t <max_template_date>`: Maximum template release date to consider (ISO-8601 format, i.e., YYYY-MM-DD). This parameter helps in folding historical test sets.
Optional Parameters:
  • `-g <use_gpu>`: Enable NVIDIA runtime to run with GPUs (default: true).
  • `-r <run_relax>`: Whether to run the final relaxation step on the predicted models (default: true).
  • `-e <enable_gpu_relax>`: Run relax on GPU if GPU is enabled (default: true).
  • `-n <openmm_threads>`: OpenMM threads (default: all available cores).
  • `-a <gpu_devices>`: Comma-separated list of devices to pass to 'CUDA_VISIBLE_DEVICES' (default: 0).
  • `-m <model_preset>`: Choose preset model configuration: 'monomer', 'monomer_casp14', 'monomer_ptm', or 'multimer' (default: 'monomer').
  • `-c <db_preset>`: Choose preset MSA database configuration ('reduced_dbs' or 'full_dbs', default: 'full_dbs').
  • `-p <use_precomputed_msas>`: Whether to read MSAs written to disk (default: 'false').
  • `-l <num_multimer_predictions_per_model>`: Number of predictions per model when using `model_preset=multimer` (default: 5).
  • `-b <benchmark>`: Run multiple JAX model evaluations to obtain a timing that excludes compilation time (default: 'false').

Example Slurm Script

This script demonstrates how to submit an AlphaFold job using SLURM:

#!/bin/bash
#SBATCH --job-name=AlphaFold-Multimer     # Job name
#SBATCH --partition=gpu2                  # Specify GPU partition
#SBATCH --nodes=1                         # Number of nodes
#SBATCH --ntasks=1                        # Number of tasks (processes)
#SBATCH --cpus-per-task=4                 # Number of CPU cores per task
#SBATCH --mem=32G                         # request RAM
#SBATCH --gres=gpu:1                      # Request 1 GPU
#SBATCH --output=alphafold_%j.out         # Standard output (with job ID)
#SBATCH --error=alphafold_%j.err          # Standard error (with job ID)

# Description: AlphaFold-Multimer (Non-Docker) with auto-GPU selection

# Load the required module/environment
module load alphafold/alphafold_non_docker_2.3.1

# Run the AlphaFold script
bash $ALPHAFOLD_SCRIPT_PATH/run_alphafold.sh -d $ALPHAFOLD_DB_PATH -o ~/output_dir -f $ALPHAFOLD_SCRIPT_PATH/examples/query.fasta -t $(date +%Y-%m-%d)

Important Notes

  • Output Directory: You can specify the output directory using the `-o` parameter to store the results. This directory can be anywhere you choose.
  • The `-t` (max_template_date) parameter defines the maximum release date of templates to consider in the format `YYYY-MM-DD`. This is crucial when working with historical test sets, as it restricts the search for templates to those released on or before the specified date. You can use different dates depending on your requirements, such as the current date with `$(date +%Y-%m-%d)` or a specific historical date, like `-t 2021-12-31`.
  • Memory Requirements: For monomer jobs, at least 32GB of RAM is recommended. For multimer jobs, allocate at least 64GB of RAM; however, for more complex or large structures, consider using 128GB or more to ensure stability.

Additional Resources

  • You can download the `dummy_test` folder for sample output from this The Github Repository.
  • For sample data, you can use `/home/alphafold_folder/alphafold_multimer_non_docker/example/query.fasta` or provide your own data for queries.