Repository relating to generation of encoding of amino acid muations.
ProtEncode provides pipelines for encoding protein mutations using multiple schemes:
- Sequence preparation → processes mutation data and UniProt sequences.
- Sample preparation → generates sample-level encoding matrices (binary, multi-mutation, ESM-based).
- Embeddings generation → coming soon.
The package installs a CLI tool protencode for running the pipelines end-to-end.
git clone https://github.com/BIMSBbioinfo/protEncode.git
cd protEncode
# Create environment
mamba env create -f environment.yml
conda activate protencode-envThis installs all dependencies (Python, PyTorch, transformers, etc.) and ProtEncode itself.
git clone https://github.com/BIMSBbioinfo/protEncode.git
cd protEncode
pip install -e .This installs ProtEncode in editable mode, so changes to the source code are immediately reflected.
The package installs the command-line tool protencode. Run:
protencode --helpYou should see available pipelines:
usage: protencode [-h] {embeddings,sample,sequence} ...
ProtEncode: encode protein mutations with different embedding schemes.
Available pipelines:
• sequence Prepare sequences from MAF files and UniProt
• sample Generate sample-level encoding matrices (binary, multi-mutation, ESM)
• embeddings (coming soon) Generate embeddings for sequences
Processes mutation data (MAF/CSV) and UniProt sequences into finalised mutated sequences.
protencode sequence --data ./data --output ./output --organism 9606 --email [email protected] --min-length 200 --updateArguments:
--data(required) → directory with.mafor.csvmutation files.--output(required) → output directory.--organism→ NCBI taxonomy ID (default: 9606 = human).--email→ contact email for UniProt downloads.--min-length→ minimum gene length filter (default: 200).--update→ force UniProt FASTA re-download.
Generates encoding matrices at the sample level.
protencode sample --output ./outputBy default, all matrices are generated.
You can restrict output with flags:
-
Only binary matrix:
protencode sample --output ./output --binary
-
Only multi-mutation matrix:
protencode sample --output ./output --multi
-
Only ESM matrix (top-20 embeddings):
protencode sample --output ./output --esm --top-n 20
Arguments:
--output(required) → directory with outputs from sequence preparation.--top-n→ number of top embeddings to use for ESM (default: 10).--binary→ generate binary matrix.--multi→ generate multi-mutation matrix.--esm→ generate ESM attention matrix.- If no flags are given, all three are generated.
Placeholder for embeddings pipeline:
protencode embeddings-
Sequence preparation
Produces mutated sequences, UniProt data, logs, and sample-to-sequence mappings in the specified--outputdirectory. -
Sample preparation
Produces encoding matrices saved in the output directory:binary_matrix.*multi_matrix.*esm_topN_matrix.*
- Command not found → ensure your conda env is activated or pip install ran successfully.
- Missing module errors → make sure
__init__.pyfiles exist in each subpackage. Reinstall withpip install -e .. - Torch/CUDA errors → check your GPU setup or install the CPU version of PyTorch via pip.
- UniProt download fails → ensure you provide a valid email with
--email.
MIT License.