Protein Encoding

Repository relating to generation of encoding of amino acid muations.

ProtEncode provides pipelines for encoding protein mutations using multiple schemes:

Sequence preparation → processes mutation data and UniProt sequences.
Sample preparation → generates sample-level encoding matrices (binary, multi-mutation, ESM-based).
Embeddings generation → coming soon.

The package installs a CLI tool protencode for running the pipelines end-to-end.

🚀 Installation

Option 1: Using `conda`/`mamba` (recommended)

git clone https://github.com/BIMSBbioinfo/protEncode.git
cd protEncode

# Create environment
mamba env create -f environment.yml
conda activate protencode-env

This installs all dependencies (Python, PyTorch, transformers, etc.) and ProtEncode itself.

Option 2: Using `pip`

git clone https://github.com/BIMSBbioinfo/protEncode.git
cd protEncode

pip install -e .

This installs ProtEncode in editable mode, so changes to the source code are immediately reflected.

🛠️ Usage

The package installs the command-line tool protencode. Run:

protencode --help

You should see available pipelines:

usage: protencode [-h] {embeddings,sample,sequence} ...

ProtEncode: encode protein mutations with different embedding schemes.

Available pipelines:
  • sequence   Prepare sequences from MAF files and UniProt
  • sample     Generate sample-level encoding matrices (binary, multi-mutation, ESM)
  • embeddings (coming soon) Generate embeddings for sequences

🔬 Pipelines

1️⃣ Sequence preparation

Processes mutation data (MAF/CSV) and UniProt sequences into finalised mutated sequences.

protencode sequence     --data ./data     --output ./output     --organism 9606     --email [email protected]     --min-length 200     --update

Arguments:

--data (required) → directory with .maf or .csv mutation files.
--output (required) → output directory.
--organism → NCBI taxonomy ID (default: 9606 = human).
--email → contact email for UniProt downloads.
--min-length → minimum gene length filter (default: 200).
--update → force UniProt FASTA re-download.

2️⃣ Sample preparation

Generates encoding matrices at the sample level.

protencode sample --output ./output

By default, all matrices are generated.
You can restrict output with flags:

Only binary matrix:

protencode sample --output ./output --binary

Only multi-mutation matrix:

protencode sample --output ./output --multi

Only ESM matrix (top-20 embeddings):

protencode sample --output ./output --esm --top-n 20

Arguments:

--output (required) → directory with outputs from sequence preparation.
--top-n → number of top embeddings to use for ESM (default: 10).
--binary → generate binary matrix.
--multi → generate multi-mutation matrix.
--esm → generate ESM attention matrix.
If no flags are given, all three are generated.

3️⃣ Embeddings generation (coming soon)

Placeholder for embeddings pipeline:

protencode embeddings

📂 Output

Sequence preparation
Produces mutated sequences, UniProt data, logs, and sample-to-sequence mappings in the specified --output directory.
Sample preparation
Produces encoding matrices saved in the output directory:
- binary_matrix.*
- multi_matrix.*
- esm_topN_matrix.*

🐛 Troubleshooting

Command not found → ensure your conda env is activated or pip install ran successfully.
Missing module errors → make sure __init__.py files exist in each subpackage. Reinstall with pip install -e ..
Torch/CUDA errors → check your GPU setup or install the CPU version of PyTorch via pip.
UniProt download fails → ensure you provide a valid email with --email.

📜 License

MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
protencode.egg-info		protencode.egg-info
protencode		protencode
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
plm_mutation_cpu.yml		plm_mutation_cpu.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Protein Encoding

🚀 Installation

Option 1: Using `conda`/`mamba` (recommended)

Option 2: Using `pip`

🛠️ Usage

🔬 Pipelines

1️⃣ Sequence preparation

2️⃣ Sample preparation

3️⃣ Embeddings generation (coming soon)

📂 Output

🐛 Troubleshooting

📜 License

About

Uh oh!

Releases

Packages

Languages

License

BIMSBbioinfo/protEncode

Folders and files

Latest commit

History

Repository files navigation

Protein Encoding

🚀 Installation

Option 1: Using conda/mamba (recommended)

Option 2: Using pip

🛠️ Usage

🔬 Pipelines

1️⃣ Sequence preparation

2️⃣ Sample preparation

3️⃣ Embeddings generation (coming soon)

📂 Output

🐛 Troubleshooting

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Option 1: Using `conda`/`mamba` (recommended)

Option 2: Using `pip`

Packages