Scripts to apply Fbeta-score to classify predicted probabilities from txt2onto models into binary classes. #1

Damonlin11 · 2025-06-03T16:20:50Z

What

~~1. A new script to apply MCC-F1 method to classify the txt2onto predicted probability into binary classes.~~

Abandon MCC-F1 method. But the scripts of MCC-F1 are kept in case there is any needs in the future.
Switch to using fbeta (f1 & f0.5)-score to find the best threshold to classify the txt2onto predicted probabilities into binary class.

How

We identify the best threshold by looking for the predicted probability that yields the highest fbeta-score.

Why

The predicted outcome from txt2onto models is continuous probability values which need a cut-off value to convert to binary classes.
As the data is highly imbalanced, most of the predicted probabilities are very low, i.e., the distribution of the predicted probas is highly skewed. We applied fbeta-score to identify the best cut-off value.

Damonlin11 · 2025-06-03T16:21:46Z

Hi @phicks22 , the script for classifying the txt2onto predicted probability is also ready to review.

phicks22

@Damonlin11 Please see my comments.

src/MCC_F1_funcs.py

src/mcc_f1_predictions_binary.py

Damonlin11 · 2025-06-23T16:46:02Z

@phicks22 I have updated the MCC-F1 workflow to reflect the plan we discussed. It is ready for your review.

phicks22

@Damonlin11 Nice just two more things.

results/txt2onto2_predicions_binary.parquet

phicks22 · 2025-06-24T17:47:44Z

src/mcc_f1_binary_classification.py

@Damonlin11 Format this script with black once more please.

@phicks22 updated here cd5061a

…notation file

… obtain predicted annotations.

Damonlin11 · 2025-07-24T20:15:05Z

@phicks22 I pushed all the updated scripts generating the predicted annotations using fbeta-score here. Please review them. The main script is src/fbeta_binary_classification.py.

phicks22

@Damonlin11 Good work, Junxia. Please see my comments. Most of them have to do with refactoring to modularize these scripts even more. A couple comments about argument naming and module import conventions too.

phicks22 · 2025-07-24T21:06:13Z

src/mcc_f1_binary_classification.py

+
+            # save the best threshold
+            best_th = [[task, best_threshold]]
+            best_th_df = pd.DataFrame(best_th, columns=["task", "best_threshold"])


@Damonlin11 Any reason you're using both pandas and polars here? Seems a bit unnecessary to include both of them. I would recommend choosing only one when possible to reduce dependencies and imports.

phicks22 · 2025-07-24T21:08:23Z

src/mcc_f1_binary_classification.py

+if __name__ == "__main__":
+    parser = ArgumentParser()
+    parser.add_argument(
+        "-label_dir", help="Path to the label data", required=True, type=str


@Damonlin11 Minor comment. The label_dir argument is a path to a file though right? Not a directory? I suggest keeping argument names to single words whenever possible. Makes the script easier to run.

phicks22 · 2025-07-24T21:09:09Z

src/sampleLASSO_labels_prior_pos_predpos.py

+    annotations = pl.scan_parquet(args.annotations)
+    best_th_df = pd.read_csv(args.best_threshold, sep="\t")
+    best_th_df = best_th_df[best_th_df["task"] != "task"].reset_index(drop=True)
+    # pred_label_agg_data = label_data.select(["group", "index"]).collect()


@Damonlin11 Any reason this is commented out?

@Damonlin11 Any reason this is commented out?

No longer needed, removed it.

phicks22 · 2025-07-24T21:09:35Z

src/sampleLASSO_labels_prior_pos_predpos.py

+    best_th_df = best_th_df[best_th_df["task"] != "task"].reset_index(drop=True)
+    # pred_label_agg_data = label_data.select(["group", "index"]).collect()
+
+    prior = [pd.NA] * len(best_th_df["task"])


@Damonlin11 What exactly is this doing here?

@Damonlin11 What exactly is this doing here?

This is to get the prior for each term. The prior value is from a separate parquet file.

phicks22 · 2025-07-24T21:10:32Z

src/sampleLASSO_labels_prior_pos_predpos.py

+import polars as pl
+from glob import glob
+import numpy as np
+import pandas as pd


@Damonlin11 Same comment here about using both polars and pandas.

@Damonlin11 Same comment here about using both polars and pandas.

This is because the output file format from this script is in pandas dataframe, which can be changed to polars dataframe such that only the polars is used.

phicks22 · 2025-07-24T21:26:45Z

src/mcc_f1_binary_classification.py

+def best_threshold_classify(
+    train: pl.DataFrame, test: pl.DataFrame, task: str
+) -> pl.DataFrame:
+    """Function to calculate the best threshold using testing set and apply it to training set"""


@Damonlin11 What exactly are train and test here? I recommend describing them in the docstring. Also describe the return variables.

phicks22 · 2025-07-24T21:26:56Z

src/mcc_f1_binary_classification.py

+
+def best_threshold_classify(
+    train: pl.DataFrame, test: pl.DataFrame, task: str
+) -> pl.DataFrame:


@Damonlin11 Incorrect return type.

phicks22 · 2025-07-24T21:30:17Z

src/mcc_f1_binary_classification.py

+    pred_label_agg_data = label_data.select(["group", "index"]).collect()
+
+    # loop over the prediction file of each term
+    for file in tqdm(glob(f"{prob_dir}/*.csv"), total=len(glob(f"{prob_dir}/*.csv"))):


@Damonlin11 DRY. Don't repeat yourself (when possible). This is an instance where you can have a variable defining glob(f"{prob_dir}/*.csv"), then use that in the for loop instead of computing it twice.

Also, pathlib has a glob function too: Path(dir).glob("<some pattern>") and it will return a list of files as pathlib.Path objects.

phicks22 · 2025-07-24T21:32:24Z

src/sampleLASSO_labels_prior_pos_predpos.py

+
+    # loop over the prediction file of each term
+    for file in tqdm(
+        glob(f"{args.prob_dir}/*.csv"), total=len(glob(f"{args.prob_dir}/*.csv"))


@Damonlin11 Similar comments as in src/mcc_f1_binary_classification.py. Modularize and DRY if possible.

phicks22 · 2025-07-24T21:34:17Z

src/fbeta_binary_classification.py

@Damonlin11 Why are there two separate scripts for f1 and fbeta? I suggest merging them into one.

It seems like the two scripts are relatively different though. Why is this? Aren't they the same procedure, just optimizing for different scores?

A script to apply MCC-F1 score to classify binary classes.

78fc983

Add reference note on the top.

a7124fe

phicks22 reviewed Jun 5, 2025

View reviewed changes

src/MCC_F1_funcs.py Show resolved Hide resolved

src/mcc_f1_predictions_binary.py Outdated Show resolved Hide resolved

src/mcc_f1_predictions_binary.py Outdated Show resolved Hide resolved

Junxia Lin added 3 commits June 9, 2025 11:00

Adding the citation.

1efad39

Updated the script.

a8200f5

Deleted a file.

23ce5ff

Adding the predictions MCC-F1 binary annotations

328c129

phicks22 requested changes Jun 24, 2025

View reviewed changes

Junxia Lin and others added 4 commits June 26, 2025 11:22

Applied black format to the script and updated the filename of the an…

cd5061a

…notation file

Adding the fbeta method annotation script and analysis-related script.

94a6cf6

Uploading all the new and updated scripts after implementing fbeta to…

c5a94db

… obtain predicted annotations.

Format the scripts with black.

3ebbdae

Damonlin11 changed the title ~~A script to apply MCC-F1 score to classify binary classes.~~ Scripts to apply Fbeta-score to classify predicted probabilities from txt2onto models into binary classes. Jul 24, 2025

phicks22 requested changes Jul 24, 2025

View reviewed changes

Scripts to apply Fbeta-score to classify predicted probabilities from txt2onto models into binary classes. #1

Are you sure you want to change the base?

Scripts to apply Fbeta-score to classify predicted probabilities from txt2onto models into binary classes. #1

Uh oh!

Conversation

Damonlin11 commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How

Why

Uh oh!

Damonlin11 commented Jun 3, 2025

Uh oh!

phicks22 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Damonlin11 commented Jun 23, 2025

Uh oh!

phicks22 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Damonlin11 commented Jul 24, 2025

Uh oh!

phicks22 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Damonlin11 Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Damonlin11 commented Jun 3, 2025 •

edited

Loading

Damonlin11 Aug 7, 2025 •

edited

Loading