Skip to content

Conversation

Damonlin11
Copy link
Collaborator

@Damonlin11 Damonlin11 commented Jun 3, 2025

What

1. A new script to apply MCC-F1 method to classify the txt2onto predicted probability into binary classes.

  1. Abandon MCC-F1 method. But the scripts of MCC-F1 are kept in case there is any needs in the future.
  2. Switch to using fbeta (f1 & f0.5)-score to find the best threshold to classify the txt2onto predicted probabilities into binary class.

How

  1. We identify the best threshold by looking for the predicted probability that yields the highest fbeta-score.

Why

  1. The predicted outcome from txt2onto models is continuous probability values which need a cut-off value to convert to binary classes.
  2. As the data is highly imbalanced, most of the predicted probabilities are very low, i.e., the distribution of the predicted probas is highly skewed. We applied fbeta-score to identify the best cut-off value.

@Damonlin11
Copy link
Collaborator Author

Hi @phicks22 , the script for classifying the txt2onto predicted probability is also ready to review.

Copy link
Member

@phicks22 phicks22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonlin11 Please see my comments.

@Damonlin11
Copy link
Collaborator Author

@phicks22 I have updated the MCC-F1 workflow to reflect the plan we discussed. It is ready for your review.

Copy link
Member

@phicks22 phicks22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonlin11 Nice just two more things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonlin11 Format this script with black once more please.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phicks22 updated here cd5061a

@Damonlin11 Damonlin11 changed the title A script to apply MCC-F1 score to classify binary classes. Scripts to apply Fbeta-score to classify predicted probabilities from txt2onto models into binary classes. Jul 24, 2025
@Damonlin11
Copy link
Collaborator Author

@phicks22 I pushed all the updated scripts generating the predicted annotations using fbeta-score here. Please review them. The main script is src/fbeta_binary_classification.py.

Copy link
Member

@phicks22 phicks22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonlin11 Good work, Junxia. Please see my comments. Most of them have to do with refactoring to modularize these scripts even more. A couple comments about argument naming and module import conventions too.


# save the best threshold
best_th = [[task, best_threshold]]
best_th_df = pd.DataFrame(best_th, columns=["task", "best_threshold"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonlin11 Any reason you're using both pandas and polars here? Seems a bit unnecessary to include both of them. I would recommend choosing only one when possible to reduce dependencies and imports.

if __name__ == "__main__":
parser = ArgumentParser()
parser.add_argument(
"-label_dir", help="Path to the label data", required=True, type=str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonlin11 Minor comment. The label_dir argument is a path to a file though right? Not a directory? I suggest keeping argument names to single words whenever possible. Makes the script easier to run.

annotations = pl.scan_parquet(args.annotations)
best_th_df = pd.read_csv(args.best_threshold, sep="\t")
best_th_df = best_th_df[best_th_df["task"] != "task"].reset_index(drop=True)
# pred_label_agg_data = label_data.select(["group", "index"]).collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonlin11 Any reason this is commented out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonlin11 Any reason this is commented out?

No longer needed, removed it.

best_th_df = best_th_df[best_th_df["task"] != "task"].reset_index(drop=True)
# pred_label_agg_data = label_data.select(["group", "index"]).collect()

prior = [pd.NA] * len(best_th_df["task"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonlin11 What exactly is this doing here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonlin11 What exactly is this doing here?

This is to get the prior for each term. The prior value is from a separate parquet file.

import polars as pl
from glob import glob
import numpy as np
import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonlin11 Same comment here about using both polars and pandas.

Copy link
Collaborator Author

@Damonlin11 Damonlin11 Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonlin11 Same comment here about using both polars and pandas.

This is because the output file format from this script is in pandas dataframe, which can be changed to polars dataframe such that only the polars is used.

def best_threshold_classify(
train: pl.DataFrame, test: pl.DataFrame, task: str
) -> pl.DataFrame:
"""Function to calculate the best threshold using testing set and apply it to training set"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonlin11 What exactly are train and test here? I recommend describing them in the docstring. Also describe the return variables.


def best_threshold_classify(
train: pl.DataFrame, test: pl.DataFrame, task: str
) -> pl.DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonlin11 Incorrect return type.

pred_label_agg_data = label_data.select(["group", "index"]).collect()

# loop over the prediction file of each term
for file in tqdm(glob(f"{prob_dir}/*.csv"), total=len(glob(f"{prob_dir}/*.csv"))):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonlin11 DRY. Don't repeat yourself (when possible). This is an instance where you can have a variable defining glob(f"{prob_dir}/*.csv"), then use that in the for loop instead of computing it twice.

Also, pathlib has a glob function too: Path(dir).glob("<some pattern>") and it will return a list of files as pathlib.Path objects.


# loop over the prediction file of each term
for file in tqdm(
glob(f"{args.prob_dir}/*.csv"), total=len(glob(f"{args.prob_dir}/*.csv"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonlin11 Similar comments as in src/mcc_f1_binary_classification.py. Modularize and DRY if possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonlin11 Why are there two separate scripts for f1 and fbeta? I suggest merging them into one.

It seems like the two scripts are relatively different though. Why is this? Aren't they the same procedure, just optimizing for different scores?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants