-
Notifications
You must be signed in to change notification settings - Fork 1
Scripts to apply Fbeta-score to classify predicted probabilities from txt2onto models into binary classes. #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi @phicks22 , the script for classifying the txt2onto predicted probability is also ready to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonlin11 Please see my comments.
@phicks22 I have updated the MCC-F1 workflow to reflect the plan we discussed. It is ready for your review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonlin11 Nice just two more things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonlin11 Format this script with black
once more please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… obtain predicted annotations.
@phicks22 I pushed all the updated scripts generating the predicted annotations using fbeta-score here. Please review them. The main script is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonlin11 Good work, Junxia. Please see my comments. Most of them have to do with refactoring to modularize these scripts even more. A couple comments about argument naming and module import conventions too.
|
||
# save the best threshold | ||
best_th = [[task, best_threshold]] | ||
best_th_df = pd.DataFrame(best_th, columns=["task", "best_threshold"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonlin11 Any reason you're using both pandas
and polars
here? Seems a bit unnecessary to include both of them. I would recommend choosing only one when possible to reduce dependencies and imports.
if __name__ == "__main__": | ||
parser = ArgumentParser() | ||
parser.add_argument( | ||
"-label_dir", help="Path to the label data", required=True, type=str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonlin11 Minor comment. The label_dir
argument is a path to a file though right? Not a directory? I suggest keeping argument names to single words whenever possible. Makes the script easier to run.
annotations = pl.scan_parquet(args.annotations) | ||
best_th_df = pd.read_csv(args.best_threshold, sep="\t") | ||
best_th_df = best_th_df[best_th_df["task"] != "task"].reset_index(drop=True) | ||
# pred_label_agg_data = label_data.select(["group", "index"]).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonlin11 Any reason this is commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonlin11 Any reason this is commented out?
No longer needed, removed it.
best_th_df = best_th_df[best_th_df["task"] != "task"].reset_index(drop=True) | ||
# pred_label_agg_data = label_data.select(["group", "index"]).collect() | ||
|
||
prior = [pd.NA] * len(best_th_df["task"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonlin11 What exactly is this doing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonlin11 What exactly is this doing here?
This is to get the prior for each term. The prior value is from a separate parquet file.
import polars as pl | ||
from glob import glob | ||
import numpy as np | ||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonlin11 Same comment here about using both polars
and pandas
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonlin11 Same comment here about using both
polars
andpandas
.
This is because the output file format from this script is in pandas dataframe, which can be changed to polars dataframe such that only the polars is used.
def best_threshold_classify( | ||
train: pl.DataFrame, test: pl.DataFrame, task: str | ||
) -> pl.DataFrame: | ||
"""Function to calculate the best threshold using testing set and apply it to training set""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonlin11 What exactly are train
and test
here? I recommend describing them in the docstring. Also describe the return variables.
|
||
def best_threshold_classify( | ||
train: pl.DataFrame, test: pl.DataFrame, task: str | ||
) -> pl.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonlin11 Incorrect return type.
pred_label_agg_data = label_data.select(["group", "index"]).collect() | ||
|
||
# loop over the prediction file of each term | ||
for file in tqdm(glob(f"{prob_dir}/*.csv"), total=len(glob(f"{prob_dir}/*.csv"))): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonlin11 DRY. Don't repeat yourself (when possible). This is an instance where you can have a variable defining glob(f"{prob_dir}/*.csv")
, then use that in the for loop instead of computing it twice.
Also, pathlib
has a glob function too: Path(dir).glob("<some pattern>")
and it will return a list of files as pathlib.Path
objects.
|
||
# loop over the prediction file of each term | ||
for file in tqdm( | ||
glob(f"{args.prob_dir}/*.csv"), total=len(glob(f"{args.prob_dir}/*.csv")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonlin11 Similar comments as in src/mcc_f1_binary_classification.py
. Modularize and DRY if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonlin11 Why are there two separate scripts for f1 and fbeta? I suggest merging them into one.
It seems like the two scripts are relatively different though. Why is this? Aren't they the same procedure, just optimizing for different scores?
What
1. A new script to apply MCC-F1 method to classify the txt2onto predicted probability into binary classes.How
Why