⚖️ DiFBench

Implementation of paper:

Seiji Maekawa, Hayate Iso, Nikita Bhutani. The Rarity Blind Spot: A Framework for Evaluating Statistical Reasoning in LLMs

⭐️ Key Contributions

Introduced Distinctive Feature Mining (DFM) and DiFBENCH – Proposed a new task and benchmark framework to systematically evaluate LLMs’ statistical reasoning by identifying globally rare features across document collections.
Large-scale empirical evaluation – Conducted the first comprehensive study on ten state-of-the-art LLMs, revealing that even advanced reasoning models degrade significantly with scale and often misclassify frequent features as distinctive, providing computational evidence of base rate neglect.
Mitigation via explicit verification prompting – Demonstrated a simple yet effective prompting strategy that improves F1 scores by 65% relative, highlighting both practical mitigation and persistent limitations in multi-document comparative reasoning.

🔧 DiFBench Setup

conda create -n difbench python=3.11
conda activate difbench
pip install -r requirements.txt

🚀 Quick tour

The following gives a quick overview of how to use DiFBench for evaluating distinctive feature mining tasks.

Running with the resume dataset

bash scripts/resume_dfm.sh

Running with the news summary dataset

bash scripts/news_dfm.sh

The results look like:

{
  "0": {
    "average_scores": {
      "micro_avg_precision": 1.0,
      "micro_avg_recall": 1.0,
      "micro_avg_f1_score": 1.0,
      "macro_avg_precision": 1.0,
      "macro_avg_recall": 1.0,
      "macro_avg_f1_score": 1.0,
      "input_tokens": 7008,
      "output_tokens": 15101,
      "reasoning_tokens": 14464
    },
    "outputs": [
      {
        "summary_id": 1,
        "reasoning": "I listed every bullet from summary 1, then checked how many other summaries contained the exact same wording. Only two lines met the \u22642-occurrence rule: the 3.09:1 axle ratio line (appears here and in summary 7 only) and the VIN-matching perk line (unique to this summary).",
        "output": [
          "3.09:1 axle ratio balances highway comfort with quarter-mile thrust.",
          "VIN-matching perk for 2018 owners requires pre-reserved serial blocks on assembly line."
        ]
      },
      {
        "summary_id": 2,
        "reasoning": "All but one of summary 2\u2019s lines were found in three or more other summaries. The head-bolt/stud line appears only here and in summary 5, so it qualifies.",
        "output": [
          "Head bolts replaced by studs to prevent gasket lift under roughly 2,500-psi peak combustion loads."
        ]
      },
      ...,
  "1":{
    ...}

🗂️ Dataset

ID	OSS Component Name	Modified	Copyright Holder	Upstream Link	License
1	DiFBench	No	Megagon Labs	link	BSD-3-Clause license

📂 Dataset Organization

You can find our generated resume features and news insights under ./data/ Each dataset is stored as a separate .jsonl file in the same folder. The filename corresponds to the category name (e.g., Legal_occupations.jsonl).

In each file:

Each row represents a source document (e.g., a resume or a news article).
Each column represents a set of features in a specific section of the document.

The 10 dataset categories are:

Resume Domain:
- Computer_and_mathematical_occupations
- Life_physical_and_social_science_occupations
- Legal_occupations
- Architecture_and_engineering_occupations
- Healthcare_occupations
News Summary Domain:
- topic1
- topic2
- topic3
- topic4
- topic5

🔄　How to Load the Dataset

You can easily load any of the datasets using Python with the pandas library. Make sure the .jsonl files are in the same directory as your script, or provide the correct path to the files.

Here is an example of how to load a single dataset:

import os
import pandas as pd

# List of all dataset categories
categories = [
  "Computer_and_mathematical_occupations",
  "Life_physical_and_social_science_occupations",
  "Legal_occupations",
  "Architecture_and_engineering_occupations",
  "Healthcare_occupations",
  "topic1",
  "topic2",
  "topic3",
  "topic4",
  "topic5",
]

# --- Load a single dataset ---
# Select a category to load
category_to_load = categories[0] 

# Define the path to the data file
# Assumes the data files are in the current directory ("./")
data_path = f"./{category_to_load}.jsonl"

# Load the dataset into a pandas DataFrame
df = pd.read_json(data_path, lines=True)

# Display the first few rows of the dataframe
print(f"Successfully loaded {category_to_load}:")
print(df.head())

📚 Citations

@misc{maekawa2025distinctive,
  title={The Rarity Blind Spot: A Framework for Evaluating Statistical Reasoning in LLMs},
  author={Seiji Maekawa and Hayate Iso and Nikita Bhutani},
  url={https://arxiv.org/abs/2509.00245},
  year={2025}
}

📜 Disclosure

Embedded in, or bundled with, this product are open source software (OSS) components, datasets and other third party components identified below. The license terms respectively governing the datasets and third-party components continue to govern those portions, and you agree to those license terms, which, when applicable, specifically limit any distribution. You may receive a copy of, distribute and/or modify any open source code for the OSS component under the terms of their respective licenses, which may be BSD 3 clause license and Apache 2.0 license. In the event of conflicts between Megagon Labs, Inc., license conditions and the Open Source Software license conditions, the Open Source Software conditions shall prevail with respect to the Open Source Software portions of the software. You agree not to, and are not permitted to, distribute actual datasets used with the OSS components listed below. You agree and are limited to distribute only links to datasets from known sources by listing them in the datasets overview table below. You are permitted to distribute derived datasets of data sets from known sources by including links to original dataset source in the datasets overview table below. You agree that any right to modify datasets originating from parties other than Megagon Labs, Inc. are governed by the respective third party’s license conditions. All OSS components and datasets are distributed WITHOUT ANY WARRANTY, without even implied warranty such as for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE, and without any liability to or claim against any Megagon Labs, Inc. entity other than as explicitly documented in this README document. You agree to cease using any part of the provided materials if you do not agree with the terms or the lack of any warranty herein. While Megagon Labs, Inc., makes commercially reasonable efforts to ensure that citations in this document are complete and accurate, errors may occur. If you see any error or omission, please help us improve this document by sending information to [email protected].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

⚖️ DiFBench

⭐️ Key Contributions

🔧 DiFBench Setup

🚀 Quick tour

Running with the resume dataset

Running with the news summary dataset

🗂️ Dataset

📂 Dataset Organization

🔄　How to Load the Dataset

📚 Citations

📜 Disclosure

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
data		data
dfm		dfm
scripts		scripts
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

megagonlabs/DiFBench

Folders and files

Latest commit

History

Repository files navigation

⚖️ DiFBench

⭐️ Key Contributions

🔧 DiFBench Setup

🚀 Quick tour

Running with the resume dataset

Running with the news summary dataset

🗂️ Dataset

📂 Dataset Organization

🔄 How to Load the Dataset

📚 Citations

📜 Disclosure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

🔄　How to Load the Dataset

Packages