๐ Awesome lists about all kinds of LLM related datasets
- Automated Programming Progress Standard: A collection of 12,500 challenging mathematical problems from competitions, providing step-by-step solutions for training models in answer derivation and explanation generation
 - GSM8k Dataset: A collection of 8,500 grade school math problems. This dataset tests the multi-step reasoning abilities of models, highlighting their limitations despite the simplicity of the problems
 - MathQA:A large-scale dataset of math word problems.
 - AQUA-RAT: A algebraic word problem dataset, with multiple choice questions annotated with rationales.
 
- Magicoder
 - Salesforce/xlam-function-calling-60k: APIGen Function-Calling Datasets
 
- ImageInWords: Unlocking Hyper-Detailed Image Descriptions
 
- Mendeley digital knee X-ray images
 - PAD-UFES-20
 - UltraMedical: Building Specialized Generalists in Biomedicine.
 
- MS MARCO Web Search: A large-scale information-rich web dataset, featuring millions of real clicked query-document labels
 
