DatasetDocs

This repository serves two purposes:

It hosts a documentation generator for Lmod dataset module files
It contains the generated documentation that is hosted on Read the Docs

Documentation

The documentation for all available datasets can be found online at ReadTheDocs. This documentation is automatically generated from Lmod module files and provides detailed information about available datasets, their versions, and associated environment variables.

The documentation is stored in the /docs directory of this repository and is continuously built and updated on Read the Docs.

Use of Datasets

The datasets provided in this repository are federated and play a crucial role in enhancing the efficacy of HPC-optimized workflows across various research domains. Anvil's community dataset storage offers smooth and high-speed access to large-scale datasets, significantly benefiting scientific workflows. The hundreds of terabytes of meteorological and geospatial datasets available on Anvil have been essential for the seamless operation of our tools and scientific efforts, allowing researchers to focus more on scientific discovery rather than navigating data-related challenges.

For more information about Anvil and its capabilities, please visit the RCAC Anvil page.

Documentation Generator

The documentation generator tool automatically creates and maintains documentation for scientific datasets by parsing Lmod module files and creating structured documentation in reStructuredText (rst) format.

Features

Recursively scans directories containing Lmod (.lua) module files
Extracts and formats help text from module files
Captures environment variables set by the modules
Automatically detects version information from date-based filenames (YYYY-MM-DD format)
Generates structured documentation in reStructuredText format
Builds documentation using Sphinx and hosts it on Read the Docs
Maintains hierarchical documentation structure mirroring the dataset organization

Installation

Clone this repository:

git clone https://github.com/PurdueRCAC/DatasetDocs.git
cd DatasetDocs

Install Python dependencies:

pip install -r docs/requirements.txt

Usage

The main script generate_docs.py can be run as follows:

python generate_docs.py \
  --datasets-dir /path/to/lmod/datasets \
  --output-dir /path/to/DatasetDocs/docs

Arguments

--datasets-dir: Directory containing the Lmod (.lua) module files
--output-dir: Directory where the generated documentation will be written

Documentation Structure

The generated documentation follows this structure:

docs/
├── index.rst               # Main documentation index
├── category1/              # Top-level dataset category
│   ├── dataset1/           # Dataset subdirectory
│   │   └── YYYY-MM-DD.rst  # Version-specific documentation
│   └── index.rst           # Category index
└── category2/
    └── ...

Building Documentation Locally

While the documentation is automatically built on Read the Docs, you can also build it locally using Sphinx:

cd docs
make html

The built documentation will be available in docs/_build/html/.

AI Search Functionality (`ai_search`)

The ai_search directory provides advanced AI-powered search and knowledge management tools that integrate with the DatasetDocs ecosystem. These tools enable programmatic uploading, tagging, and management of documentation files as well as conversational search via large language models (LLMs).

Overview of Modules

knowledge_manager.py: Handles API interactions for managing the knowledge base. Supports uploading files, adding files to knowledge bases, tagging documents, retrieving document lists, and deleting documents via RESTful endpoints. This is the core for programmatic knowledge base management.
chat.py: Provides a command-line chatbot interface that connects to an LLM API (such as OpenWebUI/AnvilGPT). It allows users to ask questions and receive answers based on the indexed documentation and datasets.
dataset_uploader.py: Automates the process of finding, renaming, and uploading .rst documentation files into the knowledge base, associating them with the correct knowledge base for datasets.
config.py: Loads API keys and endpoint URLs from a secrets.json configuration file, enabling secure and flexible deployment.

These tools are designed to:

Automate ingestion and management of dataset documentation into an AI-powered knowledge base
Enable conversational and programmatic search over documentation using LLMs
Support tagging, organization, and bulk management of documentation assets

See the individual module docstrings and code for usage examples and further details.

Chat API (HTTP)

The chat_server.py module provides an HTTP API for interacting with the AI-powered chat assistant. This allows you to send messages and receive responses via simple HTTP requests, making it easy to integrate with web frontends or other systems.

Endpoint:

GET /chat

Query Parameters:

message (required): The user message to send to the assistant.
session_id (optional): A unique identifier for the conversation session. If omitted, a default session is used. This allows for multi-turn conversations per user/session.

Example Request:

GET http://localhost:5000/chat?message=What+datasets+are+available%3F&session_id=testuser

Example Response:

{
  "response": "The following datasets are available: ..."
}

Error Responses:

If the message parameter is missing:

{ "error": "Missing required GET parameter: message" }

If the chat backend fails:

{ "error": "Failed to retrieve a response from the API." }

You can run the server with:

python ai_search/chat_server.py

and then interact with it using any HTTP client (browser, curl, Postman, etc.).

Contributing

Contributions are welcome! You can contribute in several ways:

Improving the documentation generator
Fixing documentation errors
Enhancing the documentation structure
Adding new features

Please feel free to submit a Pull Request.

License

This project is licensed under the Open Source License License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
ai_search		ai_search
docs		docs
scripts		scripts
testing		testing
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE.txt		LICENSE.txt
README.md		README.md
generate_docs.py		generate_docs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DatasetDocs

Documentation

Use of Datasets

Documentation Generator

Features

Installation

Usage

Arguments

Documentation Structure

Building Documentation Locally

AI Search Functionality (`ai_search`)

Overview of Modules

Chat API (HTTP)

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

PurdueRCAC/DatasetDocs

Folders and files

Latest commit

History

Repository files navigation

DatasetDocs

Documentation

Use of Datasets

Documentation Generator

Features

Installation

Usage

Arguments

Documentation Structure

Building Documentation Locally

AI Search Functionality (ai_search)

Overview of Modules

Chat API (HTTP)

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

AI Search Functionality (`ai_search`)

Packages