Skip to content

Conversation

itsNintu
Copy link
Collaborator

@itsNintu itsNintu commented Jul 31, 2025

Replace static sitemap and llms.txt generation with dynamic Next.js App Router implementation for both web client and docs applications. This implementation automatically discovers pages and
generates sitemaps +llms.txt +llms-full.txt + robots.txt files using Next.js built-in metadata routes.

Key Changes:

• Implement dynamic sitemap.ts and robots.ts files using Next.js App Router conventions
• Remove next-sitemap dependency from docs application
• Add automatic page discovery with configurable exclusion patterns
• Ensure proper SEO optimization with appropriate priorities and change frequencies
• Maintain consistency between robots.txt disallow rules and sitemap exclusions

Related Issues

Type of Change

[ ] Bug fix
[✓] New feature
[ ] Documentation update
[ ] Release
[✓] Refactor
[ ] Other (please describe):

Testing

Manual Testing Steps:

  1. Web Client (onlook.com):
    • Visit /robots.txt - should show proper disallow rules and sitemap reference
    • Visit /sitemap.xml - should show all public pages with correct priorities
    • Verify excluded routes (auth, API, user-specific) are not in sitemap
  2. Docs (docs.onlook.com):
    • Visit /robots.txt - should reference sitemap correctly
    • Visit /sitemap.xml - should show docs homepage
    • Verify old next-sitemap functionality is replaced
  3. Build Testing:
    • Run bun install and bun build for both applications
    • Confirm no next-sitemap related errors in docs build

Screenshots (if applicable)

Additional Notes

• Breaking Change: Removes next-sitemap dependency - docs application no longer needs postbuild script
• SEO Optimized: Homepage gets priority 1.0, marketing pages 0.9, auth pages 0.6
• Automatic: New pages are automatically included in sitemap without manual configuration
• Secure: Private routes (user dashboards, API endpoints) are automatically excluded
• Standards Compliant: Uses official Next.js metadata route conventions for better caching and performance
anthropic/claude-4-sonnet-20250522 (07:39 AM)


Important

Replaces static sitemap and robots.txt generation with dynamic Next.js App Router implementation, removing next-sitemap dependency and adding automatic page discovery with exclusion patterns.

  • Behavior:
    • Replace static sitemap and robots.txt generation with dynamic implementation using Next.js App Router.
    • Automatic page discovery with exclusion patterns in sitemap-utils.ts.
    • Ensure SEO optimization with priorities and change frequencies.
    • Consistency between robots.ts disallow rules and sitemap exclusions.
  • Files:
    • Add sitemap.ts, robots.ts, llms.txt/route.ts, and llms-full.txt/route.ts in both web/client and docs applications.
    • Remove next-sitemap.config.js and related postbuild script from package.json in docs.
  • Misc:
    • Remove next-sitemap dependency from docs/package.json.
    • Update constants/index.ts for route management.

This description was created by Ellipsis for caf080d. You can customize this summary. It will automatically update as commits are pushed.

Summary by CodeRabbit

  • New Features

    • Added public LLMS documentation endpoints (llms.txt and llms-full.txt) for both the app and docs sites.
    • Implemented robots.txt via metadata routes with sensible crawl rules and sitemap references.
    • Introduced dynamic sitemap generation, including automatic route discovery for the app and a daily-updated sitemap for docs.
  • Chores

    • Removed legacy sitemap tooling and configuration.
    • Cleaned up build scripts and dependencies related to sitemap generation.
    • Minor file formatting cleanup.

itsNintu and others added 4 commits July 31, 2025 07:26
- Replace next-sitemap with native Next.js sitemap.ts and robots.ts files
- Add automatic page discovery for web client sitemap generation
- Create comprehensive sitemap utilities with SEO optimization
- Remove deprecated robots.txt route handler in docs
- Update docs package.json to remove next-sitemap dependency
- Add detailed implementation documentation in DYNAMIC_SITEMAP_SETUP.md
- Fix constants.ts formatting

🤖 Generated with [opencode](https://opencode.ai)

Co-Authored-By: opencode <[email protected]>
- Remove redundant /docs path from sitemap (docs.onlook.com/docs -> docs.onlook.com)
- Keep correct docs.onlook.com domain for both sitemap and robots

🤖 Generated with [opencode](https://opencode.ai)

Co-Authored-By: opencode <[email protected]>
Keep implementation documentation local only

🤖 Generated with [opencode](https://opencode.ai)

Co-Authored-By: opencode <[email protected]>
Copy link

vercel bot commented Jul 31, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
docs Ready Ready Preview Comment Aug 18, 2025 6:10am
web Ready Ready Preview Comment Aug 18, 2025 6:10am

Copy link

supabase bot commented Jul 31, 2025

This pull request has been ignored for the connected project wowaemfasoptxrdjhilu because there are no changes detected in apps/backend/supabase directory. You can change this behaviour in Project Integrations Settings ↗︎.


Preview Branches by Supabase.
Learn more about Supabase Branching ↗︎.

Copy link

coderabbitai bot commented Aug 18, 2025

Walkthrough

New text routes generate llms.txt and llms-full.txt for both web and docs apps. Web adds robots and sitemap metadata plus a filesystem-based sitemap utility. Docs migrates robots/sitemap to MetadataRoute, removes next-sitemap config and related script/dependency, and deletes the old robots.txt route. A minor formatting change adds a newline.

Changes

Cohort / File(s) Summary
Web LLMS text routes
apps/web/client/src/app/llms.txt/route.ts, apps/web/client/src/app/llms-full.txt/route.ts
Add GET handlers serving plaintext llms*.txt. Use DOCS_URL env fallback, set Content-Type text/plain and X-Robots-Tag: llms-txt. llms-full builds comprehensive documentation text; llms renders structured sections.
Web SEO routes and sitemap utility
apps/web/client/src/app/robots.ts, apps/web/client/src/app/sitemap.ts, apps/web/client/src/lib/sitemap-utils.ts
Add robots metadata route using APP_URL and disallow lists. Add sitemap metadata route delegating to getWebRoutes(). Implement getWebRoutes() to scan app routes, filter excluded patterns, and return MetadataRoute.Sitemap entries with priorities and frequencies.
Docs LLMS text routes
docs/src/app/llms.txt/route.ts, docs/src/app/llms-full.txt/route.ts
Add GET handlers generating plaintext llms*.txt. llms-full scans docs content, extracts titles, cleans Markdown, builds TOC and sections; sets revalidate=3600. llms outputs static sections. Both set X-Robots-Tag: llms-txt.
Docs SEO config migration
docs/src/app/robots.ts, docs/src/app/sitemap.ts, docs/src/app/robots.txt/route.ts (removed), docs/next-sitemap.config.js (deleted), docs/package.json
Migrate to MetadataRoute-based robots and sitemap. Remove dynamic robots.txt route. Delete next-sitemap config and drop next-sitemap script/dependency from package.json.
Misc formatting
apps/web/client/src/utils/constants/index.ts
Add trailing newline; no functional change.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant NextApp as Next.js Route (web/docs)
  participant Generator as LLMS Generator

  Client->>NextApp: GET /llms.txt or /llms-full.txt
  NextApp->>Generator: Build documentation text (env DOCS_URL)
  Generator-->>NextApp: Plaintext content
  NextApp-->>Client: 200 text/plain (X-Robots-Tag: llms-txt)
Loading
sequenceDiagram
  participant Client
  participant WebApp as Next.js sitemap (web)
  participant Utils as getWebRoutes()
  participant FS as File System

  Client->>WebApp: GET /sitemap.xml
  WebApp->>Utils: getWebRoutes()
  Utils->>FS: readdir(app/src/app recursively)
  FS-->>Utils: Directory entries
  Utils-->>WebApp: MetadataRoute.Sitemap entries
  WebApp-->>Client: Sitemap response
Loading
sequenceDiagram
  participant Client
  participant DocsApp as Next.js llms-full (docs)
  participant Scanner as Docs Scanner
  participant FS as File System

  Client->>DocsApp: GET /llms-full.txt
  DocsApp->>Scanner: scanDocsDirectory()
  Scanner->>FS: Read *.mdx/*.md
  FS-->>Scanner: File contents
  Scanner-->>DocsApp: Titles + cleaned content
  DocsApp-->>Client: 200 text/plain (assembled document)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

In burrows of bytes I twitch my nose,
New maps and manuals neatly compose.
Robots now know where not to tread,
Sitemaps bloom where routes are read.
LLMS scrolls, a carrot-long list—
I thump approval: nothing’s missed! 🥕🐇

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/dynamic-sitemap-robots-setup

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (15)
docs/src/app/sitemap.ts (2)

5-5: Prefer DOCS_URL for docs app + normalize trailing slashes

Use DOCS_URL (consistent with other docs routes) and strip trailing slashes to avoid double slashes when composing URLs elsewhere.

-    const BASE_URL = process.env.APP_URL ?? 'https://docs.onlook.com';
+    const BASE_URL = (process.env.DOCS_URL ?? 'https://docs.onlook.com').replace(/\/+$/, '');

4-4: Confirm whether lastModified should be build-time or request-time

Using new Date() makes lastModified change per request. If you want stable values between builds, consider computing once at module load or using ISR semantics.

Would you like me to prepare a variant that hoists timestamps to module scope or sets revalidate for predictable freshness?

apps/web/client/src/app/robots.ts (1)

3-3: Normalize BASE_URL to avoid double slashes in sitemap/host

If APP_URL ends with '/', ${BASE_URL}/sitemap.xml will produce a double slash. Normalize once.

-const BASE_URL = process.env.APP_URL ?? 'https://onlook.com';
+const BASE_URL = (process.env.APP_URL ?? 'https://onlook.com').replace(/\/+$/, '');
docs/src/app/robots.ts (1)

3-3: Use DOCS_URL for docs app + normalize trailing slashes

Align with other docs routes and prevent accidental double slashes.

-const BASE_URL = process.env.APP_URL ?? 'https://docs.onlook.com';
+const BASE_URL = (process.env.DOCS_URL ?? 'https://docs.onlook.com').replace(/\/+$/, '');
docs/src/app/llms.txt/route.ts (3)

27-27: Normalize docsUrl to avoid double slashes in links

Prevent accidental // in constructed URLs.

-    const docsUrl = process.env.DOCS_URL ?? 'https://docs.onlook.com';
+    const docsUrl = (process.env.DOCS_URL ?? 'https://docs.onlook.com').replace(/\/+$/, '');

82-87: Use a standard X-Robots-Tag value or remove it

'X-Robots-Tag: llms-txt' isn’t a standard directive. If you intend to keep this page out of search results, use 'noindex'; if indexing is fine, drop the header.

-        headers: {
-            'Content-Type': 'text/plain; charset=utf-8',
-            'X-Robots-Tag': 'llms-txt',
-        },
+        headers: {
+            'Content-Type': 'text/plain; charset=utf-8',
+            // Use 'noindex' to prevent indexing, or remove this header entirely if indexing is desired.
+            'X-Robots-Tag': 'noindex',
+        },

If you need a custom marker for observability, prefer a custom header name (e.g., X-LLMS-Doc: true) instead of overloading X-Robots-Tag.


1-24: Deduplicate LLMS types/renderer across apps

The same LLMSSection/LLMSData and renderMarkdown exist in apps/web and docs. Consider a tiny shared module to avoid divergence.

Would you like me to extract a shared llms-utils.ts and update both routes?

apps/web/client/src/app/llms.txt/route.ts (3)

26-27: Normalize docsUrl to avoid double slashes in links

Small hygiene improvement matching the docs route.

-    const docsUrl = process.env.DOCS_URL ?? 'https://docs.onlook.com';
+    const docsUrl = (process.env.DOCS_URL ?? 'https://docs.onlook.com').replace(/\/+$/, '');

71-75: Use a standard X-Robots-Tag or remove

Same reasoning as the docs route; 'llms-txt' isn’t a recognized directive.

-        headers: {
-            'Content-Type': 'text/plain; charset=utf-8',
-            'X-Robots-Tag': 'llms-txt',
-        },
+        headers: {
+            'Content-Type': 'text/plain; charset=utf-8',
+            'X-Robots-Tag': 'noindex',
+        },

1-24: DRY up LLMS data model and renderer

Same structures and renderer exist in docs. Extracting a shared utility prevents drift.

I can propose a shared file (e.g., packages/shared/llms-utils.ts or apps/web/client/src/lib/llms-utils.ts consumed by both) if you’re open to it.

apps/web/client/src/app/llms-full.txt/route.ts (1)

1-162: Unused async function and redundant parameter

The getFullDocumentation function is declared as async but doesn't perform any asynchronous operations. Additionally, the docsUrl parameter is only used once to append to the documentation content.

-async function getFullDocumentation(docsUrl: string): Promise<string> {
+function getFullDocumentation(docsUrl: string): string {

Also update the call site:

-        const content = await getFullDocumentation(docsUrl);
+        const content = getFullDocumentation(docsUrl);
docs/src/app/llms-full.txt/route.ts (2)

48-59: Consider more robust title extraction

The regex pattern for extracting titles from frontmatter doesn't handle multi-line YAML values correctly. Additionally, the filename fallback uses a deprecated regex flag.

 function extractTitle(content: string, filename: string): string {
     // Try to extract title from frontmatter or first heading
     const titleMatch =
-        content.match(/^title:\s*["']?([^"'\n]+)["']?/m) || content.match(/^#\s+(.+)$/m);
+        content.match(/^title:\s*["']?([^"'\n]+?)["']?\s*$/m) || content.match(/^#\s+(.+)$/m);
 
     if (titleMatch) {
         return titleMatch[1].trim();
     }
 
     // Fallback to filename without extension
-    return filename.replace(/\.(mdx?|md)$/, '').replace(/-/g, ' ');
+    return filename.replace(/\.(mdx?)$/, '').replace(/-/g, ' ');
 }

89-90: Potential anchor collision in table of contents

The anchor generation could produce duplicate IDs if multiple documents have the same title after normalization.

Consider adding the file path or index to ensure uniqueness:

-        const anchor = file.title.toLowerCase().replace(/[^a-z0-9]+/g, '-');
+        const anchor = `${file.title.toLowerCase().replace(/[^a-z0-9]+/g, '-')}-${docFiles.indexOf(file)}`;

Apply the same change at line 97.

apps/web/client/src/lib/sitemap-utils.ts (2)

17-61: Consider handling symbolic links and improving error messages

The directory scanning doesn't handle symbolic links, which could cause infinite loops. Also, the error message could be more specific about the failure type.

 async function scanAppDirectory(
     dir: string,
     basePath = '',
     excludedPatterns: string[],
 ): Promise<string[]> {
     const routes: string[] = [];
 
     try {
         const entries = await readdir(dir, { withFileTypes: true });
 
         for (const entry of entries) {
             const fullPath = join(dir, entry.name);
             const routePath = join(basePath, entry.name);
 
-            if (entry.isDirectory()) {
+            if (entry.isDirectory() && !entry.isSymbolicLink()) {
                 if (
                     entry.name.startsWith('_') ||
                     entry.name.startsWith('(') ||
                     entry.name.startsWith('[')
                 ) {
                     continue;
                 }
 
                 const subRoutes = await scanAppDirectory(fullPath, routePath, excludedPatterns);
                 routes.push(...subRoutes);
             } else if (entry.name === 'page.tsx' || entry.name === 'page.ts') {
                 let route = basePath === '' ? '/' : basePath.replace(/\\/g, '/');
 
                 if (!route.startsWith('/')) {
                     route = '/' + route;
                 }
 
                 const shouldExclude = excludedPatterns.some((pattern) => route.startsWith(pattern));
 
                 if (!shouldExclude) {
                     routes.push(route);
                 }
             }
         }
     } catch (error) {
-        console.warn(`Failed to scan directory ${dir}:`, error);
+        console.warn(`Failed to scan directory ${dir}:`, error instanceof Error ? error.message : String(error));
     }
 
     return routes;
 }

82-100: Consider caching the route discovery for production

The filesystem scanning operation could be expensive in production. Consider implementing a caching mechanism.

For production environments, you might want to cache the discovered routes to avoid filesystem operations on every sitemap request. You could either:

  1. Use Next.js ISR by exporting a revalidate constant (similar to the llms-full.txt route)
  2. Implement a simple in-memory cache with TTL
  3. Generate the sitemap at build time if routes don't change dynamically

Example with ISR:

// Add at the top of the file
export const revalidate = 3600; // Revalidate every hour
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 46ce8ce and caf080d.

📒 Files selected for processing (13)
  • apps/web/client/src/app/llms-full.txt/route.ts (1 hunks)
  • apps/web/client/src/app/llms.txt/route.ts (1 hunks)
  • apps/web/client/src/app/robots.ts (1 hunks)
  • apps/web/client/src/app/sitemap.ts (1 hunks)
  • apps/web/client/src/lib/sitemap-utils.ts (1 hunks)
  • apps/web/client/src/utils/constants/index.ts (1 hunks)
  • docs/next-sitemap.config.js (0 hunks)
  • docs/package.json (0 hunks)
  • docs/src/app/llms-full.txt/route.ts (1 hunks)
  • docs/src/app/llms.txt/route.ts (1 hunks)
  • docs/src/app/robots.ts (1 hunks)
  • docs/src/app/robots.txt/route.ts (0 hunks)
  • docs/src/app/sitemap.ts (1 hunks)
💤 Files with no reviewable changes (3)
  • docs/package.json
  • docs/src/app/robots.txt/route.ts
  • docs/next-sitemap.config.js
🧰 Additional context used
🧬 Code Graph Analysis (9)
docs/src/app/robots.ts (1)
apps/web/client/src/app/robots.ts (1)
  • robots (5-27)
apps/web/client/src/lib/sitemap-utils.ts (1)
apps/web/client/src/components/store/editor/pages/helper.ts (1)
  • scanAppDirectory (198-307)
apps/web/client/src/app/robots.ts (1)
docs/src/app/robots.ts (1)
  • robots (5-16)
docs/src/app/llms.txt/route.ts (3)
apps/web/client/src/app/llms-full.txt/route.ts (1)
  • GET (164-179)
apps/web/client/src/app/llms.txt/route.ts (1)
  • GET (26-77)
docs/src/app/llms-full.txt/route.ts (1)
  • GET (152-201)
apps/web/client/src/app/sitemap.ts (2)
docs/src/app/sitemap.ts (1)
  • sitemap (3-15)
apps/web/client/src/lib/sitemap-utils.ts (1)
  • getWebRoutes (82-100)
apps/web/client/src/app/llms.txt/route.ts (3)
apps/web/client/src/app/llms-full.txt/route.ts (1)
  • GET (164-179)
docs/src/app/llms.txt/route.ts (1)
  • GET (26-88)
docs/src/app/llms-full.txt/route.ts (1)
  • GET (152-201)
apps/web/client/src/app/llms-full.txt/route.ts (1)
docs/src/app/llms-full.txt/route.ts (1)
  • GET (152-201)
docs/src/app/sitemap.ts (1)
apps/web/client/src/app/sitemap.ts (1)
  • sitemap (4-7)
docs/src/app/llms-full.txt/route.ts (3)
apps/web/client/src/app/llms-full.txt/route.ts (1)
  • GET (164-179)
apps/web/client/src/app/llms.txt/route.ts (1)
  • GET (26-77)
docs/src/app/llms.txt/route.ts (1)
  • GET (26-88)
🔇 Additional comments (10)
apps/web/client/src/utils/constants/index.ts (1)

37-37: LGTM — formatting-only change adds trailing newline

No functional impact; improves POSIX compliance and avoids diff churn.

docs/src/app/sitemap.ts (1)

7-14: Homepage-only sitemap entry looks good

Returns a valid MetadataRoute.Sitemap entry with priority and changeFrequency aligned with PR goals.

apps/web/client/src/app/robots.ts (1)

24-26: Sitemap/host composition looks correct

Once BASE_URL is normalized, these fields render clean URLs.

docs/src/app/robots.ts (2)

13-15: LGTM

Sitemap and host fields are correct and consistent with BASE_URL.


7-12: CrawlDelay is supported in Next.js ≥ 13.3.0
The crawlDelay?: number property on rules was introduced in Next.js v13.3.0. No changes are required here—your use of crawlDelay: 1 will be honored.

Official docs: https://nextjs.org/docs/app/api-reference/file-conventions/metadata/robots

docs/src/app/llms.txt/route.ts (1)

12-24: Renderer is straightforward and safe

Simple markdown generation with no dynamic input; low risk of injection or formatting issues.

apps/web/client/src/app/llms-full.txt/route.ts (1)

164-179: Good error handling and header configuration

The implementation correctly handles errors with appropriate logging and status codes, and sets the proper headers for LLM consumption.

docs/src/app/llms-full.txt/route.ts (1)

152-201: Well-structured error recovery with fallback content

The error handling with fallback content ensures the endpoint always returns valid documentation, even when file reading fails. Good defensive programming.

apps/web/client/src/lib/sitemap-utils.ts (1)

63-80: Consistent metadata configuration

The route metadata configuration is well-structured with appropriate priorities and change frequencies for different page types.

apps/web/client/src/app/sitemap.ts (1)

1-7: Clean and minimal sitemap implementation

The implementation correctly delegates to the utility function and maintains a clean separation of concerns.

Comment on lines +7 to +23
rules: {
userAgent: '*',
allow: '/',
disallow: [
'/api/',
'/auth/',
'/callback/',
'/webhook/',
'/projects/',
'/project/',
'/invitation/',
'/_next/',
'/_vercel/',
'/private/',
],
crawlDelay: 1,
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Validate crawlDelay support and sync disallow list with sitemap exclusions

  • Confirm Next.js MetadataRoute.Robots supports crawlDelay; some versions don’t emit it. If unsupported, it will be ignored.
  • Ensure this disallow list exactly matches paths excluded from sitemap generation (apps/web/client/src/lib/sitemap-utils.ts) to keep robots and sitemap consistent per PR objectives.

[suggest_optional_refactor]

Run this to inspect sitemap-utils for exclusion patterns and compare:

If desired, extract disallow into a constant to make intent explicit:

+const DISALLOW = [
+  '/api/',
+  '/auth/',
+  '/callback/',
+  '/webhook/',
+  '/projects/',
+  '/project/',
+  '/invitation/',
+  '/_next/',
+  '/_vercel/',
+  '/private/',
+];
 ...
-            disallow: [
-                '/api/',
-                '/auth/',
-                '/callback/',
-                '/webhook/',
-                '/projects/',
-                '/project/',
-                '/invitation/',
-                '/_next/',
-                '/_vercel/',
-                '/private/',
-            ],
+            disallow: DISALLOW,

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "---- Inspecting sitemap exclusions ----"
fd -a "sitemap-utils.ts" | while read -r f; do
  echo "File: $f"
  rg -n -C2 -e 'exclude|excluded|EXCLUDE|DISALLOW|private|auth|api|webhook|project' "$f" || true
done

echo
echo "---- Current robots disallow list ----"
rg -n -A20 -e 'disallow:\s*\[' apps/web/client/src/app/robots.ts

Length of output: 2137


Validate crawlDelay support and align robots/disallow with sitemap exclusions

  • Confirm your Next.js version’s MetadataRoute.Robots supports crawlDelay (older versions will ignore it). If unsupported, consider upgrading or removing the field.
  • The disallow list in apps/web/client/src/app/robots.ts and the EXCLUDED_PATTERNS in sitemap-utils.ts must match exactly. Right now /private/ is present in robots but missing from sitemap exclusions. Add it:
--- a/apps/web/client/src/lib/sitemap-utils.ts
+++ b/apps/web/client/src/lib/sitemap-utils.ts
@@ -6,7 +6,8 @@ const BASE_URL = process.env.APP_URL ?? 'https://onlook.com';
 const EXCLUDED_PATTERNS = [
   '/api/',
   '/auth/',
-  '/callback/',
+  '/callback/',
   '/webhook/',
   '/projects/',
   '/project/',
   '/invitation/',
+  '/private/',       // ← add this to mirror robots.ts
   '/_',              // covers both /_next/ and /_vercel/
 ];
  • Optional refactor: extract the shared exclusion array into a single constant (e.g. EXCLUDED_PATHS) and import it in both robots.ts and sitemap-utils.ts to avoid future drift.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rules: {
userAgent: '*',
allow: '/',
disallow: [
'/api/',
'/auth/',
'/callback/',
'/webhook/',
'/projects/',
'/project/',
'/invitation/',
'/_next/',
'/_vercel/',
'/private/',
],
crawlDelay: 1,
},
// File: apps/web/client/src/lib/sitemap-utils.ts
const BASE_URL = process.env.APP_URL ?? 'https://onlook.com';
const EXCLUDED_PATTERNS = [
'/api/',
'/auth/',
'/callback/',
'/webhook/',
'/projects/',
'/project/',
'/invitation/',
'/private/', // ← add this to mirror robots.ts
'/_', // covers both /_next/ and /_vercel/
];
🤖 Prompt for AI Agents
In apps/web/client/src/app/robots.ts around lines 7 to 23, validate whether your
Next.js MetadataRoute.Robots implementation supports the crawlDelay field and if
not either remove the crawlDelay entry or upgrade Next.js to a version that
supports it; also ensure the robots disallow list exactly matches the
EXCLUDED_PATTERNS in sitemap-utils.ts by adding '/private/' to the sitemap
exclusions (or better, extract a shared EXCLUDED_PATHS constant and import it
into both robots.ts and sitemap-utils.ts so both lists remain identical going
forward).

Comment on lines +61 to +73
function cleanMarkdownContent(content: string): string {
// Remove frontmatter
content = content.replace(/^---[\s\S]*?---\n/, '');

// Remove JSX components and imports
content = content.replace(/^import\s+.*$/gm, '');
content = content.replace(/<[^>]+>/g, '');

// Clean up extra whitespace
content = content.replace(/\n{3,}/g, '\n\n');

return content.trim();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider preserving code blocks in markdown content

The current implementation removes all JSX/HTML tags indiscriminately, which would also remove legitimate code blocks containing JSX examples from the documentation.

 function cleanMarkdownContent(content: string): string {
     // Remove frontmatter
     content = content.replace(/^---[\s\S]*?---\n/, '');
 
     // Remove JSX components and imports
     content = content.replace(/^import\s+.*$/gm, '');
-    content = content.replace(/<[^>]+>/g, '');
+    // Only remove JSX components that are not within code blocks
+    content = content.replace(/^(?!```).*<[A-Z][^>]*>.*$/gm, '');
 
     // Clean up extra whitespace
     content = content.replace(/\n{3,}/g, '\n\n');
 
     return content.trim();
 }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In docs/src/app/llms-full.txt/route.ts around lines 61 to 73, the regex that
removes all angle-bracket tags (content.replace(/<[^>]+>/g, '')) strips JSX/HTML
inside fenced code blocks too; update the function to preserve fenced code
blocks by extracting or tokenizing triple-backtick sections (or using a simple
stateful parser), perform import and tag removals only on the non-code segments,
then restore the fenced code blocks intact; ensure the import-stripping regex
remains line-anchored and apply the whitespace collapse after restoring code
blocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants