Skip to content

Conversation

nao1215
Copy link
Owner

@nao1215 nao1215 commented Aug 27, 2025

Summary by CodeRabbit

  • New Features
    • Streaming processing for large files with configurable chunk sizes.
    • Flexible input sources: file paths, directories, io.Reader, and embed.FS.
  • Refactor
    • Removed the legacy driver and auto-registration; unified data export through a generic path.
  • Documentation
    • Updated READMEs (multi-language) to include streaming and input sources.
    • Documented auto-save restrictions by input type and added a manual export workflow.
    • Removed driver directory references from contribution guides.
  • Tests
    • Stabilized example timeout output; ensured directory-build tests include a valid CSV.
  • Chores
    • Lint configuration update.

Copy link

coderabbitai bot commented Aug 27, 2025

Walkthrough

Removes the custom filesql driver package and its tests, simplifies filesql API by deleting registration and specializing dump logic, and refactors the builder to a direct in-memory SQLite streaming workflow with auto-save validation and support. Updates README and CONTRIBUTING docs (multi-language) to reflect streaming, flexible inputs, and removal of the driver. Adjusts linter config.

Changes

Cohort / File(s) Summary of Changes
Builder streaming + auto-save refactor
builder.go, builder_test.go
Reworked to stream directly into in-memory SQLite, added input dedup and compressed-variant handling, header-only table creation, and explicit auto-save validation/flow. Tests create a valid CSV in temp dirs for Build.
Driver package removal
driver/doc.go, driver/driver.go, driver/driver_test.go, driver/errors.go
Deleted the filesql driver implementation, its package docs, exported errors, and its entire test suite.
Filesql API simplification
filesql.go, filesql_test.go
Removed DriverName and Register(); unified DumpDatabase to generic SQLite dumping; deleted TestRegisterDriver.
README feature updates
README.md, doc/es/README.md, doc/fr/README.md, doc/ja/README.md, doc/ko/README.md, doc/ru/README.md, doc/zh-cn/README.md
Added streaming and flexible input sources features; main README documents auto-save restrictions and manual export; ko README replaces compression bullet.
CONTRIBUTING: remove driver dir
CONTRIBUTING.md, doc/es/CONTRIBUTING.md, doc/fr/CONTRIBUTING.md, doc/ja/CONTRIBUTING.md, doc/ko/CONTRIBUTING.md, doc/ru/CONTRIBUTING.md, doc/zh-cn/CONTRIBUTING.md
Removed references to the driver/ directory from project structure and directory roles across languages.
Linter config
.golangci.yml
Dropped SA1019 exclusion for driver/.*\.go; SA1019 now reported for that path (though files are removed).
Examples
example_test.go
Normalized timeout error output by matching on "context deadline exceeded".

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant B as Builder
  participant FS as FileSystem/FS/Readers
  participant D as Deduper
  participant DB as SQLite (in-memory)
  participant AS as Auto-save

  U->>B: Build(ctx)
  B->>B: validateAutoSaveConfig()
  B->>FS: collect inputs (paths, dirs, readers, FS)
  B->>D: deduplicate (abs paths, prefer uncompressed)
  B-->>U: build result (collected paths/readers)

  U->>B: Open(ctx)
  B->>DB: open in-memory connection
  loop for each input
    B->>DB: streamReaderToSQLite (create tables, insert rows)
    alt header-only/empty
      B->>DB: createEmptyTable/createTableFromHeaders
    end
  end
  alt auto-save enabled
    B->>AS: re-open with auto-save wrapper
    B->>DB: re-stream into new DB
  end
  B-->>U: *sql.DB
Loading
sequenceDiagram
  autonumber
  participant U as User
  participant DB as *sql.DB (auto-save wrapped)
  participant TX as Transaction (wrapped)
  participant DMP as DumpDatabase
  participant OUT as Output (dir or overwrite originals)

  U->>DB: Close() / TX Commit()
  alt auto-save on Close/Commit
    DB->>DMP: performAutoSave()
    DMP->>OUT: DumpDatabase(export)
    DMP-->>DB: result
  else no auto-save
    DB-->>U: close/commit only
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

A bunny streams with tidy grace,
From files to tables, row by row we trace.
The driver hopped away—goodbye, old lane—
Now auto-save burrows paths again.
With readers, dirs, and headers neat,
The carrots of data are crisp to eat. 🥕

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch nchika/refacto-last

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

Code Metrics Report

main (bf81dc8) #16 (edecb13) +/-
Coverage 80.8% 81.1% +0.3%
Test Execution Time 3s 3s 0s
Details
  |                     | main (bf81dc8) | #16 (edecb13) |  +/-  |
  |---------------------|----------------|---------------|-------|
+ | Coverage            |          80.8% |         81.1% | +0.3% |
  |   Files             |              9 |             8 |    -1 |
  |   Lines             |           1474 |          1130 |  -344 |
- |   Covered           |           1191 |           917 |  -274 |
  | Test Execution Time |             3s |            3s |    0s |

Code coverage of files in pull request scope (77.4% → 78.3%)

Files Coverage +/- Status
builder.go 76.6% -4.0% modified
driver/driver.go 0.0% -82.6% removed
filesql.go 82.3% +24.7% modified

Reported by octocov

@nao1215 nao1215 merged commit 756ba3e into main Aug 27, 2025
7 of 8 checks passed
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
filesql.go (1)

308-310: Harden identifier quoting to prevent SQL injection via table names.
Table names are formatted into SQL with backticks; if a file name contains a backtick, injection is possible. Prefer double-quoted identifiers with escaping.

Apply this diff:

-	query := fmt.Sprintf("SELECT * FROM `%s`", tableName) //nolint:gosec // Table name is validated and comes from database metadata
+	query := fmt.Sprintf(`SELECT * FROM "%s"`, strings.ReplaceAll(tableName, `"`, `""`)) // escape " per SQLite

And in getSQLiteTableColumns:

-	query := fmt.Sprintf("PRAGMA table_info(`%s`)", tableName)
+	query := fmt.Sprintf(`PRAGMA table_info("%s")`, strings.ReplaceAll(tableName, `"`, `""`))

Also applies to: 323-327

builder.go (2)

775-808: Close compression readers (gzip/zstd) to release resources.
gzip.Reader and zstd.Decoder need Close(); currently they’re leaked.

Change createDecompressedReader to return a cleanup function and use it in streamFileToSQLite:

-func (b *DBBuilder) createDecompressedReader(file *os.File, filePath string) (io.Reader, error) {
+func (b *DBBuilder) createDecompressedReader(file *os.File, filePath string) (io.Reader, func() error, error) {
   // gzip
-		gzReader, err := gzip.NewReader(file)
+		gzReader, err := gzip.NewReader(file)
 		if err != nil {
-			return nil, fmt.Errorf("failed to create gzip reader: %w", err)
+			return nil, nil, fmt.Errorf("failed to create gzip reader: %w", err)
 		}
-		return gzReader, nil
+		return gzReader, gzReader.Close, nil
   // bz2
-		return bzip2.NewReader(file), nil
+		return bzip2.NewReader(file), func() error { return nil }, nil
   // xz
-		xzReader, err := xz.NewReader(file)
+		xzReader, err := xz.NewReader(file)
 		if err != nil {
-			return nil, fmt.Errorf("failed to create xz reader: %w", err)
+			return nil, nil, fmt.Errorf("failed to create xz reader: %w", err)
 		}
-		return xzReader, nil
+		return xzReader, func() error { return nil }, nil
   // zstd
-		zstdReader, err := zstd.NewReader(file)
+		zstdReader, err := zstd.NewReader(file)
 		if err != nil {
-			return nil, fmt.Errorf("failed to create zstd reader: %w", err)
+			return nil, nil, fmt.Errorf("failed to create zstd reader: %w", err)
 		}
-		return zstdReader.IOReadCloser(), nil
+		closer := zstdReader.Close
+		return zstdReader.IOReadCloser(), closer, nil
   // none
-	return file, nil
+	return file, func() error { return nil }, nil
}

And in streamFileToSQLite:

-	reader, err := b.createDecompressedReader(file, filePath)
+	reader, cleanup, err := b.createDecompressedReader(file, filePath)
 	if err != nil {
 		return fmt.Errorf("failed to create decompressed reader for %s: %w", filePath, err)
 	}
+	defer func() { _ = cleanup() }()

Also applies to: 618-637


724-757: Quote and escape identifiers to avoid injection when creating tables/inserts.
If headers or table names contain quotes, current formatting allows injection.

Apply this diff:

+	quote := func(s string) string { return `"` + strings.ReplaceAll(s, `"`, `""`) + `"` }
 	columns := make([]string, 0, len(columnInfo))
 	for _, col := range columnInfo {
-		columns = append(columns, fmt.Sprintf(`"%s" %s`, col.Name, col.Type.String()))
+		columns = append(columns, fmt.Sprintf(`%s %s`, quote(col.Name), col.Type.String()))
 	}
-	query := fmt.Sprintf(
-		`CREATE TABLE IF NOT EXISTS "%s" (%s)`,
-		chunk.TableName(),
-		strings.Join(columns, ", "),
-	)
+	query := fmt.Sprintf(`CREATE TABLE IF NOT EXISTS %s (%s)`,
+		quote(chunk.TableName()),
+		strings.Join(columns, ", "),
+	)

And in prepareInsertStatement:

-	query := fmt.Sprintf(
-		`INSERT INTO "%s" VALUES (%s)`,
-		chunk.TableName(),
-		strings.Join(placeholders, ", "),
-	)
+	quote := func(s string) string { return `"` + strings.ReplaceAll(s, `"`, `""`) + `"` }
+	query := fmt.Sprintf(`INSERT INTO %s VALUES (%s)`,
+		quote(chunk.TableName()),
+		strings.Join(placeholders, ", "),
+	)
🧹 Nitpick comments (10)
doc/es/README.md (1)

21-22: Utiliza “bloques” y añade referencia a SetDefaultChunkSize

Pequeño ajuste de terminología para lectura nativa, más una mención de la API pública que permite configurar el tamaño de bloques.

Puntos de atención:

  • En builder.go (líneas 172–177) hay un método público SetDefaultChunkSize(size int) *DBBuilder que configura el tamaño de bloques predeterminado.
- 🌊 **Procesamiento de flujos** - Maneja eficientemente archivos grandes mediante streaming con tamaños de bloques configurables
+ 🌊 **Procesamiento de flujos** - Maneja eficientemente archivos grandes mediante streaming con tamaños de bloques configurables (use DBBuilder.SetDefaultChunkSize para ajustar)
doc/zh-cn/README.md (1)

21-22: Minor Chinese grammar/readability fixes

Add “进行” for smoother flow and insert spaces around English tokens.

-- 🌊 **流式处理** - 通过可配置的块大小流式处理高效处理大文件
-- 📖 **灵活的输入源** - 支持文件路径、目录、io.Reader和embed.FS
+- 🌊 **流式处理** - 通过可配置的块大小进行流式处理,高效处理大文件
+- 📖 **灵活的输入源** - 支持文件路径、目录、io.Reader 和 embed.FS

If the project standardizes spacing around Latin identifiers in Chinese docs, apply consistently across files. Would you like me to submit a follow-up patch to unify this?

doc/fr/README.md (1)

21-22: French style/terminology: replace “chunk” and prefer “Prise en charge de(s)”

Use “blocs” for “chunks” and a more idiomatic construction for support.

-- 🌊 **Traitement en flux** - Gère efficacement les gros fichiers grâce au streaming avec des tailles de chunk configurables
-- 📖 **Sources d'entrée flexibles** - Support pour les chemins de fichiers, répertoires, io.Reader et embed.FS
+- 🌊 **Traitement en flux** - Traite efficacement les gros fichiers grâce au streaming avec des tailles de blocs configurables
+- 📖 **Sources d'entrée flexibles** - Prise en charge des chemins de fichiers, des répertoires, io.Reader et embed.FS

If you prefer “Gère” over “Traite” for tone consistency, keep it—both are fine. Also confirm we have an API knob for streaming chunk size to reference later in this section.

filesql.go (1)

241-243: Avoid opening an extra sql.Conn you don’t use.
DumpDatabase no longer branches on driver type; acquiring a dedicated connection is unnecessary overhead.

Apply this diff:

-	// Get the underlying connection
-	conn, err := db.Conn(context.Background())
-	if err != nil {
-		return fmt.Errorf("failed to get connection: %w", err)
-	}
-	defer conn.Close()
-
-	// Use generic dump functionality for all connections
-	return dumpSQLiteDatabase(db, outputDir, options)
+	// Use generic dump functionality for all connections
+	return dumpSQLiteDatabase(db, outputDir, options)
builder.go (3)

596-617: Empty-file handling: prefer consistent behavior with header-only flow.
Returning “file is empty” here differs from header-only handling below. Consider skipping empty files or creating an empty table to keep behavior predictable across inputs.

If you choose to skip, return nil with a debug log; if you choose to create an empty table, reuse the same path as header-only (table with headers if any, else skip).


823-892: Fallback “column1 TEXT” table is lossy for header-only sources.
Creates a misleading schema. Consider parsing just the header line (peek/tee) to preserve real column names, or require caller-provided headers for Reader inputs.

If deferring a bigger change, fail fast with a clear error for header-less files instead of creating a fake schema.


1065-1096: Case-insensitive extension checks.
Handle .GZ/.BZ2/.XZ/.ZST variants.

Apply this diff:

-	tableName := model.TableFromFilePath(file)
+	tableName := model.TableFromFilePath(file)
 	...
-	if b.isCompressedFile(file) {
+	if b.isCompressedFile(file) {

And implement isCompressedFile with lower-casing:

-func (b *DBBuilder) isCompressedFile(filePath string) bool {
-	return strings.HasSuffix(filePath, ".gz") ||
-		strings.HasSuffix(filePath, ".bz2") ||
-		strings.HasSuffix(filePath, ".xz") ||
-		strings.HasSuffix(filePath, ".zst")
+func (b *DBBuilder) isCompressedFile(filePath string) bool {
+	lp := strings.ToLower(filePath)
+	return strings.HasSuffix(lp, ".gz") ||
+		strings.HasSuffix(lp, ".bz2") ||
+		strings.HasSuffix(lp, ".xz") ||
+		strings.HasSuffix(lp, ".zst")
 }

Also applies to: 1098-1105

README.md (3)

23-24: Tighten wording; prefer fs.FS over embed.FS in feature list.

Minor wording polish and use the general interface fs.FS (which also covers embed.FS).

- - 🌊 **Stream Processing** - Efficiently handles large files through streaming with configurable chunk sizes
- - 📖 **Flexible Input Sources** - Support for file paths, directories, io.Reader, and embed.FS
+ - 🌊 **Stream Processing** - Efficiently handles large files via streaming with configurable chunk sizes
+ - 📖 **Flexible Input Sources** - Supports file paths, directories, io.Reader, and fs.FS

401-409: Be precise about the type and variable names for filesystem inputs.

Use fs.FS terminology and a neutral variable (e.g., fsys) to avoid implying a concrete type.

-- **Filesystems** (`AddFS`): **Only supports output directory mode**
+- **Filesystems (`fs.FS`)** (`AddFS`): **Only supports output directory mode**
@@
-  builder.AddFS(filesystem).EnableAutoSave("")
+  builder.AddFS(fsys).EnableAutoSave("")
@@
-  builder.AddFS(filesystem).EnableAutoSave("./output")
+  builder.AddFS(fsys).EnableAutoSave("./output")

410-411: Clarify restriction rationale and explicitly scope overwrite to path-based inputs; add caution.

Small wording tweak to reduce ambiguity and highlight risk.

-This restriction exists because io.Reader and filesystem inputs don't have original file paths that can be overwritten. The builder will return an error at build time if you try to use overwrite mode with these input types.
+This restriction exists because `io.Reader` and `fs.FS` inputs do not have stable original file paths to overwrite. Overwrite mode (`""`) is only valid with path-based inputs added via `AddPath`/`AddPaths`. The builder returns an error during `Build` if overwrite mode is used with `AddReader` or `AddFS`. Use an output directory for these inputs to avoid data loss.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bf81dc8 and 51771de.

📒 Files selected for processing (24)
  • .golangci.yml (0 hunks)
  • CONTRIBUTING.md (0 hunks)
  • README.md (2 hunks)
  • builder.go (14 hunks)
  • builder_test.go (1 hunks)
  • doc/es/CONTRIBUTING.md (0 hunks)
  • doc/es/README.md (1 hunks)
  • doc/fr/CONTRIBUTING.md (0 hunks)
  • doc/fr/README.md (1 hunks)
  • doc/ja/CONTRIBUTING.md (0 hunks)
  • doc/ja/README.md (1 hunks)
  • doc/ko/CONTRIBUTING.md (0 hunks)
  • doc/ko/README.md (1 hunks)
  • doc/ru/CONTRIBUTING.md (0 hunks)
  • doc/ru/README.md (1 hunks)
  • doc/zh-cn/CONTRIBUTING.md (0 hunks)
  • doc/zh-cn/README.md (1 hunks)
  • driver/doc.go (0 hunks)
  • driver/driver.go (0 hunks)
  • driver/driver_test.go (0 hunks)
  • driver/errors.go (0 hunks)
  • example_test.go (1 hunks)
  • filesql.go (1 hunks)
  • filesql_test.go (0 hunks)
💤 Files with no reviewable changes (13)
  • .golangci.yml
  • filesql_test.go
  • doc/zh-cn/CONTRIBUTING.md
  • doc/ru/CONTRIBUTING.md
  • CONTRIBUTING.md
  • doc/es/CONTRIBUTING.md
  • doc/ja/CONTRIBUTING.md
  • doc/ko/CONTRIBUTING.md
  • driver/driver.go
  • doc/fr/CONTRIBUTING.md
  • driver/doc.go
  • driver/driver_test.go
  • driver/errors.go
🧰 Additional context used
🧬 Code graph analysis (1)
builder.go (5)
domain/model/file.go (2)
  • IsSupportedFile (141-156)
  • FileType (18-18)
filesql.go (1)
  • DumpDatabase (227-243)
domain/model/stream.go (1)
  • NewStreamingParser (17-23)
domain/model/types.go (3)
  • Header (5-5)
  • ColumnInfo (86-89)
  • ColumnTypeText (51-51)
domain/model/table.go (1)
  • TableFromFilePath (77-88)
🪛 LanguageTool
doc/fr/README.md

[style] ~21-~21: Ce verbe peut être considéré comme familier dans un contexte formel.
Context: ...z et .zst - 🌊 Traitement en flux - Gère efficacement les gros fichiers grâce au...

(VERBES_FAMILIERS_PREMIUM)


[typographical] ~22-~22: Il manque une espace après le point.
Context: ...ichiers, répertoires, io.Reader et embed.FS - 🚀 Configuration zéro - Aucun ser...

(ESPACE_APRES_POINT)

doc/zh-cn/README.md

[uncategorized] ~22-~22: 动词的修饰一般为‘形容词(副词)+地+动词’。您的意思是否是:灵活"地"输入
Context: ...流式处理* - 通过可配置的块大小流式处理高效处理大文件 - 📖 灵活的输入源 - 支持文件路径、目录、io.Reader和embed.FS - ?...

(wb4)

doc/ko/README.md

[grammar] ~21-~21: There might be a mistake here.
Context: ... 가능한 청크 크기로 스트리밍을 통해 대용량 파일을 효율적으로 처리합니다 - 📖 유연한 입력 소스 - 파일 경로, 디렉터리, io.Reade...

(QB_NEW_EN)

README.md

[grammar] ~23-~23: There might be a mistake here.
Context: ... streaming with configurable chunk sizes - 📖 Flexible Input Sources - Support ...

(QB_NEW_EN)


[grammar] ~24-~24: There might be a mistake here.
Context: ...** - Support for file paths, directories, io.Reader, and embed.FS - 🚀 **Zero Set...

(QB_NEW_EN)


[grammar] ~24-~24: There might be a mistake here.
Context: ...hs, directories, io.Reader, and embed.FS - 🚀 Zero Setup - No database server r...

(QB_NEW_EN)

🔇 Additional comments (8)
doc/ja/README.md (1)

21-22: 日本語版READMEの文言改善
以下のとおり doc/ja/README.md の 21–22 行目を修正すると、より自然で明確な表現になります。API 側でも SetDefaultChunkSizeAddFS などのメソッドが確認できており、ドキュメント内容と合致しています。

- 🌊 **ストリーム処理** - 設定可能なチャンクサイズでストリーミングにより大きなファイルを効率的に処理します
- 📖 **柔軟な入力ソース** - ファイルパス、ディレクトリ、io.Reader、embed.FSをサポートします
+ 🌊 **ストリーム処理** - 設定可能なチャンクサイズのストリーミングで大容量ファイルを効率的に処理します
+ 📖 **柔軟な入力ソース** - ファイルパス、ディレクトリ、io.Reader、および embed.FS をサポートします
  • ドキュメントにある「設定可能なチャンクサイズ」は DBBuilder.SetDefaultChunkSize で提供されています。
  • 「柔軟な入力ソース」は AddPath/AddPaths(ファイル・ディレクトリ)、AddReader(io.Reader)、AddFS(embed.FS)で実現されています。
    これらの API が正しく存在することを確認しましたので、こちらの文言修正を適用してください。
doc/ko/README.md (1)

21-22: 문서 용어 및 문장 연결자 수정 적용 요청

작업 중인 doc/ko/README.md에서 다음 두 줄을 아래와 같이 변경해주세요.

  • 변경 전 (Lines 21–22)

    🌊 **스트림 처리** - 구성 가능한 청크 크기로 스트리밍을 통해 대용량 파일을 효율적으로 처리합니다  
    📖 **유연한 입력 소스** - 파일 경로, 디렉터리, io.Reader, embed.FS를 지원합니다
    
  • 변경 후 제안

    🌊 **스트림 처리** - 구성 가능한 블록 크기의 스트리밍으로 대용량 파일을 효율적으로 처리합니다  
    📖 **유연한 입력 소스** - 파일 경로, 디렉터리, io.Reader 및 embed.FS를 지원합니다
    

변경 이유
• “청크”보다는 한국어독자에게 친숙한 “블록” 용어를 사용하는 것이 가독성 측면에서 더 적합합니다.
• 나열된 아이템의 맨 끝 연결에는 “및”을 추가해 자연스러운 한국어 리스트 표현을 완성합니다.

doc/ru/README.md (1)

21-22: Docs: new capabilities bullets look good and consistent with repo-wide changes.
Accurately reflects streaming and flexible inputs; no action needed.

example_test.go (1)

485-491: Stabilize timeout assertion — good change.
String-contains check avoids driver-specific wrapping noise and keeps the example deterministic.

builder_test.go (1)

385-391: Good: ensure directory-based Build has a valid CSV.
Precreating test.csv removes flakiness and aligns with new Build semantics.

builder.go (2)

1042-1056: Overwrite mode writes all tables to the first path’s directory.
This violates “overwrite original files” when inputs originate from multiple directories.

Would you like a targeted refactor to track table→source path and write each table back to its original location? I can draft it with a minimal map carried from Build/Open into autoSaveConnection.


1107-1135: Auto-save overwrite guard — nice.
Clear error message and prevents ambiguous writes when using readers/FS.

README.md (1)

383-391: Overwrite and output-directory modes for AddPath/AddPaths are correctly implemented and tested

  • Both EnableAutoSave("") (overwrite mode) and EnableAutoSave("<dir>") (output-directory mode) are supported for file-path inputs added via both AddPath and AddPaths.
  • Validation in validateAutoSaveConfig() allows empty-output (overwrite) only when no reader/FS inputs are present, and errors appropriately otherwise.
  • Builder tests cover:
    • EnableAutoSave("") and EnableAutoSaveOnCommit("") with path inputs succeeding (no error)【builder_test.go 1203–1211】【builder_test.go 1222–1230】
    • Overwrite on multiple commits and auto-save on close scenarios with path inputs【builder_test.go 960–959】【builder_test.go 981–988】
  • Doc snippets for both modes accurately reflect code behavior for both single and multiple paths; no inconsistencies found.

No further changes required.

Comment on lines +309 to 360
// If it's a directory, collect all supported files from it
if info.IsDir() {
// Recursively collect all supported files from the directory
err := filepath.WalkDir(path, func(filePath string, d os.DirEntry, err error) error {
if err != nil {
return err
}

// Skip directories and non-supported files
if d.IsDir() || !model.IsSupportedFile(filePath) {
return nil
}

// Skip files with known issues (like duplicate columns for testing)
if strings.Contains(filepath.Base(filePath), "duplicate_columns") {
return nil
}

// Get absolute path to avoid duplicates
absPath, err := filepath.Abs(filePath)
if err != nil {
return fmt.Errorf("failed to get absolute path for %s: %w", filePath, err)
}

// Only add if not already processed
if !processedFiles[absPath] {
processedFiles[absPath] = true
b.collectedPaths = append(b.collectedPaths, filePath)
}
return nil
})
if err != nil {
return nil, fmt.Errorf("failed to walk directory %s: %w", path, err)
}
} else {
// It's a file, check if it has a supported extension
if !model.IsSupportedFile(path) {
return nil, fmt.Errorf("unsupported file type: %s", path)
}

// Get absolute path to avoid duplicates
absPath, err := filepath.Abs(path)
if err != nil {
return nil, fmt.Errorf("failed to get absolute path for %s: %w", path, err)
}

// Only add if not already processed
if !processedFiles[absPath] {
processedFiles[absPath] = true
b.collectedPaths = append(b.collectedPaths, path)
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Don’t bake test-specific filters into production (skipping “duplicate_columns”).
This silently ignores legitimate files whose names accidentally match, surprising users.

Apply this diff:

-				// Skip files with known issues (like duplicate columns for testing)
-				if strings.Contains(filepath.Base(filePath), "duplicate_columns") {
-					return nil
-				}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// If it's a directory, collect all supported files from it
if info.IsDir() {
// Recursively collect all supported files from the directory
err := filepath.WalkDir(path, func(filePath string, d os.DirEntry, err error) error {
if err != nil {
return err
}
// Skip directories and non-supported files
if d.IsDir() || !model.IsSupportedFile(filePath) {
return nil
}
// Skip files with known issues (like duplicate columns for testing)
if strings.Contains(filepath.Base(filePath), "duplicate_columns") {
return nil
}
// Get absolute path to avoid duplicates
absPath, err := filepath.Abs(filePath)
if err != nil {
return fmt.Errorf("failed to get absolute path for %s: %w", filePath, err)
}
// Only add if not already processed
if !processedFiles[absPath] {
processedFiles[absPath] = true
b.collectedPaths = append(b.collectedPaths, filePath)
}
return nil
})
if err != nil {
return nil, fmt.Errorf("failed to walk directory %s: %w", path, err)
}
} else {
// It's a file, check if it has a supported extension
if !model.IsSupportedFile(path) {
return nil, fmt.Errorf("unsupported file type: %s", path)
}
// Get absolute path to avoid duplicates
absPath, err := filepath.Abs(path)
if err != nil {
return nil, fmt.Errorf("failed to get absolute path for %s: %w", path, err)
}
// Only add if not already processed
if !processedFiles[absPath] {
processedFiles[absPath] = true
b.collectedPaths = append(b.collectedPaths, path)
}
}
// If it's a directory, collect all supported files from it
if info.IsDir() {
// Recursively collect all supported files from the directory
err := filepath.WalkDir(path, func(filePath string, d os.DirEntry, err error) error {
if err != nil {
return err
}
// Skip directories and non-supported files
if d.IsDir() || !model.IsSupportedFile(filePath) {
return nil
}
// Get absolute path to avoid duplicates
absPath, err := filepath.Abs(filePath)
if err != nil {
return fmt.Errorf("failed to get absolute path for %s: %w", filePath, err)
}
// Only add if not already processed
if !processedFiles[absPath] {
processedFiles[absPath] = true
b.collectedPaths = append(b.collectedPaths, filePath)
}
return nil
})
if err != nil {
return nil, fmt.Errorf("failed to walk directory %s: %w", path, err)
}
} else {
// It's a file, check if it has a supported extension
if !model.IsSupportedFile(path) {
return nil, fmt.Errorf("unsupported file type: %s", path)
}
// Get absolute path to avoid duplicates
absPath, err := filepath.Abs(path)
if err != nil {
return nil, fmt.Errorf("failed to get absolute path for %s: %w", path, err)
}
// Only add if not already processed
if !processedFiles[absPath] {
processedFiles[absPath] = true
b.collectedPaths = append(b.collectedPaths, path)
}
}
🤖 Prompt for AI Agents
In builder.go around lines 309-360, remove the hard-coded test-only filter that
skips files with "duplicate_columns" in their name and the related comment;
instead add a configurable ignore pattern (or boolean) to the Builder struct
(default off/empty) and change the WalkDir logic to consult that configuration
when deciding to skip a file. Keep all other logic (supported-file check, abs
path dedupe, processedFiles map) unchanged; update tests to set the Builder's
ignore-pattern to "duplicate_columns" (or enable the test-only flag) so tests
retain the current behavior without baking test filters into production code.

Comment on lines +430 to 443
// Remove compressed duplicates when uncompressed versions exist
b.collectedPaths = b.deduplicateCompressedFiles(b.collectedPaths)

// Case 1: Only file paths, use filesql driver for autosave support
if len(b.collectedPaths) > 0 && len(b.readers) == 0 {
// Create DSN with all collected paths and auto-save config
dsn := strings.Join(b.collectedPaths, ";")
// Unified approach: always use direct SQLite with streaming
// Create in-memory SQLite database using the same method as filesql driver
sqliteDriver := &sqlite.Driver{}
conn, err := sqliteDriver.Open(":memory:")
if err != nil {
return nil, fmt.Errorf("failed to create in-memory database: %w", err)
}

// Append auto-save configuration to DSN if enabled
if b.autoSaveConfig != nil && b.autoSaveConfig.Enabled {
configJSON, err := json.Marshal(b.autoSaveConfig)
if err != nil {
return nil, fmt.Errorf("failed to serialize auto-save config: %w", err)
}
configEncoded := base64.StdEncoding.EncodeToString(configJSON)
dsn += "?autosave=" + configEncoded
}
// Wrap the driver connection in sql.DB
db := sql.OpenDB(&directConnector{conn: conn})

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Single underlying driver.Conn without pool limits can break with database/sql.
directConnector returns the same Conn; without limiting the pool, database/sql may create multiple logical connections referencing the same Conn, risking concurrent use of a non-thread-safe Conn.

Apply this minimal safety net (keeps one physical connection):

-	db := sql.OpenDB(&directConnector{conn: conn})
+	db := sql.OpenDB(&directConnector{conn: conn})
+	// Ensure a single physical connection to avoid concurrent use of the same driver.Conn
+	db.SetMaxOpenConns(1)
+	db.SetMaxIdleConns(1)
+	db.SetConnMaxLifetime(0)

Note: If you later increase pool size, switch to a DSN like file:memdb?mode=memory&cache=shared and return a fresh driver.Conn per Connect.

🤖 Prompt for AI Agents
In builder.go around lines 430 to 443, the current sql.OpenDB wraps a single
physical driver.Conn but doesn't restrict database/sql's pool, risking
concurrent use of that one Conn; after creating db := sql.OpenDB(...), set the
connection pool limits to a single logical connection (e.g.,
db.SetMaxOpenConns(1) and db.SetMaxIdleConns(1), optionally
db.SetConnMaxLifetime(0)) so database/sql never opens multiple concurrent
connections against the single underlying driver.Conn; if you later need more
concurrency switch to a true in-memory DSN and return fresh driver.Conn per
Connect.

Comment on lines +452 to 458
// Process reader inputs using streaming
for _, readerInput := range b.readers {
if err := b.streamReaderToSQLite(ctx, db, readerInput); err != nil {
_ = db.Close() // Ignore close error during error handling
return nil, fmt.Errorf("failed to stream reader input for table '%s': %w", readerInput.TableName, err)
}

// Note: Auto-save for readers-only case is not fully implemented yet
// This requires extending the driver to support auto-save without original file paths
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Close fs.FS readers after streaming to avoid descriptor leaks.
ReaderInput may wrap fs.File or compressed readers that implement io.Closer; they’re never closed here.

Apply this diff:

 	for _, readerInput := range b.readers {
-		if err := b.streamReaderToSQLite(ctx, db, readerInput); err != nil {
+		if err := b.streamReaderToSQLite(ctx, db, readerInput); err != nil {
 			_ = db.Close() // Ignore close error during error handling
 			return nil, fmt.Errorf("failed to stream reader input for table '%s': %w", readerInput.TableName, err)
 		}
+		// Best-effort close if the reader is closable (fs.File, gzip, zstd, etc.)
+		if c, ok := readerInput.Reader.(io.Closer); ok {
+			_ = c.Close()
+		}
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Process reader inputs using streaming
for _, readerInput := range b.readers {
if err := b.streamReaderToSQLite(ctx, db, readerInput); err != nil {
_ = db.Close() // Ignore close error during error handling
return nil, fmt.Errorf("failed to stream reader input for table '%s': %w", readerInput.TableName, err)
}
// Note: Auto-save for readers-only case is not fully implemented yet
// This requires extending the driver to support auto-save without original file paths
}
// Process reader inputs using streaming
for _, readerInput := range b.readers {
if err := b.streamReaderToSQLite(ctx, db, readerInput); err != nil {
_ = db.Close() // Ignore close error during error handling
return nil, fmt.Errorf("failed to stream reader input for table '%s': %w", readerInput.TableName, err)
}
// Best-effort close if the reader is closable (fs.File, gzip, zstd, etc.)
if c, ok := readerInput.Reader.(io.Closer); ok {
_ = c.Close()
}
}
🤖 Prompt for AI Agents
In builder.go around lines 452 to 458, the loop that streams b.readers does not
close reader resources and can leak file descriptors; after calling
b.streamReaderToSQLite(ctx, db, readerInput) you must detect if
readerInput.Reader (or the underlying reader) implements io.Closer and invoke
Close() when done. Ensure the Close() is called regardless of whether
streamReaderToSQLite succeeded or returned an error (close before returning),
and surface or log any Close() error (do not ignore it silently). Implement the
Close() call directly after streaming each reader, handling the closer assertion
and propagating/annotating errors appropriately.

Comment on lines +488 to +510
// Create auto-save enabled database using our connector
connector := &autoSaveConnector{
sqliteConn: freshConn,
autoSaveConfig: b.autoSaveConfig,
originalPaths: b.collectOriginalPaths(),
}
db = sql.OpenDB(connector)

// Stream all data to new connection
for _, path := range b.collectedPaths {
if err := b.streamFileToSQLite(ctx, db, path); err != nil {
_ = db.Close() // Ignore close error during error handling
return nil, fmt.Errorf("failed to stream file %s: %w", path, err)
}
}

for _, readerInput := range b.readers {
if err := b.streamReaderToSQLite(ctx, db, readerInput); err != nil {
_ = db.Close() // Ignore close error during error handling
return nil, fmt.Errorf("failed to stream reader input for table '%s': %w", readerInput.TableName, err)
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Mirror pool limits for the auto-save connection.
Same rationale as above; keep the connection count at 1.

Apply this diff:

-		db = sql.OpenDB(connector)
+		db = sql.OpenDB(connector)
+		db.SetMaxOpenConns(1)
+		db.SetMaxIdleConns(1)
+		db.SetConnMaxLifetime(0)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Create auto-save enabled database using our connector
connector := &autoSaveConnector{
sqliteConn: freshConn,
autoSaveConfig: b.autoSaveConfig,
originalPaths: b.collectOriginalPaths(),
}
db = sql.OpenDB(connector)
// Stream all data to new connection
for _, path := range b.collectedPaths {
if err := b.streamFileToSQLite(ctx, db, path); err != nil {
_ = db.Close() // Ignore close error during error handling
return nil, fmt.Errorf("failed to stream file %s: %w", path, err)
}
}
for _, readerInput := range b.readers {
if err := b.streamReaderToSQLite(ctx, db, readerInput); err != nil {
_ = db.Close() // Ignore close error during error handling
return nil, fmt.Errorf("failed to stream reader input for table '%s': %w", readerInput.TableName, err)
}
}
}
// Create auto-save enabled database using our connector
connector := &autoSaveConnector{
sqliteConn: freshConn,
autoSaveConfig: b.autoSaveConfig,
originalPaths: b.collectOriginalPaths(),
}
db = sql.OpenDB(connector)
db.SetMaxOpenConns(1)
db.SetMaxIdleConns(1)
db.SetConnMaxLifetime(0)
// Stream all data to new connection
for _, path := range b.collectedPaths {
if err := b.streamFileToSQLite(ctx, db, path); err != nil {
_ = db.Close() // Ignore close error during error handling
return nil, fmt.Errorf("failed to stream file %s: %w", path, err)
}
}
for _, readerInput := range b.readers {
if err := b.streamReaderToSQLite(ctx, db, readerInput); err != nil {
_ = db.Close() // Ignore close error during error handling
return nil, fmt.Errorf("failed to stream reader input for table '%s': %w", readerInput.TableName, err)
}
}
🤖 Prompt for AI Agents
In builder.go around lines 488 to 510, the auto-save connection created with
sql.OpenDB(connector) does not mirror the single-connection pool limits used
elsewhere; after creating db = sql.OpenDB(connector) set the connection pool
limits to a single connection by calling db.SetMaxOpenConns(1) and
db.SetMaxIdleConns(1) (and optionally SetConnMaxIdleTime/SetConnMaxLifetime to
match other usage if present) before streaming files/readers so the auto-save
connection remains limited to one connection.

Comment on lines +392 to +400
- **io.Reader** (`AddReader`): **Only supports output directory mode**
```go
// ❌ Build error - overwrite mode not supported
builder.AddReader(reader, "table", model.FileTypeCSV).EnableAutoSave("")

// ✅ Must specify output directory
builder.AddReader(reader, "table", model.FileTypeCSV).EnableAutoSave("./output")
```

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Docs reference internal symbol model.FileTypeCSV; use public API (likely filesql.FileTypeCSV).

End users shouldn't import internal packages. The snippet should reference an exported constant from the public package.

-  builder.AddReader(reader, "table", model.FileTypeCSV).EnableAutoSave("")
+  builder.AddReader(reader, "table", filesql.FileTypeCSV).EnableAutoSave("")
@@
-  builder.AddReader(reader, "table", model.FileTypeCSV).EnableAutoSave("./output")
+  builder.AddReader(reader, "table", filesql.FileTypeCSV).EnableAutoSave("./output")

If the exported identifier differs (e.g., filesql.CSV), adjust accordingly.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- **io.Reader** (`AddReader`): **Only supports output directory mode**
```go
// ❌ Build error - overwrite mode not supported
builder.AddReader(reader, "table", model.FileTypeCSV).EnableAutoSave("")
// ✅ Must specify output directory
builder.AddReader(reader, "table", model.FileTypeCSV).EnableAutoSave("./output")
```
// ❌ Build error - overwrite mode not supported
builder.AddReader(reader, "table", filesql.FileTypeCSV).EnableAutoSave("")
// ✅ Must specify output directory
builder.AddReader(reader, "table", filesql.FileTypeCSV).EnableAutoSave("./output")
🤖 Prompt for AI Agents
In README.md around lines 392 to 400 the example references the internal symbol
model.FileTypeCSV; replace it with the exported public constant from the public
package (e.g., filesql.FileTypeCSV or filesql.CSV) and update the code snippet
import and usage accordingly — change the snippet to import the public package
(filesql) instead of model, use the correct exported identifier, and verify the
README compiles against the public API shown to avoid suggesting internal
package usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant