KeemenaPreprocessing

One-stop text pre-processor for Julia - clean -> tokenise -> segment -> build vocabulary -> align levels -> save bundle.

What you get

Vocabulary
- deterministic id <-> token tables
- minimum-frequency filtering
- user-defined special tokens
Tokenisation
- byte, character, whitespace or Unicode-word
- pluggable custom function
Offset vectors
- word, sentence, paragraph and document boundaries
- always begin with 1 and end with n_tokens + 1
Alignment cross-maps
- byte <-> char <-> word indices (forward & backward)
Streaming mode
- constant-memory two-pass pipeline
- choose vector of bundles or single merged bundle
Bundles
- everything packed into a PreprocessBundle
- save / load with JLD2 in one line

Quick example (full corpus in RAM)

using KeemenaPreprocessing

docs = ["First document.", "Second document..."]

cfg  = PreprocessConfiguration(
          tokenizer_name          = :unicode,
          record_sentence_offsets = true,
          minimum_token_frequency = 2)

bundle = preprocess_corpus(docs; config = cfg)

word_ids = get_token_ids(bundle, :word)
println("tokens:", length(word_ids))

The single call does all of: load, clean, tokenise, build vocabulary, record offsets, assemble bundle.

Processing huge corpora with constant memory

using KeemenaPreprocessing, Downloads

# Two Project Gutenberg books
alice = Downloads.download(
          "https://www.gutenberg.org/files/11/11-0.txt", "alice.txt")
time  = Downloads.download(
          "https://www.gutenberg.org/files/35/35-0.txt", "time_machine.txt")

cfg = PreprocessConfiguration(tokenizer_name = :whitespace)

merged = preprocess_corpus_streaming_full(
           [alice, time];           # any iterable of sources
           cfg          = cfg,
           chunk_tokens = 5_000)    # ~5 k tokens per internal chunk

println("total tokens:",
        length(get_token_ids(merged, :word)))

preprocess_corpus_streaming_full runs the two-pass streaming pipeline, merges all internal chunks on the fly, and returns one cohesive bundle covering the entire corpus—ideal when downstream code expects a single artefact yet you still need strict memory bounds during preprocessing.

Installing

It can be downloaded from the general registry: import Pkg; Pkg.add("KeemenaPreprocessing"), or pressing ']' and then typing add KeemenaPreprocessing and then back in the REPL prompt using KeemenaPreprocessing.

For the Dev version: open the Julia REPL, get into package mode pressing ] and put: add https://github.com/mantzaris/KeemenaPreprocessing.jl

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
.github		.github
docs		docs
paper		paper
src		src
test		test
.gitignore		.gitignore
.notes.txt		.notes.txt
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KeemenaPreprocessing

What you get

Quick example (full corpus in RAM)

Processing huge corpora with constant memory

Installing

About

Uh oh!

Releases 1

Packages

Languages

License

mantzaris/KeemenaPreprocessing.jl

Folders and files

Latest commit

History

Repository files navigation

KeemenaPreprocessing

What you get

Quick example (full corpus in RAM)

Processing huge corpora with constant memory

Installing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages