Skip to content

mantzaris/KeemenaPreprocessing.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KeemenaPreprocessing

License: MIT

Dev Build Status


One-stop text pre-processor for Julia - clean -> tokenise -> segment -> build vocabulary -> align levels -> save bundle.


What you get

  • Vocabulary

    • deterministic id <-> token tables
    • minimum-frequency filtering
    • user-defined special tokens
  • Tokenisation

    • byte, character, whitespace or Unicode-word
    • pluggable custom function
  • Offset vectors

    • word, sentence, paragraph and document boundaries
    • always begin with 1 and end with n_tokens + 1
  • Alignment cross-maps

    • byte <-> char <-> word indices (forward & backward)
  • Streaming mode

    • constant-memory two-pass pipeline
    • choose vector of bundles or single merged bundle
  • Bundles

    • everything packed into a PreprocessBundle
    • save / load with JLD2 in one line

Quick example (full corpus in RAM)

using KeemenaPreprocessing

docs = ["First document.", "Second document..."]

cfg  = PreprocessConfiguration(
          tokenizer_name          = :unicode,
          record_sentence_offsets = true,
          minimum_token_frequency = 2)

bundle = preprocess_corpus(docs; config = cfg)

word_ids = get_token_ids(bundle, :word)
println("tokens:", length(word_ids))

The single call does all of: load, clean, tokenise, build vocabulary, record offsets, assemble bundle.


Processing huge corpora with constant memory

using KeemenaPreprocessing, Downloads

# Two Project Gutenberg books
alice = Downloads.download(
          "https://www.gutenberg.org/files/11/11-0.txt", "alice.txt")
time  = Downloads.download(
          "https://www.gutenberg.org/files/35/35-0.txt", "time_machine.txt")

cfg = PreprocessConfiguration(tokenizer_name = :whitespace)

merged = preprocess_corpus_streaming_full(
           [alice, time];           # any iterable of sources
           cfg          = cfg,
           chunk_tokens = 5_000)    # ~5 k tokens per internal chunk

println("total tokens:",
        length(get_token_ids(merged, :word)))

preprocess_corpus_streaming_full runs the two-pass streaming pipeline, merges all internal chunks on the fly, and returns one cohesive bundle covering the entire corpus—ideal when downstream code expects a single artefact yet you still need strict memory bounds during preprocessing.


Installing

It can be downloaded from the general registry: import Pkg; Pkg.add("KeemenaPreprocessing"), or pressing ']' and then typing add KeemenaPreprocessing and then back in the REPL prompt using KeemenaPreprocessing.

For the Dev version: open the Julia REPL, get into package mode pressing ] and put: add https://github.com/mantzaris/KeemenaPreprocessing.jl

About

Preprocessing for text data: cleaning, normalization, vectorization, tokenization and more

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published