Analysis tool for ocr files in Mokuro processed manga. Also supports miscellaneous files.
japanese_text_analyzer directory_or_file_path OPTIONS
-
--mokurojson(Default): Searches only for.jsonfiles in the specified path.Note: The Mokuro
_ocrjson files must be present. -
--mokuro: Searches only for.mokurofiles in the specified path.Note: The Mokuro
.mokurofiles must be present. -
--any: Searches for all files in the specified path. -
--any=EXTENSION: Searches for all files matching the file extension in the specified path.
japanese_text_analyzer ./mokuro_manga_path/
japanese_text_analyzer "./example path/" --any
japanese_text_analyzer "./example path/" --any=.html
analysis.txt (Stats on the analyzed text)
./sample_manga/
----------------------------------------------------------------------------
Number of Japanese characters: 43811
Number of kanji characters: 10952
Number of unique kanji: 1082
Number of unique kanji appearing only once: 285 (26.34% of unique kanji)
Number of words in total: 25204
Number of unique words: 3519 (13.96% of all words)
Number of words appearing only once: 2018 (57.35% of unique words)
Average volume length in characters: 14603 (3 total volumes)
Average page length in characters: 103 (422 total pages)
Average textbox length in characters: 11 (shortest: 1) (longest: 254) (4302 total textboxes)
word_list.csv (Deduped list of words along with the number of times they were found in the analyzed text)
て 831
の 805
に 710
た 702
です 555
は 528
で 521
が 508
ん 504
... (3510 more lines)
word_list_raw.csv (Unsorted list of words found in the analyzed text)
まぁ
まぁ
話し
て
き
まし
... (25198 more lines)
Linux:
./setup.sh
cargo build --release
Windows:
setup.bat
cargo build --release