Skip to content

multibyte character will be broken when it is divided by block size during comparing #17

@kmuto

Description

@kmuto

Describe the problem

TTY::File::CompareFiles#call seems read a file by chunk of block size.
When there is a multibyte character (CJK character, emoji, etc) crosses between blocks, the character will be broken.

Steps to reproduce the problem

./diff-j.rb
       diff  4096-a.txt and 4096-aj.txt
--- 4096-a.txt
+++ 4096-aj.txt
@@ -1 +1 @@
-aaa(repeats 4096 times )aaa�
@@ -1 +1 @@
-A
+��い

4096-a.txt

aaa(repeats 4096 times)aaaA

4096-aj.txt

aaa(repeats 4096 times)aaaあい

check

puts TTY::File.diff("4096-a.txt", "4096-aj.txt")

Actual behaviour

Multi byte character is divided by byte, and broken.

�
��い

Expected behaviour

./diff-j.rb
       diff  4096-a.txt and 4096-aj.txt
--- 4096-a.txt
+++ 4096-aj.txt
@@ -1 +1 @@
-aaa(repeats 4096 times )aaa
@@ -1 +1 @@
-A
+あい

It looks hard to solve with current implementation using block reads.

Describe your environment

  • OS version: Debian 11
  • Ruby version: 2.7.4
  • TTY::File version: 0.10.0
    diff-j.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions