-
-
Notifications
You must be signed in to change notification settings - Fork 607
Closed
Description
I had a chat with a Linux kernel committer (nick removed for privacy) earlier today:
<kernel> i'm aware of scancode
<kernel> haven't had time yet to run it on a large scale
<kernel> btw, one of the files I ran it against is
<kernel> linux-kernel/drivers/gpu/drm/gma500/cdv_intel_crt.c
<kernel> that's an MIT license and scancode says 100% match
<kernel> which is actually wrong
<kernel> the text is modified
<kernel> fossology gets it wrong as well
<pombreda> let me fix this asap. that's an easy one
<kernel> have you spotted the extra text in there?
<kernel> it's not changing the meaning of the license
<kernel> but one can put anything into it
<pombreda> so running
<pombreda> $ ./scancode --license --diag --license-text --format json-pp cdv_intel_crt.c
<pombreda> with the head of the develop branch yields this:
<pombreda> https://gist.githubusercontent.com/pombredanne/a9bdd72ffed4acdb3c904e43a3f22c43/raw/9db30c36ed19ea7834618ae25be2d1d9fdfb854c/match.json
<pombreda> where you can see the actual detected text differences [enclosed]
<pombreda> in square brackets for the non-matched parts
<kernel> looks correct
<pombreda> so IMHO here: MIT is properly detected. AND scancode returns also
<pombreda> the exact text to use a Debian machine copyright file or attribution notice
<pombreda> there are so many variants of MIT that this is the approach
<pombreda> would you agree this is sane approach?
<kernel> ok
<kernel> but for me it's confusing to see 100% while it's not
<pombreda> agreed. ok, the 100% match is based on 100% of the MIT license text
<pombreda> being matched. not on 100% of the scanned file being matched as I
<pombreda> can never know what it is really, can I?
<kernel> the thing is that if I do pure machine based analysis
<kernel> then I want to see that this is not 100%
<kernel> well, in this case the modification is inside the license text
<kernel> and I can add random crap instead of this harmless (blurb)
<pombreda> ok, I see: I guess that when I have a match that is not contiguous
<pombreda> I could introduce some bias in the score
<kernel> that would be helpful
<pombreda> excellent point :)
<kernel> so we can point someone to it automatically and say:
<kernel> Look someone fiddled with the text
<kernel> figure out what it means
<pombreda> let me enter a ticket. Do you care If I paste (and anonymize)
<pombreda> some of this chat log
<kernel> no problem
<pombreda> thanks. Do you want to have your GH nick mentioned in it?
<pombreda> I do not expect a hacker of your class to care for this nor have a
<pombreda> GH account anyway :)
<kernel> I don't think I have one
<kernel> my repositories are on kernel.org :)
<pombreda> ok, I will ping you when this is ready
<pombreda> this is a great bug
<kernel> when the full run is done, I'll probably have some more for you :)
<pombreda> I want scancode to be the final solution to license detection
[..]
<kernel> I need to do some wrappery around the tool, so I can store
<kernel> info in a database for comparison and other things
<kernel> quick question while I have your attention
<pombreda> sure: shoot. I am horored to have YOUR attention :)
<kernel> is the invocation per file or does the tool take directories as well?
<pombreda> you can pass a single or a directory . anyway you like.
<pombreda> We are also adding ignores support and eventually list of files or globs
<kernel> ok
<kernel> does it parallelize multiple files?
<kernel> i.e. does it go into multithreaded mode?
<pombreda> multithreaded: yes, using a combo of Python threads and multiprocessing
<pombreda> with cli options for timeouts and max memory usage:
<pombreda> https://github.com/nexB/scancode-toolkit/blob/develop/src/scancode/interrupt.py#L46
<pombreda> and the number of processes :)
<pombreda> and we do not leak threads or memory. This was fixed a few weeks ago.
<kernel> nice
<pombreda> and we are also working on deduction and summarization such that a
<pombreda> Dep5 machine readable copyright file can be created at once with a
<pombreda> reasonable amount of summarization
<pombreda> we have a prototype "server" that I need to push and I will task a
<pombreda> GSOC student to polish it
<kernel> :)
<kernel> so should I use devel for out tests
<kernel> or what branch is the best choice?
<pombreda> devel