Skip to content

License detection scores can be misleading - conversation with a Linux kernel committer #534

@pombredanne

Description

@pombredanne

I had a chat with a Linux kernel committer (nick removed for privacy) earlier today:

<kernel> i'm aware of scancode
<kernel> haven't had time yet to run it on a large scale
<kernel> btw, one of the files I ran it against is
<kernel> linux-kernel/drivers/gpu/drm/gma500/cdv_intel_crt.c
<kernel> that's an MIT license and scancode says 100% match
<kernel> which is actually wrong
<kernel> the text is modified
<kernel> fossology gets it wrong as well

<pombreda> let me fix this asap. that's an easy one

<kernel> have you spotted the extra text in there?
<kernel> it's not changing the meaning of the license
<kernel> but one can put anything into it

<pombreda> so running 
<pombreda> $ ./scancode --license --diag --license-text --format json-pp cdv_intel_crt.c  
<pombreda> with the head of the develop branch yields this: 
<pombreda> https://gist.githubusercontent.com/pombredanne/a9bdd72ffed4acdb3c904e43a3f22c43/raw/9db30c36ed19ea7834618ae25be2d1d9fdfb854c/match.json
<pombreda> where you can see the actual detected text differences [enclosed] 
<pombreda> in square brackets for the non-matched parts

<kernel> looks correct

<pombreda> so IMHO here: MIT is properly detected. AND scancode returns also 
<pombreda> the exact text to use a Debian machine copyright file or attribution notice
<pombreda> there are so many variants of MIT that this is the approach
<pombreda> would you agree this is sane approach?

<kernel> ok
<kernel> but for me it's confusing to see 100% while it's not

<pombreda> agreed. ok, the 100% match is based on 100% of the MIT license text
<pombreda> being matched. not on 100% of the scanned file being matched as I 
<pombreda> can never know what it is really, can I?

<kernel> the thing is that if I do pure machine based analysis
<kernel> then I want to see that this is not 100%
<kernel> well, in this case the modification is inside the license text
<kernel> and I can add random crap instead of this harmless (blurb)

<pombreda> ok, I see: I guess that when I have a match that is not contiguous 
<pombreda> I could introduce some bias in the score

<kernel> that would be helpful

<pombreda> excellent point :)

<kernel> so we can point someone to it automatically and say: 
<kernel> Look someone fiddled with the text
<kernel> figure out what it means

<pombreda> let me enter a ticket. Do you care If I paste (and anonymize)
<pombreda> some of this chat log

<kernel> no problem

<pombreda> thanks. Do you want to have your GH nick mentioned in it? 
<pombreda> I do not expect a hacker of your class to care for this nor have a 
<pombreda> GH account anyway :)

<kernel> I don't think I have one
<kernel> my repositories are on kernel.org :)

<pombreda> ok, I will ping you when this is ready
<pombreda> this is a great bug

<kernel> when the full run is done, I'll probably have some more for you :)

<pombreda> I want scancode to be the final solution to license detection

[..]
<kernel> I need to do some wrappery around the tool, so I can store 
<kernel> info in a database for comparison and other things
<kernel> quick question while I have your attention

<pombreda> sure: shoot. I am horored to have YOUR attention :)

<kernel> is the invocation per file or does the tool take directories as well?

<pombreda> you can pass a single or a directory . anyway you like. 
<pombreda> We are also adding ignores support and eventually list of files or globs

<kernel> ok
<kernel> does it parallelize multiple files?
<kernel> i.e. does it go into multithreaded mode?

<pombreda> multithreaded: yes, using a combo of Python threads and multiprocessing
<pombreda> with cli options for timeouts and max memory usage: 
<pombreda> https://github.com/nexB/scancode-toolkit/blob/develop/src/scancode/interrupt.py#L46
<pombreda> and the number of processes :)
<pombreda> and we do not leak threads or memory. This was fixed a few weeks ago.

<kernel> nice

<pombreda> and we are also working on deduction and summarization such that a 
<pombreda> Dep5 machine readable copyright file can be created at once with a 
<pombreda> reasonable amount of summarization

<pombreda> we have a prototype "server" that I need to push and I will task a 
<pombreda> GSOC student to polish it

<kernel> :)

<kernel> so should I use devel for out tests
<kernel> or what branch is the best choice?

<pombreda> devel

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions