Skip to content

Commit ff41366

Browse files
Add docs for LicenseDetection and detail referencing
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
1 parent f861310 commit ff41366

File tree

4 files changed

+195
-0
lines changed

4 files changed

+195
-0
lines changed

docs/scripts/doc8_style_check.sh

100644100755
File mode changed.

docs/scripts/sphinx_build_link_check.sh

100644100755
File mode changed.

docs/source/explanations/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
:maxdepth: 2
88

99
overview
10+
license-detection-reference
1011

1112
..
1213
[ToAdd]
Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
License Detection and Reference Additions
2+
=========================================
3+
4+
`Main Issue <https://github.com/nexB/scancode-toolkit/issues/2878>`_
5+
6+
`Main Pull Request <https://github.com/nexB/scancode-toolkit/pull/2961>`_
7+
8+
`A presentation on this <https://github.com/nexB/scancode-toolkit/issues/2878#issuecomment-1079639973>`_
9+
10+
11+
Previous Work
12+
-------------
13+
14+
- Akansha's GSoC work on unknown local references and unknown detection
15+
based on ngrams from LicenseDB texts.
16+
17+
- work from ``scancode-analyzer`` and ``debian copyright detection``
18+
which had the concept of a LicenseDetection, flat LicenseMatches and
19+
getting a unique detections across a scan referencing the details.
20+
21+
- work on primary-license and license scoring.
22+
23+
LicenseDetection
24+
----------------
25+
26+
This aims to solve a few types of false positives commonly observed in
27+
ScanCode license detection. These are:
28+
29+
The ``unknown`` cases
30+
^^^^^^^^^^^^^^^^^^^^^
31+
32+
- Unknown Intros with Proper Detections after them
33+
- Unknown references to local files
34+
35+
License Clues
36+
^^^^^^^^^^^^^
37+
38+
Also this would introduce a ``license_clues`` list of LicenseMatches
39+
which would have improper detections or other clues like urls which
40+
cannot be marked as detections.
41+
42+
License Versions
43+
^^^^^^^^^^^^^^^^
44+
45+
This would also simplify license-expressions for gpl/lgpl cases
46+
with versioned/unversioned matches detected together.
47+
48+
Package License Detections
49+
^^^^^^^^^^^^^^^^^^^^^^^^^^
50+
51+
License detections in package manifests now just have the license-expression
52+
from the detection and this is different from licenses detected directly which
53+
have details. So packages now would also have details.
54+
55+
Other Soulution Elements
56+
^^^^^^^^^^^^^^^^^^^^^^^^
57+
58+
Merged:
59+
60+
- Key {{phrases}} in license text rules
61+
- New license clarity scoring
62+
- Report the primary license
63+
64+
Upcoming:
65+
66+
- Make it easier to report, review and curate license detections
67+
(GSoC Project in scancode.io)
68+
69+
- Fixing bugs and updating the heuristics.
70+
(This will be ongoing like the LicenseDB updates)
71+
72+
Examples
73+
^^^^^^^^
74+
75+
An example from the eclipse foundation::
76+
77+
/*********************************************************************
78+
* Copyright (c) 2019 Red Hat, Inc.
79+
*
80+
* This program and the accompanying materials are made
81+
* available under the terms of the Eclipse Public License 2.0
82+
* which is available at https://www.eclipse.org/legal/epl-2.0/
83+
*
84+
* SPDX-License-Identifier: EPL-2.0
85+
**********************************************************************/
86+
87+
88+
The text ``"This program and the accompanying materials are made\n* available under the terms
89+
of the",`` is detected as ``unknown-license-reference`` with ``is_license_intro`` as True,
90+
and has several ``"epl-2.0"`` detections after that.
91+
92+
What is a LicenseDetection?
93+
---------------------------
94+
95+
A detection which can have one or multiple LicenseMatch in them,
96+
and creates a License Expression that we finally report.
97+
98+
Properties:
99+
100+
- A file can have multiple LicenseDetections (seperated by non-legalese lines)
101+
- This can be from a file directly or a package.
102+
- We should be mostly certain of a proper detection to create a LicenseDetection.
103+
- One LicenseDetection can have matches from different files, in case of local license
104+
references.
105+
106+
107+
LicenseMatch Result Data
108+
------------------------
109+
110+
LicenseMatch data currently is based on a ``license key`` instead of being based
111+
on an ``license-expression``.
112+
113+
So if there is a ``mit and apache-2.0`` license expression detected from a single
114+
LicenseMatch, we currently add two entries in the ``licenses`` list for that
115+
resource, one for each license key, (here ``mit`` and ``apache-2.0`` respectively).
116+
This repeats the match details as these two entries have the same details except the
117+
license key. And this is wrong.
118+
119+
We should only add one entry per match (and therefore per ``rule``) and here the
120+
primary attribute should be the ``license-expression``, rather than the ``license-key``.
121+
122+
We also create a mapping inside a mapping in these license details to refer to the
123+
license rule (and there are other incosistencies in how we report here). We should
124+
just report a flat mapping here, (with a list at last for each of the license keys).
125+
126+
127+
Only reference License related Data
128+
-----------------------------------
129+
130+
Currently all license related data is inlined in each match, and this repeats
131+
a lot of information. This repeatation exists in three levels:
132+
133+
- License Data
134+
- LicenseDB Data
135+
- LicenseDetection Data
136+
137+
If we introduce a new command line option ``--licenses-reference``, which of these
138+
should we reference, just License/LicenseDB data, just LicenseDetection level data
139+
or all of them?
140+
141+
License Data
142+
^^^^^^^^^^^^
143+
144+
This is referencing data related to whole licenses, references by their license key.
145+
146+
Example: ``apache-2.0``
147+
148+
Other attributes are it's full test, links to origin, licenseDB, spdx, osi etc.
149+
150+
151+
LicenseDB Data
152+
^^^^^^^^^^^^^^
153+
154+
This is referencing data related to a LicenseDB entry.
155+
I.e. the identifier is a `RULE` or a `LICENSE` file.
156+
157+
Example: ``apache-2.0_2.RULE``
158+
159+
Other attributes are it's license-expression, the boolean fields, length, relevance etc.
160+
161+
162+
LicenseDetection Data
163+
^^^^^^^^^^^^^^^^^^^^^
164+
165+
This is referencing by LicenseDetections. This has one or multiple license Matches.
166+
167+
Identifier is a hash/uuid field computed from a nested tuple of select attributes.
168+
169+
This will represent each LicenseDetection, if the same detection is present across multiple files.
170+
171+
Attributes will be:
172+
173+
- File Regions where these are found (File Path + Start and End line)
174+
- Score, matched length, matcher (like ``1-hash``, ``2-aho``), and matched text.
175+
176+
177+
What should be the default option?
178+
----------------------------------
179+
180+
Two changes were long-planned and should be default:
181+
182+
- LicenseDetections in the results
183+
- LicenseMatch being for a ``license-expression``
184+
185+
This is already a lot of change, so also having the referencing details as default doesn't
186+
make sense IMHO.
187+
188+
- We need to have the details inlined as an option surely because otherwise it will be downstream
189+
tools resposibility to get this and inline them.
190+
191+
We can always make the details referenced as the default option in a later release after more
192+
testing and feedback. So we can then have the ``--licenses-reference`` command line option
193+
which removes the details and puts them in a top-level list. And the details inlined as
194+
default.

0 commit comments

Comments
 (0)