Skip to content
Snippets Groups Projects

Add scancode evaluation

Merged Dylan Aïssi requested to merge wip/daissi/scancode into master
All threads resolved!
+ 589
0
+++
title = "Scancode evaluation"
short-description = "Evaluate switching from scan-copyrights to scancode"
weight = 100
outputs = [ "html", "pdf-in",]
date = "2024-04-10"
+++
Currently, [scan-copyrights](https://tracker.debian.org/pkg/libconfig-model-dpkg-perl)
(which uses [licensecheck](https://tracker.debian.org/pkg/licensecheck) under
the hood) is used in Apertis to scan copyright/license notices. This tool has
some downsides, thus we are evaluating to use [scancode-toolkit](https://github.com/nexB/scancode-toolkit)
instead.
A comparison of `licensecheck` vs `scancode` is available on the
[ScanCode's website](https://scancode-toolkit.readthedocs.io/en/stable/misc/faq.html#how-is-scancode-different-from-debian-licensecheck),
TL;DR: *scancode is more accurate but slower*.
`scancode-toolkit` has an option to export results as [DEP5 format](https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/)
(see [GH#472](https://github.com/nexB/scancode-toolkit/issues/472)) which is the
format currently use by Apertis license tooling. That means, `scancode-toolkit`
is potentially compatible with the rest of the Apertis licensing tooling.
# Scancode installation
`scancode` is not available as Debian package ([GH#1580](https://github.com/nexB/scancode-toolkit/issues/1580)
and [GH#3253](https://github.com/nexB/scancode-toolkit/issues/3253)) nor as a
Docker image ([GH#3026](https://github.com/nexB/scancode-toolkit/issues/3026)),
but a [Dockerfile](https://github.com/nexB/scancode-toolkit/blob/develop/Dockerfile)
is [provided by upstream](https://scancode-toolkit.readthedocs.io/en/latest/getting-started/install.html#installation-via-docker).
That means, we can create our own Docker image, or we can reuse the
OSS Review Toolkit Docker image which integrates `scancode`. Since the ORT
Docker image used in our pipeline is outdated, it would be easier for now
to decouple scancode from the ORT docker image to avoid having to use an outdated
scancode (scancode used in the ORT image is [one year old](https://github.com/nexB/scancode-toolkit/releases/tag/v31.2.4).
Here are the steps to build a docker image:
```sh
git clone https://github.com/nexB/scancode-toolkit
cd scancode-toolkit
LATEST_VER=v32.0.8
git checkout $LATEST_VER
docker build --tag scancode-toolkit --tag scancode-toolkit:$LATEST_VER .
```
# Scancode output format
Scancode is able to write its output in different formats:
```
docker run scancode-toolkit --help
...
output formats:
--json FILE Write scan output as compact JSON to FILE.
--json-pp FILE Write scan output as pretty-printed JSON to FILE.
--json-lines FILE Write scan output as JSON Lines to FILE.
--yaml FILE Write scan output as YAML to FILE.
--csv FILE [DEPRECATED] Write scan output as CSV to FILE. The
--csv option is deprecated and will be replaced by
new CSV and tabular output formats in the next
ScanCode release. Visit
https://github.com/nexB/scancode-toolkit/issues/3043
to provide inputs and feedback.
--html FILE Write scan output as HTML to FILE.
--custom-output FILE Write scan output to FILE formatted with the custom
Jinja template file.
--debian FILE Write scan output in machine-readable Debian
copyright format to FILE.
--custom-template FILE Use this Jinja template FILE as a custom template.
--cyclonedx FILE Write scan output in CycloneDX JSON format to FILE.
--cyclonedx-xml FILE Write scan output in CycloneDX XML format to FILE.
--spdx-rdf FILE Write scan output as SPDX RDF to FILE.
--spdx-tv FILE Write scan output as SPDX Tag/Value to FILE.
...
```
These formats include:
- Debian DEP5 format, the one already in use with Apertis licensing tooling.
- YAML, widely used in Apertis and used to teach `scan-copyrights` detection
license issues (i.e. `debian/apertis/copyright.yml`).
- SPDX, open standard for communicating SBOM information.
- CycloneDX, another SBOM standard.
Initially, it should be simpler to continue using the Debian DEP5 since the whole
Apertis licensing tooling is using it. But for a long term plan, we may want to
switch to a more widely used format like `SPDX` or `CycloneDX` which are also
compatible with `ORT` and other tools. This should make the Apertis license/SBOM
processes more flexible.
# Select Apertis packages to evaluate scancode
Let's use packages that are wrongly detected by `scan-copyrights` (i.e. packages
with the use of `override-license` in `debian/apertis/copyright.yml`.
Here is a small random list of packages based on a local grep of `override-license`:
debianutils, libarchive, libgdata, libunistring, libusb, nss, nss-pem, openjpeg2,
openssl xorg-server.
# Run scan-copyrights as gold standard
First, `scan-copyrights` is run on the package in a v2025dev2 VM:
```sh
# Because of the use of "override-license" in debian/apertis/copyright.yml
LIST_PKGS=" debianutils libarchive libgdata libunistring libusb nss nss-pem openjpeg2 openssl xorg-server "
# Adding other well known pkgs
LIST_PKGS+=" pipewire rust-coreutils "
for PKG in $LIST_PKGS:
do
git clone https://gitlab.apertis.org/pkg/${PKG}.git
cd ${PKG}
/usr/bin/time -f "%e" scan-copyrights > ../${PKG}-scan-copyright 2> ../${PKG}-time
cd ..
done
```
# Run scancode
Now, `scancode` is run by excluding `debian/copyright` and `debian/apertis/copyright`
because they can easily confuse `scancode` (see [GH#2885](https://github.com/nexB/scancode-toolkit/issues/2885#issuecomment-1136268172)).
```sh
# Because of the use of "override-license" in debian/apertis/copyright.yml
LIST_PKGS=" debianutils libarchive libgdata libunistring libusb nss nss-pem openjpeg2 openssl xorg-server "
# Adding other well known pkgs
LIST_PKGS+=" pipewire rust-coreutils firefox-esr"
for PKG in $LIST_PKGS:
do
git clone https://gitlab.apertis.org/pkg/${PKG}.git
cd ${PKG}
# DEP5 output
docker run -v $PWD/:/project scancode-toolkit \
--copyright --license --license-text --strip-root \
--ignore */debian/copyright --ignore */debian/apertis/copyright \
--ignore */debian/apertis/${PKG}-scancode-copyright \
-n 8 --debian /project/debian/apertis/${PKG}-scancode-copyright \
/project/.
# YAML output
docker run -v $PWD/:/project scancode-toolkit \
--copyright --license --license-text --strip-root \
--ignore */debian/copyright --ignore */debian/apertis/copyright \
--ignore */debian/apertis/${PKG}-scancode-copyright \
-n 8 \
--yaml /project/debian/apertis/${PKG}-scancode-copyright-yaml \
/project/.
cd ..
done
```
# Analysis time
This analysis was performed on a XPS13-9310 laptop with a CPU i7-1185G7
(@3.00GHz×8), 16 GB RAM and an SSD hard disk.
| Package | Time scan-copyrights | Time scancode | Diff |
| -------- | ------- | ------- | ------- |
| debianutils | 1.3 s | 38 s | ~29 times slower |
| libarchive | 6.6 s | 7 m 25 s | ~67 times slower |
| libgdata | 4.7 s | 3 m 55 s | ~50 times slower |
| libunistring | 8.1 s | 10 m 48 s | ~80 times slower |
| libusb | 1.2 s | 54 s | ~45 times slower |
| nss | 25.8 s | 28 m 47 s | ~67 times slower |
| nss-pem | 28.6 s | 26 m 59 s | ~56 times slower |
| openjpeg2 | 3.7 s | 3 m 8 s | ~50 times slower |
| openssl | 23.6 s | 18 m 3 s | ~46 times slower |
| pipewire | 6.1 s | 4 m 57 s | ~48 times slower |
| rust-coreutils| 4.4 s | 1 m 54 s | ~26 times slower |
| xorg-server | 9.1 s | 9 m 24 s | ~62 times slower |
| firefox-esr* | XX s | OOM killed after ~ 1 d [1] | ~XX times slower |
* `firefox-esr` is one of the biggest packages in Apertis, but is
not in the `target` repository. Thus, we wouldn't have to analyze it with
`scancode`, but it is used here to evaluate scancode in the worst cases.
- [1] scancode ran on `firefox-esr` for ~ 23 hours and 30 mins before being OOM
killed. It seems, the scan was over and scancode was processing data to generate
its output file when it was killed. Its scanning parallel processes have
stopped, only the main process was running and the used RAM was at ~ 3 GB (of 16
GB available) about 10 mins before OOM.
- While doing the analysis with 8 parallel processes, all of them were at 100%
during the entire analysis time, so at least the CPU is a bottleneck.
## Analysis time with --processes from 1 to 8
From the [scancode options](https://scancode-toolkit.readthedocs.io/en/stable/cli-reference/core-options.html):
```
-n, --processes INTEGER
Scan <input> using n parallel processes. [Default: 1]
```
This option allows to use several processes for scanning files.
```sh
PKG="debianutils"
git clone https://gitlab.apertis.org/pkg/${PKG}.git
cd ${PKG}
for N in {1..8}
do
# YAML output
docker run -v $PWD/:/project scancode-toolkit \
--copyright --license --license-text --strip-root \
--ignore */debian/copyright --ignore */debian/apertis/* \
-n ${N} \
--yaml /project/debian/apertis/${PKG}-${N}-scancode-copyright-yaml \
/project/.
done
```
| N processes | Time scancode |
| -------- | ------- |
| 1 | 1 m 34.5 s |
| 2 | 58.2 s |
| 3 | 45.4 s |
| 4 | 39.0 s |
| 5 | 38.1 s |
| 6 | 37.2 s |
| 7 | 36.7 s |
| 8 | 36.0 s |
Adding more parallel processes improve the scanning time, but it seems we are
reaching a threshold at ~ 4 parallel processes where adding more processes only
slightly improves the scanning time. This may be due to the fact that the tested
package is *small*. For a bigger package like `firefox-esr`, this threshold may
be higher and it could be beneficial to have more parallel processes.
## Analysis time with --timeout X
From the [scancode options](https://scancode-toolkit.readthedocs.io/en/stable/cli-reference/core-options.html):
```
--timeout FLOAT
Stop scanning a file if scanning takes longer than a timeout in seconds. [Default: 120]
```
This option allows to avoid getting stuck on a file for too long.
```sh
PKG="debianutils"
git clone https://gitlab.apertis.org/pkg/${PKG}.git
cd ${PKG}
for N in 120 110 100 90 80 70 60 30 10
do
# YAML output
docker run -v $PWD/:/project scancode-toolkit \
--copyright --license --license-text --strip-root \
--ignore */debian/copyright --ignore */debian/apertis/* \
--timeout ${N} -n 8 \
--yaml /project/debian/apertis/${PKG}-${N}-scancode-copyright-yaml \
/project/.
done
```
| Timeout | Time scancode |
| -------- | ------- |
| 120 (default) | 38.5 s |
| 110 | 43.6 s |
| 100 | 44.1 s |
| 90 | 39.9 s |
| 80 | 40.3 s |
| 70 | 38.3 s |
| 60 | 37.8 s |
| 30 | 37.9 s |
| 10 | 29.6 s |
Decreasing the timeout per file seems to be quite efficient to reduce the
scanning time, but since some files are no longer fully scanned, a more
comprehensive comparison of detected licenses should be done to ensure we are not
loosing too much data.
```sh
PKG="firefox-esr"
git clone https://gitlab.apertis.org/pkg/${PKG}.git
cd ${PKG}
date
docker run -v $PWD/:/project scancode-toolkit \
--copyright --license --license-text --strip-root \
--ignore */debian/copyright --ignore */debian/apertis/* \
--timeout 10 -n 8 \
--yaml /project/debian/apertis/${PKG}-scancode-timeout-10-copyright-yaml \
/project/.
date
```
## Analysis time with --max-in-memory 0
From the [scancode options](https://scancode-toolkit.readthedocs.io/en/stable/cli-reference/core-options.html):
```
--max-in-memory INTEGER
Maximum number of files and directories scan details kept in memory during a
scan. Additional files and directories scan details above this number are
cached on-disk rather than in memory. Use 0 to use unlimited memory and
disable on-disk caching. Use -1 to use only on-disk caching. [Default: 10000]
```
Based on an upstream issue (see [GH#1014](https://github.com/nexB/scancode-toolkit/issues/1014)),
the disk cache seems to be really slow.
```sh
# Because of the use of "override-license" in debian/apertis/copyright.yml
LIST_PKGS=" debianutils libarchive libgdata libunistring libusb nss nss-pem openjpeg2 openssl xorg-server "
# Adding other well known pkgs
LIST_PKGS+=" pipewire rust-coreutils firefox-esr"
for PKG in $LIST_PKGS:
do
git clone https://gitlab.apertis.org/pkg/${PKG}.git
cd ${PKG}
# YAML output
docker run -v $PWD/:/project scancode-toolkit \
--copyright --license --license-text --strip-root \
--ignore */debian/copyright --ignore */debian/apertis/* \
--max-in-memory 0 -n 8 \
--yaml /project/debian/apertis/${PKG}-scancode-copyright-yaml \
/project/.
cd ..
done
```
| Package | Time scan-copyrights | Time scancode | Diff |
| -------- | ------- | ------- | ------- |
| debianutils | 1.3 s | 34 s | ~26 times slower |
| libarchive | 6.6 s | 6 m 41 s | ~60 times slower|
| libgdata | 4.7 s | 3 m 35 s | ~46 times slower |
| libunistring | 8.1 s | 11 m 12 s | ~83 times slower |
| libusb | 1.2 s | 54 s | ~45 times slower |
| nss | 25.8 s | 27 m 47 s | ~66 times slower |
| nss-pem | 28.6 s | 26 m 49 s | ~57 times slower |
| openjpeg2 | 3.7 s | 3 m 14 s | ~52 times slower |
| openssl | 23.6 s | 18 m | ~46 times slower |
| pipewire | 6.1 s | 5 m 29 s | ~54 times slower |
| rust-coreutils| 4.4 s | 2 m 3 s | ~28 times slower |
| xorg-server | 9.1 s | 10 m 9 s | ~67 times slower |
| firefox-esr* | XXX s | OOM killed after ~ 1 d [1] | ~XX times slower |
Passing `--max-in-memory 0` to `scancode` doesn't improve scanning time since
these results are of the same order of magnitude (+/- random fluctuation) to the
ones without this option.
## Analysis time with ONLY --license
i.e. without `--copyright --license-text`
```sh
PKG="nss"
git clone https://gitlab.apertis.org/pkg/${PKG}.git
cd ${PKG}
# YAML output
docker run -v $PWD/:/project scancode-toolkit \
--license --strip-root \
--ignore */debian/copyright --ignore */debian/apertis/* \
-n 8 \
--yaml /project/debian/apertis/${PKG}-scancode-ONLYlicense-copyright-yaml \
/project/.
```
| Package | Time scan-copyrights | Time scancode with copyright | Time scancode without copyright |
| -------- | ------- | ------- | ------- |
| nss | 25.8 s | 27 m 47 s | 21 m 42 s |
Do not scan copyrights (i.e. only license) decrease the scanning time
# Reliability of detected license
*Some of debian/apertis/copyright.yaml files used are no longer required
since the Bookworm rebase, so all packages analyzed don't have a problematic file
which can be used to compare `scan-copyrights` and `scancode`.*
| Package | File | Actual license | Detected license (scancode) | Detected license (scan-copyrights) |
| -------- | ------- | ------- | ------- | ------- |
| libarchive | [shar.1](https://gitlab.apertis.org/pkg/libarchive/-/blob/13653b9a562c658133b37e5859693898a7b929e8/contrib/shar/shar.1) | BSD-4-Clause-UC | BSD-4-Clause-UC | BSD-4-Clause-UC [0] |
| libgdata | [README](https://gitlab.apertis.org/pkg/libgdata/-/blob/6e88c1a214f86c3025b0124a97b557c3710df339/README) | LGPL-2.1-or-later | LGPL-2.0-or-later | LGPL |
| libunistring | [version.c](https://gitlab.apertis.org/pkg/libunistring/-/blob/d4765b859b6ca17f0ea885d3ec9adea691dc1133/lib/version.c) | LGPL-3.0-or-later OR GPL-2.0-or-later | LGPL-3.0-or-later OR GPL-2.0-or-later | LGPL |
| libusb | [06_bsd.diff](https://gitlab.apertis.org/pkg/libusb/-/blob/5890eb277e80734cea681e58c52712a06516d85b/debian/patches/06_bsd.diff) | BSD-2-Clause [1] | BSD-4-Clause | [1.1] |
| nss | [derdump.1](https://gitlab.apertis.org/pkg/nss/-/blob/4a93da238116e085ad1d949b650e9ec32d98038a/nss/doc/nroff/derdump.1) | MPL-2.0 [2] | MPL-2.0 | MPL-2.0 |
| nss-pem | [doc/rst/legacy/*](https://gitlab.apertis.org/pkg/nss-pem/-/blob/65192a7e89364a15fadb7b2c832a72a0efa16e63/nss/nss/doc/rst/legacy/nss_3.12.2_release_notes.html/index.rst) | MPL-1.1 OR GPL-2.0-only OR LGPL-2.1-only | MPL-1.1 OR GPL-2.0-only OR LGPL-2.1-only | MPL-2.0 |
| openjpeg2 | [opj_getopt.c](https://gitlab.apertis.org/pkg/openjpeg2/-/blob/a1d3412b2721a1d7dbd7cd3d7f13bd6647bda7f5/src/bin/common/opj_getopt.c) | BSD-3-Clause | (BSD-2-Clause AND LicenseRef-scancode-proprietary-license) AND BSD-3-Clause | BSD-3-clause |
| openssl | [cmll-x86*.pl](https://gitlab.apertis.org/pkg/openssl/-/blob/f603f6ae103b26ef89089fc63945e833c0401839/crypto/camellia/asm/cmll-x86.pl) | Apache-2.0 OR GPL-2.0-or-later OR LGPL-2.1-or-later OR MPL-1.1 OR BSD-2-Clause | OpenSSL AND (GPL-2.0-or-later OR LGPL-2.1-or-later OR MPL-1.1 OR BSD-3-Clause) | Apache-2.0 and/or GPL-2+ |
| xorg-server | [hw/xwin/winprefsyacc.*](https://gitlab.apertis.org/pkg/xorg-server/-/blob/2d6db144d7c3393d2f3f113168b4f421192c90c8/hw/xwin/winprefsyacc.c)| GPL-3.0-or-later WITH Bison-exception-2.2 AND LicenseRef-scancode-xfree86-1.0 | GPL-3.0-or-later WITH Bison-exception-2.2 AND LicenseRef-scancode-xfree86-1.0 [3] | GPL-3+ with Bison-2.2 exception |
- [0] scan-copyrights has improved in Bookworm.
- [1] Retrospective change: https://www.netbsd.org/about/redistribution.html#why2clause
- [1.1] BSD-2-Clause-NetBSD and/or BSD-2-clause and/or BSD-3-clause and/or FSFUL and/or FSFULLR and/or GPL-2 and/or LGPL-2 and/or X11
- [2] Simplified upstream with bookworm, debian/apertis/copyright.yaml outdated.
- [3] A second license is later in the code
`scancode` is better to deal with complex licenses combinations, especially
because it scans the whole file and not only the first lines. Moreover, it
reports all licenses detected with a matching score (available in the YAML
output).
# No license deduction for project and folder
While `scan-copyrights` is able to perform some deduction of license/copyright
for a project and/or folder, `scancode` only performs scanning at file level.
For instance, `scan-copyrights` gives the following result for `openssl`:
```
Files: *
Copyright: 1998-2023, The OpenSSL Project
1995-1998, Eric A. Young, Tim J. Hudson
License: Apache-2.0
...
Files: crypto/ec/asm/*
Copyright: 1998-2023, The OpenSSL Project Authors.
License: Apache-2.0 and/or OpenSSL
```
This result give us the information that the project is under the license
`Apache-2.0` and files in `crypto/ec/asm/*` are under `Apache-2.0 and/or OpenSSL`
licenses.
This behavior allows to assign a license to files by inheriting it from the license
of the project (or from the higher level folder's license).
Some projects don't add license/copyright information in all of their files, which
could be annoying for `scancode` as it won't be able to detect the right license.
We would need to add this logic in `scancode` or in another Apertis script
(like [ci-license-scan](https://gitlab.apertis.org/infrastructure/apertis-docker-images/-/blob/apertis/v2025dev2/package-source-builder/overlay/usr/bin/ci-license-scan)?).
Some related upstream issues:
- [Proposal: Scan deduction and summarization](https://github.com/nexB/scancode-toolkit/issues/377)
- [Primary license detections not shown properly in debian_copyright](https://github.com/nexB/scancode-toolkit/issues/3424)
## Statistic about files without detected license
The yaml file generated by scancode is sometimes malformed due to `--license-text`
(see [#GH3219](https://github.com/nexB/scancode-toolkit/issues/3219), but seems not enough).
```sh
# Because of the use of "override-license" in debian/apertis/copyright.yml
LIST_PKGS=" debianutils libarchive libgdata libunistring libusb nss nss-pem openjpeg2 openssl xorg-server "
# Adding other well known pkgs
LIST_PKGS+=" pipewire rust-coreutils firefox-esr"
for PKG in $LIST_PKGS:
do
git clone https://gitlab.apertis.org/pkg/${PKG}.git
cd ${PKG}
# YAML output
docker run -v $PWD/:/project scancode-toolkit \
--license --strip-root \
--ignore */debian/copyright --ignore */debian/apertis/copyright \
--ignore */debian/apertis/${PKG}-scancode-copyright \
-n 8 \
--yaml /project/debian/apertis/${PKG}-scancode-copyright-yaml \
/project/.
cd ..
done
```
| Package | Files number | Files number with license | Detected license % |
| -------- | ------- | ------- | ------- |
| debianutils | 134 | 44 | 32.8 % |
| libarchive | 1420 | 716 | 51.9 % |
| libgdata | 705 | 326 | 46.2 % |
| libunistring | 2118 | 1961 | 92.5 % |
| libusb | 96 | 30 | 31.2 % |
| nss | 4531 | 2393 | 52.8 % |
| nss-pem | 4574 | 2423 | 52.9 % |
| openjpeg2 | 478 | 334 | 69.8 % |
| openssl | 4655 | 3349 | 72 % |
| pipewire | 1251 | 952 | 76 % |
| rust-coreutils| 1311 | 326 | 24.8 % |
| xorg-server | 1791 | 1227 | 68.5 % |
| firefox-esr* | XXX | XXX | XXX |
# DEP5 invalid format
`scancode` generates malformed `files stanza`. As defined in the
[debian/copyright specification](https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/#files-stanza),
each `files stanza` is composed by mandatory fields (i.e. `Files`, `Copyright`
and `License`) and one optional field (i.e. `Comment`). For instance:
```
Files: Xext/sleepuntil.h
Copyright: 1993-2003, The XFree86 Project, Inc.
License: Expat
```
When `scancode` is not able to define a copyright or a license for a file, then
it creates a stanza with only the `Files` field whereas `scan-copyrights` fills
the missing field with `UNKNOWN`.
Here is an example of malformed stanza by `scancode`:
```
Files: CODE_OF_CONDUCT.md
```
Here is another example from `scan-copyrights` where a missing field is filled:
```
Files: xkb/Makefile.in
Copyright: 1994-2021, Free Software Foundation, Inc.
License: UNKNOWN
```
This issue should easy be fixable in `scancode`, by filing missing field with
an `UNKNOWN` value.
# scancode output format
Support of `DEP5` format is incomplete in `scancode`, but bigger issues can
probably be easily fixed.
`YAML` format gives way more information like: lines of the detected licenses,
the pattern, a score of matching, several identifiers of the licenses detected, etc.
Having all of these information may be useful for future enhancement of Apertis
license tooling.
# Summary
Their different approaches in file analysis explains why `scancode` is slower,
but is able to detect way more licenses than `scan-copyrights`
(see [upstream comparison](https://scancode-toolkit.readthedocs.io/en/stable/misc/faq.html#how-is-scancode-different-from-debian-licensecheck)):
- `licensecheck` is "a Perl script using hand-crafted regex patterns to find
typical copyright statements and about 50 common licenses";
- `scancode`'s detection "is based on a (large) number of license full texts
(~2100) and license notices, mentions and variants (~32,000) and is data-driven
as opposed to regex-driven. It detects and reports exactly where license text is
found in a file. Just throw in more license texts to improve the detection."
## Required resources for analysis
`scancode` is ~50 times slower than `scan-copyrights`.
`scancode` requires much more RAM than `scan-copyrights`.
## License scan accuracy
- `scan-copyrights` has improved between Bullseye and Bookworm.
- `scancode` has a better detection for complex cases.
## Output format
- `DEP5` incomplete support, but already used by apertis license tooling.
- `YAML` seems a sensible alternative since it provides many more information,
but would require to adapt apertis license tooling to this new format.
## Outdated debian/apertis/copyright.yaml
This file would need to be refreshed in Apertis packages since the Bookworm
rebase. `scan-copyrights` is smarter and some packages have fixed their licensing
issues.
# Proposed plan
Some general guidelines:
- We need to come with a progressive approach, this is not something that will happen from one day to the other
- Most of the packages are small and should take a reasonable amount of time to scan
- We should be able to selectively disable scancode when necessary
- We can add additional logic to only scan the files that have changed since last scan
Proposed plan to use scancode instead of scan-copyrights:
1. Update the docker image used to generate ORT reports in order to reuse it
for scancode.
Apertis uses a handcrafted docker image to generate ORT reports, since this
image already contains scancode, it's possible to reuse it to run scancode.
The first step is to switch to an up-to-date [image provided by OR](https://github.com/oss-review-toolkit/ort/pkgs/container/ort).
This step will requite to adjust some scripts use by Apertis including the
template used to generate ORT reports.
2. Fix the DEP5 format created by scancode by adding missing mandatory fields.
Scancode generates report in the [DEP5 format](https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/),
unfortunately, mandatory fields are missing when the copyright/license is not
detected (see [GH#3714](https://github.com/nexB/scancode-toolkit/issues/3714)).
Instead, scancode should fill missing field with `UNKNOWN` or `no-info-found`
as done by `scan-copyrights`.
3. Add support of "license deduction for project and folder" to scancode.
scancode is unable to deduce a license for a whole project/folder based on
the license of other files. Without this feature, ~ 50% files will have a
missing license which is a regression compared to scan-copyrights. This task
consists in adding a logic to scancode to deduce a license for a folder
and/or project.
4. Add a new job running `scancode` to the `ci-package-builder` pipeline in
parallel to the current `scan-licenses` job using `scan-copyrights`.
5. Generate a new `scancode` report for all packages in `target` using the job
added in the previous step.
6. In the SBOM logic, add preference to use the `scancode` report if available
otherwise use the one from `scan-copyrights`.
Some other tasks can be done in parallel:
- Investigate how to use caching to avoid scanning files already scanned in a
previous run.
- Investigate how to improve performance of scancode (speed and RAM usage).
# References
- [scancode-toolkit](https://github.com/nexB/scancode-toolkit)
Loading