We have compared the license data of few packages for v2023pre.
As we could see the difference in license data observation between scan copyright and
Expected result
License data should be same in scan copyright and fossology tool
Actual result
Explain the actual result observed after the execution steps.
Paste any error output here between code block back-quotes.
For long text contents (over 1000 lines) it is better to attach a file by using the button in the right corner.
If adding comments on the log is required create a new snippet and add the link to it here.
Attachments
Observation captured in excel sheet will be added.
Management data
This section is for management only, it should be the last one in the description.
These kind of differences need to be checked manually to see which is the right license. Please also not that the license reported by scan-copyrights.
Just as an example, the Zlib package is license under its own license, while fossology reports "Public domain", most probably due to strings like "Not copyrighted -- provided to the public domain" which may confuse it.
High level review comments with reference on the slides.
1- Tool tested only on script level on Apertis SDK by running complex packages taken from Apertis distribution.
2- Observed following open points (please provide the reference links for these findings):
Missing information section in scan copyright tool results is not 100% accurate - Page 11
Wrong copyright and license updated - Page 17
Unknown copyright/License updated - Page 16
Extra copyright/License details shown though not present in files - Page 12
Multiline Copyright text is not captured correctly - Page 15
Dual license captured text adding “and/or” - Page15
Unknown license/copyright errors are displayed. - Page 13
3- Complex packages are taking more time to scan and generate the report
4- Overall License finding rate is 20% to 70% only.
Thanks again for sharing this report, which provides useful and more detailed information. As we have discussed in the past, I was waiting for the rebase on top of Debian Bookworm to check how the new versions of licensecheck and scan-copyrigths were performing.
I haven't gone deeply into this analysis, instead I took some of the packages you use in your report to investigate and also to check the difference between v2023 and v2024dev2.
One important thing to mention is that on Apertis, the output of scan-copyrights is processed by ci-license-scan which tries to fill missing license and copyright information with the information available in debian/copyright. For these reason, the UNKNOWN mentions have less impact that the one described in the report.
Just as an example, you can check the copyright report for libxml2 which does not contain any UNKNOWN.
The script ci-license-scan also allows us to tweak or override license/copyright information as described in documentation
Besides that point, which I think has a huge impact in the overall result, the rest of the comments apply, the scanner is not as accurate as we would like.
Let me comment in the cons you described:
Missing information section in scancopyright tool results is not 100% accurate, “.c” & “.h” contains reference to other files for license & copyright
This is complemented with debian/copyright by ci-license-scan
− Unknown license → If the file contains reference to other files for license. For all files under test, po/makevars, package results shown is unknown license
This is complemented with debian/copyright by ci-license-scan
− Extra copyright details shown though not present in files
− Extra license details are captured though not present in files
Same as above.
− Multiple licenses & its copyright are not captured
Same as above.
− Package license can be found under filename “*”, but results captured is not 100% accurate
This is the default license of the package, which should be used if not a specific entry is found.
− Dual license is captured as “and/or” instead of just “OR”
Yes.
− Auto report generation for Denylisted OSS licenses with file path
details is not there
Can you clarify?
− Capturing of all copyright details file wise in single file is not there
Can you clarify?
As a general conclusion, I believe that there is room from improvement in the scanning tools, which will be very welcome. At the same time, the report probably does not reflect reality since the evaluation was done using scan-copyrights alone instead of the ci-license-scan which output can be found in each repository at debian/apertis/copyright.
Additionally, after evaluating the new version of the tool in Bookworm some improvement were found, but nothing substantial to share.
Thanks for the input.
Above you have mentioned that,ci-license-scan addressed some of the issue reported from scan-copyright, so we will cross verify and update to OSS team.
− Auto report generation for Denylisted OSS licenses with file path details is not there
Handling the blacklisted license in ci-license-scan is same as Denylisted OSS part, so no action required it for this item.
− Capturing of all copyright details file wise in single file is not there
there is one requirement from OSS team to capture the copyright details for all the packages, so this won't be covered under our requirement
As suggested above we will plan to evaluate our CI/CD based license scan wrt to OSS team findings, and create separate task to handle the further findings, hence closing this task.