Overhaul needed for Package scan requests and indexing #49

JonoYang · 2023-03-15T19:32:27Z

I've set up purldb and scancode.io locally, where I run make run_visit and make run_map to visit and map Maven packages. I then run make request_scans and make process_scans to get information on the Resources in the packages we visit and map. I've noticed that the scan requests we send off to scancode.io are for multiple versions of the same package. This causes a few problems:

We cannot possibly scan every single package in the maven index to create our matching index
In the case of the directory structure fingerprint, if there is little/no difference between the different versions of a package, then we are essentially scanning and indexing the same package repeatedly
We can't populate our matching index with a single scancode.io instance, we need multiple instances so we can have multiple scans going at once

For the first two issues, we need to come up with a new way to group and index fingerprints. A starting idea would be to come up with a bit more general. Currently, we create directory fingerprints for every package we map. If two packages we index are the same package but different versions, then we may have the same fingerprints twice. We could do something along the lines of indexing fingerprints to a package in general, rather than to a specific package version.

For the second issue, we will have to flip the current scan queue request model. purldb will have a queue of packages that it wants scanned and it will be up to scancode.io to poll purldb to see what needs to be scanned. scancode.io would poll purldb, get the package that needs to be scanned, scan and fingerprint it, then send the results back to purldb. This issue is tracked at #14

The text was updated successfully, but these errors were encountered:

JonoYang · 2023-03-30T00:48:57Z

There are my notes for an on-demand queue for requesting package information. This queue would take in a package URL, and the queue would figure out which handler works for the package url, and then look at the upstream repo for package information.

This is a queue that allows us to:

Visit and map a package requested by the user
Visit and map the other versions of the package requested by the user

How does (should) it work:

Take a purl and create something similar to a ResourceURI
Using the purl, look up in the right repo, then get pom info and map info as package
Look up other versions of the package and map that package info

How should we start:

Create ResourceURI-like model
Create queue loop (we will add the ResourceURI in manually for now, in the future we will have an API or something that can take in requests)
Queue loop only works on maven packages for now, in the future we will have that router thing like what we do with the visitors and mappers

https://repo1.maven.org/maven2/yom/yom/1.0-alpha-1/
https://repo1.maven.org/maven2/<namespace>/<name>/<version>/<name>-<version>-<classifier>.jar

We would first create a new Package in the packagedb with the initial information from maven, then we would download the sources jar of the package (if available), create a Package and index it. We would also have to do some sort of summarization (maybe use the license clarity score plugin?) on the scanned sources jar in order to get a copyright and license.

Signed-off-by: Jono Yang <[email protected]>

DennisClark · 2023-03-30T14:39:37Z

@JonoYang all this looks great! Question please: is the "ResourceURI" the same concept as the "Inferred URL" that we can see in scan results? or does it include other cases? (I guess I am asking for a somewhat precise definition of a ResourceURI.)

JonoYang · 2023-03-30T21:06:38Z

@DennisClark

ResourceURI is different than inferred URLs that are generated from purls.

ResourceURI is a model that represents a Resource from the internet you can download. The value of these URIs are usually the download URLs for a Package. In special cases, URIs could point to an upstream repo's index, like https://repo1.maven.org/maven2/.index/nexus-maven-repository-index.gz or https://replicate.npmjs.com/registry/_changes?include_docs=true&limit=1000&since=0. These are called seed URIs.

In purldb, we have visitors and mappers that work on these ResourceURIs. Visitors visit the seed URIs and create ResourceURIs for the packages it finds listed from the repo index. Mappers take the ResourceURIs created by visitors and creates new entries in the PackageDB for each ResourceURI.

DennisClark · 2023-03-30T21:11:17Z

@JonoYang thanks, that all makes sense!

* This is maven specific for now Signed-off-by: Jono Yang <[email protected]>

* This is maven specific for now * Add PriorityResourceURI migration Signed-off-by: Jono Yang <[email protected]>

Signed-off-by: Jono Yang <[email protected]>

* Do not handle scan request during Package creation time as it takes too long to wait for the scan to complete * Add new command for creating PriorityResourceURIs Signed-off-by: Jono Yang <[email protected]>

Signed-off-by: Jono Yang <[email protected]>

* Generate sha1 values if we can't get them from the sha1 files Signed-off-by: Jono Yang <[email protected]>

Signed-off-by: Jono Yang <[email protected]>

* This is maven specific for now * Add PriorityResourceURI migration Signed-off-by: Jono Yang <[email protected]>

Signed-off-by: Jono Yang <[email protected]>

* Do not handle scan request during Package creation time as it takes too long to wait for the scan to complete * Add new command for creating PriorityResourceURIs Signed-off-by: Jono Yang <[email protected]>

Signed-off-by: Jono Yang <[email protected]>

* Guard scan_queue API endpoint Signed-off-by: Jono Yang <[email protected]>

Signed-off-by: Jono Yang <[email protected]>

* Do not overwrite spdx license fields as these are generated * set Package fields to be nullable Signed-off-by: Jono Yang <[email protected]>

Signed-off-by: Jono Yang <[email protected]>

* Test to see if package data can be updated from indexing Signed-off-by: Jono Yang <[email protected]>

Signed-off-by: Jono Yang <[email protected]>

* Add tests for validate_uuid * Test for missing scan_status in update_status Signed-off-by: Jono Yang <[email protected]>

Signed-off-by: Jono Yang <[email protected]>

* Update quotes Signed-off-by: Jono Yang <[email protected]>

Signed-off-by: Jono Yang <[email protected]>

Expose new scan queue to many workers #49

JonoYang · 2024-03-28T23:27:02Z

We have updated the scan queue in #285

JonoYang mentioned this issue Mar 29, 2023

PURLDB: Ensure Maven matching works #52

Open

JonoYang added a commit that referenced this issue Mar 30, 2023

Create new queue for Package requests #49

72388e7

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 31, 2023

Create PriorityResourceURI processing logic #49

2fc8812

* This is maven specific for now Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 31, 2023

Create PriorityResourceURI processing logic #49

a827f7f

* This is maven specific for now * Add PriorityResourceURI migration Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 31, 2023

Move Package map logic into own function #49

ae37f02

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 31, 2023

Use merge_or_create function in priority_queue #49

c19ee5b

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Apr 3, 2023

Send scan request for binary package #49

efe5e44

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Apr 4, 2023

Save processing errors #49

a26e092

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Apr 4, 2023

Scan source package #49

17e1691

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Apr 5, 2023

Strip whitespace from sha1 #49

6d74f12

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Apr 5, 2023

Validate sha1's #49

52d73d2

* Generate sha1 values if we can't get them from the sha1 files Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Apr 5, 2023

Validate sha1's #49

934c992

* Generate sha1 values if we can't get them from the sha1 files Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Apr 5, 2023

Get data from parent package #49

f372ab9

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Apr 6, 2023

Check pom.parent before variable assignments #49

e162c5e

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Apr 6, 2023

Do not use compute_normalized_license #49

122b3e4

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Apr 7, 2023

Code cleanup #49

df5c20e

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Apr 13, 2023

Create new queue for Package requests #49

99e7d39

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Apr 13, 2023

Create PriorityResourceURI processing logic #49

9c10fe9

* This is maven specific for now * Add PriorityResourceURI migration Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Apr 13, 2023

Move Package map logic into own function #49

57de192

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Apr 13, 2023

Use merge_or_create function in priority_queue #49

3466c02

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Apr 13, 2023

Send scan request for binary package #49

ec26d69

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Apr 13, 2023

Save processing errors #49

e08d582

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Apr 13, 2023

Scan source package #49

e4b281f

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Apr 13, 2023

Strip whitespace from sha1 #49

76e417b

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 5, 2024

Create scan_queue_workers group #49 #285

5856cde

* Guard scan_queue API endpoint Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 7, 2024

Update ScannableURI migrations #49 #285

86b0885

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 7, 2024

Do not track scan progress from purldb #49 #285

4169da5

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 8, 2024

Share code between user creation commands #49 #285

38c5805

Signed-off-by: Jono Yang <[email protected]>

pombredanne mentioned this issue Mar 12, 2024

Improve indexing queue with multiple instances of ScanCode.io #236

Closed

9 tasks

pombredanne added this to 02-MatchCode - More matching Mar 12, 2024

pombredanne assigned JonoYang Mar 12, 2024

JonoYang added a commit that referenced this issue Mar 15, 2024

Get scan summary data and use it in index_package #49

ef57b49

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 16, 2024

Fix Package indexing issues #49

ccdee32

* Do not overwrite spdx license fields as these are generated * set Package fields to be nullable Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 16, 2024

Update package checksums using project extra data #49

4921a9a

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 16, 2024

Update update_status error responses #49

1192879

* Test to see if package data can be updated from indexing Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 18, 2024

Do not allow updates of finished ScannableURIs #49 #285

c41dda0

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 18, 2024

Do not allow updates of finished ScannableURIs #49 #285

b0b8ddd

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 18, 2024

Move validate_uuid to utils.py #49 #285

ff83b09

* Add tests for validate_uuid * Test for missing scan_status in update_status Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 18, 2024

Move code from minecode/collectors to original location #49 #285

4ecd97a

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 19, 2024

Add tests to ensure proper API permissions for scan_queue #49

e3f46fd

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 19, 2024

Remove previous scan queue Docker services #49 #285

093d94b

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 19, 2024

Fix quotes #49 #285

560368e

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 19, 2024

Fix quotes #49 #285

5de9076

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 19, 2024

Fix quotes #49 #285

6fbd43b

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 19, 2024

Update default pipelines to be run for packages #49 #285

8dcc1a3

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 19, 2024

Use reindex_uri attribute when calling index_package #49 #285

254edd1

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 19, 2024

Use reindex_uri attribute when calling index_package #49 #285

6e6f3b9

* Update quotes Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 19, 2024

Use reindex_uri attribute when calling index_package #49 #285

abb6439

* Update quotes Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 19, 2024

Remove duplicate import #49 #285

767e8a9

Signed-off-by: Jono Yang <[email protected]>

JonoYang added a commit that referenced this issue Mar 19, 2024

Merge pull request #290 from nexB/49-scan-queue-update

b2c8013

Expose new scan queue to many workers #49

JonoYang closed this as completed Mar 28, 2024

github-project-automation bot moved this to Done in 02-MatchCode - More matching Mar 28, 2024

pombredanne mentioned this issue Apr 3, 2024

ScanCode.io: Support multiple scan workers systems aboutcode-org/scancode.io#1107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul needed for Package scan requests and indexing #49

Overhaul needed for Package scan requests and indexing #49

JonoYang commented Mar 15, 2023 •

edited

Loading

JonoYang commented Mar 30, 2023 •

edited

Loading

DennisClark commented Mar 30, 2023

JonoYang commented Mar 30, 2023

DennisClark commented Mar 30, 2023

JonoYang commented Mar 28, 2024

Overhaul needed for Package scan requests and indexing #49

Overhaul needed for Package scan requests and indexing #49

Comments

JonoYang commented Mar 15, 2023 • edited Loading

JonoYang commented Mar 30, 2023 • edited Loading

DennisClark commented Mar 30, 2023

JonoYang commented Mar 30, 2023

DennisClark commented Mar 30, 2023

JonoYang commented Mar 28, 2024

JonoYang commented Mar 15, 2023 •

edited

Loading

JonoYang commented Mar 30, 2023 •

edited

Loading