Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul needed for Package scan requests and indexing #49

Closed
JonoYang opened this issue Mar 15, 2023 · 5 comments
Closed

Overhaul needed for Package scan requests and indexing #49

JonoYang opened this issue Mar 15, 2023 · 5 comments
Assignees

Comments

@JonoYang
Copy link
Member

JonoYang commented Mar 15, 2023

I've set up purldb and scancode.io locally, where I run make run_visit and make run_map to visit and map Maven packages. I then run make request_scans and make process_scans to get information on the Resources in the packages we visit and map. I've noticed that the scan requests we send off to scancode.io are for multiple versions of the same package. This causes a few problems:

  • We cannot possibly scan every single package in the maven index to create our matching index
  • In the case of the directory structure fingerprint, if there is little/no difference between the different versions of a package, then we are essentially scanning and indexing the same package repeatedly
  • We can't populate our matching index with a single scancode.io instance, we need multiple instances so we can have multiple scans going at once

For the first two issues, we need to come up with a new way to group and index fingerprints. A starting idea would be to come up with a bit more general. Currently, we create directory fingerprints for every package we map. If two packages we index are the same package but different versions, then we may have the same fingerprints twice. We could do something along the lines of indexing fingerprints to a package in general, rather than to a specific package version.

For the second issue, we will have to flip the current scan queue request model. purldb will have a queue of packages that it wants scanned and it will be up to scancode.io to poll purldb to see what needs to be scanned. scancode.io would poll purldb, get the package that needs to be scanned, scan and fingerprint it, then send the results back to purldb. This issue is tracked at #14

@JonoYang
Copy link
Member Author

JonoYang commented Mar 30, 2023

There are my notes for an on-demand queue for requesting package information. This queue would take in a package URL, and the queue would figure out which handler works for the package url, and then look at the upstream repo for package information.

This is a queue that allows us to:

  • Visit and map a package requested by the user
  • Visit and map the other versions of the package requested by the user

How does (should) it work:

  • Take a purl and create something similar to a ResourceURI
  • Using the purl, look up in the right repo, then get pom info and map info as package
  • Look up other versions of the package and map that package info

How should we start:

  • Create ResourceURI-like model
  • Create queue loop (we will add the ResourceURI in manually for now, in the future we will have an API or something that can take in requests)
  • Queue loop only works on maven packages for now, in the future we will have that router thing like what we do with the visitors and mappers

https://repo1.maven.org/maven2/yom/yom/1.0-alpha-1/
https://repo1.maven.org/maven2/<namespace>/<name>/<version>/<name>-<version>-<classifier>.jar

We would first create a new Package in the packagedb with the initial information from maven, then we would download the sources jar of the package (if available), create a Package and index it. We would also have to do some sort of summarization (maybe use the license clarity score plugin?) on the scanned sources jar in order to get a copyright and license.

JonoYang added a commit that referenced this issue Mar 30, 2023
@DennisClark
Copy link
Member

@JonoYang all this looks great! Question please: is the "ResourceURI" the same concept as the "Inferred URL" that we can see in scan results? or does it include other cases? (I guess I am asking for a somewhat precise definition of a ResourceURI.)

@JonoYang
Copy link
Member Author

@DennisClark

ResourceURI is different than inferred URLs that are generated from purls.

ResourceURI is a model that represents a Resource from the internet you can download. The value of these URIs are usually the download URLs for a Package. In special cases, URIs could point to an upstream repo's index, like https://repo1.maven.org/maven2/.index/nexus-maven-repository-index.gz or https://replicate.npmjs.com/registry/_changes?include_docs=true&limit=1000&since=0. These are called seed URIs.

In purldb, we have visitors and mappers that work on these ResourceURIs. Visitors visit the seed URIs and create ResourceURIs for the packages it finds listed from the repo index. Mappers take the ResourceURIs created by visitors and creates new entries in the PackageDB for each ResourceURI.

@DennisClark
Copy link
Member

@JonoYang thanks, that all makes sense!

JonoYang added a commit that referenced this issue Mar 31, 2023
    * This is maven specific for now

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Mar 31, 2023
    * This is maven specific for now
    * Add PriorityResourceURI migration

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Mar 31, 2023
JonoYang added a commit that referenced this issue Mar 31, 2023
JonoYang added a commit that referenced this issue Apr 3, 2023
JonoYang added a commit that referenced this issue Apr 4, 2023
Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Apr 4, 2023
Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Apr 5, 2023
    * Do not handle scan request during Package creation time as it takes too long to wait for the scan to complete
    * Add new command for creating PriorityResourceURIs

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Apr 5, 2023
JonoYang added a commit that referenced this issue Apr 5, 2023
    * Generate sha1 values if we can't get them from the sha1 files

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Apr 5, 2023
    * Generate sha1 values if we can't get them from the sha1 files

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Apr 5, 2023
JonoYang added a commit that referenced this issue Apr 6, 2023
JonoYang added a commit that referenced this issue Apr 6, 2023
JonoYang added a commit that referenced this issue Apr 7, 2023
Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Apr 13, 2023
JonoYang added a commit that referenced this issue Apr 13, 2023
    * This is maven specific for now
    * Add PriorityResourceURI migration

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Apr 13, 2023
JonoYang added a commit that referenced this issue Apr 13, 2023
JonoYang added a commit that referenced this issue Apr 13, 2023
JonoYang added a commit that referenced this issue Apr 13, 2023
Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Apr 13, 2023
Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Apr 13, 2023
    * Do not handle scan request during Package creation time as it takes too long to wait for the scan to complete
    * Add new command for creating PriorityResourceURIs

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Apr 13, 2023
JonoYang added a commit that referenced this issue Mar 5, 2024
    * Guard scan_queue API endpoint

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Mar 7, 2024
JonoYang added a commit that referenced this issue Mar 7, 2024
JonoYang added a commit that referenced this issue Mar 8, 2024
JonoYang added a commit that referenced this issue Mar 15, 2024
JonoYang added a commit that referenced this issue Mar 16, 2024
    * Do not overwrite spdx license fields as these are generated
    * set Package fields to be nullable

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Mar 16, 2024
JonoYang added a commit that referenced this issue Mar 16, 2024
    * Test to see if package data can be updated from indexing

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Mar 18, 2024
JonoYang added a commit that referenced this issue Mar 18, 2024
JonoYang added a commit that referenced this issue Mar 18, 2024
    * Add tests for validate_uuid
    * Test for missing scan_status in update_status

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Mar 18, 2024
JonoYang added a commit that referenced this issue Mar 19, 2024
JonoYang added a commit that referenced this issue Mar 19, 2024
JonoYang added a commit that referenced this issue Mar 19, 2024
Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Mar 19, 2024
Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Mar 19, 2024
Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Mar 19, 2024
JonoYang added a commit that referenced this issue Mar 19, 2024
JonoYang added a commit that referenced this issue Mar 19, 2024
JonoYang added a commit that referenced this issue Mar 19, 2024
JonoYang added a commit that referenced this issue Mar 19, 2024
JonoYang added a commit that referenced this issue Mar 19, 2024
Expose new scan queue to many workers #49
@JonoYang
Copy link
Member Author

We have updated the scan queue in #285

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants