Add DB v6 schema #2128

wagoodman · 2024-09-17T21:36:32Z

DB v6 is meant to cover several use cases (you can safely ignore this link). The high level goals are:

Make the distributed DB smaller
- Via removing necessary information: for instance, "namespace" is a hold over from earlier projects and is no longer strictly necessary.
- Via minimizing redundant information: in order to optimize the matching process, the v5 Vulnerability table is filled with redundant information (rows are duplicated for every unique fix-in found in upstream data).
Make adding new use cases to the DB easier (e.g. add EPSS or KEV data sources for grype to functionally use them without needing to make structural changes to grype)

The high level design is as follows:

Add a table-per-search-usecase. The bare minimum information needed to search records are in these tables, which are indexed. These tables allow you to identify a record ID. We'll refer to these tables as "Search" or "*Hanadle" tables.
Have a pseudo content addressable (single) blobs table that maps record ID to JSON blobs. When writing to the DB, these record IDs are 1:1 to JSON content digests, so that we cheaply don't write duplicate information. These digests should be stripped away before distribution, since they are not strictly needed at runtime. These JSON blobs represent the business elements (e.g. Vulnerability, Severity, AffectedPackage, etc). All payloads are self-describing with a named/version "kind" field used for deserialization concerns. We'll refer to these as "*Blob" models.
Have auxiliary tables related to search tables to aid in refinement / augmentation to searches.

Here are a list of the “Handle” tables to search against:

AffectedPackageHandles: search by package name within an ecosystem
AffectedCPEHandles: search by CPE fields
VulnerabilityHandles: search by vulnerability ID

These are related to two other auxiliary tables:

Packages: listing of all unique pairings of package name and ecosystem
OperatingSystems: listing of all unique tuples of distro name, major version, minor version (with optional codename)

Here's how they AffectedPackageHandle table would relate to auxiliary tables:

CREATE TABLE AffectedPackageHandle (
    id INTEGER PRIMARY KEY,
    vulnerability_id INTEGER NOT NULL,
    operating_system_id INTEGER,
    package_id INTEGER NOT NULL,
    blob_id INTEGER NOT NULL,
    FOREIGN KEY(vulnerability_id) REFERENCES VulnerabilityHandle(id),
    FOREIGN KEY(operating_system_id) REFERENCES OperatingSystem(id),
    FOREIGN KEY(package_id) REFERENCES Package(id),
    FOREIGN KEY(blob_id) REFERENCES Blob(id)
);

# supporting tables...

CREATE TABLE VulnerabilityHandle (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    blob_id INTEGER UNIQUE,
    FOREIGN KEY(blob_id) REFERENCES Blob(id)
);

CREATE TABLE Package (
    id INTEGER PRIMARY KEY,
    ecosystem TEXT NOT NULL,
    name TEXT NOT NULL,
    UNIQUE(name, ecosystem)
);

CREATE TABLE OperatingSystem (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    major_version TEXT NOT NULL,
    minor_version TEXT NOT NULL,
    codename TEXT,
    UNIQUE(name, major_version, minor_version)
);

And the rather simple Blobs table:

CREATE TABLE Blobs (
    id INTEGER PRIMARY KEY,
    value TEXT NOT NULL UNIQUE
);

Here's how you might craft a search for an affected package for a specific OS:

SELECT 
    affected.blob_id AS affected_blob_id,
    vuln.blob_id AS vulnerability_blob_id
FROM 
    AffectedPackageHandles affected
JOIN 
    OperatingSystems os ON aph.operating_system_id = os.id
JOIN 
    Packages pkg ON aph.package_id = p.id
JOIN 
    VulnerabilityHandles vuln ON aph.vulnerability_id = vh.id
WHERE 
    os.name = 'ubuntu' AND 
    os.major_version = '22' AND 
    os.minor_version = '4' AND 
    pkg.name = 'openssl';

At this point you can take these blob IDs and query the blob table for the JSON payload and deserialize. This has an advantage over the existing schemas: you can conditionally inflate DB objects based on what you need, not have to inflate entire records that you end up throwing away.

In v1-5 you'd need to craft the correct namespace, which was a bespoke string --this shifts this to relations per-record.

Eventually we’d like to add additional handle tables (out of scope for v6 though):

NotAffectedPackageHandles: enables negative matching
NotAffectedCPEHandles: enables negative matching
EpssHandles: for aux sorting of grype results at presentation epss score in grype results #1973
KevHandles: for aux sorting of grype results at presentation

The proposed blobs are as follows:

type VulnerabilityBlob struct {
	ID string `json:"id"` 

	// Name of the Vunnel provider (or sub processor responsible for data records from a single specific source, e.g. "ubuntu")
	ProviderName string `json:"provider"`

	// List of names, emails, or organizations who submitted the vulnerability
	Assigner []string `json:"assigner,omitempty"`

	// Description of the vulnerability as provided by the source
	Description string `json:"description"`

	// Date the vulnerability record was last modified
	ModifiedDate *time.Time `json:"modified,omitempty"`

	// Date the vulnerability record was first published
	PublishedDate *time.Time `json:"published,omitempty"`

	// Date the vulnerability record was withdrawn
	WithdrawnDate *time.Time `json:"withdrawn,omitempty"`

	// Conveys the current status of the vulnerability
	Status VulnerabilityStatus `json:"status"`

	// URLs to external resources that provide more information about the vulnerability
	References []struct {
		// External resource
		URL string `json:"url"`

		// Free-form organizational field to convey additional information about the reference
		Tags []string `json:"tags,omitempty"`
	} `json:"refs,omitempty"`

	// List of IDs of the same vulnerability in other databases, in the form of the ID field.
	Aliases []string `json:"aliases,omitempty"`

	// List of severity indications (quantitative or qualitative) for the vulnerability
	Severities []struct {
		// Describes the quantitative method used to determine the score, or indicates the qualitative value
		Scheme SeverityScheme `json:"scheme"`

		// Severity score (e.g., "7.5", "CVSS:4.0/AV:N/AC:L/AT:N/PR:H/UI:N/VC:L/VI:L/VA:N/SC:N/SI:N/SA:N", or "high")
		Value string `json:"value"`

		// Name of the source of the severity score (e.g., "[email protected]" or "[email protected]")
		Source string `json:"source"`

		// Free-form organizational field to convey priority over other severities
		Rank int `json:"rank"`
	} `json:"severities,omitempty"`
}

type AffectedBlob struct {
	CVEs          []string `json:"cves"`
	RpmModularity string   `json:"rpm_modularity,omitempty"`
	PlatformCPEs  []string `json:"platform_cpes,omitempty"`
	Ranges        []struct {
		Version struct {
			// Type of version range, such as "semver", "rpm", "pypi", etc.
			Type string `json:"type"`

			// Allows for a version range expression, such as ">=1.0.0", "1.0.0", ">= 1.0, <2.0", etc.
			Constraint string `json:"constraint"`
		} `json:"version"`

		Fix *struct {
			Version string `json:"version"`
			State   string `json:"state"`
			Detail  *struct {
				GitCommit  string `json:"git_commit"`
				Timestamp  time.Time `json:"timestamp"`
				References []struct {
					// External resource
					URL string `json:"url"`

					// Free-form organizational field to convey additional information about the reference
					Tags []string `json:"tags,omitempty"`
				} `json:"references,omitempty"`
			} `json:"detail,omitempty"`
		} `json:"fix,omitempty"`
	} `json:"ranges,omitempty"`
}

Implied changes from this:

All Blob entities…
a. may have fields added to them over time, but not removed or modified (within the same schema version)
b. only one “version” of a blob may exist within a DB at a time
When you have additional data that fits new use cases, add a new Handle table that points to new Blob types.
Remove the concept of a “Namespace” from the DB. All search data should be searched with the proper joins to auxiliary tables (or not) depending on what you’re searching for. In this way you are not searching within a namespace string pattern, you enumerate associated qualities you want to filter on.

Specific changes (see prototype models for reference):

Add VulnerabilityHandle gorm model Add v6 vulnerability & blob stores #2243
Add OperatingSystem gorm model Add AffectedPackage store #2245
Add Package gorm model Add AffectedPackage store #2245
Add AffectedPackageHandle gorm model Add AffectedPackage store #2245
Add Provider gorm model Add v6 provider store #2232
Add DbMetadata gorm model Add v6 DB metadata store #2146
Add Blob gorm model Add v6 vulnerability & blob stores #2243

The text was updated successfully, but these errors were encountered:

TimBrown1611 · 2024-09-18T15:19:54Z

hi @wagoodman ,

are we expecting the scans results to change from v5 to v6?
will the db be stateful (can track what have changed \ added) - meaning, in case CVE was updated in a source which doesn't has a field of published \ updated, will it be mentioned in the DB?
flexibility to add new data - how easy it will be to add new tables (for example, EOL)?
will it effect grype's schema (v6 = grype 1.0.0)?
thanks for your time!

wagoodman · 2024-09-25T19:46:54Z

are we expecting the scans results to change from v5 to v6?

No, there is no functional change in the match results as currently planned out. There should be the same match results. This could change if we fix any matcher bugs while we're rewriting that portion of the search package though. But we would not merge changes that made results worse, only better.

will the db be stateful?

It will be more stateful than it is today. Most information will be in the Blobs table and be content addressable. We wont be keeping the value digests in the DB to save on space, but it would be possible to digest all values in that table and compare digests computed from one DB and another DB to get a rough sense on the changes. We cannot use the blob IDs themselves sense they cannot be considered stable between DB builds.

how easy it will be to add new tables?

Since there would be new Handle tables for searching and new Blob schemas for entries that are searchable by a handle table, at any point during the v6 lifetime we can add these new elements without it being a breaking change or a paradigm shift. Take your example for EOL, I imagine there would be a new EolBlob that represents a json payload with cpe, purl, version, date, etc... as well as a new EolHandle table that is indexed by cpe or purl. It might be that we add new entries into the Packages table for joining on too (optionally, depending on what kind of data we pull in from that dataset).

will it effect grype's schema (v6 = grype 1.0.0)?

grype is still in v0 so we're making breaking changes on minor releases still. However, v6 schema is a big step towards grype 1.0. We still need to determine if we want to make the DB schema part of the grype version contract (e.g. should we allow for breaking schema changes and not need to bump the major version of grype?) -- that still hasn't been decided.

wagoodman added the enhancement New feature or request label Sep 17, 2024

wagoodman added this to the DB v6 milestone Sep 17, 2024

anchoretoolsops added this to OSS Sep 17, 2024

wagoodman changed the title ~~Add DB v6 reader/writers~~ Add DB v6 schema foundation Sep 17, 2024

wagoodman moved this to Backlog in OSS Sep 17, 2024

wagoodman changed the title ~~Add DB v6 schema foundation~~ Add DB v6 schema Sep 18, 2024

wagoodman mentioned this issue Sep 18, 2024

add exploit \ epss for cves anchore/vunnel#632

Open

wagoodman mentioned this issue Sep 25, 2024

Replace grype db diff with grype db search --since DATE #2129

Open

wagoodman moved this from Backlog to Ready in OSS Sep 26, 2024

wagoodman moved this from Ready to In Progress in OSS Sep 26, 2024

wagoodman assigned wagoodman and willmurphyscode Sep 26, 2024

wagoodman mentioned this issue Sep 26, 2024

Add v6 DB metadata store #2146

Merged

willmurphyscode removed their assignment Oct 4, 2024

This was referenced Oct 4, 2024

Refactor matching process to be chained processors #1869

Open

Distro matchers should be guided by package type not detected distro #86

Open

This was referenced Nov 1, 2024

Add v6 provider store #2232

Merged

Add v6 vulnerability & blob stores #2243

Merged

Add AffectedPackage store #2245

Merged

wagoodman linked a pull request Nov 7, 2024 that will close this issue

Add AffectedPackage store #2245

Merged

wagoodman removed a link to a pull request Nov 13, 2024

Add AffectedPackage store #2245

Merged

wagoodman mentioned this issue Nov 13, 2024

Add affected CPE store #2258

Merged

kzantow closed this as completed in #2258 Nov 14, 2024

github-project-automation bot moved this from In Progress to Done in OSS Nov 14, 2024

wagoodman added the changelog-ignore Don't include this issue in the release changelog label Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DB v6 schema #2128

Add DB v6 schema #2128

wagoodman commented Sep 17, 2024 •

edited

Loading

TimBrown1611 commented Sep 18, 2024 •

edited

Loading

wagoodman commented Sep 25, 2024

Add DB v6 schema #2128

Add DB v6 schema #2128

Comments

wagoodman commented Sep 17, 2024 • edited Loading

TimBrown1611 commented Sep 18, 2024 • edited Loading

wagoodman commented Sep 25, 2024

wagoodman commented Sep 17, 2024 •

edited

Loading

TimBrown1611 commented Sep 18, 2024 •

edited

Loading