Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize how metadata supporting text mined results is represented #399

Open
mbrush opened this issue Feb 28, 2023 · 1 comment
Open

Comments

@mbrush
Copy link
Collaborator

mbrush commented Feb 28, 2023

Translator uses two main sources for text-mined knowledge: TMKP, and SemmeDB.

These sources want to report metadata supporting a text-mined edge, including the sentence(s) mined, metrics/scores reflecting confidence in accurate extraction of concepts and relationships form each sentence, and information about the context in which the sentence is found (e.g. what section of an article).

Often, a given edge is supported by mining of multiple sentences/spans of text - each of which comes with its own set of such metadata.

Precise representation of this information requires a way to group metadata for each NLP-based sentence analysis together in a TRAPI message.

The modeling team worked with TMKP to define a way to do this using Biolink StudyResult objects, and leveraging nested Attributes in the TRAPI structure. Details and examples of this model are here.

This modeling structure is reflected in how edge metadata is returned in the ARAX-ARS interface. Below I show a subset of the metadata on a 'is treated_by' edge from TMKP, which shows up in the KG supporting ARAGORN's 'Nutarsudil' result for this query:
image
image
image

However, other KPs who provide text-mined edges from SemMedDB (BTE, RTX-KG2) return less detailed metadata . . .
image
(from https://arax.ncats.io/?r=623df483-e0c8-45b5-80bb-38f15627c93c, specifically a 'treated by' edge in ARAGORN's TOFACITIMIB result)

. . . and when more detail is provided, a very different structure is used. In the rtx2-semmed example below, sentence text and pub date are stuffed next to pmid in thepublications attribute for convenience, and then duplicated in a richer json format alongside score and date info in a separate bts:sentence attribute:
image
image
(from https://arax.transltr.io/?r=9360c5c9-cb10-47d2-9910-535fc4cbbf05, specifically a 'treats' edge in ARAGORN's biguaniide result)


In summary, semmeddb edge metadata describing source publications, sentences and metadata about these (dates, scores, etc) are inconsistently provided and represented across KPs, and do not use the same detailed structure as TMKP.

  • The TMKP model provides rich metadata using Study Result object as organizing nodes in a two level structure.
  • In the bte-sememddb example , the KP doesn’t include sentence or other metadata at all.
  • In the rtx2-semmed example sentence text and pub date are stuffed next to pmid in thepublications attribute for convenience, and then duplicated in a richer json format alongside score and date info in a separate bts:sentence attribute. (Each object in this blob is analogous to a Study Result in the TMKP model).

We should try to use a similar structure in all cases, aligning where possible with that defined by TMKP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants