-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create an automated metadata harmonization/curation tool #111
Comments
Hello maintainers, Please acknowledge if I can proceed. |
Hi (@shbrief), this seems a very interesting project to work on. I am familiar with cosine similarity and vectorization of words. I have used Word2vec in the past but I am new to txt2onto, which seems specialized in tissue and cell-type annotations. |
Hello Mentors Team cBioPortal ! Sehyun Oh (@shbrief), Sean Davis (@seandavi) I'm Omkar Nikam ,I am second year student studying Artificial Intelligence and Machine Learning (CSE) from KITCOEK(INDIA). I am having good exposure to Machine Learning Concepts , And I Have started learning R Programming Language , soon I'll be very comfortable with it . Please , guide me for next steps ! Sincerely , |
Thank you for expressing your interest! @HarshaTejaswi @RINO-GAELICO @iOmkarNikam
|
Hi @shbrief I am working on number 2 and I have a question. How are we suppose to "create a set of uncurated values" ? |
@RINO-GAELICO The |
hi @shbrief , |
Hello @shbrief , Vaishnavi Mudaliar this side :) |
Hi @shbrief, Could you please take a look at my colab notebook and provide feedback? Colab notebook (removed) Thank you |
Thanks @VaishnaviMudaliar @RINO-GAELICO! Could you remove the link to your colab notebook from your reply and directly send it to me? |
Mailed, thanks a lot :) |
Hello @shbrief , could you please let me know where to check if we have successfully gotten through the evaluation task or not? |
Hello @shbrief , I have made a prototype with the architecture explanation and I have mailed the colab link to you please check. |
Hi @shbrief, I've worked as a data scientist for 3+ years in the bioinformatics domain. Having a background in biophysics and statistical genetics (M.S from UC Davis, B.Tech from IIT-BHU), I've passionate about streamlining ML and AI based pipelines in bioinformatics. One of my personal projects in the domain of automation and streamlining of bioinformatics tools and ML is https://github.com/adhal007/OmixHub. I've worked out a basic solution for the automated metadata curation (Problem 2) and emailed you the code for the same. I look forward to getting your feedback around my approach and discuss any prospective avenues to contribute to this exciting project at GSOC 2024! Best Regards, |
Hi @shbrief , |
Background:
Though many omics data repositories host large volumes of datasets from diverse studies, cross-study analysis within these repositories is still somewhat limited due to the heterogeneity in their metadata structures. This lack of metadata harmonization especially impedes the application and development of machine learning tools around high-throughput biological data, which is in high demand due to the complexity and high dimensionality of multi-omics datasets. To facilitate comparable analysis across data sources through machine learning, we initiated OmicsMLRepo projects harmonizing metadata from diverse omics data repositories. Under this project, we manually reviewed metadata schema, consolidated similar or identical information spread across schema, and incorporated ontologies where possible. One of our target data repositories is cBioPortal and we have harmonized cBioPortal’s key clinical metadata across the whole data repository, not just at the study level, and incorporated ontology terms to improve the AI/ML-readiness of the cBioPortal data.
We performed a manual inspection of clinical metadata from 375 studies in cBioPortal (available on 5/13/2023) and harmonized major attributes, such as treatment, demographic information (e.g., age, sex, etc.), and disease. For example, 24 different values (e.g., RADIO_THERAPY, Rad, XRT, etc.) categorized as ‘treatment_type’ were harmonized into a single ontology term, “Radiation Therapy” (NCIT:C15313). While the comparability of the 375 datasets has been improved a lot, cBioPortal is continuously growing and we want to harmonize/digest new data to follow the data dictionary established under the OmicsMLRepo project. To reduce this maintenance effort, we would like to create an automated data harmonization tool.
The main approach we are currently considering is using semantic similarity. Understanding the meaning of a set of terms is often not straightforward because words might be different but meanings might be the same (e.g., leukemia and blood cancer). “Semantic similarity search” is searching by meaning rather than by word through “encoding”. Encoding is a way of transforming words or sentences into vectors of numbers, such that the points in N-dimensional space (usually 700~2,000), where points near each other have similar meanings. We want to encode both curated terms (from our data dictionary) and uncurated terms (from new incoming data), compare them, and map uncurated terms into curated terms.
Goal:
Approach:
Need skills:
Python
R
Possible mentors:
Sehyun Oh (@shbrief), Sean Davis (@seandavi)
If you are interested:
Anyone interested in this project, please try the EDA below and email your EDA work to [email protected]. Looking forward to hearing your idea. Thanks!
The text was updated successfully, but these errors were encountered: