Skip to content

Commit

Permalink
update: readme
Browse files Browse the repository at this point in the history
  • Loading branch information
yunfeixie233 committed Sep 23, 2024
1 parent 2cb7f25 commit 3f5d753
Showing 1 changed file with 22 additions and 22 deletions.
44 changes: 22 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,18 +22,17 @@
- **[📄💥 September 24, 2023] Our arXiv paper is released (link coming soon).**

### Performances Overview
<div style="display: flex; justify-content: center; align-items: flex-end;">
<div style="margin-right: 20px; width: 500px; display: flex; flex-direction: column; align-items: flex-end;">
<img src="./resources/radar_chart.png" alt="Overall results of o1 and other 4 strong LLMs" width="500">
<p style="text-align: left; width: 100%;"><strong>Figure 1:</strong> Overall results of o1 and other 4 strong LLMs. We show performance on 12 medical datasets spanning diverse domains. o1 demonstrates a clear performance advantage over close- and open-source models.</p>
</div>
<div style="width: 500px; display: flex; flex-direction: column; align-items: flex-end;">
<img src="./resources/bar.png" alt="Average accuracy of o1 and other 4 strong LLMs" width="500">
<p style="text-align: left; width: 100%;"><strong>Figure 2:</strong> Average accuracy of o1 and other 4 strong LLMs. o1 achieves the highest average accuracy of 73.3% across 19 medical datasets.</p>
</div>
</div>

<p align="center">
<img src="./resources/radar_chart.png" alt="Overall results of o1 and other 4 strong LLMs" width="50%">
</p>

<p align="center"><strong>Figure 1:</strong> Overall results of o1 and other 4 strong LLMs. We show performance on 12 medical datasets spanning diverse domains. o1 demonstrates a clear performance advantage over close- and open-source models.</p>

<p align="center">
<img src="./resources/bar.png" alt="Average accuracy of o1 and other 4 strong LLMs" width="50%">
</p>

<p align="center"><strong>Figure 2:</strong> Average accuracy of o1 and other 4 strong LLMs. o1 achieves the highest average accuracy of 73.3% across 19 medical datasets.</p>

---

Expand All @@ -43,7 +42,7 @@
<img src="./resources/pipeline.png" width="100%">
</p>

<p align="center"><strong>Figure 3:</strong> Our evaluation pipeline has different evaluation (a) <em>aspects</em> containing various <em>tasks</em>. We collect multiple (b) <em>datasets</em> for each task, combining with various (c) <em>prompt strategies</em> to evaluate the latest (d) <em>language models</em>. We leverage a comprehensive set of (e) <em>evaluations</em> to present a holistic view of model progress in the medical domain.</p>
<p align="left"><strong>Figure 3:</strong> Our evaluation pipeline has different evaluation (a) <em>aspects</em> containing various <em>tasks</em>. We collect multiple (b) <em>datasets</em> for each task, combining with various (c) <em>prompt strategies</em> to evaluate the latest (d) <em>language models</em>. We leverage a comprehensive set of (e) <em>evaluations</em> to present a holistic view of model progress in the medical domain.</p>

---

Expand All @@ -53,31 +52,31 @@
<img src="./resources/table1.png" width="100%">
</p>

<p align="center"><strong>Table 1:</strong> Accuracy (Acc.) or F1 results on 4 tasks across 2 aspects. Model performances with * are taken from <cite>Wu et al. (2024)</cite> as the reference. We also present the average score (Average) of each metric in the table.</p>
<p align="left"><strong>Table 1:</strong> Accuracy (Acc.) or F1 results on 4 tasks across 2 aspects. Model performances with * are taken from <cite>Wu et al. (2024)</cite> as the reference. We also present the average score (Average) of each metric in the table.</p>

<p align="center">
<img src="./resources/table2.png" width="100%">
</p>

<p align="center"><strong>Table 2:</strong> BLEU-1 (B-1) and ROUGE-1 (R-1) results on 3 tasks across 2 aspects. We use the gray background to highlight o1 results. We also present the average score (Average) of each metric.</p>
<p align="left"><strong>Table 2:</strong> BLEU-1 (B-1) and ROUGE-1 (R-1) results on 3 tasks across 2 aspects. We use the gray background to highlight o1 results. We also present the average score (Average) of each metric.</p>

<p align="center">
<img src="./resources/table3.png" width="100%">
</p>

<p align="center"><strong>Table 3:</strong> Accuracy of models on the multilingual task, XmedBench <cite>Wang et al. (2024)</cite>.</p>
<p align="left"><strong>Table 3:</strong> Accuracy of models on the multilingual task, XmedBench <cite>Wang et al. (2024)</cite>.</p>

<p align="center">
<img src="./resources/table4.png" width="100%">
</p>

<p align="center"><strong>Table 4:</strong> Accuracy of LLMs on two agentic benchmarks.</p>
<p align="left"><strong>Table 4:</strong> Accuracy of LLMs on two agentic benchmarks.</p>

<p align="center">
<img src="./resources/table5.png" width="100%">
</p>

<p align="center"><strong>Table 5:</strong> Accuracy results of model results with/without CoT prompting on 5 knowledge QA datasets.</p>
<p align="left"><strong>Table 5:</strong> Accuracy results of model results with/without CoT prompting on 5 knowledge QA datasets.</p>

---

Expand All @@ -87,13 +86,13 @@
<img src="./resources/case_1.png" width="100%">
</p>

<p align="center"><strong>Figure 4:</strong> Comparison of the answers from o1 and GPT-4 for a question from NEJM. o1 provides a more concise and accurate reasoning process compared to GPT-4.</p>
<p align="left"><strong>Figure 4:</strong> Comparison of the answers from o1 and GPT-4 for a question from NEJM. o1 provides a more concise and accurate reasoning process compared to GPT-4.</p>

<p align="center">
<img src="./resources/hos_case_1.png" width="100%">
</p>

<p align="center"><strong>Figure 5:</strong> Comparison of the answers from o1 and GPT-4 for a case from the Chinese dataset AI Hospital, along with its English translation. o1 offers a more precise diagnosis and practical treatment suggestions compared to GPT-4.</p>
<p align="left"><strong>Figure 5:</strong> Comparison of the answers from o1 and GPT-4 for a case from the Chinese dataset AI Hospital, along with its English translation. o1 offers a more precise diagnosis and practical treatment suggestions compared to GPT-4.</p>

---

Expand All @@ -116,18 +115,19 @@ This work is partially supported by the OpenAI Researcher Access Program and Mic
## 📜 Citation

If you find this work useful for your research and applications, please cite using this BibTeX:

<!--
```bibtex
<!-- @misc{xie2023preliminarystudy,
@misc{xie2023preliminarystudy,
title={A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?},
author={Yunfei Xie and Juncheng Wu and Haoqin Tu and Siwei Yang and Bingchen Zhao and Yongshuo Zong and Qiao Jin and Cihang Xie and Yuyin Zhou},
year={2023},
eprint={XXXX.XXXXX},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={link_to_arxiv},
} -->
}
```
-->

## 🔗 Related Projects

Expand Down

0 comments on commit 3f5d753

Please sign in to comment.