update: readme

UCSC-VLAA · Sep 23, 2024 · 3f5d753 · 3f5d753
1 parent 2cb7f25
commit 3f5d753
Showing 1 changed file with 22 additions and 22 deletions.
diff --git a/README.md b/README.md
@@ -22,18 +22,17 @@
 - **[📄💥 September 24, 2023] Our arXiv paper is released (link coming soon).**
 
 ### Performances Overview
+<div style="display: flex; justify-content: center; align-items: flex-end;">
+  <div style="margin-right: 20px; width: 500px; display: flex; flex-direction: column; align-items: flex-end;">
+    <img src="./resources/radar_chart.png" alt="Overall results of o1 and other 4 strong LLMs" width="500">
+    <p style="text-align: left; width: 100%;"><strong>Figure 1:</strong> Overall results of o1 and other 4 strong LLMs. We show performance on 12 medical datasets spanning diverse domains. o1 demonstrates a clear performance advantage over close- and open-source models.</p>
+  </div>
+  <div style="width: 500px; display: flex; flex-direction: column; align-items: flex-end;">
+    <img src="./resources/bar.png" alt="Average accuracy of o1 and other 4 strong LLMs" width="500">
+    <p style="text-align: left; width: 100%;"><strong>Figure 2:</strong> Average accuracy of o1 and other 4 strong LLMs. o1 achieves the highest average accuracy of 73.3% across 19 medical datasets.</p>
+  </div>
+</div>
 
-<p align="center">
-  <img src="./resources/radar_chart.png" alt="Overall results of o1 and other 4 strong LLMs" width="50%">
-</p>
-
-<p align="center"><strong>Figure 1:</strong> Overall results of o1 and other 4 strong LLMs. We show performance on 12 medical datasets spanning diverse domains. o1 demonstrates a clear performance advantage over close- and open-source models.</p>
-
-<p align="center">
-  <img src="./resources/bar.png" alt="Average accuracy of o1 and other 4 strong LLMs" width="50%">
-</p>
-
-<p align="center"><strong>Figure 2:</strong> Average accuracy of o1 and other 4 strong LLMs. o1 achieves the highest average accuracy of 73.3% across 19 medical datasets.</p>
 
 ---
 
@@ -43,7 +42,7 @@
   <img src="./resources/pipeline.png" width="100%">
 </p>
 
-<p align="center"><strong>Figure 3:</strong> Our evaluation pipeline has different evaluation (a) <em>aspects</em> containing various <em>tasks</em>. We collect multiple (b) <em>datasets</em> for each task, combining with various (c) <em>prompt strategies</em> to evaluate the latest (d) <em>language models</em>. We leverage a comprehensive set of (e) <em>evaluations</em> to present a holistic view of model progress in the medical domain.</p>
+<p align="left"><strong>Figure 3:</strong> Our evaluation pipeline has different evaluation (a) <em>aspects</em> containing various <em>tasks</em>. We collect multiple (b) <em>datasets</em> for each task, combining with various (c) <em>prompt strategies</em> to evaluate the latest (d) <em>language models</em>. We leverage a comprehensive set of (e) <em>evaluations</em> to present a holistic view of model progress in the medical domain.</p>
 
 ---
 
@@ -53,31 +52,31 @@
   <img src="./resources/table1.png" width="100%">
 </p>
 
-<p align="center"><strong>Table 1:</strong> Accuracy (Acc.) or F1 results on 4 tasks across 2 aspects. Model performances with * are taken from <cite>Wu et al. (2024)</cite> as the reference. We also present the average score (Average) of each metric in the table.</p>
+<p align="left"><strong>Table 1:</strong> Accuracy (Acc.) or F1 results on 4 tasks across 2 aspects. Model performances with * are taken from <cite>Wu et al. (2024)</cite> as the reference. We also present the average score (Average) of each metric in the table.</p>
 
 <p align="center">
   <img src="./resources/table2.png" width="100%">
 </p>
 
-<p align="center"><strong>Table 2:</strong> BLEU-1 (B-1) and ROUGE-1 (R-1) results on 3 tasks across 2 aspects. We use the gray background to highlight o1 results. We also present the average score (Average) of each metric.</p>
+<p align="left"><strong>Table 2:</strong> BLEU-1 (B-1) and ROUGE-1 (R-1) results on 3 tasks across 2 aspects. We use the gray background to highlight o1 results. We also present the average score (Average) of each metric.</p>
 
 <p align="center">
   <img src="./resources/table3.png" width="100%">
 </p>
 
-<p align="center"><strong>Table 3:</strong> Accuracy of models on the multilingual task, XmedBench <cite>Wang et al. (2024)</cite>.</p>
+<p align="left"><strong>Table 3:</strong> Accuracy of models on the multilingual task, XmedBench <cite>Wang et al. (2024)</cite>.</p>
 
 <p align="center">
   <img src="./resources/table4.png" width="100%">
 </p>
 
-<p align="center"><strong>Table 4:</strong> Accuracy of LLMs on two agentic benchmarks.</p>
+<p align="left"><strong>Table 4:</strong> Accuracy of LLMs on two agentic benchmarks.</p>
 
 <p align="center">
   <img src="./resources/table5.png" width="100%">
 </p>
 
-<p align="center"><strong>Table 5:</strong> Accuracy results of model results with/without CoT prompting on 5 knowledge QA datasets.</p>
+<p align="left"><strong>Table 5:</strong> Accuracy results of model results with/without CoT prompting on 5 knowledge QA datasets.</p>
 
 ---
 
@@ -87,13 +86,13 @@
   <img src="./resources/case_1.png" width="100%">
 </p>
 
-<p align="center"><strong>Figure 4:</strong> Comparison of the answers from o1 and GPT-4 for a question from NEJM. o1 provides a more concise and accurate reasoning process compared to GPT-4.</p>
+<p align="left"><strong>Figure 4:</strong> Comparison of the answers from o1 and GPT-4 for a question from NEJM. o1 provides a more concise and accurate reasoning process compared to GPT-4.</p>
 
 <p align="center">
   <img src="./resources/hos_case_1.png" width="100%">
 </p>
 
-<p align="center"><strong>Figure 5:</strong> Comparison of the answers from o1 and GPT-4 for a case from the Chinese dataset AI Hospital, along with its English translation. o1 offers a more precise diagnosis and practical treatment suggestions compared to GPT-4.</p>
+<p align="left"><strong>Figure 5:</strong> Comparison of the answers from o1 and GPT-4 for a case from the Chinese dataset AI Hospital, along with its English translation. o1 offers a more precise diagnosis and practical treatment suggestions compared to GPT-4.</p>
 
 ---
 
@@ -116,18 +115,19 @@ This work is partially supported by the OpenAI Researcher Access Program and Mic
 ## 📜 Citation
 
 If you find this work useful for your research and applications, please cite using this BibTeX:
-
+<!--
 ```bibtex
-<!-- @misc{xie2023preliminarystudy,
+ @misc{xie2023preliminarystudy,
   title={A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?},
   author={Yunfei Xie and Juncheng Wu and Haoqin Tu and Siwei Yang and Bingchen Zhao and Yongshuo Zong and Qiao Jin and Cihang Xie and Yuyin Zhou},
   year={2023},
   eprint={XXXX.XXXXX},
   archivePrefix={arXiv},
   primaryClass={cs.CL},
   url={link_to_arxiv}, 
-} -->
+} 
 ```
+-->
 
 ## 🔗 Related Projects