Skip to content

Commit

Permalink
docs:eval add exp and summary res,make base eval as same ,del filter (#…
Browse files Browse the repository at this point in the history
  • Loading branch information
wangzaistone authored Dec 8, 2023
1 parent cb8aeeb commit 3ec335f
Showing 1 changed file with 274 additions and 2 deletions.
276 changes: 274 additions & 2 deletions docs/eval_llm_result.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,7 @@ This doc aims to summarize the performance of publicly available big language mo
| Baichuan2-13B-Chat | 0.392 | eval in this project default param |
| llama2_13b_hf | 0.449 | [numbersstation-eval-res](https://www.numbersstation.ai/post/nsql-llama-2-7b) |
| llama2_13b_hf_lora_best | 0.744 | sft train by our this project,only used spider train dataset, the same eval way in this project. |
| chatglm3_lora_default | 0.590 | sft train by our this project,only used spider train dataset, the same eval way in this project. |
| chatglm3_qlora_default | 0.581 | sft train by our this project,only used spider train dataset, the same eval way in this project. |




Expand All @@ -28,6 +27,279 @@ It's important to note that our evaluation results are obtained based on the cur
If you have improved methods for objective evaluation, we warmly welcome contributions to the project's codebase.


## LLMs Text-to-SQL capability evaluation before 20231208
the follow our experiment execution accuracy of Spider, this time is base on the database which is download from the [the Spider-based test-suite](https://github.com/taoyds/test-suite-sql-eval) ,size of 1.27G, diffrent from Spider official [website](https://yale-lily.github.io/spider) ,size only 95M.
the model

<table>
<tr>
<th>Model</th>
<th>Method</th>
<th>EX</th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
<tr>
<td>Llama2-7B-Chat</td>
<td>base</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td></td>
<td>lora</td>
<td>0.887</td>
<td>0.641</td>
<td>0.489</td>
<td>0.331</td>
<td>0.626</td>
</tr>
<tr>
<td></td>
<td>qlora</td>
<td>0.847</td>
<td>0.623</td>
<td>0.466</td>
<td>0.361</td>
<td>0.608</td>
</tr>
<tr>
<td>Llama2-13B-Chat</td>
<td>base</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td></td>
<td>lora</td>
<td>0.907</td>
<td>0.729</td>
<td>0.552</td>
<td>0.343</td>
<td>0.68</td>
</tr>
<tr>
<td></td>
<td>qlora</td>
<td>0.911</td>
<td>0.7</td>
<td>0.552</td>
<td>0.319</td>
<td>0.664</td>
</tr>
<tr>
<td>CodeLlama-7B-Instruct</td>
<td>base</td>
<td>0.214</td>
<td>0.177</td>
<td>0.092</td>
<td>0.036</td>
<td>0.149</td>
</tr>
<tr>
<td></td>
<td>lora</td>
<td>0.923</td>
<td>0.756</td>
<td>0.586</td>
<td>0.349</td>
<td>0.702</td>
</tr>
<tr>
<td></td>
<td>qlora</td>
<td>0.911</td>
<td>0.751</td>
<td>0.598</td>
<td>0.331</td>
<td>0.696</td>
</tr>
<tr>
<td>CodeLlama-13B-Instruct</td>
<td>base</td>
<td>0.698</td>
<td>0.601</td>
<td>0.408</td>
<td>0.271</td>
<td>0.539</td>
</tr>
<tr>
<td></td>
<td>lora</td>
<td>0.94</td>
<td>0.789</td>
<td>0.684</td>
<td>0.404</td>
<td>0.746</td>
</tr>
<tr>
<td></td>
<td>qlora</td>
<td>0.94</td>
<td>0.774</td>
<td>0.626</td>
<td>0.392</td>
<td>0.727</td>
</tr>
<tr>
<td>Baichuan2-7B-Chat</td>
<td>base</td>
<td>0.577</td>
<td>0.352</td>
<td>0.201</td>
<td>0.066</td>
<td>335</td>
</tr>
<tr>
<td></td>
<td>lora</td>
<td>0.871</td>
<td>0.63</td>
<td>0.448</td>
<td>0.295</td>
<td>0.603</td>
</tr>
<tr>
<td></td>
<td>qlora</td>
<td>0.891</td>
<td>0.637</td>
<td>0.489</td>
<td>0.331</td>
<td>0.624</td>
</tr>
<tr>
<td>Baichuan2-13B-Chat</td>
<td>base</td>
<td>0.581</td>
<td>0.413</td>
<td>0.264</td>
<td>0.187</td>
<td>0.392</td>
</tr>
<tr>
<td></td>
<td>lora</td>
<td>0.903</td>
<td>0.702</td>
<td>0.569</td>
<td>0.392</td>
<td>0.678</td>
</tr>
<tr>
<td></td>
<td>qlora</td>
<td>0.895</td>
<td>0.675</td>
<td>0.58</td>
<td>0.343</td>
<td>0.659</td>
</tr>
<tr>
<td>Qwen-7B-Chat</td>
<td>base</td>
<td>0.395</td>
<td>0.256</td>
<td>0.138</td>
<td>0.042</td>
<td>0.235</td>
</tr>
<tr>
<td></td>
<td>lora</td>
<td>0.855</td>
<td>0.688</td>
<td>0.575</td>
<td>0.331</td>
<td>0.652</td>
</tr>
<tr>
<td></td>
<td>qlora</td>
<td>0.911</td>
<td>0.675</td>
<td>0.575</td>
<td>0.343</td>
<td>0.662</td>
</tr>
<tr>
<td>Qwen-14B-Chat</td>
<td>base</td>
<td>0.871</td>
<td>0.632</td>
<td>0.368</td>
<td>0.181</td>
<td>0.573</td>
</tr>
<tr>
<td></td>
<td>lora</td>
<td>0.895</td>
<td>0.702</td>
<td>0.552</td>
<td>0.331</td>
<td>0.663</td>
</tr>
<tr>
<td></td>
<td>qlora</td>
<td>0.919</td>
<td>0.744</td>
<td>0.598</td>
<td>0.367</td>
<td>0.701</td>
</tr>
<tr>
<td>ChatGLM3-6b</td>
<td>base</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td></td>
<td>lora</td>
<td>0.855</td>
<td>0.605</td>
<td>0.477</td>
<td>0.271</td>
<td>0.59</td>
</tr>
<tr>
<td></td>
<td>qlora</td>
<td>0.843</td>
<td>0.603</td>
<td>0.506</td>
<td>0.211</td>
<td>0.581</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</table>


1、All the models lora and qlora are trained by default based on the spider dataset training set.
2、All candidate models adopt the same evaluation method and prompt. The prompt has explicitly required the model to output only sql. The base evaluation results of Llama2-7B-Chat, Llama2-13B-Chat, and ChatGLM3-6b are 0. After analysis, we see that there are many errors because content other than sql has been generated.


## 2. Acknowledgements
Thanks to the following open source projects.

Expand Down

0 comments on commit 3ec335f

Please sign in to comment.