Skip to content

Commit

Permalink
Added readme for so_semantic_text track
Browse files Browse the repository at this point in the history
  • Loading branch information
Mikep86 committed May 8, 2024
1 parent 07af735 commit 6448184
Show file tree
Hide file tree
Showing 3 changed files with 109 additions and 0 deletions.
60 changes: 60 additions & 0 deletions so_semantic_text/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
## StackOverflow semantic_text track

This track benchmarks the `semantic_text` field type and `semantic` query performance.
It also enables comparison with existing inference generation & query approaches,
with a focus on comparison with ML inference pipelines and the `text_expansion` query.

The corpus is the same as that used by the `so` track; see `so/README.md` for more info about it.

### Generating the query set

Since the `so` track does not include queries, they were synthetically generated for this track
using the titles of questions from the corpus.
Only the first 1,000,000 questions were used to keep the query file size manageable.

To regenerate the query set from scratch, download the `so` corpus using
[this link](https://rally-tracks.elastic.co/so/posts.json.bz2) and run the query generation script:

```shell
./_tools/generate_queries.py -c 1000000 <path_to_posts_file> queries.txt.bz2
```

### Parameters

This track allows to overwrite the following parameters with Rally 0.8.0+ using `--track-params`:

* `bulk_size` (default: 10)
* `bulk_indexing_clients` (default: 8): Number of clients that issue bulk indexing requests.
* `ingest_percentage` (default: 100): A number between 0 and 100 that defines how much of the document corpus should be ingested.
* `number_of_replicas` (default: 0)
* `number_of_shards` (default: 5)
* `source_enabled` (default: true): A boolean defining whether the `_source` field is stored in the index.
* `index_settings`: A list of index settings. Index settings defined elsewhere (e.g. `number_of_replicas`) need to be overridden explicitly.
* `cluster_health` (default: "green"): The minimum required cluster health.
* `error_level` (default: "non-fatal"): Available for bulk operations only to specify ignore-response-error-level.
* `post_ingest_sleep` (default: false): Whether to pause after ingest and prior to subsequent operations.
* `post_ingest_sleep_duration` (default: 30): Sleep duration in seconds.
* `use_pipelines` (default: false): A boolean defining whether to use an ML inference pipeline or a `semantic_text` field.
This value also controls the query type used in the `semantic-search` operation.
See the flow chart below for more details.
* `elser_model_id` (default: ".elser_model_2"): The name of the ELSER model to use.
* `num_allocations` (default: 1): The number of ELSER allocations to deploy.
* `num_threads` (default: 1): The number of threads to use per ELSER allocation.
* `calculate_body_vector` (default: false): A boolean defining whether inference should be performed on the `body` field during ingest.
* `enable_search` (default: false): A boolean defining if the `semantic-search` operation is enabled.
* `semantic_search_clients` (default: 1): The number of clients that issue queries in the `semantic-search` operation.
* `semantic_search_time_period` (default: 300): The time period, in seconds, to execute the `semantic-search` operation for.
* `semantic_search_warmup_time_period` (default: 10): The warmup time period, in seconds, for the `semantic-search` operation.
* `semantic_search_page_size` (default: 20): The number of results to fetch for each query.
* `use_nested_text_expansion` (default: false): A boolean defining if a nested `text_expansion` query is used instead of a `semantic` query.
See the flow chart below for more details.

When the `semantic-search` operation is enabled, the type of query executed is controlled by multiple parameters:

![image](query_flow_chart.png)

### License

We use the same license for the data as the original data: [CC-SA-3.0](http://creativecommons.org/licenses/by-sa/3.0/)


49 changes: 49 additions & 0 deletions so_semantic_text/query_flow_chart.drawio
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
<mxfile host="app.diagrams.net" modified="2024-05-08T18:07:43.482Z" agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36" etag="MGauHMRWr4ZgbiEgyRJX" version="24.3.1" type="device">
<diagram name="Page-1" id="V41hylW0FixsCSBRaKah">
<mxGraphModel dx="745" dy="1060" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="850" pageHeight="1100" background="#ffffff" math="0" shadow="0">
<root>
<mxCell id="0" />
<mxCell id="1" parent="0" />
<mxCell id="lYxadVdlI73aAsubxVjZ-5" value="true" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;strokeColor=#000000;strokeWidth=2;fontColor=#000000;labelBackgroundColor=none;fontSize=12;" parent="1" source="lYxadVdlI73aAsubxVjZ-2" target="lYxadVdlI73aAsubxVjZ-4" edge="1">
<mxGeometry x="-0.0108" y="10" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="lYxadVdlI73aAsubxVjZ-7" value="false" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;strokeColor=#000000;strokeWidth=2;fontColor=#000000;labelBackgroundColor=none;fontSize=12;" parent="1" source="lYxadVdlI73aAsubxVjZ-2" target="lYxadVdlI73aAsubxVjZ-6" edge="1">
<mxGeometry x="-0.0092" y="-10" relative="1" as="geometry">
<Array as="points">
<mxPoint x="355" y="140" />
<mxPoint x="198" y="140" />
</Array>
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="lYxadVdlI73aAsubxVjZ-2" value="use_pipelines" style="rounded=1;whiteSpace=wrap;html=1;fillColor=#dae8fc;strokeColor=#6c8ebf;" parent="1" vertex="1">
<mxGeometry x="295" y="30" width="120" height="60" as="geometry" />
</mxCell>
<mxCell id="lYxadVdlI73aAsubxVjZ-4" value="text_expansion&lt;div&gt;query&lt;/div&gt;" style="rhombus;whiteSpace=wrap;html=1;fillColor=#d5e8d4;strokeColor=#82b366;" parent="1" vertex="1">
<mxGeometry x="415" y="190" width="145" height="90" as="geometry" />
</mxCell>
<mxCell id="2SGeM3FuVzMsr3sRwzxZ-1" value="true" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;strokeColor=#000000;strokeWidth=2;fontColor=#000000;labelBackgroundColor=none;fontSize=12;" edge="1" parent="1" source="lYxadVdlI73aAsubxVjZ-6" target="lYxadVdlI73aAsubxVjZ-8">
<mxGeometry x="-0.0108" y="18" relative="1" as="geometry">
<mxPoint y="1" as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="2SGeM3FuVzMsr3sRwzxZ-3" value="false" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;strokeColor=#000000;strokeWidth=2;fontColor=#000000;labelBackgroundColor=none;fontSize=12;" edge="1" parent="1" source="lYxadVdlI73aAsubxVjZ-6" target="2SGeM3FuVzMsr3sRwzxZ-2">
<mxGeometry x="0.0602" y="-17" relative="1" as="geometry">
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="lYxadVdlI73aAsubxVjZ-6" value="use_nested_text_expansion" style="rounded=1;whiteSpace=wrap;html=1;fillColor=#dae8fc;strokeColor=#6c8ebf;" parent="1" vertex="1">
<mxGeometry x="100" y="205" width="195" height="60" as="geometry" />
</mxCell>
<mxCell id="lYxadVdlI73aAsubxVjZ-8" value="nested&lt;div&gt;text_expansion&lt;/div&gt;&lt;div&gt;query&lt;/div&gt;" style="rhombus;whiteSpace=wrap;html=1;fillColor=#d5e8d4;strokeColor=#82b366;" parent="1" vertex="1">
<mxGeometry x="230" y="370" width="150" height="80" as="geometry" />
</mxCell>
<mxCell id="2SGeM3FuVzMsr3sRwzxZ-2" value="semantic&lt;div&gt;query&lt;/div&gt;" style="rhombus;whiteSpace=wrap;html=1;fillColor=#d5e8d4;strokeColor=#82b366;" vertex="1" parent="1">
<mxGeometry x="20" y="370" width="150" height="80" as="geometry" />
</mxCell>
</root>
</mxGraphModel>
</diagram>
</mxfile>
Binary file added so_semantic_text/query_flow_chart.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 6448184

Please sign in to comment.