diff --git a/elser-getting-started.ipynb b/elser-getting-started.ipynb new file mode 100644 index 0000000..80c47c6 --- /dev/null +++ b/elser-getting-started.ipynb @@ -0,0 +1,798 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "s49gpkvZ7q53" + }, + "source": [ + "# Semantic Search using ELSER and Python\n", + "\n", + "Learn how to use the [Elastic Learned Sparse Encoder](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html) for text expansion-powered semantic search." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "Y01AXpELkygt" + }, + "source": [ + "# 🧰 Requirements\n", + "\n", + "For this example, you will need:\n", + "\n", + "- Python 3.6 or later\n", + "- An Elastic deployment with minimum **4GB machine learning node**\n", + " - We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html) for this example (available with a [free trial](https://cloud.elastic.co/registration?elektra=en-ess-sign-up-page))\n", + "- The [ELSER](https://www.elastic.co/guide/en/machine-learning/8.8/ml-nlp-elser.html) model installed on your Elastic deployment\n", + "- The [Elastic Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/installation.html)\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "N4pI1-eIvWrI" + }, + "source": [ + "# Create Elastic Cloud deployment\n", + "\n", + "If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?fromURI=%2Fhome) for a free trial.\n", + "\n", + "- Go to the [Create deployment](https://cloud.elastic.co/deployments/create) page\n", + " - Under **Advanced settings**, go to **Machine Learning instances**\n", + " - You'll need at least **4GB** RAM per zone for this tutorial\n", + " - Select **Create deployment**" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "nSw1R8e28F_E" + }, + "source": [ + "# Setup ELSER\n", + "To use ELSER, you must have the [appropriate subscription]() level\n", + "for semantic search or the trial period activated.\n", + "\n", + "Follow these [instructions](https://www.elastic.co/guide/en/machine-learning/8.8/ml-nlp-elser.html#trained-model) to download and deploy ELSER in the Kibana UI or using the Dev Tools **Console**.\n", + "\n", + "(Console commands in comments 👇)\n", + "\n", + "" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "gaTFHLJC-Mgi" + }, + "source": [ + "# Install packages and initialize the Elasticsearch Python client\n", + "\n", + "To get started, we'll need to connect to our Elastic deployment using the Python client.\n", + "Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.\n", + "\n", + "First we need to `pip` install the following packages:\n", + "\n", + "- `elasticsearch`\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "K9Q1p2C9-wce", + "outputId": "204d5aee-571e-4363-be6e-f87d058f2d29" + }, + "outputs": [], + "source": [ + "!pip install elasticsearch" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "gEzq2Z1wBs3M" + }, + "source": [ + "[TODO: Update]\n", + "Next we need to import the `elasticsearch` module and the `getpass` module.\n", + "`getpass` is part of the Python standard library and is used to securely prompt for credentials." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uP_GTVRi-d96" + }, + "outputs": [], + "source": [ + "from elasticsearch import Elasticsearch, helpers\n", + "from urllib.request import urlopen\n", + "import getpass\n", + "import requests\n", + "import json" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "AMSePFiZCRqX" + }, + "source": [ + "Now we can instantiate the Python Elasticsearch client.\n", + "First we prompt the user for their password and Cloud ID.\n", + "\n", + "🔐 NOTE: `getpass` enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.\n", + "\n", + "Then we create a `client` object that instantiates an instance of the `Elasticsearch` class." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "h0MdAZ53CdKL", + "outputId": "96ea6f81-f935-4d51-c4a7-af5a896180f1" + }, + "outputs": [], + "source": [ + "# Found in the 'Manage Deployment' page\n", + "CLOUD_ID = getpass.getpass('Enter Elastic Cloud ID: ')\n", + "\n", + "# Password for the 'elastic' user generated by Elasticsearch\n", + "ELASTIC_PASSWORD = getpass.getpass('Enter Elastic password: ')\n", + "\n", + "# Create the client instance\n", + "client = Elasticsearch(\n", + " cloud_id=CLOUD_ID,\n", + " basic_auth=(\"elastic\", ELASTIC_PASSWORD)\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "bRHbecNeEDL3" + }, + "source": [ + "Confirm that the client has connected with this test" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "rdiUKqZbEKfF", + "outputId": "43b6f1cd-a43e-4dbe-caa5-7fd170464881" + }, + "outputs": [], + "source": [ + "print(client.info())" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "enHQuT57DhD1" + }, + "source": [ + "Refer to https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect to a self-managed deployment.\n", + "\n", + "Read https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect using API keys.\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "TF_wxIAhD07a" + }, + "source": [ + "# Create Elasticsearch index with required mappings\n", + "\n", + "To use the ELSER model at index time, we'll need to create an index mapping that supports a [`text_expansion`](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-text-expansion-query.html) query.\n", + "The mapping must include a field of type [`rank_features`](https://www.elastic.co/guide/en/elasticsearch/reference/current/rank-features.html) to work with our feature vectors of interest.\n", + "This field contains the token-weight pairs the ELSER model created based on the input text.\n", + "\n", + "Let's create an index named `elser-movies` with the mappings we need.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cvYECABJJs_2", + "outputId": "18fb51e4-c4f6-4d1b-cb2d-bc6f8ec1aa84" + }, + "outputs": [], + "source": [ + "INDEX = 'elser-movies'\n", + "client.indices.create(\n", + " index=INDEX,\n", + " settings={\n", + " \"index\": {\n", + " \"number_of_shards\": 1,\n", + " \"number_of_replicas\": 1\n", + " }\n", + " },\n", + " mappings={\n", + " \"properties\": {\n", + " \"genre\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " },\n", + " \"keyScene\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " },\n", + " \"plot\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " },\n", + " \"released\": {\n", + " \"type\": \"integer\"\n", + " },\n", + " \"runtime\": {\n", + " \"type\": \"integer\"\n", + " },\n", + " \"title\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " },\n", + " \"ml.tokens\": {\n", + " \"type\": \"rank_features\"\n", + " },\n", + " \"keyScene\": {\n", + " \"type\": \"text\"\n", + " }\n", + " }\n", + "}\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "ohcvdngCGJlo" + }, + "source": [] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "EmELvr_JK_22" + }, + "source": [ + "# Create an ingest pipeline with an inference processor to use ELSER\n", + "\n", + "In order to use ELSER on our Elastic Cloud deployment we'll need to create an ingest pipeline that contains an inference processor that runs the ELSER model.\n", + "Let's add that pipeline using the [`put_pipeline`](https://www.elastic.co/guide/en/elasticsearch/reference/master/put-pipeline-api.html) method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "XhRng99KLQsd", + "outputId": "00ea73b5-45a4-472b-f4bc-2c2c790ab94d" + }, + "outputs": [], + "source": [ + "\n", + "client.ingest.put_pipeline(id=\"elser-v1-test\", body={\n", + " \"processors\": [\n", + " {\n", + " \"inference\": {\n", + " \"model_id\": \".elser_model_1\",\n", + " \"target_field\": \"ml\",\n", + " \"field_map\": {\n", + " \"keyScene\": \"text_field\",\n", + " \"plot\": \"text_field\"\n", + " },\n", + " \"inference_config\": {\n", + " \"text_expansion\": {\n", + " \"results_field\": \"tokens\"\n", + " }\n", + " }\n", + " }\n", + " }\n", + " ]\n", + "})" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "0wCH7YHLNW3i" + }, + "source": [ + "Let's note a few important parameters from that API call:\n", + "\n", + "- `inference`: A processor that performs inference using a machine learning model.\n", + "- `model_id`: Specifies the ID of the machine learning model to be used. In this example, the model ID is set to `.elser_model_1`.\n", + "- `target_field`: Defines the field where the inference result will be stored. Here, it is set to `ml`.\n", + "- `text_expansion`: Configures text expansion options for the inference process.\n", + "In this example, the inference results will be stored in a field named \"tokens\"." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "WgWDMgf9NkHL" + }, + "source": [] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "U3vT2g5LVIQF" + }, + "source": [ + "# Create index and mapping for test data\n", + "\n", + "\n", + "We have some test data in a `json` file at this [URL](https://raw.githubusercontent.com/leemthompo/notebook-tests/main/12-movies.json).\n", + "Let's load that into our Elastic deployment.\n", + "First we'll create an index named `search-movies` to store that data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "X3ONJckPnUIT", + "outputId": "07ea0766-c226-4510-c910-893db89757ad" + }, + "outputs": [], + "source": [ + "client.indices.create(\n", + " index=\"search-movies\",\n", + " mappings= {\n", + " \"properties\": {\n", + " \"genre\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " },\n", + " \"keyScene\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " },\n", + " \"plot\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " },\n", + " \"released\": {\n", + " \"type\": \"integer\"\n", + " },\n", + " \"runtime\": {\n", + " \"type\": \"integer\"\n", + " },\n", + " \"title\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " }\n", + " }\n", + "})" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "lFHgRUYVpNKP" + }, + "source": [ + "# Upload sample data\n", + "\n", + "> ⚠ To use the UI to upload data, follow the approach described [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/semantic-search-elser.html#load-data).\n", + "\n", + "Let's upload the JSON data.\n", + "The dataset provides information on twelve iconic films.\n", + "Each film's entry includes its title, runtime, plot summary, a key scene, genre classification, and release year." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "IBfqgdAcuKRG", + "outputId": "3b86daa1-ade1-4ff3-da81-4207fa814d30" + }, + "outputs": [], + "source": [ + "url = \"https://raw.githubusercontent.com/leemthompo/notebook-tests/main/12-movies.json\"\n", + "\n", + "# Send a request to the URL and get the response\n", + "response = urlopen(url)\n", + "\n", + "# Load the response data into a JSON object\n", + "data_json = json.loads(response.read())\n", + "\n", + "def create_index_body(doc):\n", + " \"\"\" Generate the body for an Elasticsearch document. \"\"\"\n", + " return {\n", + " \"_index\": \"search-movies\",\n", + " \"_source\": doc,\n", + " }\n", + "\n", + "# Prepare the documents to be indexed\n", + "documents = [create_index_body(doc) for doc in data_json]\n", + "\n", + "# Use helpers.bulk to index\n", + "helpers.bulk(client, documents)\n", + "\n", + "print(\"Done indexing documents into `search-movies` index!\")\n", + "\n", + "\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "73d3Td-1ubhv" + }, + "source": [ + "# Ingest the data through the inference ingest pipeline\n", + "\n", + "Create tokens from the text by reindexing the data throught the inference pipeline that uses ELSER as the inference model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "ysYobyC9uhn5", + "outputId": "27af8c88-9039-4ff8-a20f-9af9ffcff05c" + }, + "outputs": [], + "source": [ + "client.reindex(wait_for_completion=False,\n", + " source={\n", + " \"index\": \"search-movies\"\n", + " },\n", + " dest= {\n", + " \"index\": \"elser-movies\",\n", + " \"pipeline\": \"elser-v1-test\"\n", + " }\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "tUDGeY7e2-I2" + }, + "source": [ + "# Confirm documents are indexed with additional fields\n", + "\n", + "A successful API call in the previous step returns a task ID to monitor the job's progress.\n", + "Use the [task management API](https://www.elastic.co/guide/en/elasticsearch/reference/current/tasks.html) to check progress.\n", + "You can also monitor this task using the **Trained Models** UI in Kibana, selecting the **Pipelines** tab under **ELSER**.\n", + "\n", + "Call the following, replacing `` with the task id returned in the previous step." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "2KXeXCc63WVw", + "outputId": "e8fee6dd-34a1-401d-c879-71fd54de3c90" + }, + "outputs": [], + "source": [ + "client.tasks.get(task_id='cxy4bU9ASFKpFgZUpa-jnA:19545263')" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "oCj3jHHML4Tn" + }, + "source": [ + "Inspect a new document to confirm that it now has an `\"ml\": {\"tokens\":...}` field that contains a list of new, additional terms.\n", + "These terms are the **text expansion** of the field(s) you targeted for ELSER inference.\n", + "ELSER essentially creates a tree of expanded terms to improve the semantic searchability of your documents.\n", + "We'll be able to search these documents using a `text_expansion` query.\n", + "\n", + "But first let's start with a simple keyword search, to see how ELSER delivers semantically relevant results out of the box." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "_KahQAbPPd9l" + }, + "source": [ + "# Keyword match\n", + "\n", + "## Successful match\n", + "\n", + "Let's start by assuming a user queries the data set and hits an exact match.\n", + "BM25 is perfect for exact keyword matches.\n", + "Imagine our user remembers a movie where a child's spinning top was a recurring image.\n", + "They search for `spinning top` and because these exact words are used in the key scene description, we get a perfect hit.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "FsZkFhGaYnzD", + "outputId": "843c72f1-6a0c-43ce-c1e4-ad5e763ebc95" + }, + "outputs": [], + "source": [ + "response = client.search(\n", + " index=\"elser-movies\",\n", + " query= {\n", + " \"match\": {\n", + " \"keyScene\": \"spinning top\"\n", + " }\n", + " }\n", + ")\n", + "for hit in response['hits']['hits']:\n", + " doc_id = hit['_id']\n", + " score = hit['_score']\n", + " title = hit['_source']['title']\n", + " text = hit['_source']['keyScene']\n", + " print(f\"\\nTitle: {title}\\nKey scene description: {text}\\n\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "Y01WHeOtbTZ-" + }, + "source": [ + "## Unsuccessful match\n", + "\n", + "Unfortunately, searches that rely on exact matches are brittle.\n", + "What if you can't remember the exact name of the thing you're searching for?\n", + "Who knows what a spinning top is anyway?\n", + "\n", + "Imagine I can only think of the word `child toy` to describe this apparatus?\n", + "A match query won't find any relevant documents." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "osifkhqidjYw", + "outputId": "6b917df6-b0af-4947-9280-98f7b17f2ff9" + }, + "outputs": [], + "source": [ + "response = client.search(\n", + " index=\"elser-movies\",\n", + " query= {\n", + " \"match\": {\n", + " \"keyScene\": \"child toy\"\n", + " }\n", + " }\n", + ")\n", + "hits = response['hits']['hits']\n", + "\n", + "if not hits:\n", + " print(\"No matches found\")\n", + "else:\n", + " for hit in hits:\n", + " doc_id = hit['_id']\n", + " score = hit['_score']\n", + " title = hit['_source']['title']\n", + " text = hit['_source']['keyScene']\n", + " print(f\"\\nTitle: {title}\\nKey scene description: {text}\\n\")\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "MPCVztOLeAk_" + }, + "source": [ + "So it turns out classical term matching strategies are very good, if you know precisely what you're looking for.\n", + "But they break down when a user has a hard time articulating what they're trying to find.\n", + "Here's where semantic search shines.\n", + "It helps capture a user's intent or meaning better, without relying on brittle term matches.\n", + "\n", + "Traditional dense vector based similarity strategies require you to generate embeddings for your data and then map queries into the same mathematical space as the data.\n", + "This works well but is time consuming and requires a lot of legwork.\n", + "The beauty of the Elastic Learned Sparse Encoder model is that it works out-of-the-box, without the need to fine tune on your data.\n", + "\n", + "The Elastic Learned Sparse Encoder creates a tree of expanded terms, adds them to your documents, improving their semantic searchability.\n", + "The fields that you targeted for inference are now enriched with a range of relevant synonyms and related terms, that increase the probability of a successful search." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "Zy5GT2xb38oz" + }, + "source": [ + "# Semantic search with the `text_expansion` query\n", + "\n", + "Let's test out semantic search using the Elastic Learned Sparse Encoder, and see if we can improve our earlier unsuccessful search, using the query `child toy`.\n", + "\n", + "To perform semantic search using the Elastic Learned Sparse Encoder, you need the following:\n", + "- A `text_expansion` query\n", + "- Query text\n", + " - In this example we use `child toy`\n", + "- ELSER model ID" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "bAZRxja-5Q6X", + "outputId": "37a26a2c-4284-4e51-c34e-9a55edf77cb8" + }, + "outputs": [], + "source": [ + "response = client.search(index='elser-movies', size=3,\n", + " query={\n", + " \"text_expansion\": {\n", + " \"ml.tokens\": {\n", + " \"model_id\":\".elser_model_1\",\n", + " \"model_text\":\"child toy\"\n", + " \n", + " }\n", + " }\n", + "}\n", + ")\n", + "\n", + "for hit in response['hits']['hits']:\n", + " doc_id = hit['_id']\n", + " score = hit['_score']\n", + " title = hit['_source']['title']\n", + " text = hit['_source']['keyScene']\n", + " print(f\"Score: {score}\\nTitle: {title}\\nKey scene description: {text}\\n\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "yYSJ7fnv5uWd" + }, + "source": [ + "Success! Out of the box ELSER has taken a fuzzy, but semantically similar query and found the correct match.\n", + "Our user has found the movie they're looking for!" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.7" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/notebook.ipynb b/notebook.ipynb new file mode 100644 index 0000000..e69de29