Skip to content

Configuration files for HT full-text search (ls) Solr

Notifications You must be signed in to change notification settings

hathitrust/lss_solr_configs

Repository files navigation


lss_solr_configs

Report Bug - Request Feature

Table Of Contents

About The Project

This project is a configuration for Solr 6 and 8 to be used in the HathiTrust full-text search.

The main problem we are trying to solve is to provide HathiTrust custom architecture to Solr server to deal with:

  • Huge indexes that require significant changes to Solr defaults to work;
  • Custom indexing to deal with multiple languages and huge documents.

The initial version of HT Solr cluster runs in Solr 6 in standalone mode. The current proposal of this repository is to upgrade the Solr server to Solr 8 in cloud mode. However, the Solr 6 server documentation will be here for a while as legacy and to use it as a reference.

Built With

Phases

The project is divided into four phases. Each phase has a specific goal to achieve.

  • Phase 1: Upgrade Solr server from Solr 6 to Solr 8 in cloud mode
    • Understand the Solr 6 architecture to migrate to Solr 8
    • Create a docker image for Solr 8 in cloud mode
      • Create a docker image for Solr 8 and external Zookeeper
      • Create a docker image for Solr 8 and embedded Zookeeper
  • Phase 2: Index data in Solr 8 and integrate it in babel-local-dev and ht_indexer
    • Create a script to automate the process of indexing data in Solr 8
  • Phase 3: Set up the Solr cluster in Kubernetes with data persistence and security
    • Create a Kubernetes deployment for Solr 8 and external Zookeeper
    • Deploy the Solr cluster in Kubernetes
    • Create a Python module to manage Solr collections and configsets
    • Clean up the code and documentation

Project Set Up

Prerequisites

  • Docker
  • Python
  • Java

Installation

  1. Clone the repo git clone [email protected]:hathitrust/lss_solr_configs.git

  2. Start the Solr server in standalone mode docker-compose -f docker-compose_solr6_standalone.yml up

  3. Start the Solr server in cloud mode docker-compose -f docker-compose.yml up

Content Structure

Project Structure

lss_solr_configs/
├── solr_manager/
│   ├── Dockerfile
│   ├── .env
│   ├── solr_init.sh
│   ├── security.json
│   ├── collection_manager.sh
│   └── README.md
├── solr8.11.2_cloud/
│   ├── Dockerfile
│   ├── solrconfig.xml
│   ├── schema.xml
│   ├── core.properties
│   ├── lib/
│   └── data/
├── solr6_standalone/
├── docker-compose_test.yml
├── docker-compose_solr6_standalone.yml
├── README.md
└── indexing_data.sh

Design

  • solr6_standalone: Contains the Dockerfile and configuration files for Solr 6 in standalone mode.
  • solr8.11.2_cloud: Contains the Dockerfile and configuration files for Solr 8.11.2 in cloud mode.
    • Dockerfile: Dockerfile for building the Solr 8.11.2 cloud image.
      • Create the image with the target:external_zookeeper_docker to run Solr in Docker. This application uses the script init_files/solr_init.sh to copy a custom security.json file to initialize Solr and external Zookeeper using the Basic authentication.
        • Create the image with the target: common to run Solr in Kubernetes. Solr will start automatically without the need to run the script solr_init.sh.

The image will copy files that are relevant to set up the cluster

  • conf/: Directory for Solr configuration files.

    • solrconfig.xml: Solr configuration file.
    • schema.xml: Solr schema file.
  • lib/: Directory for JAR files.

  • solr_manager: Contains the Dockerfile and scripts for managing Solr collections and configurations using Python.

    • This application will have access to any Solr server running in Docker or Kubernetes.
    • Inside the solr_manager you will see the Dockerfile for building the image to run the Python application and its documentation.
    • To create collections and to upload configsets, Solr requires Admin credentials, then you will need it to provide the Solr admin password as an environment variable. Find here admin credentials.
  • indexing_data.sh: Use this script for indexing data into Solr when Solr cluster is running.

[Legacy] Overview Solr 6

A solr configuration for LSS consists of five symlinks in the same directory that point to the correct files for that core:

Three of these symlinks will point to the same file regardless of what core you're configuring:

The other two files are specific to whether it's an x/y core (similarity.xml) and what the core number is mergePolicy.xml These are referenced in schema.xml and solrconfig.xml via a standard XML <!ENTITY <name> SYSTEM "./<relative_file_path>">. This allows us to have a single base file that can be modified just by putting a symlink to the right file in the conf directory.

  • similarity.xml contains directives to use either the tfidf or BM25 similarity scores and are stored in the corresponding directories. We've been linking the tfidf file into core-#x and the BM25 into core-#y for each of the cores.

  • mergePolicy.xml configures merge variables and ramBuffer size for each core (as specified in Confluence), with a goal of making them different enough that it's less likely that many cores will be involved in big merges at the same time.

    Note that production solrs should symlink in serve/mergePolicy.xml, while the indexing servers should use the core-specific version in the indexing_core_specific directory.

Overview Solr 8

  • Solr8.11.2 is the latest version of the 8.x series

Upgrading our index from Solr6 to Solr8.11.2

To set up Solr 8, the same logic and resources used with Solr 6 have reused, then, minimal changes were made on JAR files, Solr schemas, and solrconfig.xml files.

See below the followed steps to upgrade the server from Solr 6 to Solr 8.

  1. **Create a DockerFile to generate our own image.
  • (DockerFile) The image was built using the official Solr image Solr:8.11.2 and adding the necessary files to set up the Solr server. We have to ensure the lib (JAR files) directories are copied to the image.
    • The lib directory contains the JAR files.
    • The folder conf, that contains the configuration files used by Solr to index the documents, such as:
      • schema.xml: The schema file that defines the fields and types of the documents
      • solrconfig.xml: The configuration file that defines the handlers and configurations of the Solr server
      • security.json: The security file that defines the authentication to access the Solr server
  • The image was built with the target external_zookeeper_docker to run Solr in Docker. This application uses the script init_files/solr_init.sh to copy a custom security.json file to initialize Solr and external Zookeeper using the Basic authentication. The image was built with the target common to run Solr in Kubernetes. Solr will start automatically without the need to run the script solr_init.sh.
  1. Copy some of the Java JARS that were already generated in Catalog

    • icu4j-62.1.jar
    • lucene-analyzers-icu-8.2.0.jar
    • lucene-umich-solr-filters-3.0-solr-8.8.2.jar
  2. Upgrading the JAR: HTPostingsFormatWrapper (Check here to see all the steps to re-generate this JAR)

  3. Updating schema.xml

    • root field is type=int in Solr6 and type=string in Solr8. In Solr 8 root field must be defined using the exact same fieldType as the uniqueKey field (id) uses: string
  4. Updating solr8.11.2_cloud/conf/solrconfig.xml

    • This file has been updated along with this project. The date of each update was added in the file to track the changes.
  5. Create a docker-compose file to start up Solr server and for indexing data.

  • On docker, the SolrCloud is a single replica and a single shard. If you want to add more nodes of Solr and Zookeeper, you should copy solr and zookeeper services in the docker-compose file. Remember to update the port of each service.

Although different architectures to set up Solr cloud have tested, the best option is to use an external Zookeeper server because:

  • It is the recommended architecture for production environment;
  • It is more stable and secure (In our solution, authentication is applied);
  • It is easier to manage the Solr cluster and Zookeeper separately;
  • It is easier to scale the Solr cluster.

Functionality

In the docker-compose file, the address (a string) where ZooKeeper is running is defined, this way Solr is able to connect to ZooKeeper server. Additional environment variables have been added to ensure the Solr server starts up. On this page, You can find a detail explanation of the environment variables used to set up Solr and Zookeeper in a docker.

When the Solr cluster starts up, it is empty. To upload configset and create new collections, the Solr API is used for. In this repository, the Python package solr_manager is based on Solr collection API to manage Solr collections and configsets.

On docker, to start up the Solr server in cloud mode, we mount the script init_files/solr_init.sh in the container to allow setting up the authentication using a predefined security.json file. It also copies the security.json file to ZooKeeper using the solr zk cp command. In the docker-compose file, each Solr container should run a command to start up the Solr in foreground mode.

In the container, we should define health checks to verify the Zookeeper and Solr are working well. These health checks will help us to define the dependencies between the services in the docker-compose file.

If we do not use the health checks, we probably will have to use the scripts wait-for-solr.sh and wait-for-zookeeper.sh to make sure the authentication is set up correctly.

On Kubernetes, none script is necessary to set up the authentication because the Solr operator will create the secrets by default.

Usage

How to start up the Solr 6 server in a standalone mode

  • Launch Solr server

    • docker-compose -f docker-compose_solr6_standalone.yml up
  • Stop Solr server

    • docker-compose -f docker-compose_solr6_standalone down
  • Go inside the Solr container

    • docker exec -it solr-lss-dev-8 /bin/bash

If you are using Apple Silicon M1 chip, you will get this error -no matching manifest for Linux/arm64/v8 in the manifest list entries-. To fix it, add this platform in the docker-compose.yml file as shown below:

platform: linux/amd64

How to integrate it in babel-local-dev

Update docker-compose.yml file inside babel directory replacing the service solr-lss-dev. Create a new one with the following specifications:

    image: solr:6.6.6-alpine
    ports:
      - "8983:8983"
    user: ${CURRENT_USER}
    volumes:
      - ${BABEL_HOME}/lss_solr_configs/solr6_standalone/lss-dev/core-x:/opt/solr/server/solr/core-x
      - ${BABEL_HOME}/lss_solr_configs/solr6_standalone/lss-dev/core-y:/opt/solr/server/solr/core-y
      - ${BABEL_HOME}/lss_solr_configs/solr6_standalone:/opt/lss_solr_configs
      - ${BABEL_HOME}/lss_solr_configs/solr6_standalone/lib:/opt/solr/server/solr/lib
      - ${BABEL_HOME}/logs/solr:/opt/solr/server/logs

How to start up the Solr 8 server in clode mode with external Zookeeper

docker-compose -f docker-compose.yml up
  • Start up the Solr server in cloud mode with external Zookeeper. The following services will run in the docker-compose file:
    • solr1
    • zoo1

In the folder .github/workflows, there is a workflow to create the image for Solr and external Zookeeper. This workflow creates the image for the different platforms (linux/amd64, linux/arm64, linux/arm/v7) and pushes the image to the GitHub container registry. You should use this image to start up the Solr server in Kubernetes.

In Kubernetes, you should use a multiple platform image to run the Solr server. The recommendation is use the github actions workflow to create the image for the different platforms.

If you are doing changes in the Dockerfile or in the solr_init.sh script, it is better to create the image each time you run the docker-compose file instead of using the image in the repository.

Update the solr service adding the following lines:

    build:
      context: .
      dockerfile: solr8.11.2_cloud/Dockerfile
      target: external_zookeeper_docker
  • [For testing] Manually create the Solr image with external Zookeeper
cd lss_solr_configs
export IMAGE_REPO=ghcr.io/hathitrust/full-text-search-cloud
docker build . --file solr8.11.2_files/Dockerfile --target external_zookeeper_docker --tag $IMAGE_REPO:shards-docker
docker image tag shards-docker:latest ghcr.io/hathitrust/full-text-search-cloud:shards-docker
docker image push ghcr.io/hathitrust/full-text-search-cloud:shards-docker

How to run the application to manage collections and configset

The service to manage collections is defined in the docker-compose.yml. As it is dependent on Solr, for convenience, it is in the same docker-compose file. However, once the solr_manager container is up, you can use it to manage any collection in any Solr server running in Docker or Kubernetes, because it is a Python module that receives the Solr URL as a parameter. You will have to pass the admin password to create collections and upload configsets.

If you start the Solr server in Docker, the admin password is defined in the security.json file and it is the default password used by Solr (solrRocks).

If you start the Solr server in Kubernetes, the admin password is defined in the secrets.

export SOLR_PASSWORD=solrRocks
docker compose -f docker-compose.yml --profile solr_collection_manager up

Using --profile option in the docker-compose file, you can start up the following services

  • solr1
  • zoo1
  • solr_manager

Read solr_manager/README.md to see how to use this module.

How to run the application to create the Solr cluster with one collection

How to integrate it in babel-local-dev

Update docker-compose.yml file inside babel directory replacing the service solr-lss-dev. Create a new one with the following specifications

    image: ghcr.io/hathitrust/solr-lss-dev:shards-docker
    container_name: solr1
    ports:
     - "8981:8983"
    environment:
      - ZK_HOST=zoo1:2181
      - SOLR_OPTS=-XX:-UseLargePages
    networks:
      - solr
    depends_on:
      zoo1:
        condition: service_healthy
    volumes:
      - solr1_data:/var/solr/data
    command: solr-foreground -c # Solr command to start the container to make sure the security.json is created
    healthcheck:
      test: [ "CMD", "/usr/bin/curl", "-s", "-f", "http://solr-lss-dev:8983/solr/#/admin/ping" ]
      interval: 30s
      timeout: 10s
      retries: 5
  zoo1:
    image: zookeeper:3.8.0
    container_name: zoo1
    restart: always
    hostname: zoo1
    ports:
      - 2181:2181
      - 7001:7000
    environment:
      ZOO_MY_ID: 1
      ZOO_SERVERS: server.1=zoo1:2888:3888;2181
      ZOO_4LW_COMMANDS_WHITELIST: mntr, conf, ruok
      ZOO_CFG_EXTRA: "metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider metricsProvider.httpPort=7000 metricsProvider.exportJvmInfo=true"
    networks:
      - solr
    volumes:
      - zoo1_data:/data
    healthcheck:
      test: [ "CMD", "echo", "ruok", "|", "nc", "localhost", "2181", "|", "grep", "imok" ]
      interval: 30s
      timeout: 10s
      retries: 5
    solr_manager:
      build:
        context: solr_manager
        target: runtime
        dockerfile: Dockerfile
        args:
          UID: ${UID:-1000}
          GID: ${GID:-1000}
          ENV: ${ENV:-dev}
          POETRY_VERSION: ${POETRY_VERSION:-1.5.1}
          SOLR_PASSWORD: ${SOLR_PASSWORD:-solr}
          SOLR_USER: ${SOLR_USER:-solrRocks}
          ZK_HOST: ${ZK_HOST:-zoo1:2181}
      env_file:
        - solr_manager/.env
      volumes:
        - .:/app
      stdin_open: true
      depends_on:
        solr-lss-dev:
          condition: service_healthy
      tty: true
      container_name: solr_manager
      networks:
        - solr
      profiles: [ solr_collection_manager ]

You might add the following list of volume to the docker-compose file.


  zookeeper1_data:
  zookeeper1_datalog:
  zookeeper1_log:
  zookeeper1_wd:

  • To start up the application,

    • docker-compose build
    • docker-compose up
  • To create the collection in full-text search server, use the command below

    • docker exec solr-lss-dev /var/solr/data/collection_manager.sh
  • To index data in full-text search server use the command below

    • ./indexing_data.sh http://localhost:8983 solr_pass ~/mydata data_sample.zip core-x

Hosting

The Solr server is hosted in Kubernetes.

Find here a detail explanation of how the Solr server was set up in Kubernetes

Fulltext search Solr cluster Argocd application: https://argocd.ictc.kubernetes.hathitrust.org/applications/argocd/fulltext-workshop-solrcloud?resource=

Resources

How-to index data using a sample of documents

The sample of data is in macc-ht-ingest-000.umdl.umich.edu:/htprep/fulltext_indexing_sample/data_sample.zip

  • Download a zip file with a sample of documents to your local environment scp macc-ht-ingest-000.umdl.umich.edu:/htprep/fulltext_indexing_sample/data_sample.zip ~/datasets

  • In your working directory,

    • After starting up the Solr server inside the docker,
    • run the script indexing_data.sh. You will need the admin password for doing that.

    ./indexing_data.sh http://localhost:8983 solr_pass ~/mydata data_sample.zip collection_name`.

    The script will extract all the XML files inside the Zip file to a destined folder. Then, it will index the documents in Solr server. The script input parameters are:

    • solr_url
    • solr password
    • the path to the target folder to extract the files
    • the path to the zip file with the sample of documents.

At the end of this process, your Solr server should have a sample of 150 documents.

Note: If in the future we should automatize this process, a service to index documents could be included in the docker-compose. You will have to add the data sample to the docker image or download it from a repository. See the example below as a reference

    build:
      context: ./lss_solr_configs
      dockerfile: ./solr8.11.2_files/Dockerfile
      target: external_zookeeper
    entrypoint: [ "/bin/sh", "-c" ,"indexing_data.sh http://solr-lss-dev:8983" ]
    volumes:
      - solr1_data:/var/solr/data
    depends_on:
      collection_creator:
        condition: service_completed_successfully
    networks:
      - solr

Useful commands

  • Command to create core-x collection. Recommendation: Pass the instanceDir and the dataDir to the curl command

    • curl -u solr:SolrRocks "http://localhost:8983/solr/admin/collections?action=CREATE&name=core-x&instanceDir=/var/solr/data/core-x&numShards=1&collection.configName=core-x&dataDir=/var/solr/data/core-x"
  • Command to index documents into core-x collection, remove authentication if you do not need it

    • curl -u solr:SolrRocks 'http://localhost:8983/solr/core-x/update?commit=true' --data-binary @core-data.json -H 'Content-type:application/json'
  • Delete a configset

    • curl -u solr:SolrRocks -X DELETE "http://localhost:8983/api/cluster/configs/core-x"
  • Create a configset through a .zip file (You should create the zip file using this command: e.g. zip -r core-x.zip core-x)

    • curl -u solr:SolrRocks -X PUT --header "Content-Type:application/octet-stream" --data-binary @core-x.zip "http://localhost:8983/api/cluster/configs/core-x"
  • Delete documents, you should add commit=true

    • curl -X POST -H 'Content-Type: application/json' \ 'http://<host>:<port>/solr/<core>/update?commit=true' \ -d '{ "delete": {"query":"*:*"} }'
  • Export JSON file with index documents

    • curl "http://localhost:8983/solr/core-x/select?q=*%3A*&wt=json&indent=true&start=0&rows=2000000000&fl=*" > full-output-of-my-solr-index.json
  • Below one can be used through browser to delete documents from Solr index:

    • http://host:port/solr/collection_name/update?commit=true&stream.body=<delete><query>*:*</query></delete>

Deployment and Use

Go to section How to integrate it in babel-local-dev to see how to integrate each Solr server into another application.

Re-Indexing

  • To solve the issue below, the solrconfig.xml file was updated to enable the updateLog option.

ERROR org.apache.solr.cloud.SyncStrategy – No UpdateLog found - cannot sync

  • In the Solr cloud logs with embedded ZooKeeper, you could see the issue below. That is more of a warning than error, and it appears because we are running only one ZK in standalone mode. More details of this message here.

Invalid configuration, only one server specified (ignoring)

Testing

Deployment to Production

Production Indexing

Production Serving

Considerations for future modification

Move to AWS

Testing

Links to more background