Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch fails to startup with valid fsize and virtual memory limits #113705

Closed
nathan-maves opened this issue Sep 27, 2024 · 26 comments · Fixed by #113723
Closed

Elasticsearch fails to startup with valid fsize and virtual memory limits #113705

nathan-maves opened this issue Sep 27, 2024 · 26 comments · Fixed by #113723
Assignees
Labels
>bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team

Comments

@nathan-maves
Copy link

nathan-maves commented Sep 27, 2024

Elasticsearch Version

8.15

Installed Plugins

No response

Java Version

bundled

OS Version

Linux

Problem Description

We have found a few issues with the Bootstrap checks on Linux/Unix machines. The first is that a value of -1 should be accepted along with unlimited and infinity based on this documentation.

All items support the values -1, unlimited or infinity indicating no limit, except for priority, nice, and nonewprivs. If nofile is to be set to one of these values, it will be set to the contents of /proc/sys/fs/nr_open instead (see setrlimit(3)).

The second issue is that this code appears to be incorrect.

It should be checking for the max file size and NOT the max memory size

long getMaxFileSize() {
    return NativeAccess.instance().getProcessLimits().maxFileSize();
}

Steps to Reproduce

Set the fsize value to -1 in the /etc/security/limits.conf file then start up Elastic 8.15.x.

Logs (if relevant)

[2024-09-25T17:00:39,398][ERROR][o.e.b.Elasticsearch ] [node-f41036c7-370b-4665-a0d0-679e2bedef84] node validation exception
[2] bootstrap checks failed. You must address the points described in the following [2] lines before starting Elasticsearch. For more information see [https://www.elastic.co/guide/en/elasticsearch/reference/8.15/bootstrap-checks.html]
bootstrap check failure [1] of [2]: max size virtual memory [-1] for user [###] is too low, increase to [unlimited]; for more information see [https://www.elastic.co/guide/en/elasticsearch/reference/8.15/max-size-virtual-memory-check.html]
bootstrap check failure [2] of [2]: max file size [-1] for user [###] is too low, increase to [unlimited]; for more information see [https://www.elastic.co/guide/en/elasticsearch/reference/8.15/_max_file_size_check.html]
@nathan-maves nathan-maves added >bug needs:triage Requires assignment of a team area label labels Sep 27, 2024
@andreidan andreidan added :Core/Infra/Core Core issues without another label and removed needs:triage Requires assignment of a team area label labels Sep 27, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Sep 27, 2024
@prdoyle prdoyle self-assigned this Sep 27, 2024
@prdoyle
Copy link
Contributor

prdoyle commented Sep 27, 2024

For the second part... maxVirtualMemorySize certainly looks like a bug. @rjernst - do you agree it should be calling maxFileSize? The code came from here.

@prdoyle
Copy link
Contributor

prdoyle commented Sep 27, 2024

For the first part, I think we want to change this Long.MIN_VALUE to just -1 like it is elsewhere.

@prdoyle
Copy link
Contributor

prdoyle commented Sep 27, 2024

I've changed it to accept both Long.MIN_VALUE and -1. A comment in the unit test seems to suggest that this value can be MIN_VALUE if the size is "not available".

@nathan-maves
Copy link
Author

I think you might want to add the -1 to this check as well.

if (getMaxSizeVirtualMemory() != Long.MIN_VALUE && getMaxSizeVirtualMemory() != ProcessLimits.UNLIMITED) {

There could be others too.

@prdoyle
Copy link
Contributor

prdoyle commented Sep 27, 2024

I believe @rjernst is also looking at this now.

@prdoyle
Copy link
Contributor

prdoyle commented Sep 27, 2024

My original -1 fix was incorrect. The code is already supposed to be turning that -1 into ProcessLimits.UNLIMITED here (where constants.RLIMIT_INFINITY is defined to be -1L by the first parameter here.

@prdoyle
Copy link
Contributor

prdoyle commented Sep 27, 2024

@nathan-maves - what happens if you try to use -1?

@nathan-maves
Copy link
Author

nathan-maves commented Sep 27, 2024

You can see in the logs I added to the issue we already have the system set to -1. So the code is reading in the value of -1 and telling us that it is too low.

max file size [-1] for user [####] is too low

@rjernst
Copy link
Member

rjernst commented Sep 27, 2024

Are there any other log messages like unable to retrieve max size virtual memory? We translate the RLIMIT_INFINITY value for each system into our own (which is represented by MAX_INT). I suspect what is happening here is the rlimit call failed, which then stores our own UNKNOWN (-1), but the bootstrap checks aren't currently specializing the error message for that case, so it looks as if -1 was not handled.

@rjernst
Copy link
Member

rjernst commented Sep 27, 2024

The issue description mentiones "Unix" as the OS. Do you mean linux, and if so, what distribution? We do not support any Unix distributions, and I can see how that might be an issue (our rlimit calls are probably not setup right for unix).

@nathan-maves
Copy link
Author

That was my bad. I am pretty sure we support RHEL and rocky linux.

@prdoyle
Copy link
Contributor

prdoyle commented Sep 30, 2024

The merged PR only fixes one of the reported problems. The "-1 problem" still exists.

@prdoyle prdoyle reopened this Sep 30, 2024
@nathan-maves
Copy link
Author

Is there anything your team needs from me?

A member of my team tried both "unlimited" and "-1" and ES 8.15 would not start up on debian linux. This might stem from the issue you fixed as the code is not reading the correct setting value. Is there any chance we can get a build with the fix to test things out?

@prdoyle
Copy link
Contributor

prdoyle commented Oct 7, 2024

Hey @nathan-maves - I think we have everything we need. I'll try to reproduce today and reach out if I'm unable to do so.

@prdoyle
Copy link
Contributor

prdoyle commented Oct 7, 2024

Actually @nathan-maves - can you please confirm that you have no files in /etc/security/limits.d?

@prdoyle
Copy link
Contributor

prdoyle commented Oct 8, 2024

@nathan-maves - if you want to try with the fix... are you in a position to clone this repo and run ./gradlew run?

@nathan-maves
Copy link
Author

@prdoyle I asked some people on our team and they confirmed that there is not much or sometimes nothing in their /etc/security/limits.d.

@prdoyle
Copy link
Contributor

prdoyle commented Oct 8, 2024

Note to self: these limits are annoyingly sticky.

First, they only take effect on a new ssh session. The existing ssh session will retain the old limits even after limits.conf is edited.

Second, they apparently also stick to a Gradle daemon started during an ssh session. So even if you log out and ssh back into the machine, if a Gradle daemon is still running, it will remember the old limits too.

The command line I'm experimenting with right now is:

./gradlew --no-daemon run -Dtests.jvm.argline="-Des.enforce.bootstrap.checks=true" -Dtests.es.xpack.security.enabled=false

This appears to be doing what I want.

@prdoyle
Copy link
Contributor

prdoyle commented Oct 8, 2024

Using the above technique, -1 appears to work as expected. I'm unable to provoke ES into showing -1 as the file size setting in the error message.

Here's what I tried:

  1. Boot an Ubuntu 20 VM.
  2. Set fsize to 10001000000 in /etc/security/limits.conf. Open an SSH into the same machine. Boot elasticsearch using the above ./gradlew run command line. ES fails to boot. At the end of build/testclusters/runTask-0/logs/es.out I see an error like the one originally reported (quoted below).
  3. Set fsize to -1. Open an SSH into the same machine. Boot ES. ES boots successfully.

The error I saw in the log was this:

bootstrap check failure [1] of [1]: max file size [10241024000000] for user [###] is too low, increase to [unlimited].

(Note that the reported value is 1024 times the setting I put in limits.conf as expected.)

@prdoyle
Copy link
Contributor

prdoyle commented Oct 8, 2024

@nathan-maves - could you please check your ES logs for a line like this? If you're seeing this in your log, it's pretty hard for me to understand how you could have seen the -1 error message. 🤔

[INFO ][o.e.n.NativeAccess       ] [runTask-0] Using [jdk] native provider and native methods for [Linux]

@prdoyle
Copy link
Contributor

prdoyle commented Oct 9, 2024

...in particular, it should say either [Linux], [MacOS], or [Windows]. If it says [Linux] that should be an indicator it was called from this location, which also sets -1 to mean "unlimited", in which case I'm not seeing how the code could generate the error message you saw.

@prdoyle
Copy link
Contributor

prdoyle commented Oct 11, 2024

@nathan-maves - given that I've fixed one bug and cannot reproduce the other, I'm going to close this issue for now. If you do add a comment with some additional guidance for reproducing the "-1" error, we can reopen this issue and resume the investigation.

@nathan-maves
Copy link
Author

Was there a back port to 8.15 planned for this?

@rjernst
Copy link
Member

rjernst commented Oct 24, 2024

@prdoyle we should backport to 8.15 as well. I had thought the breakage had been committed to 8.16 only, but it looks like it was in 8.15 (#108805).

@prdoyle
Copy link
Contributor

prdoyle commented Nov 3, 2024

8.15 backport is in #116152.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team
Projects
None yet
5 participants