Sep 23, 2019 Too Many Open Files: A Fence or an Ambulance
Today I was researching an error message for a client that contained the error:
Too many open files
Usually this leads to answers like "change the file limit". But this is using an ambulance when we need a fence.
What do I mean by that?
When I was in elementary school a man came to give a "Say no to drugs!" type of talk to the entire school.
The metaphor he used made an impression on me and I still remember it now. If there was a cliff that people kept falling off would you build a fence to prevent the accidents or would you put an ambulance where people land and to get them help as quickly as possible.
Too many times with competing priorities, driven by business value, we compromise about which way to solve a problem. Many times we as developers decide it's quicker to just deal with a problem after it happens (use an ambulance) than try to prevent it from happening (build a fence.)
Many times we think, "I'll just increase the file limit." Which would look like this:
$ ulimit -n 2048
That would fix it for the current shell session, but not for new processes. So you might change the
/etc/security/limits.conf file so every process can now use more file descriptors. That's great right?
Yet we should be asking ourselves, "Why does linux limit open files in the first place?"
The reason is that the operating system needs memory to manage each open file, and memory is a limited resource - especially on embedded systems.
While it's important to address issues in the short term, in the long term we'll need a fence. So you need to find out what process(es) are using too many files.
Yet that can be like trying to find a needle in your linux stack because it outputs too much. Which is why I suggest you narrow the search with a couple of useful flags.
For instance, if you thought it may be the
syslog process opening a bunch of files, you could specify the user.
$ lsof -usyslog
Then you could also check to see if that user is using a lot of sockets or TCP/IP connections.
# Sockets $ lsof -a -U -usyslog # TCP/IP connections $ lsof -a -i -usyslog
Let's say you've found the culprit and it's
vault instead of
syslog. What do you do next? Open Source to the rescue, you'd find or raise an issue on the repo for the project.
Together you can then figure out how to solve the problem.
In the case of
vault it was a third-party dependency on the
aws-sdk-go to use AWS's software to interface with the
Who Builds the Fence?
This vault issue we've discussed here is a great example of how each of us in the Open Source Software community can help each other succeed.
Which is what we're here to do at Stark & Wayne and how we operate when it comes to solving problems.
In doing research for this article, I also learned that if you don't find a
man page there's another way to get help.
$ man ulimit No manual entry for ulimit
You can also run
help <command> .
$ help ulimit ulimit: ulimit [-SHabcdefilmnpqrstuvxT] [limit] Modify shell resource limits. Provides control over the resources available to the shell and processes it creates, on systems that allow such control. Options: -S use the `soft' resource limit -H use the `hard' resource limit ...
Why do you have to use
help instead of
man? Because the
ulimit command is a bash built-in command.
Builtin commands are contained within the shell itself. When the name of a builtin command is used as the first word of a simple command (see Simple Commands), the shell executes the command directly, without invoking another program. Builtin commands are necessary to implement functionality impossible or inconvenient to obtain with separate utilities.