Stark & Wayne

Too Many Open Files: A Fence or an Ambulance

Today I was researching an error message for a client that contained the error:

Too many open files

Usually this leads to answers like "change the file limit".  But this is using an ambulance when we need a fence.

What do I mean by that?

When I was in elementary school a man came to give a "Say no to drugs!" type of talk to the entire school.

The metaphor he used made an impression on me and I still remember it now.  If there was a cliff that people kept falling off would you build a fence to prevent the accidents or would you put an ambulance where people land and to get them help as quickly as possible.

Too many times with competing priorities, driven by business value, we compromise about which way to solve a problem.  Many times we as developers decide it's quicker to just deal with a problem after it happens (use an ambulance) than try to prevent it from happening (build a fence.)

The Ambulance

Many times we think, "I'll just increase the file limit."  Which would look like this:

$ ulimit -n 2048

That would fix it for the current shell session, but not for new processes.  So you might change the /etc/security/limits.conf file so every process can now use more file descriptors.  That's great right?

Yet we should be asking ourselves, "Why does linux limit open files in the first place?"

The reason is that the operating system needs memory to manage each open file, and memory is a limited resource - especially on embedded systems.

The Fence

While it's important to address issues in the short term, in the long term we'll need a fence.  So you need to find out what process(es) are using too many files.

$ lsof

Yet that can be like trying to find a needle in your linux stack because it outputs too much.  Which is why I suggest you narrow the search with a couple of useful flags.

For instance, if you thought it may be the syslog process opening a bunch of files, you could specify the user.

$ lsof -usyslog

Then you could also check to see if that user is using a lot of sockets or TCP/IP connections.

# Sockets
$ lsof -a -U -usyslog

# TCP/IP connections
$ lsof -a -i -usyslog

Let's say you've found the culprit and it's vault instead of syslog.  What do you do next?  Open Source to the rescue, you'd find or raise an issue on the repo for the project.

Together you can then figure out how to solve the problem.

In the case of vault it was a third-party dependency on the aws-sdk-go to use AWS's software to interface with the vault backend.

Who Builds the Fence?

This vault issue we've discussed here is a great example of how each of us in the Open Source Software community can help each other succeed.  

Which is what we're here to do at Stark & Wayne and how we operate when it comes to solving problems.

Additional Help

In doing research for this article, I also learned that if you don't find a man page there's another way to get help.

$ man ulimit
No manual entry for ulimit

You can also run help <command> .

$ help ulimit
ulimit: ulimit [-SHabcdefilmnpqrstuvxT] [limit]
    Modify shell resource limits.

    Provides control over the resources available to the shell and processes
    it creates, on systems that allow such control.

    Options:
      -S	use the `soft' resource limit
      -H	use the `hard' resource limit
...

Why do you have to use help instead of man?  Because the ulimit command is a bash built-in command.

Builtin commands are contained within the shell itself. When the name of a builtin command is used as the first word of a simple command (see Simple Commands), the shell executes the command directly, without invoking another program. Builtin commands are necessary to implement functionality impossible or inconvenient to obtain with separate utilities.