Solaris: Too many files open

Recently, I had to diagnose an issue where the only symptom was a failure with ‘mailx’ command from time to time, which caused the following output to be mailed instead of the expected report from a cron job:

Your “cron” job on <Hostname> /path/to/script | mailx my@email.domain
produced the following output:
/tmp/Rs10293: File exists

At this point, it was unclear to me where the “/tmp/Rs10293” file was coming, but apparently, /tmp was containing a lot of these RsXXXXX files.

It became quickly obvious that the command which produced those RsXXXX files was ‘mailx’:

# strings /usr/bin/mailx|grep /tmp/Rs
/tmp/Rs%-ld

Oddly enough, I knew now that the file was created by mailx and that after a while, it was trying to create the same file and quit failing. But how comes this file was left in /tmp ?

I’ve decided to put a wrapper to the mailx command which would dump the ‘truss’ output to /var/tmp and then see what happens:

# mv /usr/bin/mailx /usr/bin/mailx.orig
# cat > /usr/bin/mailx
#!/bin/bash
truss -eafld -vall -wall -rall -xall -u a.out -o /var/tmp/mail.$(date +%Y%m%d_%H%M%S).truss /usr/bin/mailx.orig $@
^D

After a while, I’ve noticed that I got some very short truss output file and gotcha! They were exitting quickly after having created the corresponding ‘/tmp/RsXXXX’ file:

14581/1: 0.3988 openat(0xFFD19553, 0x080865C2, 02402, 0600) = 263
14581/1: 0xFFD19553: AT_FDCWD
14581/1: 0x080865C2: “/tmp/Rs14581”
14581/1: 0.3990 fcntl(263, 1, 0x00000000) = 0
14581/1: 0.3991 openat(0xFFD19553, 0xF0FC7F60, 0, 024) Err#2 ENOENT
14581/1: 0xFFD19553: AT_FDCWD
14581/1: 0xF0FC7F60: “/usr/lib/locale/en_US.UTF-8/LC_MESSAGES/SUNW_OST_OSLIB.mo”
14581/1: 0.3992 write(2, 0x080865C2, 12) = 12
14581/1: 0x080865C2: ” / t m p / R s 1 4 5 8 1″
14581/1: 0.3993 write(2, 0xEF558B90, 2) = 2
14581/1: 0xEF558B90: ” : ”
14581/1: 0.3995 write(2, 0xEF50393A, 19) = 19
14581/1: T o o m a n y o p e n f i l e s
14581/1: 0.3995 write(2, 0xEF558B8C, 1) = 1
14581/1: 0xEF558B8C: “\n”

I also discovered that the utility which was launching theses faulty ‘mailx’ commands was ‘smartmon-ux’ disk monitoring software.

I confirmed that by reloading smartmon-ux and waiting on the console:

/dev/rdsk/c0t5000C50055C78D1Bd0s0 polled at Mon Feb 11 06:18:07 2013 Status:Passed (Temperature = 34C 93F) (Speed: Port0=6.0 G; Port1=6.0 G)
/dev/rdsk/c0t5000C50055C78D43d0s0 polled at Mon Feb 11 06:18:08 2013 Status:Passed (Temperature = 28C 82F) (Speed: Port0=6.0 G; Port1=6.0 G)
/dev/rdsk/c0t5000C50055C791F7d0s0 polled at Mon Feb 11 06:18:08 2013 Status:Passed (Temperature = 28C 82F) (Speed: Port0=6.0 G; Port1=6.0 G)
/dev/rdsk/c0t5000C50055C792EFd0s0 polled at Mon Feb 11 06:18:08 2013 Status:Passed (Temperature = 31C 87F) (Speed: Port0=6.0 G; Port1=6.0 G)
/dev/rdsk/c0t5000C50055C7982Bd0s0 polled at Mon Feb 11 06:18:09 2013 Status:Passed (Temperature = 36C 96F) (Speed: Port0=6.0 G; Port1=6.0 G)
/dev/rdsk/c0t5000C50055C7999Fd0s0 polled at Mon Feb 11 06:18:09 2013 Status:Passed (Temperature = 32C 89F) (Speed: Port0=6.0 G; Port1=6.0 G)
/dev/rdsk/c0t5000C5005A8B8693d0s0 polled at Mon Feb 11 06:18:10 2013 Status:Passed (Temperature = 27C 80F) (Speed: Port0=6.0 G; Port1=<unattached>)
/tmp/Rs16850: Too many open files
Device on /dev/rdsk/c2t0d0s0, Thermal alert. Temperature now at 255C 491F degrees.
/dev/rdsk/c2t0d0s0 polled at Mon Feb 11 06:18:10 2013 Status:Passed (Temperature = 255C 491F)
/tmp/Rs16853: Too many open files
Device on /dev/rdsk/c3t1d0s0, Thermal alert. Temperature now at 255C 491F degrees.
/dev/rdsk/c3t1d0s0 polled at Mon Feb 11 06:18:10 2013 Status:Passed (Temperature = 255C 491F)
/tmp/Rs16856: Too many open files
Device on /dev/rdsk/c6t6d0s0, Thermal alert. Temperature now at 255C 491F degrees.
/dev/rdsk/c6t6d0s0 polled at Mon Feb 11 06:18:10 2013 Status:Passed (Temperature = 255C 491F)
/tmp/Rs16859: Too many open files
Device on /dev/rdsk/c7t7d0s0, Thermal alert. Temperature now at 255C 491F degrees.
/dev/rdsk/c7t7d0s0 polled at Mon Feb 11 06:18:11 2013 Status:Passed (Temperature = 255C 491F)

Right, now we know the “Who”, but why?

The syscall returned ‘Too many open files’ error which was impossible as per our /etc/system:

set rlim_fd_max=900000
set rlim_fd_cur=32768

And checking the current per-process limit:

# plimit $$|grep nofiles
nofiles(descriptors) 32768 900000

So we can open 32k files! How comes smartmon-ux/mailx is complaining?

# pgrep smartmon-ux
16692
# ls -l /proc/16692/fd/*|wc -l
261

So we’re clearly not reaching the 32k file descriptor limit! I’ve then found this blog post from 2006.

Let’s confirm that by compiling this little .c file:

#include <stdio.h>
#define MAXF 65535

int main()
{
        char filename[20];
        FILE *fds[MAXF];
        int i;

        for (i = 0; i < MAXF; ++i) 
        {
                sprintf (filename, "./out/%d.log", i);
                fds[i] = fopen(filename, "w");

                if (fds[i] == NULL)
                {
                        printf("\n** Number of open files = %d. fopen() failed with error:  ", i);
                        perror("");
                        exit(1);
                }
                else
                {
                        fprintf (fds[i], "some string");
                }
        }
        return (0);
}

# gcc -o files files.c
# mkdir out
# ./files
** Number of open files = 253. fopen() failed with error: Too many open files
# ls out|wc -l
254

ARGHL… ?

Then, the funny part is comming:

# gcc -o files files.c -m64
# ./files
** Number of open files = 32765. fopen() failed with error: Too many open files

Okay, this is a 32 vs 64 bits problem. Googling again is taking us to the /usr/lib/extendedFILE.so.1 library, which overcomes this bug:

# gcc -o files files.c
# file files
files: ELF 32-bit LSB executable 80386 Version 1 [FPU], dynamically linked, not stripped, no debugging information available
# LD_PRELOAD=/usr/lib/extendedFILE.so.1 ./files
** Number of open files = 32353. fopen() failed with error: Not enough space

And so our problem is fixed! Now let’s change a little bit the smartmon-ux start script to work around our problem:
Edit /etc/init.d/smartmon-ux.d and change line

/etc/smartmon-ux -E -G 45 -link -F 1200 -P -sq -M smartmon-alert@domain.tld

by:

LD_PRELOAD=/usr/lib/secure/extendedFILE.so.1 /etc/smartmon-ux -E -G 45 -link -F 1200 -P -sq -M smartmon-alert@domain.tld

also, symlink libextendedFILE.so.1 in the /usr/lib/secure directory to avoid the system complaining (mailx is setguid!!)

# cd /usr/lib/secure ; ln -s /usr/lib/extendedFILE.so.1

restart smartmon-ux and check it’s been using the library:

# /etc/init.d/smartmon-ux.d restart
# pgrep smartmon-ux
18319
# ls -al /proc/18319/path/|grep extendedFILE
lrwxrwxrwx 1 root root 0 2013-02-11 06:43 zfs.124.65538.134236 -> /usr/lib/extendedFILE.so.1

Then, I’ve begin thinking… what would be the proper and definitive fix for that problem? Would smartmon-ux compiled in 64bit be the fix?

Let’s try with our little .c file slightly modified:

#include <stdio.h>
#define MAXF 252

int main()
{
        char filename[20];
        FILE *fds[MAXF];
        int i;

        for (i = 0; i < MAXF; ++i)
        {
                sprintf (filename, "./out/%d.log", i);
                fds[i] = fopen(filename, "w");

                if (fds[i] == NULL)
                {
                        printf("\n** Number of open files = %d. fopen() failed with error:  ", i);
                        perror("");
                        exit(1);
                }
                else
                {
                        fprintf (fds[i], "some string");
                }
        }
        /* we have MAXF opened FD, now try to send a mail... */
        execl("/usr/bin/mailx", "-s", "test files2.c", "-f", "/root/test/mymail", "my@mail.tld", NULL);
        return (0);
}

# gcc -o files2 files2.c # ./files2
/tmp/Rs20586: Too many open files
# gcc -o files2_64 files2.c -m64
# ./files2_64
/tmp/Rs20637: Too many open files

Conclusions:

  • Smartmon-UX is leaking FDs to mailx process
  • Having Smartmon-UX compiled in 64 bits wouldn’t help
  • mailx should be fixed to cleanup /tmp/RsXXXXX files before exiting in error
  • extendedFILE.so.1 preload could be used as workaround
  • Final fix should be fixing smartmon-ux file descriptor leak issue

NOTE: I filed a support case with this, I will let you know the outcome…

This entry was posted in Solaris and tagged , , , , . Bookmark the permalink.

3 Responses to Solaris: Too many files open

Leave a Reply

Your email address will not be published. Required fields are marked *