en - [sympa-users] Possible race condition in bulk.pid handling

Subject: The mailing list for listmasters using Sympa

List archive

[sympa-users] Possible race condition in bulk.pid handling

From: John Kirkland <address@concealed>
To: address@concealed
Subject: [sympa-users] Possible race condition in bulk.pid handling
Date: Tue, 19 Jun 2012 09:35:11 -0500

Hi, all,

I have a sympa server that sends out a few hundred thousand mailing list emails per night. I use 5 bulk senders.

I have had problems, about once a week, for a very long time, where sympa will no longer start up any new bulk senders and only the parent process is delivering mail (all children have shut down). This obviously slows down my mail delivery quite dramatically. The quick fix is to restart sympa, but today I did some troubleshooting.

First, in the bulk.pid file, I had 2 processes listed, the parent and an old child process that was no longer running.

These were PIDs: 24938 21089

The parent PID (24938) won't start up any new child processes because it thinks that the child (21089) is still running. I then did some digging in the log file and saw that the child process seemed to have a normal lifecycle:

Jun 18 00:10:23 maillists bulk[21089]: info Bulk slave daemon started with pid 21089
Jun 18 00:10:23 maillists bulk[24938]: info Starting bulk child daemon, pid 21089
Jun 18 00:44:00 maillists bulk[21089]: info Process 21089 didn't send any message since 60 seconds, exiting.
Jun 18 00:44:00 maillists bulk[21089]: notice bulk.pl exited normally due to signal

I then checked what else occurred during the shutdown of the child process and saw the following:

Jun 18 00:44:00 maillists bulk[21089]: info Process 21089 didn't send any message since 60 seconds, exiting.
Jun 18 00:44:00 maillists bulk[21146]: info Process 21146 didn't send any message since 60 seconds, exiting.
Jun 18 00:44:00 maillists bulk[21089]: notice bulk.pl exited normally due to signal
Jun 18 00:44:00 maillists bulk[21146]: notice bulk.pl exited normally due to signal

It seems 2 child processes ended their lives at this exact same moment in time. I checked the bulkfile pid functions (get_number_of_pids, write_pid, and remove_pid), and there is no file locking.

Based on all this information, that leads me to believe that there is a race condition that occurs when 2 bulk processes shut down at precisely the same time.

I added explicit file locking into the get_number_of_pids, write_pid, and remove_pid functions in tools.pm. I'll see if that fixes my problem, but I feel pretty good this is the root cause of my issue.

Here's the changes I made to tools.pm. Even if this is not the preferred locking method for sympa, I'd suggest that locking be added to the core sympa code for bulk sender handling.

Best Regards,
John Kirkland

--- tools.pm.bak 2012-06-19 08:06:46.000000000 -0500 +++ tools.pm 2012-06-19 09:27:14.000000000 -0500 @@ -2494,6 +2494,7 @@ do_log('err','Could not open %s to remove pid %s', $pidfile, $pid); return undef; } + flock (PFILE, 1); # LOCK_SH my $l = <PFILE>; close PFILE; my @pids = grep {/[0-9]+/} split(/\s+/, $l); @@ -2512,6 +2513,7 @@ &do_log('err', "Failed to open $pidfile: %s", $!); return undef; } + flock (PFILE, 2); # LOCK_EX print PFILE join(' ', @pids)."\n"; close(PFILE); }else{ @@ -2574,6 +2576,7 @@ ## If pidfile exists, read the PIDs if(-f $pidfile) { open(PFILE, $pidfile); + flock(PFILE, 1); # LOCK_SH my $l = <PFILE>; close PFILE; @pids = grep {/[0-9]+/} split(/\s+/, $l); @@ -2585,9 +2588,9 @@ unless(open(LCK, '> '.$pidfile)) { fatal_err('Could not open %s, exiting', $pidfile); } - ## Print other pids + this one push(@pids, $pid); + flock(LCK, 2); # LOCK_EX print LCK join(' ', @pids)."\n"; close(LCK); }else{@@ -2595,8 +2598,11 @@ unless(open(LOCK, '+>> '.$pidfile)) { fatal_err('Could not open %s, exiting', $pidfile); } - unless(flock(LOCK, 6)) { - fatal_err('Could not lock %s, process is probably already running : %s', $pidfile, $!); + # Don't lock the pidfile_bulk here... we have to lock it elsewhere + if ($pidfile ne $Conf::Conf{'pidfile_bulk'}) { + unless(flock(LOCK, 6)) { + fatal_err('Could not lock %s, process is probably already running : %s', $pidfile, $!); + } } ## The previous process died suddenly, without pidfile cleanup @@ -3858,6 +3864,7 @@ &do_log('err', "unable to open pidfile %s:%s",$pidfile,$!); return undef; } + flock(PFILE,1); # LOCK_SH my $l = <PFILE>; close PFILE; my @pids = grep {/[0-9]+/} split(/\s+/, $l);

[sympa-users] Fwd: Database queries should use placeholders, Mitar, 06/14/2012
- Re: [sympa-users] Fwd: Database queries should use placeholders, David Verdin, 06/15/2012
  - [sympa-users] Possible race condition in bulk.pid handling, John Kirkland, 06/19/2012
    - Re: [sympa-users] Possible race condition in bulk.pid handling, David Verdin, 06/19/2012

List archive

[sympa-users] Possible race condition in bulk.pid handling