Hi John,
thanks a lot for your patch. Actually, we just modified this part of
the Sympa code, by changing explicitely the Lock.pm module. Indeed,
it is likely that, in future version of Sympa, we will allow any
Sympa daemon to run under multiple instances, thus making the
generalized file locking mandatory.
Also, we clean up the pid file and count the actual number of bulks
running on the system, so that we won't completely trust what we
read in the pid file.
In addition, we will chang the way the bulk is run: we will ha ve a
mster bulk which won't send messages at all and just create the
processes needed with regards to the sending charge.
This will prevent a decrease in the number of bulks if some of them
die unexpectidly.
This code will be available in the 6.2 version of Sympa. We'll
probably add the loking mechanism alone to Sympa 6.1 as well.
Regards,
David
Le 19/06/12 16:35, John Kirkland a écrit :
address@concealed">
Hi, all,
I have a sympa server that sends out a few hundred thousand
mailing list emails per night. I use 5 bulk senders.
I have had problems, about once a week, for a very long time,
where sympa will no longer start up any new bulk senders and only
the parent process is delivering mail (all children have shut
down). This obviously slows down my mail delivery quite
dramatically. The quick fix is to restart sympa, but today I did
some troubleshooting.
First, in the bulk.pid file, I had 2 processes listed, the parent
and an old child process that was no longer running.
These were PIDs: 24938 21089
The parent PID (24938) won't start up any new child processes
because it thinks that the child (21089) is still running. I then
did some digging in the log file and saw that the child process
seemed to have a normal lifecycle:
Jun 18 00:10:23 maillists bulk[21089]: info Bulk slave daemon
started with pid 21089
Jun 18 00:10:23 maillists bulk[24938]: info Starting bulk child
daemon, pid 21089
Jun 18 00:44:00 maillists bulk[21089]: info Process 21089 didn't
send any message since 60 seconds, exiting.
Jun 18 00:44:00 maillists bulk[21089]: notice bulk.pl exited
normally due to signal
I then checked what else occurred during the shutdown of the child
process and saw the following:
Jun 18 00:44:00 maillists bulk[21089]: info Process 21089 didn't
send any message since 60 seconds, exiting.
Jun 18 00:44:00 maillists bulk[21146]: info Process 21146 didn't
send any message since 60 seconds, exiting.
Jun 18 00:44:00 maillists bulk[21089]: notice bulk.pl exited
normally due to signal
Jun 18 00:44:00 maillists bulk[21146]: notice bulk.pl exited
normally due to signal
It seems 2 child processes ended their lives at this exact same
moment in time. I checked the bulkfile pid functions
(get_number_of_pids, write_pid, and remove_pid), and there is no
file locking.
Based on all this information, that leads me to believe that there
is a race condition that occurs when 2 bulk processes shut down at
precisely the same time.
I added explicit file locking into the get_number_of_pids,
write_pid, and remove_pid functions in tools.pm. I'll see if that
fixes my problem, but I feel pretty good this is the root cause of
my issue.
Here's the changes I made to tools.pm. Even if this is not the
preferred locking method for sympa, I'd suggest that locking be
added to the core sympa code for bulk sender handling.
Best Regards,
John Kirkland
--- tools.pm.bak 2012-06-19 08:06:46.000000000
-0500
+++ tools.pm 2012-06-19 09:27:14.000000000 -0500
@@ -2494,6 +2494,7 @@
do_log('err','Could not open %s to
remove pid %s', $pidfile, $pid);
return undef;
}
+ flock (PFILE, 1); # LOCK_SH
my $l = <PFILE>;
close PFILE;
my @pids = grep {/[0-9]+/} split(/\s+/, $l);
@@ -2512,6 +2513,7 @@
&do_log('err',
"Failed to open $pidfile: %s", $!);
return undef;
}
+ flock (PFILE, 2); # LOCK_EX
print PFILE join(' ',
@pids)."\n";
close(PFILE);
}else{
@@ -2574,6 +2576,7 @@
## If pidfile exists, read the PIDs
if(-f $pidfile) {
open(PFILE, $pidfile);
+ flock(PFILE, 1); # LOCK_SH
my $l = <PFILE>;
close PFILE;
@pids = grep {/[0-9]+/} split(/\s+/, $l);
@@ -2585,9 +2588,9 @@
unless(open(LCK, '> '.$pidfile)) {
fatal_err('Could not open %s,
exiting', $pidfile);
}
-
## Print other pids + this one
push(@pids, $pid);
+ flock(LCK, 2); # LOCK_EX
print LCK join(' ', @pids)."\n";
close(LCK);
}else{
@@ -2595,8 +2598,11 @@
unless(open(LOCK, '+>> '.$pidfile)) {
fatal_err('Could not open %s,
exiting', $pidfile);
}
- unless(flock(LOCK, 6)) {
- fatal_err('Could not lock %s, process
is probably already running : %s', $pidfile, $!);
+ # Don't lock the pidfile_bulk here... we have
to lock it elsewhere
+ if ($pidfile ne $Conf::Conf{'pidfile_bulk'}) {
+ unless(flock(LOCK, 6)) {
+ fatal_err('Could not lock %s,
process is probably already running : %s', $pidfile, $!);
+ }
}
## The previous process died suddenly, without
pidfile cleanup
@@ -3858,6 +3864,7 @@
&do_log('err', "unable to open pidfile
%s:%s",$pidfile,$!);
return undef;
}
+ flock(PFILE,1); # LOCK_SH
my $l = <PFILE>;
close PFILE;
my @pids = grep {/[0-9]+/} split(/\s+/, $l);
|