en - Re: [sympa-users] How to increase Archiver throughput?

Subject: The mailing list for listmasters using Sympa

List archive

Re: [sympa-users] How to increase Archiver throughput?

From: Matt Taggart <address@concealed>
To: "address@concealed" <address@concealed>
Subject: Re: [sympa-users] How to increase Archiver throughput?
Date: Tue, 25 Mar 2014 22:04:04 -0700

Steve Shipway writes:
> Hi -
>
> We have a lot of messages coming through our system, and the apparently
> single-threaded archived process cannot keep up. Now, the 'outgoing' spool
> has several thousand messages awaiting archival and rising. A single
> mhonarc process is running as a child of archived.pl
>
> Is there any way to increase the number of mhonarc threads we are running?
> If not, then we may never catch up with our backlog of messages pending
> archival. I'd like to hear how some of the other large mail sites handle
> this.

Hi Steve,

Here's a brain dump:

* The disk I/O comments I made in this message are relevant to arc/ too

https://listes.renater.fr/sympa/arc/sympa-users/2014-02/msg00021.html

sympa/MTA/OS are doing so many things, if several of them are on the same
disk it can easily cause i/o starvation.

* newer versions of munin (>2.0) have some nice disk i/o and latency graphs
that are good for spotting problems. The other munin graphs are helpful in
determining if cpu or memory are limiting things as well. I also use some
sympa plugins to graph the queues and number of subscribers/users/lists.
First step is having some data on the problem.

* Are you using the sympa upstream version of mhonarc-ressources.tt2 or a
custom version? What version of mhonarc?

* Anything else on the system that might be causing a bottleneck?
- entropy shortage (install haveged and get a hRNG. common on VMs)
- interrupt thrashing (consider irqbalance)
- you mentioned you are using VMware, maybe other VMs on the system
causing disruption
- SAN bottleneck, in the controller/link/array/etc
The munin graphs will help you determine the above.

* Long ago we had to switch to ext4 because ext3 wasn't able to handle
everything. We run newer backported kernels in order to have the latest
performance/fixes to all the block layer stuff (we use ext4+lvm+dmcrypt+md
raid).

* Running a large sympa install is really hard on mechanical drives. We
monitor the SMART disk attributes and watch the Reallocated_Sector_Ct. We
use RAID1 and if we suspect a drive is having problems we will remove it
from the array and run an offline badblocks test on it. This serves two
purposes: 1) tests the drive and tells if there are any bad blocks (which
also means the drive has used up it's spare blocks) and 2) exercises the
blocks and hopefully gets any marginal blocks to fail during testing rather
than while in production. Having a block have problems in production can
cause serious i/o hiccups or drives to be thrown from arrays.

You said you are on a SAN, but the same things will apply, just on the SAN
device instead of local. How the SAN is configure will have a big effect
too. For example, for sympa archives I bet you are better off with the
disks split into multiple arrays (RAID0+1, or even just multiple RAID1 as
LVM PV's in the same VG) than you would be with all the disks in one big
RAID5/6 array. (because then disks could do different things rather than
all the disks needing to deal with every request)

* We've actually seen HDD i/o problems caused by vibrations. If you are
having HDD i/o problems and they won't go away by isolating the load, keep
this in mind.(hopefully not an issue for your SAN)
Watch this: https://www.youtube.com/watch?v=tDacjrSCeq4

* What is your web archive load like? If you have lots of people browsing
the archives all the time (or web crawlers) that could disrupt write i/o.

* How do you backup the archives and when does that happen?

* Can you determine if the load is being caused by particular lists? If so
maybe you can think of ways to isolate those particular lists, maybe
putting them on a separate drive and using symlinks.

* With lots of lists in the arc/ directory, it can be painful to do
anything that involves a stat on that dir. We've considered suggesting that
lists be bucketed in a-z0-9 directories. (for expl/ too). So far we just
try not to do that (trying to do bash tab completion is always fun)

* We've considered using the new bcache (http://bcache.evilpiepirate.org/)
to speed things up, but do all the above before considering that.

> We have many bulk threads handling the distribution of messages, so no
> problems scaling out there, but how do we increase the archival capacity?

If our sympa install keeps growing we will eventually need to be able to
split the archive stuff off to dedicated machine(s). Ideally sympa would be
able to do it, but I've also thought about ways it could be done outside of
sympa, like maybe have a mirror of the archive for the website to serve, or
migrate dormant list archives to a separate server and do some apache magic
to make things work.

When you find the problem, let us know what it is :)

--
Matt Taggart
address@concealed

[sympa-users] How to increase Archiver throughput?, Steve Shipway, 03/25/2014
- Re: [sympa-users] How to increase Archiver throughput?, Matt Taggart, 03/26/2014
  - RE: [sympa-users] How to increase Archiver throughput?, Steve Shipway, 03/26/2014

List archive

Re: [sympa-users] How to increase Archiver throughput?