8 messages in net.sourceforge.lists.courier-usersRe: [courier-users] Newbie: mail queu...
FromSent OnAttachments
Colin DickSep 6, 2003 10:33 am 
Mircea DamianSep 7, 2003 1:10 am 
Mircea DamianSep 7, 2003 8:44 am.pl
Gordon MessmerSep 7, 2003 6:50 pm 
Evelyn PichlerSep 8, 2003 10:28 pm 
Colin DickSep 9, 2003 10:23 pm 
Gordon MessmerSep 9, 2003 11:50 pm 
Mitch (WebCob)Sep 9, 2003 11:51 pm 
Actions with this message:
Paste this link in email or IM:
Paste this link in email or IM:
Atom feed for this thread
Paste this URL into your reader:
Subject:Re: [courier-users] Newbie: mail queue optimization.Actions...
From:Colin Dick (cdi@mail.ocis.net)
Date:Sep 9, 2003 10:23:44 pm
List:net.sourceforge.lists.courier-users

Hi,

Thank for your comments. Now I have a couple more questions:

I have all courier files installed under /usr/lib/courier. What folders specifically should I move to a partition mounted with 'noatime'. I guess /usr/lib/courier/var. Would it be best to do all of /usr/lib/courier. Do you think I should mount /home as noatime as well? Also, is noatime safe? Quick web searches find conflicting reports and sob stories of busted production servers.

Thanks for the queuetime suggestion. That is the setting that is equivalent to ignore_errormsg_errors_after that exim has I suppose. Is this setting relevant if I have a smarthost set? Do messages have to go into the queue to deliver to the smarthost anyway? I set a smarthost in attempts to ensure outbound mail from the box was not delayed.

My queue is still quite large (but I am slowly deleting and delivering the 137000 odd messages that are remaining... it was at 242845 recently). And without the atime change that you suggest, I guess it will continue to take hours for the mailq command to complete. So, I am not able to use your perl script as of yet. Plus, as I mentioned, I have a php script that allows me to do similar things.

I have stopped saving messages in /var/opt/rav/quarantine and /var/opt/rav/bulk. I understand the disk i/o issue. A subsequent message suggested clustering a couple machines. If these measures don't help, that will be my next step.

I have stopped sending messages as an action of RAV scanning. I understand I was sending three bounces for every virus/spam detected by rav. One to sender, one to receiver and I think one to the rav coders as well so that they can gather stats. I have also stopped bouncing a message to the senders of messages deemed to be spam by spamassassin. Hopefully, by taking these steps, the mailq will be restricted to inbound locate deliveries only.

Please let me know if any of my understanding is off base. This problem has been going on for almost a month now and I hope to have it resloved soon so I can get on to other tasks. Thanks again for your suggestions.

On Sun, 7 Sep 2003, Mircea Damian wrote:

Hello,

I had the same problem at the beginning of SoBig and similars which were filling my queue. At that point I had at least 20000 messages in the queue in the peak time. I came with the following optimizations:

1. Mount the partition with folders and mail queue with noatime flag. That will speed up mailq. In my case mailq took about 10s to run with 20K messages. 2. Lower the time for messages to stay in queue: queuetime. I've set 24h. 3. I've made a short perl program that will parse the queue and help you cancel messages (see attachment). Just run it with -c 10 and it will show the top 10 messages. Remove them from the queue (if it is spam or bounce) with -r flag _BUT_ if they are bounces or spam put them also into bofh file. 4. Remove any quarantine actions from RAV. It's just killing your disks and I do not think it is usefull (use it only for debug). And disk bandwidth is what courierd is laking to parse the disk queue faster. 5. DO NOT bounce any message from RAV (just do not inform anyone!). Drop all high possibility spam and virus messages (return address of sobig is broken anyway).

If you do that I promise you: # time mailq | tail -n 2

166 messages. mailq 0.01s user 0.02s system 108% cpu 0.028 total tail -n 2 0.00s user 0.00s system 0% cpu 0.020 total

On Sat, Sep 06, 2003 at 10:33:21AM -0700, Colin Dick wrote:

Hi, I am trying to use courier-mta (courierimap/courierpop), squirrelmail, spamassassin (w/razor2) and RAV antivirus as a spam/virus reduced service to my customers. In testing everything worked great. Now that I have about 1000 users on the box, I am running into mail delays (up to 3 days).

As far as I can tell, I am getting more mail than I am able to process through the memory based queue during peak times and the messages are getting dumped to the disk based queue. I have many message (mostly bounce messages to spoofed senders) that are old and are taking up the space in the memory queue when it reloads due to my queuefill value (15m). The legitimate local deliveries never get a chance. I have tried playing with queuehi/queuelo (2500/2000) but it doesn't seem to matter how many messages I allow into memory, the system still can't keep up.

I have a couple ideas but need some help in implementing:

How would I reserve some space in the memory queue for new local deliveries? In other words prioritize for the local users on the box. I have currently written a program to dump the mailq (which takes 8 or more hours), parse it for localusers and flush specific messages. However, when the memory queue is full, I don't think the messages can even be forced to flush.

How can I discard bounce messages. In exim, there is a setting called ignore-errormsg-errors-after or simply ignore-errormsg-errors. This really helps with bounce messages to spoofed (invalid) senders. Again, I have currently written a script to parse the mailq for specific addresses that I can run cancelmsg on. This way, I can toss 'batches' of messages once I determine the return address of any particular spam run.

Writing my own scripts is fine, however, I am sure there is a better way since this must be a common issue. Also, my scripts rely on the mailq output (which takes forever), so my forcing of local mail can only be as quick as the mailq results. The computer is a P4 1.60GHz (cpu MHz : 1615.935) running on RH9.0. We recently doubled the RAM from 512 to 1G which helped the speed of Squirrelmail, however, the queue just won't process fast enough. Hopefully that is enough information to get a couple of tips. Or perhaps someones document on how they optimized similar setups. I look forward to any replies (as do my users.... yup, I have already turned this into a production server and the delays have been going on 3 weeks now).