mirror of
https://github.com/apache/httpd.git
synced 2025-05-17 15:21:13 +03:00
git-svn-id: https://svn.apache.org/repos/asf/httpd/httpd/trunk@87505 13f79535-47bb-0310-9956-ffa450edef68
907 lines
39 KiB
HTML
907 lines
39 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
|
|
<HTML>
|
|
<HEAD>
|
|
<TITLE>Apache Performance Notes</TITLE>
|
|
</HEAD>
|
|
<!-- Background white, links blue (unvisited), navy (visited), red (active) -->
|
|
<BODY
|
|
BGCOLOR="#FFFFFF"
|
|
TEXT="#000000"
|
|
LINK="#0000FF"
|
|
VLINK="#000080"
|
|
ALINK="#FF0000"
|
|
>
|
|
<!--#include virtual="header.html" -->
|
|
|
|
<blockquote><strong>Warning:</strong>
|
|
This document has not been updated to take into account changes
|
|
made in the 2.0 version of the Apache HTTP Server. Some of the
|
|
information may still be relevant, but please use it
|
|
with care.
|
|
</blockquote>
|
|
|
|
<H1 align="center">Apache Performance Notes</H1>
|
|
|
|
<P>Author: Dean Gaudet
|
|
|
|
<ul>
|
|
<li><a href="#introduction">Introduction</a></li>
|
|
<li><a href="#hardware">Hardware and Operating System Issues</a></li>
|
|
<li><a href="#runtime">Run-Time Configuration Issues</a></li>
|
|
<li><a href="#compiletime">Compile-Time Configuration Issues</a></li>
|
|
<li>Appendixes
|
|
<ul>
|
|
<li><a href="#trace">Detailed Analysis of a Trace</a></li>
|
|
<li><a href="#patches">Patches Available</a></li>
|
|
<li><a href="#preforking">The Pre-Forking Model</a></li>
|
|
</ul></li>
|
|
</ul>
|
|
|
|
<hr>
|
|
|
|
<H3><a name="introduction">Introduction</A></H3>
|
|
<P>Apache is a general webserver, which is designed to be correct first, and
|
|
fast second. Even so, its performance is quite satisfactory. Most
|
|
sites have less than 10Mbits of outgoing bandwidth, which Apache can
|
|
fill using only a low end Pentium-based webserver. In practice sites
|
|
with more bandwidth require more than one machine to fill the bandwidth
|
|
due to other constraints (such as CGI or database transaction overhead).
|
|
For these reasons the development focus has been mostly on correctness
|
|
and configurability.
|
|
|
|
<P>Unfortunately many folks overlook these facts and cite raw performance
|
|
numbers as if they are some indication of the quality of a web server
|
|
product. There is a bare minimum performance that is acceptable, beyond
|
|
that extra speed only caters to a much smaller segment of the market.
|
|
But in order to avoid this hurdle to the acceptance of Apache in some
|
|
markets, effort was put into Apache 1.3 to bring performance up to a
|
|
point where the difference with other high-end webservers is minimal.
|
|
|
|
<P>Finally there are the folks who just plain want to see how fast something
|
|
can go. The author falls into this category. The rest of this document
|
|
is dedicated to these folks who want to squeeze every last bit of
|
|
performance out of Apache's current model, and want to understand why
|
|
it does some things which slow it down.
|
|
|
|
<P>Note that this is tailored towards Apache 1.3 on Unix. Some of it applies
|
|
to Apache on NT. Apache on NT has not been tuned for performance yet;
|
|
in fact it probably performs very poorly because NT performance requires
|
|
a different programming model.
|
|
|
|
<hr>
|
|
|
|
<H3><a name="hardware">Hardware and Operating System Issues</a></H3>
|
|
|
|
<P>The single biggest hardware issue affecting webserver performance
|
|
is RAM. A webserver should never ever have to swap, swapping increases
|
|
the latency of each request beyond a point that users consider "fast
|
|
enough". This causes users to hit stop and reload, further increasing
|
|
the load. You can, and should, control the <CODE>MaxClients</CODE>
|
|
setting so that your server does not spawn so many children it starts
|
|
swapping.
|
|
|
|
<P>Beyond that the rest is mundane: get a fast enough CPU, a fast enough
|
|
network card, and fast enough disks, where "fast enough" is something
|
|
that needs to be determined by experimentation.
|
|
|
|
<P>Operating system choice is largely a matter of local concerns. But
|
|
a general guideline is to always apply the latest vendor TCP/IP patches.
|
|
HTTP serving completely breaks many of the assumptions built into Unix
|
|
kernels up through 1994 and even 1995. Good choices include
|
|
recent FreeBSD, and Linux.
|
|
|
|
<hr>
|
|
|
|
<H3><a name="runtime">Run-Time Configuration Issues</a></H3>
|
|
|
|
<H4>HostnameLookups</H4>
|
|
<P>Prior to Apache 1.3, <CODE>HostnameLookups</CODE> defaulted to On.
|
|
This adds latency
|
|
to every request because it requires a DNS lookup to complete before
|
|
the request is finished. In Apache 1.3 this setting defaults to Off.
|
|
However (1.3 or later), if you use any <CODE>Allow from domain</CODE> or
|
|
<CODE>Deny from domain</CODE> directives then you will pay for a
|
|
double reverse DNS lookup (a reverse, followed by a forward to make sure
|
|
that the reverse is not being spoofed). So for the highest performance
|
|
avoid using these directives (it's fine to use IP addresses rather than
|
|
domain names).
|
|
|
|
<P>Note that it's possible to scope the directives, such as within
|
|
a <CODE><Location /server-status></CODE> section. In this
|
|
case the DNS lookups are only performed on requests matching the
|
|
criteria. Here's an example which disables
|
|
lookups except for .html and .cgi files:
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
HostnameLookups off
|
|
<Files ~ "\.(html|cgi)$">
|
|
HostnameLookups on
|
|
</Files>
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
But even still, if you just need DNS names
|
|
in some CGIs you could consider doing the
|
|
<CODE>gethostbyname</CODE> call in the specific CGIs that need it.
|
|
|
|
<H4>FollowSymLinks and SymLinksIfOwnerMatch</H4>
|
|
<P>Wherever in your URL-space you do not have an
|
|
<CODE>Options FollowSymLinks</CODE>, or you do have an
|
|
<CODE>Options SymLinksIfOwnerMatch</CODE> Apache will have to
|
|
issue extra system calls to check up on symlinks. One extra call per
|
|
filename component. For example, if you had:
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
DocumentRoot /www/htdocs
|
|
<Directory />
|
|
Options SymLinksIfOwnerMatch
|
|
</Directory>
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
and a request is made for the URI <CODE>/index.html</CODE>.
|
|
Then Apache will perform <CODE>lstat(2)</CODE> on <CODE>/www</CODE>,
|
|
<CODE>/www/htdocs</CODE>, and <CODE>/www/htdocs/index.html</CODE>. The
|
|
results of these <CODE>lstats</CODE> are never cached,
|
|
so they will occur on every single request. If you really desire the
|
|
symlinks security checking you can do something like this:
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
DocumentRoot /www/htdocs
|
|
<Directory />
|
|
Options FollowSymLinks
|
|
</Directory>
|
|
<Directory /www/htdocs>
|
|
Options -FollowSymLinks +SymLinksIfOwnerMatch
|
|
</Directory>
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
This at least avoids the extra checks for the <CODE>DocumentRoot</CODE>
|
|
path. Note that you'll need to add similar sections if you have any
|
|
<CODE>Alias</CODE> or <CODE>RewriteRule</CODE> paths outside of your
|
|
document root. For highest performance, and no symlink protection,
|
|
set <CODE>FollowSymLinks</CODE> everywhere, and never set
|
|
<CODE>SymLinksIfOwnerMatch</CODE>.
|
|
|
|
<H4>AllowOverride</H4>
|
|
|
|
<P>Wherever in your URL-space you allow overrides (typically
|
|
<CODE>.htaccess</CODE> files) Apache will attempt to open
|
|
<CODE>.htaccess</CODE> for each filename component. For example,
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
DocumentRoot /www/htdocs
|
|
<Directory />
|
|
AllowOverride all
|
|
</Directory>
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
and a request is made for the URI <CODE>/index.html</CODE>. Then
|
|
Apache will attempt to open <CODE>/.htaccess</CODE>,
|
|
<CODE>/www/.htaccess</CODE>, and <CODE>/www/htdocs/.htaccess</CODE>.
|
|
The solutions are similar to the previous case of <CODE>Options
|
|
FollowSymLinks</CODE>. For highest performance use
|
|
<CODE>AllowOverride None</CODE> everywhere in your filesystem.
|
|
|
|
<H4>Negotiation</H4>
|
|
|
|
<P>If at all possible, avoid content-negotiation if you're really
|
|
interested in every last ounce of performance. In practice the
|
|
benefits of negotiation outweigh the performance penalties. There's
|
|
one case where you can speed up the server. Instead of using
|
|
a wildcard such as:
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
DirectoryIndex index
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
Use a complete list of options:
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
DirectoryIndex index.cgi index.pl index.shtml index.html
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
where you list the most common choice first.
|
|
|
|
<H4>Process Creation</H4>
|
|
|
|
<P>Prior to Apache 1.3 the <CODE>MinSpareServers</CODE>,
|
|
<CODE>MaxSpareServers</CODE>, and <CODE>StartServers</CODE> settings
|
|
all had drastic effects on benchmark results. In particular, Apache
|
|
required a "ramp-up" period in order to reach a number of children
|
|
sufficient to serve the load being applied. After the initial
|
|
spawning of <CODE>StartServers</CODE> children, only one child per
|
|
second would be created to satisfy the <CODE>MinSpareServers</CODE>
|
|
setting. So a server being accessed by 100 simultaneous clients,
|
|
using the default <CODE>StartServers</CODE> of 5 would take on
|
|
the order 95 seconds to spawn enough children to handle the load. This
|
|
works fine in practice on real-life servers, because they aren't restarted
|
|
frequently. But does really poorly on benchmarks which might only run
|
|
for ten minutes.
|
|
|
|
<P>The one-per-second rule was implemented in an effort to avoid
|
|
swamping the machine with the startup of new children. If the machine
|
|
is busy spawning children it can't service requests. But it has such
|
|
a drastic effect on the perceived performance of Apache that it had
|
|
to be replaced. As of Apache 1.3,
|
|
the code will relax the one-per-second rule. It
|
|
will spawn one, wait a second, then spawn two, wait a second, then spawn
|
|
four, and it will continue exponentially until it is spawning 32 children
|
|
per second. It will stop whenever it satisfies the
|
|
<CODE>MinSpareServers</CODE> setting.
|
|
|
|
<P>This appears to be responsive enough that it's
|
|
almost unnecessary to twiddle the <CODE>MinSpareServers</CODE>,
|
|
<CODE>MaxSpareServers</CODE> and <CODE>StartServers</CODE> knobs. When
|
|
more than 4 children are spawned per second, a message will be emitted
|
|
to the <CODE>ErrorLog</CODE>. If you see a lot of these errors then
|
|
consider tuning these settings. Use the <CODE>mod_status</CODE> output
|
|
as a guide.
|
|
|
|
<P>Related to process creation is process death induced by the
|
|
<CODE>MaxRequestsPerChild</CODE> setting. By default this is 0, which
|
|
means that there is no limit to the number of requests handled
|
|
per child. If your configuration currently has this set to some
|
|
very low number, such as 30, you may want to bump this up significantly.
|
|
If you are running SunOS or an old version of Solaris, limit this
|
|
to 10000 or so because of memory leaks.
|
|
|
|
<P>When keep-alives are in use, children will be kept busy
|
|
doing nothing waiting for more requests on the already open
|
|
connection. The default <CODE>KeepAliveTimeout</CODE> of
|
|
15 seconds attempts to minimize this effect. The tradeoff
|
|
here is between network bandwidth and server resources.
|
|
In no event should you raise this above about 60 seconds, as
|
|
<A HREF="http://www.research.digital.com/wrl/techreports/abstracts/95.4.html"
|
|
>most of the benefits are lost</A>.
|
|
|
|
<hr>
|
|
|
|
<H3><a name="compiletime">Compile-Time Configuration Issues</a></H3>
|
|
|
|
<H4>mod_status and ExtendedStatus On</H4>
|
|
|
|
<P>If you include <CODE>mod_status</CODE>
|
|
and you also set <CODE>ExtendedStatus On</CODE> when building and running
|
|
Apache, then on every request Apache will perform two calls to
|
|
<CODE>gettimeofday(2)</CODE> (or <CODE>times(2)</CODE> depending
|
|
on your operating system), and (pre-1.3) several extra calls to
|
|
<CODE>time(2)</CODE>. This is all done so that the status report
|
|
contains timing indications. For highest performance, set
|
|
<CODE>ExtendedStatus off</CODE> (which is the default).
|
|
|
|
<H4>accept Serialization - multiple sockets</H4>
|
|
|
|
<P>This discusses a shortcoming in the Unix socket API.
|
|
Suppose your
|
|
web server uses multiple <CODE>Listen</CODE> statements to listen on
|
|
either multiple ports or multiple addresses. In order to test each
|
|
socket to see if a connection is ready Apache uses <CODE>select(2)</CODE>.
|
|
<CODE>select(2)</CODE> indicates that a socket has <EM>zero</EM> or
|
|
<EM>at least one</EM> connection waiting on it. Apache's model includes
|
|
multiple children, and all the idle ones test for new connections at the
|
|
same time. A naive implementation looks something like this
|
|
(these examples do not match the code, they're contrived for
|
|
pedagogical purposes):
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
for (;;) {
|
|
for (;;) {
|
|
fd_set accept_fds;
|
|
|
|
FD_ZERO (&accept_fds);
|
|
for (i = first_socket; i <= last_socket; ++i) {
|
|
FD_SET (i, &accept_fds);
|
|
}
|
|
rc = select (last_socket+1, &accept_fds, NULL, NULL, NULL);
|
|
if (rc < 1) continue;
|
|
new_connection = -1;
|
|
for (i = first_socket; i <= last_socket; ++i) {
|
|
if (FD_ISSET (i, &accept_fds)) {
|
|
new_connection = accept (i, NULL, NULL);
|
|
if (new_connection != -1) break;
|
|
}
|
|
}
|
|
if (new_connection != -1) break;
|
|
}
|
|
process the new_connection;
|
|
}
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
But this naive implementation has a serious starvation problem. Recall
|
|
that multiple children execute this loop at the same time, and so multiple
|
|
children will block at <CODE>select</CODE> when they are in between
|
|
requests. All those blocked children will awaken and return from
|
|
<CODE>select</CODE> when a single request appears on any socket
|
|
(the number of children which awaken varies depending on the operating
|
|
system and timing issues).
|
|
They will all then fall down into the loop and try to <CODE>accept</CODE>
|
|
the connection. But only one will succeed (assuming there's still only
|
|
one connection ready), the rest will be <EM>blocked</EM> in
|
|
<CODE>accept</CODE>.
|
|
This effectively locks those children into serving requests from that
|
|
one socket and no other sockets, and they'll be stuck there until enough
|
|
new requests appear on that socket to wake them all up.
|
|
This starvation problem was first documented in
|
|
<A HREF="http://bugs.apache.org/index/full/467">PR#467</A>. There
|
|
are at least two solutions.
|
|
|
|
<P>One solution is to make the sockets non-blocking. In this case the
|
|
<CODE>accept</CODE> won't block the children, and they will be allowed
|
|
to continue immediately. But this wastes CPU time. Suppose you have
|
|
ten idle children in <CODE>select</CODE>, and one connection arrives.
|
|
Then nine of those children will wake up, try to <CODE>accept</CODE> the
|
|
connection, fail, and loop back into <CODE>select</CODE>, accomplishing
|
|
nothing. Meanwhile none of those children are servicing requests that
|
|
occurred on other sockets until they get back up to the <CODE>select</CODE>
|
|
again. Overall this solution does not seem very fruitful unless you
|
|
have as many idle CPUs (in a multiprocessor box) as you have idle children,
|
|
not a very likely situation.
|
|
|
|
<P>Another solution, the one used by Apache, is to serialize entry into
|
|
the inner loop. The loop looks like this (differences highlighted):
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
for (;;) {
|
|
<STRONG>accept_mutex_on ();</STRONG>
|
|
for (;;) {
|
|
fd_set accept_fds;
|
|
|
|
FD_ZERO (&accept_fds);
|
|
for (i = first_socket; i <= last_socket; ++i) {
|
|
FD_SET (i, &accept_fds);
|
|
}
|
|
rc = select (last_socket+1, &accept_fds, NULL, NULL, NULL);
|
|
if (rc < 1) continue;
|
|
new_connection = -1;
|
|
for (i = first_socket; i <= last_socket; ++i) {
|
|
if (FD_ISSET (i, &accept_fds)) {
|
|
new_connection = accept (i, NULL, NULL);
|
|
if (new_connection != -1) break;
|
|
}
|
|
}
|
|
if (new_connection != -1) break;
|
|
}
|
|
<STRONG>accept_mutex_off ();</STRONG>
|
|
process the new_connection;
|
|
}
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
<A NAME="serialize">The functions</A>
|
|
<CODE>accept_mutex_on</CODE> and <CODE>accept_mutex_off</CODE>
|
|
implement a mutual exclusion semaphore. Only one child can have the
|
|
mutex at any time. There are several choices for implementing these
|
|
mutexes. The choice is defined in <CODE>src/conf.h</CODE> (pre-1.3) or
|
|
<CODE>src/include/ap_config.h</CODE> (1.3 or later). Some architectures
|
|
do not have any locking choice made, on these architectures it is unsafe
|
|
to use multiple <CODE>Listen</CODE> directives.
|
|
|
|
<DL>
|
|
<DT><CODE>USE_FLOCK_SERIALIZED_ACCEPT</CODE>
|
|
<DD>This method uses the <CODE>flock(2)</CODE> system call to lock a
|
|
lock file (located by the <CODE>LockFile</CODE> directive).
|
|
|
|
<DT><CODE>USE_FCNTL_SERIALIZED_ACCEPT</CODE>
|
|
<DD>This method uses the <CODE>fcntl(2)</CODE> system call to lock a
|
|
lock file (located by the <CODE>LockFile</CODE> directive).
|
|
|
|
<DT><CODE>USE_SYSVSEM_SERIALIZED_ACCEPT</CODE>
|
|
<DD>(1.3 or later) This method uses SysV-style semaphores to implement the
|
|
mutex. Unfortunately SysV-style semaphores have some bad side-effects.
|
|
One is that it's possible Apache will die without cleaning up the semaphore
|
|
(see the <CODE>ipcs(8)</CODE> man page). The other is that the semaphore
|
|
API allows for a denial of service attack by any CGIs running under the
|
|
same uid as the webserver (<EM>i.e.</EM>, all CGIs, unless you use something
|
|
like suexec or cgiwrapper). For these reasons this method is not used
|
|
on any architecture except IRIX (where the previous two are prohibitively
|
|
expensive on most IRIX boxes).
|
|
|
|
<DT><CODE>USE_USLOCK_SERIALIZED_ACCEPT</CODE>
|
|
<DD>(1.3 or later) This method is only available on IRIX, and uses
|
|
<CODE>usconfig(2)</CODE> to create a mutex. While this method avoids
|
|
the hassles of SysV-style semaphores, it is not the default for IRIX.
|
|
This is because on single processor IRIX boxes (5.3 or 6.2) the
|
|
uslock code is two orders of magnitude slower than the SysV-semaphore
|
|
code. On multi-processor IRIX boxes the uslock code is an order of magnitude
|
|
faster than the SysV-semaphore code. Kind of a messed up situation.
|
|
So if you're using a multiprocessor IRIX box then you should rebuild your
|
|
webserver with <CODE>-DUSE_USLOCK_SERIALIZED_ACCEPT</CODE> on the
|
|
<CODE>EXTRA_CFLAGS</CODE>.
|
|
|
|
<DT><CODE>USE_PTHREAD_SERIALIZED_ACCEPT</CODE>
|
|
<DD>(1.3 or later) This method uses POSIX mutexes and should work on
|
|
any architecture implementing the full POSIX threads specification,
|
|
however appears to only work on Solaris (2.5 or later), and even then
|
|
only in certain configurations. If you experiment with this you should
|
|
watch out for your server hanging and not responding. Static content
|
|
only servers may work just fine.
|
|
</DL>
|
|
|
|
<P>If your system has another method of serialization which isn't in the
|
|
above list then it may be worthwhile adding code for it (and submitting
|
|
a patch back to Apache).
|
|
|
|
<P>Another solution that has been considered but never implemented is
|
|
to partially serialize the loop -- that is, let in a certain number
|
|
of processes. This would only be of interest on multiprocessor boxes
|
|
where it's possible multiple children could run simultaneously, and the
|
|
serialization actually doesn't take advantage of the full bandwidth.
|
|
This is a possible area of future investigation, but priority remains
|
|
low because highly parallel web servers are not the norm.
|
|
|
|
<P>Ideally you should run servers without multiple <CODE>Listen</CODE>
|
|
statements if you want the highest performance. But read on.
|
|
|
|
<H4>accept Serialization - single socket</H4>
|
|
|
|
<P>The above is fine and dandy for multiple socket servers, but what
|
|
about single socket servers? In theory they shouldn't experience
|
|
any of these same problems because all children can just block in
|
|
<CODE>accept(2)</CODE> until a connection arrives, and no starvation
|
|
results. In practice this hides almost the same "spinning" behaviour
|
|
discussed above in the non-blocking solution. The way that most TCP
|
|
stacks are implemented, the kernel actually wakes up all processes blocked
|
|
in <CODE>accept</CODE> when a single connection arrives. One of those
|
|
processes gets the connection and returns to user-space, the rest spin in
|
|
the kernel and go back to sleep when they discover there's no connection
|
|
for them. This spinning is hidden from the user-land code, but it's
|
|
there nonetheless. This can result in the same load-spiking wasteful
|
|
behaviour that a non-blocking solution to the multiple sockets case can.
|
|
|
|
<P>For this reason we have found that many architectures behave more
|
|
"nicely" if we serialize even the single socket case. So this is
|
|
actually the default in almost all cases. Crude experiments under
|
|
Linux (2.0.30 on a dual Pentium pro 166 w/128Mb RAM) have shown that
|
|
the serialization of the single socket case causes less than a 3%
|
|
decrease in requests per second over unserialized single-socket.
|
|
But unserialized single-socket showed an extra 100ms latency on
|
|
each request. This latency is probably a wash on long haul lines,
|
|
and only an issue on LANs. If you want to override the single socket
|
|
serialization you can define <CODE>SINGLE_LISTEN_UNSERIALIZED_ACCEPT</CODE>
|
|
and then single-socket servers will not serialize at all.
|
|
|
|
<H4>Lingering Close</H4>
|
|
|
|
<P>As discussed in
|
|
<A
|
|
HREF="http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-connection-00.txt"
|
|
>draft-ietf-http-connection-00.txt</A> section 8,
|
|
in order for an HTTP server to <STRONG>reliably</STRONG> implement the protocol
|
|
it needs to shutdown each direction of the communication independently
|
|
(recall that a TCP connection is bi-directional, each half is independent
|
|
of the other). This fact is often overlooked by other servers, but
|
|
is correctly implemented in Apache as of 1.2.
|
|
|
|
<P>When this feature was added to Apache it caused a flurry of
|
|
problems on various versions of Unix because of a shortsightedness.
|
|
The TCP specification does not state that the FIN_WAIT_2 state has a
|
|
timeout, but it doesn't prohibit it. On systems without the timeout,
|
|
Apache 1.2 induces many sockets stuck forever in the FIN_WAIT_2 state.
|
|
In many cases this can be avoided by simply upgrading to the latest
|
|
TCP/IP patches supplied by the vendor. In cases where the vendor has
|
|
never released patches (<EM>i.e.</EM>, SunOS4 -- although folks with a source
|
|
license can patch it themselves) we have decided to disable this feature.
|
|
|
|
<P>There are two ways of accomplishing this. One is the
|
|
socket option <CODE>SO_LINGER</CODE>. But as fate would have it,
|
|
this has never been implemented properly in most TCP/IP stacks. Even
|
|
on those stacks with a proper implementation (<EM>i.e.</EM>, Linux 2.0.31) this
|
|
method proves to be more expensive (cputime) than the next solution.
|
|
|
|
<P>For the most part, Apache implements this in a function called
|
|
<CODE>lingering_close</CODE> (in <CODE>http_main.c</CODE>). The
|
|
function looks roughly like this:
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
void lingering_close (int s)
|
|
{
|
|
char junk_buffer[2048];
|
|
|
|
/* shutdown the sending side */
|
|
shutdown (s, 1);
|
|
|
|
signal (SIGALRM, lingering_death);
|
|
alarm (30);
|
|
|
|
for (;;) {
|
|
select (s for reading, 2 second timeout);
|
|
if (error) break;
|
|
if (s is ready for reading) {
|
|
if (read (s, junk_buffer, sizeof (junk_buffer)) <= 0) {
|
|
break;
|
|
}
|
|
/* just toss away whatever is here */
|
|
}
|
|
}
|
|
|
|
close (s);
|
|
}
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
This naturally adds some expense at the end of a connection, but it
|
|
is required for a reliable implementation. As HTTP/1.1 becomes more
|
|
prevalent, and all connections are persistent, this expense will be
|
|
amortized over more requests. If you want to play with fire and
|
|
disable this feature you can define <CODE>NO_LINGCLOSE</CODE>, but
|
|
this is not recommended at all. In particular, as HTTP/1.1 pipelined
|
|
persistent connections come into use <CODE>lingering_close</CODE>
|
|
is an absolute necessity (and
|
|
<A HREF="http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html">
|
|
pipelined connections are faster</A>, so you
|
|
want to support them).
|
|
|
|
<H4>Scoreboard File</H4>
|
|
|
|
<P>Apache's parent and children communicate with each other through
|
|
something called the scoreboard. Ideally this should be implemented
|
|
in shared memory. For those operating systems that we either have
|
|
access to, or have been given detailed ports for, it typically is
|
|
implemented using shared memory. The rest default to using an
|
|
on-disk file. The on-disk file is not only slow, but it is unreliable
|
|
(and less featured). Peruse the <CODE>src/main/conf.h</CODE> file
|
|
for your architecture and look for either <CODE>USE_MMAP_SCOREBOARD</CODE> or
|
|
<CODE>USE_SHMGET_SCOREBOARD</CODE>. Defining one of those two (as
|
|
well as their companions <CODE>HAVE_MMAP</CODE> and <CODE>HAVE_SHMGET</CODE>
|
|
respectively) enables the supplied shared memory code. If your system has
|
|
another type of shared memory, edit the file <CODE>src/main/http_main.c</CODE>
|
|
and add the hooks necessary to use it in Apache. (Send us back a patch
|
|
too please.)
|
|
|
|
<P>Historical note: The Linux port of Apache didn't start to use
|
|
shared memory until version 1.2 of Apache. This oversight resulted
|
|
in really poor and unreliable behaviour of earlier versions of Apache
|
|
on Linux.
|
|
|
|
<H4><CODE>DYNAMIC_MODULE_LIMIT</CODE></H4>
|
|
|
|
<P>If you have no intention of using dynamically loaded modules
|
|
(you probably don't if you're reading this and tuning your
|
|
server for every last ounce of performance) then you should add
|
|
<CODE>-DDYNAMIC_MODULE_LIMIT=0</CODE> when building your server.
|
|
This will save RAM that's allocated only for supporting dynamically
|
|
loaded modules.
|
|
|
|
<hr>
|
|
|
|
<H3><a name="trace">Appendix: Detailed Analysis of a Trace</a></H3>
|
|
|
|
Here is a system call trace of Apache 1.3 running on Linux. The run-time
|
|
configuration file is essentially the default plus:
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
<Directory />
|
|
AllowOverride none
|
|
Options FollowSymLinks
|
|
</Directory>
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
The file being requested is a static 6K file of no particular content.
|
|
Traces of non-static requests or requests with content negotiation
|
|
look wildly different (and quite ugly in some cases). First the
|
|
entire trace, then we'll examine details. (This was generated by
|
|
the <CODE>strace</CODE> program, other similar programs include
|
|
<CODE>truss</CODE>, <CODE>ktrace</CODE>, and <CODE>par</CODE>.)
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
accept(15, {sin_family=AF_INET, sin_port=htons(22283), sin_addr=inet_addr("127.0.0.1")}, [16]) = 3
|
|
flock(18, LOCK_UN) = 0
|
|
sigaction(SIGUSR1, {SIG_IGN}, {0x8059954, [], SA_INTERRUPT}) = 0
|
|
getsockname(3, {sin_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
|
|
setsockopt(3, IPPROTO_TCP1, [1], 4) = 0
|
|
read(3, "GET /6k HTTP/1.0\r\nUser-Agent: "..., 4096) = 60
|
|
sigaction(SIGUSR1, {SIG_IGN}, {SIG_IGN}) = 0
|
|
time(NULL) = 873959960
|
|
gettimeofday({873959960, 404935}, NULL) = 0
|
|
stat("/home/dgaudet/ap/apachen/htdocs/6k", {st_mode=S_IFREG|0644, st_size=6144, ...}) = 0
|
|
open("/home/dgaudet/ap/apachen/htdocs/6k", O_RDONLY) = 4
|
|
mmap(0, 6144, PROT_READ, MAP_PRIVATE, 4, 0) = 0x400ee000
|
|
writev(3, [{"HTTP/1.1 200 OK\r\nDate: Thu, 11"..., 245}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 6144}], 2) = 6389
|
|
close(4) = 0
|
|
time(NULL) = 873959960
|
|
write(17, "127.0.0.1 - - [10/Sep/1997:23:39"..., 71) = 71
|
|
gettimeofday({873959960, 417742}, NULL) = 0
|
|
times({tms_utime=5, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 446747
|
|
shutdown(3, 1 /* send */) = 0
|
|
oldselect(4, [3], NULL, [3], {2, 0}) = 1 (in [3], left {2, 0})
|
|
read(3, "", 2048) = 0
|
|
close(3) = 0
|
|
sigaction(SIGUSR1, {0x8059954, [], SA_INTERRUPT}, {SIG_IGN}) = 0
|
|
munmap(0x400ee000, 6144) = 0
|
|
flock(18, LOCK_EX) = 0
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
<P>Notice the accept serialization:
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
flock(18, LOCK_UN) = 0
|
|
...
|
|
flock(18, LOCK_EX) = 0
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
These two calls can be removed by defining
|
|
<CODE>SINGLE_LISTEN_UNSERIALIZED_ACCEPT</CODE> as described earlier.
|
|
|
|
<P>Notice the <CODE>SIGUSR1</CODE> manipulation:
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
sigaction(SIGUSR1, {SIG_IGN}, {0x8059954, [], SA_INTERRUPT}) = 0
|
|
...
|
|
sigaction(SIGUSR1, {SIG_IGN}, {SIG_IGN}) = 0
|
|
...
|
|
sigaction(SIGUSR1, {0x8059954, [], SA_INTERRUPT}, {SIG_IGN}) = 0
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
This is caused by the implementation of graceful restarts. When the
|
|
parent receives a <CODE>SIGUSR1</CODE> it sends a <CODE>SIGUSR1</CODE>
|
|
to all of its children (and it also increments a "generation counter"
|
|
in shared memory). Any children that are idle (between connections)
|
|
will immediately die
|
|
off when they receive the signal. Any children that are in keep-alive
|
|
connections, but are in between requests will die off immediately. But
|
|
any children that have a connection and are still waiting for the first
|
|
request will not die off immediately.
|
|
|
|
<P>To see why this is necessary, consider how a browser reacts to a closed
|
|
connection. If the connection was a keep-alive connection and the request
|
|
being serviced was not the first request then the browser will quietly
|
|
reissue the request on a new connection. It has to do this because the
|
|
server is always free to close a keep-alive connection in between requests
|
|
(<EM>i.e.</EM>, due to a timeout or because of a maximum number of requests).
|
|
But, if the connection is closed before the first response has been
|
|
received the typical browser will display a "document contains no data"
|
|
dialogue (or a broken image icon). This is done on the assumption that
|
|
the server is broken in some way (or maybe too overloaded to respond
|
|
at all). So Apache tries to avoid ever deliberately closing the connection
|
|
before it has sent a single response. This is the cause of those
|
|
<CODE>SIGUSR1</CODE> manipulations.
|
|
|
|
<P>Note that it is theoretically possible to eliminate all three of
|
|
these calls. But in rough tests the gain proved to be almost unnoticeable.
|
|
|
|
<P>In order to implement virtual hosts, Apache needs to know the
|
|
local socket address used to accept the connection:
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
getsockname(3, {sin_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
It is possible to eliminate this call in many situations (such as when
|
|
there are no virtual hosts, or when <CODE>Listen</CODE> directives are
|
|
used which do not have wildcard addresses). But no effort has yet been
|
|
made to do these optimizations.
|
|
|
|
<P>Apache turns off the Nagle algorithm:
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
setsockopt(3, IPPROTO_TCP1, [1], 4) = 0
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
because of problems described in
|
|
<A HREF="http://www.isi.edu/~johnh/PAPERS/Heidemann97a.html">a
|
|
paper by John Heidemann</A>.
|
|
|
|
<P>Notice the two <CODE>time</CODE> calls:
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
time(NULL) = 873959960
|
|
...
|
|
time(NULL) = 873959960
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
One of these occurs at the beginning of the request, and the other occurs
|
|
as a result of writing the log. At least one of these is required to
|
|
properly implement the HTTP protocol. The second occurs because the
|
|
Common Log Format dictates that the log record include a timestamp of the
|
|
end of the request. A custom logging module could eliminate one of the
|
|
calls. Or you can use a method which moves the time into shared memory,
|
|
see the <A HREF="#patches">patches section below</A>.
|
|
|
|
<P>As described earlier, <CODE>ExtendedStatus On</CODE> causes two
|
|
<CODE>gettimeofday</CODE> calls and a call to <CODE>times</CODE>:
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
gettimeofday({873959960, 404935}, NULL) = 0
|
|
...
|
|
gettimeofday({873959960, 417742}, NULL) = 0
|
|
times({tms_utime=5, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 446747
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
These can be removed by setting <CODE>ExtendedStatus Off</CODE> (which
|
|
is the default).
|
|
|
|
<P>It might seem odd to call <CODE>stat</CODE>:
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
stat("/home/dgaudet/ap/apachen/htdocs/6k", {st_mode=S_IFREG|0644, st_size=6144, ...}) = 0
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
This is part of the algorithm which calculates the
|
|
<CODE>PATH_INFO</CODE> for use by CGIs. In fact if the request had been
|
|
for the URI <CODE>/cgi-bin/printenv/foobar</CODE> then there would be
|
|
two calls to <CODE>stat</CODE>. The first for
|
|
<CODE>/home/dgaudet/ap/apachen/cgi-bin/printenv/foobar</CODE>
|
|
which does not exist, and the second for
|
|
<CODE>/home/dgaudet/ap/apachen/cgi-bin/printenv</CODE>, which does exist.
|
|
Regardless, at least one <CODE>stat</CODE> call is necessary when
|
|
serving static files because the file size and modification times are
|
|
used to generate HTTP headers (such as <CODE>Content-Length</CODE>,
|
|
<CODE>Last-Modified</CODE>) and implement protocol features (such
|
|
as <CODE>If-Modified-Since</CODE>). A somewhat more clever server
|
|
could avoid the <CODE>stat</CODE> when serving non-static files,
|
|
however doing so in Apache is very difficult given the modular structure.
|
|
|
|
<P>All static files are served using <CODE>mmap</CODE>:
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
mmap(0, 6144, PROT_READ, MAP_PRIVATE, 4, 0) = 0x400ee000
|
|
...
|
|
munmap(0x400ee000, 6144) = 0
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
On some architectures it's slower to <CODE>mmap</CODE> small
|
|
files than it is to simply <CODE>read</CODE> them. The define
|
|
<CODE>MMAP_THRESHOLD</CODE> can be set to the minimum
|
|
size required before using <CODE>mmap</CODE>. By default
|
|
it's set to 0 (except on SunOS4 where experimentation has
|
|
shown 8192 to be a better value). Using a tool such as <A
|
|
HREF="http://www.bitmover.com/lmbench/">lmbench</A> you
|
|
can determine the optimal setting for your environment.
|
|
|
|
<P>You may also wish to experiment with <CODE>MMAP_SEGMENT_SIZE</CODE>
|
|
(default 32768) which determines the maximum number of bytes that
|
|
will be written at a time from mmap()d files. Apache only resets the
|
|
client's <CODE>Timeout</CODE> in between write()s. So setting this
|
|
large may lock out low bandwidth clients unless you also increase the
|
|
<CODE>Timeout</CODE>.
|
|
|
|
<P>It may even be the case that <CODE>mmap</CODE> isn't
|
|
used on your architecture; if so then defining <CODE>USE_MMAP_FILES</CODE>
|
|
and <CODE>HAVE_MMAP</CODE> might work (if it works then report back to us).
|
|
|
|
<P>Apache does its best to avoid copying bytes around in memory. The
|
|
first write of any request typically is turned into a <CODE>writev</CODE>
|
|
which combines both the headers and the first hunk of data:
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
writev(3, [{"HTTP/1.1 200 OK\r\nDate: Thu, 11"..., 245}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 6144}], 2) = 6389
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
When doing HTTP/1.1 chunked encoding Apache will generate up to four
|
|
element <CODE>writev</CODE>s. The goal is to push the byte copying
|
|
into the kernel, where it typically has to happen anyhow (to assemble
|
|
network packets). On testing, various Unixes (BSDI 2.x, Solaris 2.5,
|
|
Linux 2.0.31+) properly combine the elements into network packets.
|
|
Pre-2.0.31 Linux will not combine, and will create a packet for
|
|
each element, so upgrading is a good idea. Defining <CODE>NO_WRITEV</CODE>
|
|
will disable this combining, but result in very poor chunked encoding
|
|
performance.
|
|
|
|
<P>The log write:
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
write(17, "127.0.0.1 - - [10/Sep/1997:23:39"..., 71) = 71
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
can be deferred by defining <CODE>BUFFERED_LOGS</CODE>. In this case
|
|
up to <CODE>PIPE_BUF</CODE> bytes (a POSIX defined constant) of log entries
|
|
are buffered before writing. At no time does it split a log entry
|
|
across a <CODE>PIPE_BUF</CODE> boundary because those writes may not
|
|
be atomic. (<EM>i.e.</EM>, entries from multiple children could become mixed together).
|
|
The code does its best to flush this buffer when a child dies.
|
|
|
|
<P>The lingering close code causes four system calls:
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
shutdown(3, 1 /* send */) = 0
|
|
oldselect(4, [3], NULL, [3], {2, 0}) = 1 (in [3], left {2, 0})
|
|
read(3, "", 2048) = 0
|
|
close(3) = 0
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
which were described earlier.
|
|
|
|
<P>Let's apply some of these optimizations:
|
|
<CODE>-DSINGLE_LISTEN_UNSERIALIZED_ACCEPT -DBUFFERED_LOGS</CODE> and
|
|
<CODE>ExtendedStatus Off</CODE>. Here's the final trace:
|
|
|
|
<BLOCKQUOTE><PRE>
|
|
accept(15, {sin_family=AF_INET, sin_port=htons(22286), sin_addr=inet_addr("127.0.0.1")}, [16]) = 3
|
|
sigaction(SIGUSR1, {SIG_IGN}, {0x8058c98, [], SA_INTERRUPT}) = 0
|
|
getsockname(3, {sin_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
|
|
setsockopt(3, IPPROTO_TCP1, [1], 4) = 0
|
|
read(3, "GET /6k HTTP/1.0\r\nUser-Agent: "..., 4096) = 60
|
|
sigaction(SIGUSR1, {SIG_IGN}, {SIG_IGN}) = 0
|
|
time(NULL) = 873961916
|
|
stat("/home/dgaudet/ap/apachen/htdocs/6k", {st_mode=S_IFREG|0644, st_size=6144, ...}) = 0
|
|
open("/home/dgaudet/ap/apachen/htdocs/6k", O_RDONLY) = 4
|
|
mmap(0, 6144, PROT_READ, MAP_PRIVATE, 4, 0) = 0x400e3000
|
|
writev(3, [{"HTTP/1.1 200 OK\r\nDate: Thu, 11"..., 245}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 6144}], 2) = 6389
|
|
close(4) = 0
|
|
time(NULL) = 873961916
|
|
shutdown(3, 1 /* send */) = 0
|
|
oldselect(4, [3], NULL, [3], {2, 0}) = 1 (in [3], left {2, 0})
|
|
read(3, "", 2048) = 0
|
|
close(3) = 0
|
|
sigaction(SIGUSR1, {0x8058c98, [], SA_INTERRUPT}, {SIG_IGN}) = 0
|
|
munmap(0x400e3000, 6144) = 0
|
|
</PRE></BLOCKQUOTE>
|
|
|
|
That's 19 system calls, of which 4 remain relatively easy to remove,
|
|
but don't seem worth the effort.
|
|
|
|
<H3><A NAME="patches">Appendix: Patches Available</A></H3>
|
|
|
|
There are
|
|
<A HREF="http://www.arctic.org/~dgaudet/apache/1.3/">
|
|
several performance patches available for 1.3.</A> Although they may
|
|
not apply cleanly to the current version,
|
|
it shouldn't be difficult for someone with a little C knowledge to
|
|
update them. In particular:
|
|
|
|
<UL>
|
|
<LI>A
|
|
<A HREF="http://www.arctic.org/~dgaudet/apache/1.3/shared_time.patch"
|
|
>patch</A> to remove all <CODE>time(2)</CODE> system calls.
|
|
<LI>A
|
|
<A HREF="http://www.arctic.org/~dgaudet/apache/1.3/mod_include_speedups.patch"
|
|
>patch</A> to remove various system calls from <CODE>mod_include</CODE>,
|
|
these calls are used by few sites but required for backwards compatibility.
|
|
<LI>A
|
|
<A HREF="http://www.arctic.org/~dgaudet/apache/1.3/top_fuel.patch"
|
|
>patch</A> which integrates the above two plus a few other speedups at the
|
|
cost of removing some functionality.
|
|
</UL>
|
|
|
|
<H3><a name="preforking">Appendix: The Pre-Forking Model</a></H3>
|
|
|
|
<P>Apache (on Unix) is a <EM>pre-forking</EM> model server. The
|
|
<EM>parent</EM> process is responsible only for forking <EM>child</EM>
|
|
processes, it does not serve any requests or service any network
|
|
sockets. The child processes actually process connections, they serve
|
|
multiple connections (one at a time) before dying.
|
|
The parent spawns new or kills off old
|
|
children in response to changes in the load on the server (it does so
|
|
by monitoring a scoreboard which the children keep up to date).
|
|
|
|
<P>This model for servers offers a robustness that other models do
|
|
not. In particular, the parent code is very simple, and with a high
|
|
degree of confidence the parent will continue to do its job without
|
|
error. The children are complex, and when you add in third party
|
|
code via modules, you risk segmentation faults and other forms of
|
|
corruption. Even should such a thing happen, it only affects one
|
|
connection and the server continues serving requests. The parent
|
|
quickly replaces the dead child.
|
|
|
|
<P>Pre-forking is also very portable across dialects of Unix.
|
|
Historically this has been an important goal for Apache, and it continues
|
|
to remain so.
|
|
|
|
<P>The pre-forking model comes under criticism for various
|
|
performance aspects. Of particular concern are the overhead
|
|
of forking a process, the overhead of context switches between
|
|
processes, and the memory overhead of having multiple processes.
|
|
Furthermore it does not offer as many opportunities for data-caching
|
|
between requests (such as a pool of <CODE>mmapped</CODE> files).
|
|
Various other models exist and extensive analysis can be found in the
|
|
<A HREF="http://www.cs.wustl.edu/~jxh/research/research.html"> papers
|
|
of the JAWS project</A>. In practice all of these costs vary drastically
|
|
depending on the operating system.
|
|
|
|
<P>Apache's core code is already multithread aware, and Apache version
|
|
1.3 is multithreaded on NT. There have been at least two other experimental
|
|
implementations of threaded Apache, one using the 1.3 code base on DCE,
|
|
and one using a custom user-level threads package and the 1.0 code base;
|
|
neither is publicly available. There is also an experimental port of
|
|
Apache 1.3 to <A HREF="http://www.mozilla.org/docs/refList/refNSPR/">
|
|
Netscape's Portable Run Time</A>, which
|
|
<A HREF="http://www.arctic.org/~dgaudet/apache/2.0/">is available</A>
|
|
(but you're encouraged to join the
|
|
<A HREF="http://dev.apache.org/mailing-lists">new-httpd mailing list</A>
|
|
if you intend to use it).
|
|
Part of our redesign for version 2.0
|
|
of Apache will include abstractions of the server model so that we
|
|
can continue to support the pre-forking model, and also support various
|
|
threaded models.
|
|
|
|
<!--#include virtual="footer.html" -->
|
|
</BODY>
|
|
</HTML>
|