mirror of
				https://github.com/apache/httpd.git
				synced 2025-10-31 19:10:37 +03:00 
			
		
		
		
	Submitted by: Rodent of Unusual Size <Ken.Coar@Golux.Com> git-svn-id: https://svn.apache.org/repos/asf/httpd/httpd/trunk@89401 13f79535-47bb-0310-9956-ffa450edef68
		
			
				
	
	
		
			206 lines
		
	
	
		
			9.0 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			206 lines
		
	
	
		
			9.0 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
 | |
| <HTML>
 | |
| <HEAD>
 | |
| <TITLE>Apache module mod_unique_id</TITLE>
 | |
| </HEAD>
 | |
| 
 | |
| <!-- Background white, links blue (unvisited), navy (visited), red (active) -->
 | |
| <BODY
 | |
|  BGCOLOR="#FFFFFF"
 | |
|  TEXT="#000000"
 | |
|  LINK="#0000FF"
 | |
|  VLINK="#000080"
 | |
|  ALINK="#FF0000"
 | |
| >
 | |
| <!--#include virtual="header.html" -->
 | |
| <H1 ALIGN="CENTER">Module mod_unique_id</H1>
 | |
| 
 | |
| <p>This module provides an environment variable with a unique identifier
 | |
| for each request.</p>
 | |
| 
 | |
| <P><A
 | |
| HREF="module-dict.html#Status"
 | |
| REL="Help"
 | |
| ><STRONG>Status:</STRONG></A> Extension
 | |
| <BR>
 | |
| <A
 | |
| HREF="module-dict.html#SourceFile"
 | |
| REL="Help"
 | |
| ><STRONG>Source File:</STRONG></A> mod_unique_id.c
 | |
| <BR>
 | |
| <A
 | |
| HREF="module-dict.html#ModuleIdentifier"
 | |
| REL="Help"
 | |
| ><STRONG>Module Identifier:</STRONG></A> unique_id_module
 | |
| <BR>
 | |
| <A
 | |
| HREF="module-dict.html#Compatibility"
 | |
| REL="Help"
 | |
| ><STRONG>Compatibility:</STRONG></A> Available in Apache 1.3 and later.
 | |
| </P>
 | |
| 
 | |
| <h2>Summary</h2>
 | |
| 
 | |
| <p>This module provides a magic token for each request which is guaranteed
 | |
| to be unique across "all" requests under very specific conditions.
 | |
| The unique identifier is even unique across multiple machines in a
 | |
| properly configured cluster of machines.  The environment variable
 | |
| <CODE>UNIQUE_ID</CODE> is set to the identifier for each request.
 | |
| Unique identifiers are useful for various reasons which are beyond the
 | |
| scope of this document.</p>
 | |
| 
 | |
| <h2>Directives</h2>
 | |
| 
 | |
| <p>This module has no directives.</p>
 | |
| 
 | |
| 
 | |
| <H2>Theory</H2>
 | |
| 
 | |
| <P>
 | |
| First a brief recap of how the Apache server works on Unix machines.
 | |
| This feature currently isn't supported on Windows NT.  On Unix machines,
 | |
| Apache creates several children, the children process requests one at
 | |
| a time.  Each child can serve multiple requests in its lifetime.  For the
 | |
| purpose of this discussion, the children don't share any data
 | |
| with each other.  We'll refer to the children as httpd processes.
 | |
| 
 | |
| <P>
 | |
| Your website has one or more machines under your administrative control,
 | |
| together we'll call them a cluster of machines.  Each machine can
 | |
| possibly run multiple instances of Apache.  All of these collectively
 | |
| are considered "the universe", and with certain assumptions we'll
 | |
| show that in this universe we can generate unique identifiers for each
 | |
| request, without extensive communication between machines in the cluster.
 | |
| 
 | |
| <P>
 | |
| The machines in your cluster should satisfy these requirements.
 | |
| (Even if you have only one machine you should synchronize its clock
 | |
| with NTP.)
 | |
| 
 | |
| <UL>
 | |
| <LI>The machines' times are synchronized via NTP or other network time
 | |
|     protocol.
 | |
| 
 | |
| <LI>The machines' hostnames all differ, such that the module can do a
 | |
|     hostname lookup on the hostname and receive a different IP address
 | |
|     for each machine in the cluster.
 | |
| </UL>
 | |
| 
 | |
| <P>
 | |
| As far as operating system assumptions go, we assume that pids (process
 | |
| ids) fit in 32-bits.  If the operating system uses more than 32-bits
 | |
| for a pid, the fix is trivial but must be performed in the code.
 | |
| 
 | |
| <P>
 | |
| Given those assumptions, at a single point in time we can identify
 | |
| any httpd process on any machine in the cluster from all other httpd
 | |
| processes.  The machine's IP address and the pid of the httpd process
 | |
| are sufficient to do this.  So in order to generate unique identifiers
 | |
| for requests we need only distinguish between different points in time.
 | |
| 
 | |
| <P>
 | |
| To distinguish time we will use a Unix timestamp (seconds since January
 | |
| 1, 1970 UTC), and a 16-bit counter.  The timestamp has only one second
 | |
| granularity, so the counter is used to represent up to 65536 values
 | |
| during a single second.  The quadruple <EM>( ip_addr, pid, time_stamp,
 | |
| counter )</EM> is sufficient to enumerate 65536 requests per second per
 | |
| httpd process.  There are issues however with pid reuse over
 | |
| time, and the counter is used to alleviate this issue.
 | |
| 
 | |
| <P>
 | |
| When an httpd child is created, the counter is initialized with (
 | |
| current microseconds divided by 10 ) modulo 65536 (this formula was
 | |
| chosen to eliminate some variance problems with the low order bits of
 | |
| the microsecond timers on some systems).  When a unique identifier is
 | |
| generated, the time stamp used is the time the request arrived at the
 | |
| web server.  The counter is incremented every time an identifier is
 | |
| generated (and allowed to roll over).
 | |
| 
 | |
| <P>
 | |
| The kernel generates a pid for each process as it forks the process, and
 | |
| pids are allowed to roll over (they're 16-bits on many Unixes, but newer
 | |
| systems have expanded to 32-bits).  So over time the same pid will be
 | |
| reused.  However unless it is reused within the same second, it does not
 | |
| destroy the uniqueness of our quadruple.  That is, we assume the system
 | |
| does not spawn 65536 processes in a one second interval (it may even be
 | |
| 32768 processes on some Unixes, but even this isn't likely to happen).
 | |
| 
 | |
| <P>
 | |
| Suppose that time repeats itself for some reason.  That is, suppose that
 | |
| the system's clock is screwed up and it revisits a past time (or it is
 | |
| too far forward, is reset correctly, and then revisits the future time).
 | |
| In this case we can easily show that we can get pid and time stamp reuse.
 | |
| The choice of initializer for the counter is intended to help defeat this.
 | |
| Note that we really want a random number to initialize the counter,
 | |
| but there aren't any readily available numbers on most systems (<EM>i.e.</EM>, you
 | |
| can't use rand() because you need to seed the generator, and can't seed
 | |
| it with the time because time, at least at one second resolution, has
 | |
| repeated itself).  This is not a perfect defense.
 | |
| 
 | |
| <P>
 | |
| How good a defense is it?  Suppose that one of your machines serves
 | |
| at most 500 requests per second (which is a very reasonable upper bound
 | |
| at this writing, because systems generally do more than just shovel out
 | |
| static files).  To do that it will require a number of children which
 | |
| depends on how many concurrent clients you have.  But we'll be pessimistic
 | |
| and suppose that a single child is able to serve 500 requests per second.
 | |
| There are 1000 possible starting counter values such that two sequences
 | |
| of 500 requests overlap.  So there is a 1.5% chance that if time (at one
 | |
| second resolution) repeats itself this child will repeat a counter value,
 | |
| and uniqueness will be broken.  This was a very pessimistic example,
 | |
| and with real world values it's even less likely to occur.  If your
 | |
| system is such that it's still likely to occur, then perhaps you should
 | |
| make the counter 32 bits (by editing the code).
 | |
| 
 | |
| <P>
 | |
| You may be concerned about the clock being "set back" during summer
 | |
| daylight savings.  However this isn't an issue because the times used here
 | |
| are UTC, which "always" go forward.  Note that x86 based Unixes may need
 | |
| proper configuration for this to be true -- they should be configured to
 | |
| assume that the motherboard clock is on UTC and compensate appropriately.
 | |
| But even still, if you're running NTP then your UTC time will be correct
 | |
| very shortly after reboot.
 | |
| 
 | |
| <P>
 | |
| The <CODE>UNIQUE_ID</CODE> environment variable is constructed by
 | |
| encoding the 112-bit (32-bit IP address, 32 bit pid, 32 bit time stamp,
 | |
| 16 bit counter) quadruple using the alphabet <CODE>[A-Za-z0-9@-]</CODE>
 | |
| in a manner similar to MIME base64 encoding, producing 19 characters.
 | |
| The MIME base64 alphabet is actually <CODE>[A-Za-z0-9+/]</CODE> however
 | |
| <CODE>+</CODE> and <CODE>/</CODE> need to be specially encoded in URLs,
 | |
| which makes them less desirable.  All values are encoded in network
 | |
| byte ordering so that the encoding is comparable across architectures of
 | |
| different byte ordering.  The actual ordering of the encoding is: time
 | |
| stamp, IP address, pid, counter.  This ordering has a purpose, but it
 | |
| should be emphasized that applications should not dissect the encoding.
 | |
| Applications should treat the entire encoded <CODE>UNIQUE_ID</CODE> as an
 | |
| opaque token, which can be compared against other <CODE>UNIQUE_ID</CODE>s
 | |
| for equality only.
 | |
| 
 | |
| <P>
 | |
| The ordering was chosen such that it's possible to change the encoding
 | |
| in the future without worrying about collision with an existing database
 | |
| of <CODE>UNIQUE_ID</CODE>s.  The new encodings should also keep the time
 | |
| stamp as the first element, and can otherwise use the same alphabet and
 | |
| bit length.  Since the time stamps are essentially an increasing sequence,
 | |
| it's sufficient to have a <EM>flag second</EM> in which all machines in the
 | |
| cluster stop serving and request, and stop using the old encoding format.
 | |
| Afterwards they can resume requests and begin issuing the new encodings.
 | |
| 
 | |
| <P>
 | |
| This we believe is a relatively portable solution to this problem.  It can
 | |
| be extended to multithreaded systems like Windows NT, and can grow with
 | |
| future needs.  The identifiers generated have essentially an infinite
 | |
| life-time because future identifiers can be made longer as required.
 | |
| Essentially no communication is required between machines in the cluster
 | |
| (only NTP synchronization is required, which is low overhead), and no
 | |
| communication between httpd processes is required (the communication is
 | |
| implicit in the pid value assigned by the kernel).  In very specific
 | |
| situations the identifier can be shortened, but more information needs
 | |
| to be assumed (for example the 32-bit IP address is overkill for any
 | |
| site, but there is no portable shorter replacement for it).
 | |
| 
 | |
| <!--#include virtual="footer.html" -->
 | |
| </BODY>
 | |
| </HTML>
 |