Octavian Andrei Dragoi, | Assignment 1 |
Abstract:
This report presents the conceptual (abstract) architecture of the Apache web server. It tries to emphasize the overall structure of the system, without going into implementation details, or requiring such details to be previously known by the reader. The main purpose is to make the architecture "intellectually tractable" ([Monroe97]).Keywords:
The conceptual architecture has been inferred from a number of Apache related documents and from the way source files are grouped and named.
At a high level the Apache server architecture is composed of a core that implements the most basic functionality of a web server and a set of standard modules that actually service the phases of handling an HTTP request.
The server core accepts a HTTP request and implicitly invokes the appropriate handlers, sequentially, in the appropriate order, to service the request.
The report shows that the most similar architectural style (in the sense of ([Garlan94])) that can characterize the Apache architecture is "implicit invocation" , although the notion of event does not exist in the architecture.
The architecture offers great opportunities for extending or changing the Apache functionality, by the means of adding or replacing modules.
Apache, conceptual architecture, abstract architecture, web serverAvailable online at:
http://www.grad.math.uwaterloo.ca/~oadragoi/CS746G/a1/apache_conceptual_arch.html
The report assumes no previous familiarity with the architecture of the Apache web server. So it can serve as an introductory reading on the architecture of the server.
It should be noted that the architecture described here might not be entirely accurate. It has been inferred based on several sources, including the overall structure of files and files name. It does not start from a previously existing complete design document.
May be here is the place to mention that Apache is written to be drop-in compatible with the NCSA server. This has design consequences due related to some configuration commands promoted by NCSA server, which cannot be naturally implemented in Apache. These commands are supported in a way that, somehow, is not in the general "philosophy" of the system.([Thau96]). (more details in the configuration section).
Additional concerns related to controlling access authorization and clients authorizations are also in the responsibility of the web server. As has been said the web server might execute programs as response to clients requests. It must ensure that this is not a threat for the host system (were the web server runs). In addition, the web server must be capable, not only to respond to a high rate of requests, but also to satisfy a request as quickly as possible.
![]() |
|
The following are the components of the core:
http_protocol.c
: contains routines that directly communicates with the client (through the socket connection), following the HTTP protocol. All data transfers to the client are done using this component. http_main.c
: the component that startup the server and contains the main server loop that waits for and accepts connections. It is also in charge of managing timeouts. http_request.c
the component that handles the flow of the request processing, dispatching control to the modules in the appropriate order. It is also in charge with error handling. http_core.c
: the component implementing the most basic functionality, which is described in a comment from a source file as being "just 'barely' functional enough to serve documents, though not terribly well". Another interesting quote from a source file comment illustrates very well the function of this component:"this file could almost be mod_core.c". Meaning that the component behaves like a module but has to access some globals directly (which is not characteristic for a module). alloc.c
) http_config.c
), as well as support for virtual hosts. An important function of http_config
is that form the list of modules that will be called to service different phases of the requests. ![]() |
|
It is interesting to observed that although the components of the core have rather distinct functionality, there is not a simple way to depict the interactions between them. Most of the architectural information being in the names of the modules rather than in the connectors between them.
This is due to the considerably effort done by the designers to move everything that can be expressed as a separate entity into the modules part of the Apache server. What is left in the core are components too interconnected to be written as separate modules.
The following are the phases of handling a request for the Apache server:
![]() |
|
Handlers are defined by modules, and a module might specify handlers for one, many or none of the phases of a request. Handlers are the part of the module that is called when the processing of the request enters the phase for which the handler is defined.
The rationale behind having modules defining handlers for more than one phase is that a module might save internally data on the request being processed, and when its handlers for a subsequent phase of the request are called they might make use of those the data. In theory the module might even save data between different request (e.g. it might cash some file content for future use).
It should be noted that there are additional functions exported by modules, related with configuration, and initialization, They are called in the startup phase of the server.
mod_userdir
: translate the user home directories into actual paths mod_rewrite Apache 1.2 and up mod_rewrite
: rewrites URLs based on regular expressions, it has additional handlers for fix-ups and for determining the mime type mod_auth, mod_auth_anon,mod_auth_db, mod_auth_dbm
: User authentication using text files, anonymous in FTP-style, using Berkeley DB files, using DBM files. mod_access
: host based access control. mod_mime
: determines document types using file extensions. mod_mime_magic
: determines document types using "magic numbers" (e.g. all gif files start with a certain code) mod_alias
: replace aliases by the actual path mod_env
: fix-up the environment (based on information in configuration files) mod_speling
: automatically correct minor typos in URLs mod_actions
: file type/method-based script execution mod_asis
: send the file as it is mod_autoindex
: send an automatic generated representation of a directory listing mod_cgi
: invokes CGI scripts and returns the result mod_include
: handles server side includes (documents parse by server which includes certain additional data before handing the document to the client) mod_dir
: basic directory handling. mod_imap
: handles image-map file mod_log_*
: various types of logging modules For some phases only one module (handler in a module) can be called. Such phases are the authorization, the authentication, the return of the actual object to the client, and sometimes the URI to filename translation.
Other phases of servicing a request can have more that one handler called. For example there can be more than one module called to implement the logging part of the request.
In some phases of processing a request all the handlers (in the registered modules) might be called until one returns a special code meaning that subsequent registered handlers for the current phase should not be called. An example is the URI to filename, translation phase.
Further more there might be the case that a handler returns an error code. In that case the processing of the request should stop and an error should be returned to the client (i.e. no other handlers are called, from this phase or subsequent phases).
![]() |
|
As a consequence, Apache uses a different technique, namely persistent server processes. It forks a fixed number of children, right from the beginning. The children service incoming requests independently (different address spaces). Concurrency in Apache server is pictured in Figure 5.
Alternatively, when Apache compiles on MS Windows (as opposed to UNIX), a fixed number of threads is started from the beginning to service the incoming request (due probably to specific characteristic of this operating system).
![]() |
|
From another point of view one might raise the question if a module is a separated process or can be implemented as a separated process. In Apache module is not a separated process. However some modules might fork new children in order to do their job. A readily example is the mod_cgi
module, which handles the cgi script. It must fork a new child to execute the actual CGI script (after proper redirection of the standard input and output for the child process), and wait for it to finish. But this is a characteristic of the mod_cgi
, many other modules need not to fork children.
A different kind of module is the one that although it is not a separate process and does not for children it communicate through IPC mechanisms or sockets in with a different process (which might, for instance, be located on a different machine). An example of such module would be an authorization module which communicate with a server that manages users and passwords information. Even the CGI module might be implemented in this way (i.e. the actual script running as a completely different process not a child) which will result in improved security, but will have the communication overhead as a penalty.
An interesting concept implemented by Apache is the one of Virtual Hosts. The server can respond to more than one name (i.e. www.example and www2.example), each assigned to one of the multiple IP addresses of the machine. The multiple IP addresses can be addresses associated with physical network interfaces or can be addresses associated with virtual network interfaces (simulated via logical devices by the operating system). Apache is able to "tell" under which name the host has been referenced and use different configuration options (e.g. allowing more access rights to users accessing the host through an interface networked in the local network, as opposed to users accessing the web server via an interface networked in the outside-the-company network). Modules also have accessed to this information.
To summarize, the Apache "philosophy" related to configuration is: each component takes care of its own configuration, and configuration commands. The server core parse the configuration files and dispatches configuration commands to the appropriate modules to be interpreted (executed), or interprets (executes) the command itself if in particular was meant for it (i.e. is a configuration command for the core not for a module).
To "fix" this the problem commands of NCSA server (e.g. Options) are interpreted by the Apache core, even when they affect modules. The core make the configuration available to modules in the same way it make available the general configuration information.
Another key structure is the one the Apache core uses keep track of various modules. It is a linked list of module records, each holding all the information related to that module (e.g. handlers, configuration data per module). The module record is the mean by which the core calls the module.
What is characteristic for the resource pool, is that all resources are freed at once, when the resource pool is freed, preventing resource leakage. This is particularly important due to use of persistent processes.
There is, however, something that might be compared with announcing an event, namely is the issuing of a sub-request by a module in order to "force" the core to perform some of the steps for a request on the sub-request (i.e. calling sequentially handlers for each servicing phase). However this is not (conceptually) a proper event, because the issuing module does not announce something to other (unknown to it) modules. It just a mean of "forcing" an implicit invocation.
There are other characteristics of event systems (as summarized in [Shaw96]) that does not "fit" the description of the core-modules architecture of Apache. For example there is no control asynchrony, in the sense that the module issuing a the sub-request waits for the sub-request to be completed.
Also two phases of the request cannot be handled in parallel (one uses the outcome of the precedent one). More over the module is not a separate process, although it can fork children for some phases - like running a CGI script.
So although the connectors between modules are implicit invocations and data flow is a tree - with some restrictions (e.g some phases cannot have more than one module to handle them, one phase is after the other) the architecture does not have other characteristics of the event systems.
It can be argued however that as different instances of Apache (sub-processes) can handle in the same time request from different HTTP clients there is asynchrony. However the different instances are independent and do not shared information related to the requests processed.
The way a request is serviced, with phases handled one after the other and the outcome of a request is used (most of the time) by the next phase, has some similarities with the general style of "pipe line" (as in [Shaw96])). There is no upstream control (i.e. when the core invokes the handlers for one phase there is no data or control upstream). However, again, there is no asynchrony and more important the core regain control after each phase (i.e. after the handler has been invoked, and its job is done).
Further more, some phases does not provide any change in the conceptual data-flow. And more significant, some handlers might be implemented by the same module and those handler might exchange information via private data of the module, bypassing the main data-flow. For example authorization and authentication does not change the request, they can only deny the execution of it. To conclude the pipeline is rather poorly reflected by the module structures, although conceptually the idea exists, therefore the implicit invocation seems more appropriate to characterize the general conceptual architectural style.
Further more the ability of dynamically loading modules present in Apache 1.3 release (no static linking with the server code), make the task of customizing the server even easier as there is no need to recompile the entire server. It is necessarily only to change some configuration files.
Another feature worth re-mentioning here is the capability of modules to define their own configuration commands, for which they are implicitly called to execute.
An important part of the Apache web server that cannot be changed only by changing / adding a module is the one that implements the HTTP protocol. On the good, side the protocol is implemented as a separate piece of code (http_protocol.c
), and all communication with the client is done through it, so only that part must be changed in order to implement a future version of HTTP. However there is no well defined API, as is the case for modules.
The core is the one that accepts and manages HTTP connections and calls the handlers in modules in the appropriate order to service the current request.
The architectural style can be characterized implicit invocation made by the server core on handlers implemented by the modules. Concurrency exists only between a number of persistent identical processes that service incoming HTTP requests on the same port. Modules are not implemented as separate process although it is possible to fork children or to cooperate with other independent process to handle a phase of processing a request.
The functionality of Apache can be easily changed by writing new modules which complements or replace the existing one. The server is also highly configurable, at different levels (virtual host, directory, module) and modules can define their own configuration commands.