rwasa | 2 Ton Digital

Ldfa · le 30 décembre 2019

Rapid Web Application Server in Assembler is our full-featured, high performance web server designed to compete with the likes of nginx. In addition to doing all of the common things you'd expect a modern webserver to do, we also include assembly language function hooks ready-made to facilitate rapid web application server development.

Most webserver software developers start with their webserver, and then write benchmarking and performance testing tools for it like our webslap utility. We did it quite the other way around, and built rwasa only after we discovered several curiosities with existing and popular webservers. As with all of our products, rwasa is bundled with the HeavyThing library itself. NOTE: compiling from source is not required; the compiled binary is included with the library. Feature highlights:

Opensource/GPLv3
TLS 2048 bit 1.6X performance increase over OpenSSL-based webservers
TLS 4096 bit 2.0X performance increase over OpenSSL-based webservers
TLS auto-blacklisting for anti-tampering
OCSP Stapling by default
Randomized Diffie-Hellman safe prime pool
Multi-process lockless TLS session resumption cache
TLS session cache is encrypted by default
Faster dynamic content compression
Large-scale FastCGI safely via unix sockets (without hitting EAGAIN)
Simple command-line arguments covers all common configurations
Small footprint, no external dependencies
Server-side BREACH mitigation (randomized headers, see notes)
HSTS enabled by default
UPDATED: Backpath (aka upstream) support

Armed with a sufficiently fast epoll implementation, the actual code requirements to deal with a proper webserver implementation is not high. This can be evidenced by comparing what we'll deem as "simple" web serving; no encryption, no on-the-fly compression, etc. While we can't say all simple webservers are equal, we can say that when we compared rwasa with nginx and lighttpd doing simple tasks, they will all readily hit [insane] interface-speed bottlenecks.

On the other hand, when we add in strong encryption and dynamic content compression, things get far more interesting. This has little to do with the webserver layer itself for any of the three webservers, and more to do with the actual encryption and compression libraries that they use. Considering that nginx, lighttpd and a whole host of other server applications rely on OpenSSL and the stock-standard zlib libraries, this means that those performance areas are not directly in the webserver developers hands.

Since our HeavyThing library itself has hand-written assembler to accomplish all of TLS and gzip, and since our implementation is indeed faster than either of the stock-standard reference libraries, we ended up with a considerably higher performing webserver than not only nginx and lighttpd, but all that we could get our hands on.

This isn't to say that rwasa should be used in every imaginable web environment as-is. It is specifically tuned as part of the HeavyThing library to serve up the most common large scale web application environments and do so securely and faster than anything else available. This means dynamic content via either assembler itself, or via FastCGI, along with the whole gamut of normal web site assets (HTML, CSS, JS, images, media). If however you are serving up nothing but thousands of gigantic static files, the way rwasa is presently tuned will not suit you, and you are far better off sticking to nginx or contacting us for some custom tuning specific to your environment.

Even if you aren't running a large scale operation, you should still care about how fast your TLS implementation is using strong encryption parameters. This is specifically because, even if rwasa isn't doing a huge amount of work, the time in which it does it should be important to you. Specifically, it means that for new TLS session establishment that the time in milliseconds that it takes rwasa to perform the necessary encryption tasks will be considerably lower than other webserver environments. This affects your end-user experience insofar as the time delays associated with employing strong cryptography.

Please forgive us for the giant wall of content presented here, but we prefer one scrollable, navigable and connected page rather than many disjointed ones.

UPDATE Feb 9th, 2015: Due to all of the amazing feedback we have received since the initial release, several community members have provided invaluable insight into tests that we never thought to do. Today's release of 1.04 marks just over a 20% speed improvement over our earlier versions. In the coming week we'll be updating the included tests and adding others to reflect these changes. Thanks to all who have provided feedback, keep it coming! What this means of course is that the baseline charts below are inaccurate and based on older versions of rwasa. They'll look even better here shortly.

While we could present all manner of performance metrics, the simplest forms provide the most insight. With so many ways to do contrived and thus meaningless performance tests, it would certainly be nice if there was a standardized test suite for normal web operations, TLS, etc. The performance tests that follow are specific to TLS because we feel it is the single-most important metric. As mentioned above, "simple" web serving is typically easy to hit bandwidth ceilings long before CPU and/or code efficiency ever becomes a problem.

UPDATE: These tests were performed when the library-wide default setting for dh_bits was still set to 4096, rather than its now default setting of 2048 due to strong feedback that our default security settings were too high. As a result, rerunning these tests as they were initially done would require a 4096 bit rwasa to be compiled.

Test setup

We configured rwasa, nginx and lighttpd to all use the same TLS parameters as well as Diffie-Hellman parameters. This means the same exact RSA keys and certificate chains, and the same cipher lists as specified below. Regarding Diffie-Hellman parameters, since the OpenSSL implementations only allow us to specify a single dhparam file, we created both a 4kbit and a 2kbit dhparam and used them for both nginx and lighttpd. By default, our HeavyThing library uses 4kbit DH, and randomly selects from its pool during normal operations. For the 2kbit test, we specifically recompiled rwasa and modified the dh_bits setting of our HeavyThing library so that it too would make use of a 2kbit DHE exchange rather than its default of 4kbit.

UPDATE: As of version 1.02, after much feedback regarding our previous default of 4096 bits for dh_bits, the library and thus rwasa defaults are now 2048 bits.

Lighttpd does not support OCSP Stapling. While this does not present significant overhead for our tests, both rwasa and nginx are configured to do so. Further, lighttpd DOES support TLS session resumption, despite there being no documentation that we can find on the subject. We won't speculate as to why this may be, considering how important this feature is to a decent TLS web environment, but it is great that all three support it for our tests below.

Equipment: 1gbps LAN* between server which is an older 24GB Intel(R) Core(TM) i7 CPU 970 @ 3.20GHz running OpenSUSE 13.1 and our client machine, a 32GB Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz running Mac OS X 10.9.5, inside of which runs a VMware Fusion 5.0.5 virtual machine of 8GB running Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-43-generic x86_64).

NOTE: Our test setup certainly does not reflect an actual production environment, and our effective 1gbps link isn't really 1gbps. That being said, the purpose of these performance tests is still met due to the fact that all tests except the last are CPU-only, and the last is there to measure link saturation. Since our setup actually provides a sub-standard LAN environment, this is actually a good thing for our saturation test. Be advised in any case that our link speed test is not an actual 1gbps wide-open link. In fact, our test setup provides us with about 80% of that, which does not affect the point we are making.

Except for the final link speed test (which we had to exclude lighttpd from anyway), both rwasa and nginx are configured to only use 1 worker process. Since we are only benchmarking TLS performance, this works perfectly well and provides meaningful insights into multi-process scalability when dealing with more grunt than lighttpd could provide with its single process model.

nginx 1.7.9 configuration:


worker_processes  1;

http {
    include       mime.types;
    default_type  application/octet-stream;

    sendfile        on;

    keepalive_timeout  65;

    gzip  on;
    gzip_comp_level 6;
    gzip_types text/plain text/xml test/css application/x-javascript text/html application/javascript image/svg+xml;

server {
        listen       443;
        server_name  2ton.com.au;

        ssl                  on;
        # Our key is 4096 bits
        ssl_certificate      2ton.crt;
        ssl_certificate_key  2ton.key;

        ssl_session_timeout  5m;
        ssl_session_cache shared:SSL:60m;

        # Our DH parameters are also 4096 bits
        ssl_dhparam dhparam.pem;

        ssl_protocols  TLSv1 TLSv1.1 TLSv1.2;
        ssl_ciphers  DHE-RSA-AES256-SHA256:DHE-RSA-AES128-SHA256:DHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-DSS-AES256-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!3DES:!MD5:!PSK;
        ssl_prefer_server_ciphers   on;

        add_header Strict-Transport-Security 'max-age=31536000; includeSubDomains';

        ssl_stapling on;
        ssl_stapling_verify off;
        ssl_trusted_certificate gd_bundle-g2-g1.crt;

        resolver 10.0.0.1;

        location / {
            root   html;
            index  index.html index.htm;
        }
        # pass the PHP scripts to FastCGI server listening on unix:/dev/shm/php.sock:
        #
        location ~ \.php$ {
            root           html;
            fastcgi_pass   unix:/dev/shm/php.sock;
            fastcgi_index  index.php;
            fastcgi_param  SCRIPT_FILENAME  $document_root$fastcgi_script_name;
            include        fastcgi_params;
        }
}

}

Lighttpd 1.4.35 configuration (only the relevant mods are listed):


var.server_root = "/usr/local/nginx"
server.port = 4002
server.username = "nobody"
server.groupname = "nobody"
server.document-root = server_root + "/html"
ssl.engine = "enable"
ssl.pemfile = "2ton.pem"
ssl.dh-file = "/usr/local/nginx/dhparam.pem"
ssl.ca-file = "gd_bundle-g2-g1.crt"
ssl.cipher-list = "DHE-RSA-AES256-SHA256:DHE-RSA-AES128-SHA256:DHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-DSS-AES256-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!3DES:!MD5:!PSK"
ssl.honor-cipher-order = "enable"

rwasa configuration:

# ./rwasa -cpu 1 -runas nobody -tls 2ton.pem -bind 4001 -logpath logs -errsyslog -sandbox /usr/local/nginx/html -foreground

4kbit New TLS DHE sessions/s

Determine what the peak number of new TLS DHE sessions per second is for each of our three webservers. We use our 2 Ton Digital 4096 bit key, and we are using 4096 bit DH parameters. The file we are getting contains only the text "Hello World". For the below charts, webslap was used 6 times as follows to produce the results:

# nginx concurrency 64
$ ./webslap -cpu 1 -n 5000 -c 64 -notlsresume -nokeepalive -noetag -nolastmodified -noui https://10.0.0.33/hello_world.txt
# nginx concurrency 8
$ ./webslap -cpu 1 -n 5000 -c 8 -notlsresume -nokeepalive -noetag -nolastmodified -noui https://10.0.0.33/hello_world.txt
# lighttpd concurrency 64
$ ./webslap -cpu 1 -n 5000 -c 64 -notlsresume -nokeepalive -noetag -nolastmodified -noui https://10.0.0.33:4002/hello_world.txt
# lighttpd concurrency 8
$ ./webslap -cpu 1 -n 5000 -c 8 -notlsresume -nokeepalive -noetag -nolastmodified -noui https://10.0.0.33:4002/hello_world.txt
# rwasa concurrency 64
$ ./webslap -cpu 1 -n 5000 -c 64 -notlsresume -nokeepalive -noetag -nolastmodified -noui https://10.0.0.33:4001/hello_world.txt
# rwasa concurrency 8
$ ./webslap -cpu 1 -n 5000 -c 8 -notlsresume -nokeepalive -noetag -nolastmodified -noui https://10.0.0.33:4001/hello_world.txt

This test runs each single-core webserver at 100% CPU and of course doesn't use much bandwidth. Since each and every connection requires a full TLS DHE handshake, the results clearly indicate how many 4096 bit TLS DHE handshakes a single core of each is capable of doing per second. The first four charts show that the sustained 100% CPU rate per second remains the same regardless of whether 64 or 8 are happening concurrently. The last two charts are reflective of the backlog effect of running at 100% CPU, and is also a good indication of what CPU starvation looks like in a strong crypto environment (read: CPU-based denial of service attacks).

The nearly identical results for nginx and lighttpd highlight their both using the same crypto library: OpenSSL. Our HeavyThing crypto implementation for modular arithmetic is clearly on display here, and the speed differences between the two are substantial.

2kbit New TLS DHE sessions/s

Same exact test as the first test, but we use a different 2048 bit RSA key and certificate chain, along with specific use of 2048 bit DH parameters for all three webservers. For the below charts, webslap was used 6 times as follows to produce the results:

# nginx concurrency 64
$ ./webslap -cpu 1 -n 5000 -c 64 -notlsresume -nokeepalive -noetag -nolastmodified -noui https://10.0.0.33/hello_world.txt
# nginx concurrency 8
$ ./webslap -cpu 1 -n 5000 -c 8 -notlsresume -nokeepalive -noetag -nolastmodified -noui https://10.0.0.33/hello_world.txt
# lighttpd concurrency 64
$ ./webslap -cpu 1 -n 5000 -c 64 -notlsresume -nokeepalive -noetag -nolastmodified -noui https://10.0.0.33:4002/hello_world.txt
# lighttpd concurrency 8
$ ./webslap -cpu 1 -n 5000 -c 8 -notlsresume -nokeepalive -noetag -nolastmodified -noui https://10.0.0.33:4002/hello_world.txt
# rwasa concurrency 64
$ ./webslap -cpu 1 -n 5000 -c 64 -notlsresume -nokeepalive -noetag -nolastmodified -noui https://10.0.0.33:4001/hello_world.txt
# rwasa concurrency 8
$ ./webslap -cpu 1 -n 5000 -c 8 -notlsresume -nokeepalive -noetag -nolastmodified -noui https://10.0.0.33:4001/hello_world.txt

Same story as our first test, we ran each single-core webserver at 100% CPU only with 2048 bit Diffie-Hellman parameters and 2048 bit RSA keys. Since our decision to use 4096 bit DH and keys for 2 Ton Digital isn't really that common, this test is directly relevant to the most common key sizes. Perhaps we should have put this test first, but we like 4096 bit keys.

Here we can see that 2048 bit RSA+DHE for OpenSSL performs ~7.5X faster than 4096 bits. This has been somewhat of a surprise to us, after having read things like this that roughly states the required amount of effort required when you double RSA and Diffie-Hellman lengths to be around 4X (quadruple the amount of work). Having done our own modular arithmetic implementation, we know that a "perfect world" it is not. Our HeavyThing library is 5.4X, still not at the 4X mark, but closer than OpenSSL.

With this test and the first, what we are really looking at is modular exponentiation speeds. This is because both DHE and RSA require modular arithmetic, and it is here that the bulk of work is required (AES by itself as we'll see in the next tests is insignificant in comparison). We reiterate that both tests so far have nothing to do with the webserver software being used, and everything to do with the underlying modular arithmetic. Obviously we could have removed both of the OpenSSL webservers and replaced them both in our charts with OpenSSL instead, but it appears that this is a little-known fact when researching webserver vendors for performance data on TLS. As we'll see in the following tests however, webserver software begins to matter much more when the work being done is not entirely modular arithmetic.

4kbit Resumed TLS DHE sess/s

Single core tests as noted above in the Test setup, and we are back to using our 2 Ton Digital 4096 bit key and 4096 bit DH parameters for all. This test however we make specific use of TLS session resumption while still forcing -nokeepalive. We dramatically increased the number of requests for these so that they really do highlight the TLS resume speeds. As noted above in the Test setup, lighttpd DOES support TLS session resumption. Due to the high individual connection counts, we went ahead and set /proc/sys/net/ipv4/tcp_tw_reuse and /proc/sys/net/ipv4/tcp_tw_recycle both to 1 to keep the kernels from messing with our results unnecessarily (it should go without saying that those two settings are NOT for production environments). For the below charts, webslap was used 3 times as follows to produce the results:

# nginx concurrency 256
$ ./webslap -cpu 6 -n 500000 -c 256 -nokeepalive -noetag -nolastmodified -noui https://10.0.0.33/hello_world.txt
# lighttpd concurrency 256
$ ./webslap -cpu 6 -n 500000 -c 256 -nokeepalive -noetag -nolastmodified -noui https://10.0.0.33:4002/hello_world.txt
# rwasa concurrency 256
$ ./webslap -cpu 6 -n 500000 -c 256 -nokeepalive -noetag -nolastmodified -noui https://10.0.0.33:4001/hello_world.txt

Again we ran each single-core webserver at 100% CPU. Because we did not make use of keepalive, each request was on its own TCP connection, and all but the first 256 were TLS resumed (and for those that aren't aware, this means no heavy modular exponentations like we saw in both the first tests). We were quite surprised by these results, so much so we ran them several times. Specifically, that nginx's TLS resume speed was so much slower than lighttpd. We have not experienced unusual deficiencies using multiple worker threads with nginx, so perhaps it is because our tests are restricted to a single worker and that we are operating them at 100% CPU.

The latencies presented here are skewed by the very large backlog during the start of the tests, but due to the enormous number of requests we performed these results are meaningful. As you can plainly see, TLS session resumption is an absolute requirement in any webserver environment.

Throughput CPU analysis

To highlight the fact that both rwasa and OpenSSL make use of AESNI hardware instructions, we perform this test with keepalive and TLS session resumption enabled for all three, while keeping to a single core for each. We note this test was enough to keep all three webserver processes at 100% CPU, while not quite at link saturation. To crank up our throughput we replace our "Hello World" from the previous tests with 128KB of random generated data from dd if=/dev/urandom of=rand_128kb.bin bs=1024 count=128. In short, these tests highlight each server's AES256/HMAC speed at "full stick." Were they not running on AESNI hardware of course, these results would look very different. For the below charts, webslap was used 3 times as follows to produce the results (which took a fair while):

# nginx concurrency 64
$ ./webslap -cpu 4 -n 500000 -c 64 -noetag -nolastmodified -noui https://10.0.0.33/rand_128kb.bin
# lighttpd concurrency 64
$ ./webslap -cpu 4 -n 500000 -c 64 -noetag -nolastmodified -noui https://10.0.0.33:4002/rand_128kb.bin
# rwasa concurrency 64
$ ./webslap -cpu 4 -n 500000 -c 64 -noetag -nolastmodified -noui https://10.0.0.33:4001/rand_128kb.bin

Since only the first 64 connections resulted in a full TLS handshake with the rest all being done keepalive style, what we are looking at here is raw TLS AES256/HMAC speeds. For AESNI hardware, both OpenSSL and rwasa are quite similar.

Link-speed CPU analysis

This test is basically the same as the previous test, but we increased the worker thread count for both nginx and rwasa to 2. We had to exclude lighttpd from this test simply because it doesn't support scaling upward from its single core model. What this test highlights is how each server performed when saturating our test 1gbps link speed (see earlier note re: our LAN), the effective saturation rate, and how much CPU was used for each during the test. For the below charts, webslap was used twice as follows to produce the results (which again took a while):

# nginx concurrency 64
$ ./webslap -cpu 4 -n 500000 -c 64 -noetag -nolastmodified -noui https://10.0.0.33/rand_128kb.bin
# rwasa concurrency 64
$ ./webslap -cpu 4 -n 500000 -c 64 -noetag -nolastmodified -noui https://10.0.0.33:4001/rand_128kb.bin

Again we see both sides of the crypto being quite similar. The reason we included this test however was to highlight the difference under multi-process link saturation for the required CPU-time which took a sharp turn from the prior CPU test.

Test Conclusions

There are many more rwasa features that we could have highlighted here where rwasa would have similarly stood out. Considering that TLS represents the most complexity and computational difficulty for any webserver, focussing our test efforts here highlights the core components of our HeavyThing library more than rwasa itself. We hope that our tests provide sufficient evidence for you to do your own tests and come to similar conclusions as our own.

Every reasonable effort has been undertaken to ensure rwasa is bug-free. We have been running it on the wild-wild interwebs for a decent while in various configurations. Despite our extensive background in these things, we know first-hand that every operation and environment is different. In this way, while rwasa works very well for us, you may encounter problematic situations that we simply have not thought to test. If you do encounter such situations that make rwasa (or you) cry, please contact us and/or the community and let us know so we can fix it up.

This section provides specific details about our HeavyThing library's TLS implementation as it pertains to HTTPS webserving.

BREACH/TIME/etc

Both the BREACH and TIME attacks rely on measuring the size of compressed response bodies. Since rwasa supports dynamic content compression by default, the HeavyThing library's default setting for webserver_breach_mitigation is enabled and set to 48 bytes. For each rwasa response when TLS and gzip is active, this setting adds an X-NB header that contains a random 0..48 bytes that is hex-encoded to each response header. While this doesn't render response sizing attacks completely useless, it makes a would-be attacker's job much more difficult due to the highly variable response lengths.

TLS Blacklist

To prevent padding oracle and other timing side-channel attacks, the HeavyThing library employs a TLS blacklist feature. When a decryption error occurs, by default the offending source IP address is added to a blacklist such that no further connections will be accepted for a period of a full 24 hours. Please note that the blacklist is not shared or synchronized between multiple rwasa processes, so what this really means is that -cpu worth of connections will be accepted at most per 24 hour period. Even if this setting is disabled, special care is taken such that no timing information is leaked during MAC failures. During the course of normal operations, decrypt errors do not (and should not) occur so if everyone plays nice with rwasa this setting goes largely unnoticed. It is our opinion that this strategy of automatically blacklisting clients who are tampering is an effective server-side mitigation technique.

Nessus/Comodo Scans

Thanks entirely to our feature of TLS blacklisting, if you run a Nessus scan against a TLS-enabled rwasa installation, the scanner will end up blacklisted when they perform their various TLS vulnerability checks (one or more times). Since Hackerguardian (Comodo's PCI Compliance scanner) uses Nessus, doing normal PCI compliance scans can result unpredictable results. This is because the Nessus scanner does not seem to appreciate the way our blacklist hangs up on them. The only way to get reliable results using these scans is to completely disable the HeavyThing tls_blacklist feature entirely (at least for the duration of your scanning periods). It is unfortunate that in order to pass these predictably and without issue that we have to disable our anti-tampering blacklist.

NOTE: This also applies to the SSL LABS tests, in that if our TLS blacklist is enabled, the results that come back are not accurate.

SSL LABS A- rating

Obviously, our own 2 Ton Digital website is running rwasa. As we understand, our SSL Labs rating of A- is directly a result of Internet Explorer versions not being able to negotiate Perfect Forward Secrecy with rwasa. Internet Explorer supports DHE, but only with DSA keys. So, to increase our rating, we'd have to replace our RSA key with a DSA key, or go ahead with ECDHE which we specifically opted out of (for the time being). The decision by Microsoft to support DHE-DSS but not DHE-RSA seems quite strange. They went ahead with ECDHE-RSA, but skipped DHE-RSA entirely despite the code requirements being basically the same for both DHE methods.

When the HeavyThing library was built, and up to the time of this writing, the Wikipedia article on EC states "In the wake of the exposure of Dual_EC_DRBG as 'an NSA undercover operation', cryptography experts have also expressed concern over the security of the NIST recommended elliptic curves, suggesting a return to encryption based on non-elliptic-curve groups." This was taken from comments made by the much-respected Bruce Schneier, and it does not appear to us that he (or anyone) has yet to retract them. This was sufficient grounds for us to specifically exclude elliptic curve methods in our current TLS implementation. If there is sufficient community interest, we may come back around on this position. In any case, we don't feel that PFS with Internet Explorer is worth contravening our position on elliptic curve cryptography to bump our rating up from an A-.

NSS/Firefox and DH >2236 bits

By default, rwasa is configured to use 4096 bit Diffie-Hellman parameters. Old versions of NSS (prior to mid 2012), and therefore Firefox which uses NSS have a known issue of a maximum DH size of 2236 bits. By changing the HeavyThing library setting for dh_bits, you can reduce this to 2048 bits and recompile if you need to include Perfect Forward Secrecy and support these older Firefox users. We note that the limit put in place mid 2012 was upped to 16384 bits.

UPDATE: As of version 1.02, the library default for dh_bits is now 2048, and as such no interoperability issues exist with older NSS/Firefox with the prepackaged rwasa

# ./rwasa
Usage: rwasa [options...]
Options are:
    -cpu count                  How many processes to start, defaults to 1
    -runas username             Run as username (defaults to nobody, parses /etc/passwd)
    -foreground                 Run in foreground (defaults to background)
    -new                        Start a new webserver configuration object
    -tls pemfile                Specify TLS PEM for next bind option
    -bind [addr:]port           Add a listener on [addr:]port
    -cachecontrol secs          Set static file cache control (default: 300)
    -filestattime secs          Set static file stat time (default: 120)
    -logpath directory          Specify full pathname where to put logs
    -errlog filename            Specify full filename for error logs
    -errsyslog                  Send errors to syslog
    -fastcgi endswith address   Add fastcgi handler (addr:host or /unixpath)
    -backpath address           Add backpath/upstream (addr:host or /unixpath)
    -vhost directory            Add virtual hosting directory (full path)
    -sandbox directory          Add global sandbox directory (full path)
    -hostsandbox host directory Add hostname sandbox directory (full path)
    -indexfiles list            Index files list (comma separated)
    -redirect url               Redirect all requests to url
    -funcmatch endswith         Function map ends with match (default: .asmcall)

Option: -cpu count

Specifies the number of "worker" processes to fire up, defaults to 1. It is perhaps counterintuitive to think that more is better here. The number you choose should be dependent on the kind of loads you are running, and not necessarily just a simple CPU core count of your webserver machine. This is in due to two main factors; 1) Static content compression is not shared per process, so each independent process maintains its own separate cache of gzipped goods. 2) TLS session cache is broadcast to all child processes with our design. So, just because you have a 64 core piece of webserver hardware, does not necessarily mean you should configure 64 CPUs for rwasa (though you certainly CAN, and in some rare cases, may even be prudent to do that or more still).

Option: -runas username

rwasa is meant to be started as root, and by default switches to the user nobody. Specifying a different user here overrides this behaviour. Note that rwasa parses /etc/passwd to determine the UID and GID of whatever user is specified.

Option: -foreground

By default, rwasa will be very quiet and detach from its controlling terminal without a word. Specifying this option will cause rwasa to display its banner and remain attached to your terminal session.

Option: -new

This option is a configuration "separator" if you will. For single-configuration startups, this option is obviously unnecessary, but by specifying this, allows you to start over as it were with separate and additional rwasa configurations.

Option: -tls pemfile

This option expects as its argument a PEM file that must contain a private key, public key, and any intermediate certificates (in that order). NOTE: this option MUST appear before the -bind option.

Option: -bind [addr:]port

Simple as it sounds, bind the current configuration to a port with an optional IP address specified. If the bind fails, rwasa will complain and refuse to start.

Option: -cachecontrol secs

For file-based (sandbox/vhost) serving, rwasa automatically adds Cache-Control, Last-Modified, and ETag headers. This setting determins the max-age setting. rwasa sets s-maxage to whatever this value is * 3. If this value is set to zero, then rwasa will only send Cache-Control: no-cache.

Option: -filestattime secs

Also for file-based (sandbox/vhost) serving, rwasa does not constantly stat underlying static files for each and every request, and instead does them periodically. This setting determines how frequently rwasa checks for underlying file modifications. For production systems that don't change a lot, higher is better. The maximum upper limit of this is 900 seconds.

Option: -logpath directory

If specified, this is the directory location where rwasa will dump "normal" webserver access logs. Special care must be taken such that the run-as user has write permissions to this path. rwasa will create files in this directory named access.log.YYYYMMDD. NOTE: log writes are on a 1.5 second interval, so if you are tailing them and it seems "chunky" this is quite by design.

Option: -errlog filename

Similarly, this option specifies a full filename for the error log. Unlike the access logs, only a single error log can be specified. If you are employing FastCGI, stderr from there will also land in this file.

Option: -errsyslog

In addition to the prior option, you can use either and/or both and this option sends error logs to the syslog.

Option: -fastcgi endswith address

This option configures rwasa for FastCGI. The endswith argument is precisely that, e.g. .php would redirect all requests that end in .php to be forwarded to the specified FastCGI handler. Multiples of this option are fine. For the address argument, if this begins with a forward slash, it is assumed that the FastCGI handler is a unix socket, otherwise it will assume it is an IP:port combination. See the section below on Unix FastCGI for details as to why you should be using unix sockets if your FastCGI handler is on the same machine as rwasa.

Option: -backpath address

This option configures rwasa for backpath (aka upstream) handling (think: HAProxy). The address supplied must either be an IPv4 address:port, or a full pathname for a unix fd. Note that the same benefits for FastCGI via unix socket exists for backpaths. Also note that if this option is combined with a sandbox, the sandbox will be tested first, and the backpath will only receive requests for files that do not exist in the locally configured sandbox.

Option: -vhost directory

To provide virtual hosting (many domains on one address of course), this option specifies the directory whereby hostnames exist. For example, if /tmp was passed for this option, and then a request arrives for example.com, rwasa will construct the document root to be /tmp/example.com and proceed with normal processing from there.

Option: -sandbox directory

As opposed to virtual host based serving, rwasa will also ignore the Host header entirely, and choose the specified sandbox directory as the document root for requests that arrive on the current configuration.

Option: -hostsandbox host directory

Alternatively, you can specify individual hosts, which is perhaps a security enhancement over the blanket directory-based -vhost option approach, but accomplishes similar things.

Option: -indexfiles list

Just as it sounds, a comma-separated list of index filenames, e.g. index.php,index.html.

Option: -redirect url

A brutish option that overrides all other configuration directives (if they are specified), and does a 302 Redirect for any and all requests that arrive on the current configuration. Must obviously be a fully qualified URL.

Option: -funcmatch endswith

Similar to the -fastcgi directive, this option directs all incoming requests that match the endswith argument to the default assembly language function hook included in rwasa.

Simple

In its simplest form, all rwasa needs is a bind and a sandbox, and this will start rwasa as user nobody:

# ./rwasa -bind 80 -sandbox /var/www/html

Assuming you had started a PHP FastCGI server like: PHP_FCGI_CHILDREN=20 php-cgi -e -b /dev/shm/php.sock, then you could add to the first:

# ./rwasa -bind 80 -sandbox /var/www/html -fastcgi .php /dev/shm/php.sock

Simple with logs

Assuming you created a log directory like: mkdir /var/log/rwasa && chown nobody:nobody /var/log/rwasa, then you could add to that:

# ./rwasa -bind 80 -sandbox /var/www/html -fastcgi .php /dev/shm/php.sock -logpath /var/log/rwasa

Simple with TLS

NOTE: rwasa is not a TLS validation tool, far from it. It is assumed that you already know beforehand your private key, certificate and intermediates are good. So, to replicate our last test, but toss TLS into the mix (noting that rwasa reads the certificates before it changes privilege level):

# ./rwasa -tls /root/example.pem -bind 443 -sandbox /var/www/html -fastcgi .php /dev/shm/php.sock -logpath /var/log/rwasa

Virtual hosts

If we abandon our global sandbox and want to do virtual hosting, say: for i in {1..5}; do mkdir -v /var/www/html/example$i.com; echo "Heya" > /var/www/html/example$i.com/index.html; done, then we could say:

# ./rwasa -bind 80 -vhost /var/www/html -fastcgi .php /dev/shm/php.sock -logpath /var/log/rwasa

Multiple binds

To combine our port 80 example with our TLS example:

# ./rwasa -bind 80 -sandbox /var/www/html -fastcgi .php /dev/shm/php.sock -logpath /var/log/rwasa -new -tls /root/example.pem -bind 443 -sandbox /var/www/html -fastcgi .php /dev/shm/php.sock -logpath /var/log/rwasa

It is assumed that only highly experienced web/system administrators will be using rwasa. By design, rwasa is not verbose about any administrative error reporting. This section aims to provide simple ways to troubleshoot various common situations where rwasa misbehaves as a result of configuration issues.

TLS issues

TLS issues are usually related to PEM file issues. Thanks to lighttpd making use of PEM files in the same manner as rwasa, you may find it helpful to first get lighttpd to be happy with your TLS configuration, and then move to rwasa. This is because of the level of verbosity provided by OpenSSL that we didn't include with rwasa.

OCSP Stapling issues

OCSP Stapling can be difficult to get right. By default, rwasa logs its OCSP handling via syslog, so that it is easy to determine what precisely is going on. It also makes use of /etc/resolv.conf in order to perform DNS queries to OCSP servers. It is assumed that /etc/ssl/certs contains a valid (and current) CA store, noting that not all linux distributions appear to treat this path the same. If, after verifying that your certificate chains are in order and that you can correctly resolve DNS issues, but you still don't receive syslog messages re: OCSP, see the next section for identifying the culprit.

NOTE: Until RFC 6961 gets adopted everywhere, even through rwasa will go ahead and acquire OCSP responses for multiple certificate chains, it will only send the first one. Insofar as the purpose and intent behind OCSP Stapling, this seems to work well for us. Once multiple certificate stapling is adopted, we'll move to adding both options. The simple short-term solution, as noted by others is to acquire a certificate that doesn't require intermediates, or deal with only the first one.

All other issues

Due to the lack of verbosity with rwasa for configuration issues, often the easiest way to locate errors is with strace. In an isolated environment, if you start rwasa with only a single cpu and force it to remain in the foreground, then you can use strace -f ./rwasa [your config options] and usually locate configuration problems without much difficulty. When in doubt, make sure your basic configuration/environment works well with other more verbose webservers first if rwasa gives you grief.

Due to the fact that most FastCGI/PHP server environments are tightly coupled with their supporting webserver software, they normally operate on the same physical machine. Often, administrators configure their FastCGI listeners on localhost via IPv4 as opposed to AF_UNIX. This is because under high load and/or peak demand scenarios, AF_UNIX will return EAGAIN (rather than EWOULDBLOCK which all our nonblocking webservers require). Typically, this EAGAIN condition results in the webserver returning a 502 Bad Gateway response to the end-users. If the same server is configured to use IPv4 and localhost however, that same load does not cause an error condition for the webserver (and the connection is queued normally). NOTE: Unless of course you are on a very powerful single-system-image multicore machine, and you run out of localhost ports. If an administrator chose localhost over AF_UNIX for one service, likely the same choice was made for services OTHER than FastCGI.

A deeper investigation as to why AF_UNIX sockets are avoided in most configurations revealed that the maximum number of connect attempts to an AF_UNIX socket is limited by two factors: 1) /proc/sys/net/core/somaxconn, and 2) the listening process' listen() backlog parameter. It is unclear from the documentation whether the call to listen() has any affect on AF_UNIX sockets, but it is definitely so for the /proc/sys/net/core/somaxconn system setting.

In a normal high availability webserver environment, any condition that raises a 502 Bad Gateway should really be a bad gateway, and not a full backlog for local connects. While many administrators set /proc/sys/net/core/somaxconn to a very high (improbably so) number and then modify their FastCGI process' arguments to listen(), we did not feel this is the proper solution to the problem.

When rwasa under extreme loads hits this backlog ceiling, the linux kernel returns us with an EAGAIN condition (as is the case with any nonblocking webserver). We carefully considered the language of the EAGAIN return, and decided to include the HeavyThing library setting epoll_unixconnect_forgiving which is enabled by default. This has the pleasant side-effect that FastCGI calls from rwasa to AF_UNIX sockets will not return 502 Bad Gateway even if the backlog is full. Instead, the HeavyThing library manages its own pending connect queue and waits as it should (and thus does what the documentation suggests for receiving EAGAIN from connect()).

We are not suggesting that /proc/sys/net/core/somaxconn does not play a role, certainly it does and should be set according to your load and operating environment. What we are suggesting is that it need not be set to some insane value, and that rwasa will manage peak demands without returning errors to your user base. Of course, if your FastCGI handler actually does get "stuck", then this behaviour may not be desirable but high availability FastCGI webserver environments are commonplace.

In addition to being a full-featured webserver and a showcase piece for our HeavyThing library, rwasa has been designed as a template for quickly building web application servers in x86_64 assembly language. To this end, rwasa by default includes a function hook whereby all requests that arrive that end with .asmcall get directed to this function. This can be seen on our own rwasa webserver here. The function itself is named asmcall and lives in rwasa.asm, and returns a simple dynamic response containing the original request URL. See the HeavyThing page for details on recompiling rwasa. For production environments using rwasa as-is, it is recommended that the command line option -funcmatch be utilised to change the default away from .asmcall, though the function provided in rwasa is harmless.

The HeavyThing library's webserver architecture, which is not specific to rwasa, provides a webserver object that in itself is an epoll listener. For every inbound connection, a new webserver object is created. Multiple requests may occur for any given webserver object. This section here is intentionally oversimplified, and you are encouraged to peruse the code itself for a deeper understanding of how the functionality all comes together. We'll start with the function hook code itself, and follow that with more descriptive information after:

	; this is our main function call hook, as defined by _start.hookthemall
	; it is called by the webserver layer with:
	; rdi == webserver object, rsi == request url, rdx == mimelike request object
	; per the webserver layer requirements, we must return one of:
	; null: webserver will respond with a 404 automatically.
	; -1 == webserver will sit there and do absolutely nothing
	; or anything else is a properly formed mimelike response object (including
	; preface line)
	;
	; for our demonstration purposes, we'll construct a simple text/plain return
falign
asmcall:
	prolog	asmcall
	push	rbx r12
	; build a dynamic text reply first up
	mov	rbx, rsi
	call	buffer$new
	mov	rdi, rax
	mov	rsi, .stringpreface
	mov	r12, rax
	call	buffer$append_string
	mov	rdi, rbx
	call	url$tostring
	mov	rbx, rax
	mov	rdi, r12
	mov	rsi, rax
	call	buffer$append_string
	mov	rdi, rbx
	call	heap$free
	mov	rdi, r12
	mov	rsi, .stringreply
	call	buffer$append_string

	; construct our return object
	call	mimelike$new
	; set the http preface
	mov	rbx, rax
	mov	rdi, rax
	mov	rsi, .httppreface
	call	mimelike$setpreface
	; set our content type
	mov	rdi, rbx
	mov	rsi, mimelike$contenttype
	mov	rdx, mimelike$textplain
	call	mimelike$setheader
	; set our body to the UTF8 of our string
	mov	rdi, rbx
	mov	rsi, [r12+buffer_itself_ofs]
	mov	rdx, [r12+buffer_length_ofs]
	call	mimelike$setbody
	; free our working buffer
	mov	rdi, r12
	call	buffer$destroy
	; return our mimelike response
	mov	rax, rbx
	pop	r12 rbx
	epilog
cleartext .stringpreface, 'Welcome to rwasa!',13,10,'URL: '
cleartext .stringreply, 13,10,'This is a native assembler function call hook.',13,10,13,10,'See https://2ton.com.au/rwasa for more information/documentation.',13,10
cleartext .httppreface, 'HTTP/1.1 200 rwasa reporting for duty'

The first thing to point out is that for such a simple example, the HeavyThing library tools we have used were a bit overkill. For demonstration purposes however, this serves as an excellent example. The first thing we see is the comments about what arguments the function receives, and what its possible return values are. Passed in rdi is the client connection webserver object, in rsi is the url object of the request itself, in rdx is the mimelike object of the request which includes headers, POST body if present, etc.

The function hook needn't worry about the communications layer, or any of the other required and/or standard HTTP headers, only the basics such that the webserver layer can take over from there. The mimelike object, which provides both MIME and HTTP parsing and composition capabilities serves both our request and return values throughout the webserver layer. Dependent of course on the request itself, the Content-Type that the function hook returns and the body length, the webserver layer will automatically gzip the outbound contents all without any actions inside our function call hook.

While this rwasa page isn't intended to be a programming guide or reference to the HeavyThing library itself, we hope it provides a decent introductory into both the method and difficulty level of writing assembly language applications using our HeavyThing library. Perusing the code from the function hook backward through rwasa and the library itself is made much easier with a starting reference point.

Afficher l’article complet

Connexion