FFW - About hangs, timeouts and latency

When replaying captured network protocol packets, we have like three different kind of error situations:

  • Hangs
  • Timeouts
  • Latency

For further discussion, have a look of the following diagram loosely depicting the IRC protocol:

Client                           Server
  [1] Hello ----------------------->
        <-------------------------- [2] Hi
        
        
  [3] I'm groot ------------------->
                                    <DNS lookup of the client>
        <-------------------------- [4] Ok, hi groot
                                    
        
  [5] Join channel ---------------->
                                    <not so fast! sleep(1)>
         <------------------------- [6] You joined, here is user list
  ...

Problems

Hangs

The server is not reachable anymore. This can happen when:

  • The server crashed (what we actually want), good
  • The server hangs (endless loop), which is okish
  • The server stops accepting connections, which is bad. E.g. via:
    • Rate limiting / throttling
    • Out of sockets / late socket cleanup
    • Anti DoS

The last one (anti DoS) is the most troublesome one, as it heavily restricts our iterations per second, and cannot easily be detected when capturing the traffic. We can only start to see it when we send a lot of connection attempts and commands in a short time-frame.
The server also just appears to hang, but actually it just blocks our connection attempts. This may produce false-positives, as FFW thinks that the fuzzed message produced a crash, as the server is not answering anymore.

Timeouts

When replaying a session, the server suddenly stops answering, e.g. after packet 6 out of 24 of the TCP connection. We are running into a timeout when trying to recv() one of the server answers.
Some examples:

  • IRC: Duplicate username (server caches username for a bit, declines identical nicknames)
  • IRC: K-Lined (because login as root user)
  • IRC: Penalty. For example, each executed IRC commands adds an 1 second penalty time. This prohibits mass-join'ing of channels or other shenanigans which increase the workload of the server considerably.

The effect is that while some packets can be sent without a problem, the server suddenly just terminates the connection because something is wrong. We identify this as recv() timeout. We can only fuzz the first few packets, before the disconnect/timeout (e.g. packets 0-5). If the user does not identify this, we not only fuzz just a small part of the code, but also waste a lot of iterations fuzzing and sending packets which are not handled at all.

Latency

How fast the server is answering to our packets on average (or max), in milliseconds.

  • We assume that the server answers in less than 0.03s (30ms, 30 iterations/s)
  • Increasing the amount of fuzzing processes may overload the cpu, which then in turn increases the latency. This leads to disconnects, and unreliable fuzzing.
  • Some commands (e.g. involving DNS lookups, whois etc.) may just be very slow (in relationship to the other commands). Then it is only one packet of the connection which has a high latency. But FFW detects a timeout, and terminates the connection anyway (the timeout is defined per-connection, not per-packet).
  • IRC: Anti-DOS: Add penalty for each command for the user, usually in complete seconds. We may have a latency of up to 1-5 seconds, depending on the previously sent commands. But, if we increase the timeout to like 5 seconds, we'll just have 1/5 iterations per second.

Handling the problems

Handling hangs

We can always just restart the server when detecting a hang. If the hangs are rare (<0.1%), this is not a problem. One possibility is also just to restart the server periodically, e.g. via "restart_server_after": 100 all 100 iterations.

False positives are handled by the Verify-Phase of FFW, where we check if the identified crashes are really crashing the server, by just sending them again.

If we have code in the server which is actively doing rate limiting / anti DoS, we have to configure it or manually patch the source so it is not doing this anymore. This should be detected by the Test-Phase of FFW, because we send the corpus 30 or 60 times as fast as possible.

At the Fuzzing-Phase, we output information about the amount of hangs. The user should be able to identify itself that something is wrong, if there are a lot of hangs. Similar to AFL.

Handling timeouts/latency

We can specify a timeout in python for sockets. When doing a recv(), after time spent waiting for new bytes specified in timeout, python will throw an exception.
If we are fuzzing, we can safely assume that we generate data that will throw the server off, resetting our connection - and then we are running into the timeout. Therefore the timeout value defines how many iterations per second we can perform: If we have a 100ms timeout, we can perform a maximum of 10 iterations per second. The lower the latency of the server is (the faster it answers), the lower we can make the timeout, increasing our iterations/s.

We should identify in-connection-resets (timeouts) in our input corpuses on Test-Phase of FFW. Replaying the same recorded information several times should identify both problems (completely un-replayable recording as in K-Lined, and also error-after-replay-x like the duplicate username case).

At the Fuzzing-Phase, we also output information about the location of the timeouts (which message is affected). FFW intentionally does not track timeouts which happen after sending the fuzzed message, as this can be expected. Only unexpected timeouts are recorded (basically on recv()'s before we sent the fuzzed message).

Examples for FFW-Test

Inspircd

Plain Inspircd:

root@fuzz:/ffw-examples/inspircd# python2.7 /ffw/ffw.py --config config.py --basedir /ffw  --test
Using port: 6668
Initial test successful - could connect to server.
Testing CorpusData: intercept0.pickle 32 times
Message 5 timeouts: 32
Last request before good answer: 2
00000000:  43 41 50 20 52 45 51 20  3a 6d 75 6c 74 69 2d 70  |CAP REQ :multi-p|
00000010:  72 65 66 69 78 0d 0a                              |refix..|

Last good server answer was: 3
00000000:  3a 70 65 6e 67 75 69 6e  2e 6f 6d 65 67 61 2e 65  |:penguin.omega.e|
00000010:  78 61 6d 70 6c 65 2e 6f  72 67 20 4e 4f 54 49 43  |xample.org NOTIC|
00000020:  45 20 41 75 74 68 20 3a  2a 2a 2a 20 43 6f 75 6c  |E Auth :*** Coul|
00000030:  64 20 6e 6f 74 20 72 65  73 6f 6c 76 65 20 79 6f  |d not resolve yo|
00000040:  75 72 20 68 6f 73 74 6e  61 6d 65 3a 20 44 6f 6d  |ur hostname: Dom|
00000050:  61 69 6e 20 6e 61 6d 65  20 6e 6f 74 20 66 6f 75  |ain name not fou|
00000060:  6e 64 3b 20 75 73 69 6e  67 20 79 6f 75 72 20 49  |nd; using your I|
00000070:  50 20 61 64 64 72 65 73  73 20 28 31 32 37 2e 30  |P address (127.0|
00000080:  2e 30 2e 31 29 20 69 6e  73 74 65 61 64 2e 0d 0a  |.0.1) instead...|

Msg before request: 4
00000000:  43 41 50 20 45 4e 44 0d  0a                       |CAP END..|

 We have timeouts. This means that we closed the connection,
 because the server didnt answer fast enough, or for other
 reasons.
 You can try to increase the timeout to a sane level,
 but this usually means that the server has some kinde of
 processing/ratelimiting/antiddos/resolving stuff to do.
 Patch it.
 in config.py: "recvTimeout": 0.1

Message 5 consistently timed out. Not shown here is the research for the reason and more testing. But the problem is that the server only handles the user commands once every second. If we patch this, the server is lighting fast, and FFW-Test does not show any more errors:

@@ -799,7 +799,7 @@ int InspIRCd::Run()
                 * timing using this event, so we dont have to
                 * time this exactly).
                 */
-               if (TIME.tv_sec != OLDTIME)
+               if (1)
                {

ngircd

Lets test ngircd:

root@fuzz:/ffw-examples/ngircd# python2.7 /ffw/ffw.py --config config.py --basedir /ffw  --test
Using port: 6669
Initial test successful - could connect to server.
Testing CorpusData: intercept0.pickle 32 times
Message 3 timeouts: 30
Last request before good answer: 0
00000000:  43 41 50 20 4c 53 0d 0a  4e 49 43 4b 20 64 6f 62  |CAP LS..NICK dob|
00000010:  69 6e 0d 0a 55 53 45 52  20 64 6f 62 69 6e 20 64  |in..USER dobin d|
00000020:  6f 62 69 6e 20 31 32 37  2e 30 2e 30 2e 31 20 3a  |obin 127.0.0.1 :|
00000030:  64 6f 62 69 6e 0d 0a                              |dobin..|

Last good server answer was: 1
00000000:  3a 69 72 63 2e 65 78 61  6d 70 6c 65 2e 6e 65 74  |:irc.example.net|
00000010:  20 43 41 50 20 2a 20 4c  53 20 3a 6d 75 6c 74 69  | CAP * LS :multi|
00000020:  2d 70 72 65 66 69 78 0d  0a                       |-prefix..|

Msg before request: 2
00000000:  43 41 50 20 52 45 51 20  3a 6d 75 6c 74 69 2d 70  |CAP REQ :multi-p|
00000010:  72 65 66 69 78 0d 0a                              |refix..|

Message 10 timeouts: 2
Last request before good answer: 7
00000000:  50 49 4e 47 20 69 72 63  2e 65 78 61 6d 70 6c 65  |PING irc.example|
00000010:  2e 6e 65 74 0d 0a                                 |.net..|

Last good server answer was: 8
00000000:  6c 65 2e 6e 65 74 20 32  36 35 20 64 6f 62 69 6e  |le.net 265 dobin|
00000010:  20 31 20 31 20 3a 43 75  72 72 65 6e 74 20 6c 6f  | 1 1 :Current lo|
00000020:  63 61 6c 20 75 73 65 72  73 3a 20 31 2c 20 4d 61  |cal users: 1, Ma|
00000030:  78 3a 20 31 0d 0a 3a 69  72 63 2e 65 78 61 6d 70  |x: 1..:irc.examp|
00000040:  6c 65 2e 6e 65 74 20 32  36 36 20 64 6f 62 69 6e  |le.net 266 dobin|
00000050:  20 31 20 31 20 3a 43 75  72 72 65 6e 74 20 67 6c  | 1 1 :Current gl|
00000060:  6f 62 61 6c 20 75 73 65  72 73 3a 20 31 2c 20 4d  |obal users: 1, M|
00000070:  61 78 3a 20 31 0d 0a 3a  69 72 63 2e 65 78 61 6d  |ax: 1..:irc.exam|
00000080:  70 6c 65 2e 6e 65 74 20  32 35 30 20 64 6f 62 69  |ple.net 250 dobi|
00000090:  6e 20 3a 48 69 67 68 65  73 74 20 63 6f 6e 6e 65  |n :Highest conne|
000000a0:  63 74 69 6f 6e 20 63 6f  75 6e 74 3a 20 32 32 20  |ction count: 22 |
000000b0:  28 33 33 20 63 6f 6e 6e  65 63 74 69 6f 6e 73 20  |(33 connections |
000000c0:  72 65 63 65 69 76 65 64  29 0d 0a 3a 69 72 63 2e  |received)..:irc.|
000000d0:  65 78 61 6d 70 6c 65 2e  6e 65 74 20 34 32 32 20  |example.net 422 |
000000e0:  64 6f 62 69 6e 20 3a 4d  4f 54 44 20 66 69 6c 65  |dobin :MOTD file|
000000f0:  20 69 73 20 6d 69 73 73  69 6e 67 0d 0a 3a 64 6f  | is missing..:do|
00000100:  62 69 6e 21 7e 64 6f 62  69 6e 40 31 32 37 2e 30  |bin!~dobin@127.0|
00000110:  2e 30 2e 31 20 4d 4f 44  45 20 64 6f 62 69 6e 20  |.0.1 MODE dobin |
00000120:  3a 2b 69 0d 0a                                    |:+i..|

Msg before request: 9
00000000:  4a 4f 49 4e 20 23 74 65  73 74 0d 0a              |JOIN #test..|

 We have timeouts. This means that we closed the connection,
 because the server didnt answer fast enough, or for other
 reasons.
 You can try to increase the timeout to a sane level,
 but this usually means that the server has some kinde of
 processing/ratelimiting/antiddos/resolving stuff to do.
 Patch it.
 in config.py: "recvTimeout": 0.1

Interestingly, we have 30 timeouts at message 3, and an additional 2 timeouts at message 10. We ignore message 10 for now. After some research and feedback from the IRC channel, I just modified the SetPenalty to an noop:

--- ngircd.orig/src/ngircd/irc-write.c  2018-05-26 14:41:23.038058619 +0000
+++ ngircd/src/ngircd/irc-write.c       2018-05-26 14:49:24.295215261 +0000
@@ -554,7 +554,7 @@
        if (c <= NONE)
                return;

-       Conn_SetPenalty(c, Seconds);
+       //Conn_SetPenalty(c, Seconds);
 } /* IRC_SetPenalty */

Conclusion

The Test-Phase of FFW is important. It is often necessary to patch the target program for efficient fuzzing.

Identifying bottlenecks in fuzzing requires to collect many different metrics, and to handle them correctly.