WRITTEN – Page 23 – rule 11 reader

Research: Tail Attacks on Web Applications

When you think of a Distributed Denial of Service (DDoS) attack, you probably think about an attack which overflows the bandwidth available on a single link; or overflowing the number of half open TCP sessions a device can have open at once, preventing the device from accepting more sessions. In all cases, a DoS or DDoS attack will involve a lot of traffic being pushed at a single device, or across a single link.

TL;DR[time-span]

Denial of service attacks do not always require high volumes of traffic
An intelligent attacker can exploit the long tail of service queues deep in a web application to bring the service down
These kinds of attacks would be very difficult to detect

But if you look at an entire system, there are a lot of places where resources are scarce, and hence are places where resources could be consumed in a way that prevents services from operating correctly. Such attacks would not need to be distributed, because they could take much less traffic than is traditionally required to deny a service. These kinds of attacks are called tail attacks, because they attack the long tail of resource pools, where these pools are much thinner, and hence much easier to attack.

There are two probable reasons these kinds of attacks are not often seen in the wild. First, they require an in-depth knowledge of the system under attack. Most of these long tail attacks will take advantage of the interaction surface between two subsystems within the larger system. Each of these interaction surfaces can also be attack surfaces if an attacker can figure out how to access and take advantage of them. Second, these kinds of attacks are difficult to detect, because they do not require large amounts of traffic, or other unusual traffic flows, to launch.

The paper under review today, Tail Attacks on Web Applications, discusses a model for understanding and creating tail attacks in a multi-tier web application—the kind commonly used for any large-scale frontend service, such as ecommerce and social media.

Huasong Shan, Qingyang Wang, and Calton Pu. 2017. Tail Attacks on Web Applications. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS ’17). ACM, New York, NY, USA, 1725-1739. DOI: https://doi.org/10.1145/3133956.3133968

The figure below illustrates a basic service of this kind for those who are not familiar with it.

The typical application at scale will have at least three stages. The first stage will terminate the user’s session and render content; this is normally some form of modified web server. The second stage will gather information from various backend services (generally microservices), and pass the information required to build the page or portal to the rendering engine. The microservices, in turn, build individual parts of the page, and rely on various storage and other services to supply the information needed.

If you can find some way to clog up the queue at one of the storage nodes, you can cause every other service along the information path to wait on the prior service to fulfill its part of the job in hand. This can cause a cascading effect through the system, where a single node struggling because of full queues can cause an entire set of dependent nodes to become effectively unavailable, cascading to a larger set of nodes in the next layer up. For instance, in the network illustrated, if an attacker can somehow cause the queues at storage service 1 to fill up, even for a moment, this can cascade into a backlog of work at services 1 and 2, cascading into a backlog at the front-end service, ultimately slowing—or even shutting—the entire service down. The queues at storage service 1 may be the same size as every other queue in the system (although they are likely smaller, as they face internal, rather than external, services), but storage system 1 may be servicing many hundreds, perhaps thousands, of copies of services 1 and 2.

The queues at storage service 1—and all the other storage services in the system—represent a hidden bottleneck in the overall system. If an attacker can, for a few moments at a time, cause these internal, intra-application queue to fill up, the overall service can be made to slow down to the point of being almost unusable.

How plausible is this kind of attack? The researchers modeled a three-stage system (most production systems have more than three stages) and examined the total queue path through the system. By examining the queue depths at each stage, they devised a way to fill the queues at the first stage in the system by sending millibursts of valid sessions requests to the rend engine, or the use facing piece of the application. Even if these millibursts are spread out across the edge of the application, so long as they are all the same kind of requests, and timed correctly, they can bring the entire system down. In the paper, the researchers go further and show that once you understand the architecture of one such system, it is possible to try different millibursts on a running system, causing the same DoS effect.

This kind of attack, because it is built out of legitimate traffic, and it can be spread across the entire public facing edge of an application, would be nearly impossible to detect or counter at the network edge. One possible counter to this kind of attack would be increasing capacity in the deeper stages of the application. This countermeasure could be expensive, as the data must be stored on a larger number of servers. Further, data synchronized across multiple systems will subject to CAP limitations, which will ultimately limit the speed at which the application can run anyway. Operators could also consider fine grained monitoring, which increases the amount of telemetry that must be recovered from the network and processed—another form of monetary tradeoff.

Posted in RESEARCH, SECURITY, WRITTEN

Why is the Feasibility Condition Less Than?

A reader recently emailed me with this question: Why isn’t the condition for a Feasible Successor set to less than (<), rather than less than of equal (<=), in EIGRP? It certainly seems, as noted in the email, that this rules out a lot of possible possible loop free alternate paths. The network below will be used to illustrate.

First, assume all links are cost of 1 except D->C, which is cost of 2. Here D will choose B as the Successor, and the FC will be set to 2. The RD of C will be 1, so C will be an FS. Now consider two failures. The first failure is D->B. D will immediately reroute to the FS, which is C, without changing the FC. This works, because C’s cost to 100::/64 via D is 4, much higher than it’s cost to 100::64 along C->A. Now consider what happens if A->100::/64 fails. If the timing of the query “works right,” C and B will be notified first, then finally D. Even if D is somehow notified before C, and D switches to C as its FS, the traffic is dropped, rather than looped—so all is happy.

Now change the situation a little. Assume the A->C link is cost of 2, and the remaining links are 1. Now assume you make the FS condition <=, rather than "just" <. From D's perspective, C is still an FS. From C's perspective, D is also an FS. Suppose the B->D link fails. D switches to C, who’s path is intact; this works. Assume the C->A link fails. C switches to D, who’s path is intact; this works.

Finally, assume the A->100::/64 link fails. If the timing is just right, D and C will receive the update with this failure at the same moment, and will both switch to their FS’s. Now you have a loop. How long will this loop last? Until C and D can do a diffusing update — probably around 250ms or less… But if you count the outside computation time, it’s the SIA timer, which is around 10 minutes in more recent versions. Hence, “just the wrong circumstances” can cause up to a 10 minute loop. Not good.

The bottom line is this: any time you have a situation where two routers can end up pointing at one another as their local FS, you have a ring of some number of hops. If the final destination is outside the ring, some member of the ring must be the point at which traffic leaves the ring, and moves towards the destination. If the link connecting the ring to the destination fails, the update carrying the information about the loss of connectivity to the destination must travel around the ring in both directions.

When the update reaches the point at which routing would normally split horizon—the “point at which the waterfall splits,” so-to-speak, it will like reach both sides of the split horizon point at a close enough interview to cause a loop in the forwarding tables. This situation causes microloops in a link state protocol, but microloops are often resolved quickly, and hence tend to be tolerable. In a distance vector protocol, like EIGRP, the length of time the microloop can exist can be much longer—ultimately it depends on the speed at which a distributed computation can take place (because the computation is not local to each node), and, failing that, the amount of time the network can remain in an unstable state before “something is done about it.”

This, by the way, is why I am always opposed to increasing the SIA timer in EIGRP above the “factory defaults.” The SIA timer is essentially the amount of time you are willing to allow your network to remain unconverged in the worst case, and hence either dropping or looping traffic.

Posted in TECH, WRITTEN

OSPF Topology Transparent Zones

Anyone who has worked with OSPF for any length of time has at least heard of areas—but perhaps before diving into Topology Transparent Zones (TTZs), a short review is in order.

In this diagram, routers A and B are in area 0, routers C and D are Area Border Routers (ABRs), and routers E, F, G, H, and K are all in area 1. The ABRs, C and D, do not advertise the existence of E, F, G, H, or K to the routers in area 0, nor the links to or between any of those routers. Any reachable destinations in area 1 are advertised using a em>summary LSA, or a type 3 LSA, towards A and B. From the perspective of A and B, 100::/64 and 101::/64 would be advertised by C and D as directly connected destinations, using the cost from C and D to each of these two destinations, based on a summary LSA.

What if you wanted to place H and K in their own area, with G as an ABR, behind the existing area 1? You cannot do this in OSPF using any form of a standard flooding domain, or area. There is no way to take information about a destination out of a summary LSA and place it in another summary LSA without risking the formation of a routing loop. Once the topology information is lost, which OSPF relies on to prevent the formation of routing loops, there is no way to gain it back.

Why might you want to do this? The primary reasons for stacking multiple areas in this way relate to mobile ad-hoc networks; you cannot control the way routers might be connected together, nor where flooding domain boundaries might be drawn. Assuming you have a requirement, you would be out of luck if the only tool you have is a standard OSPF area.

However, you can stack areas using TTZs. What is a TTZ? It is a flooding domain that does not appear to exist to the routers around it—or rather, connected to either side of the TTZ. In the illustration below, routers C, D, E, F, and G have been configured so they form a TTZ.

Note that when the network is configured this way, routers A, B, H, and K all believe they are in a single flooding domain. From the perspective of these four routers, C, D, and G appear to be directly connected by a set of point-to-point links. Traffic sent from B to 101::/64, for instance, would pass into the TTZ at D, through the TTZ, and back out the other side at G. This traffic would be forwarded normally through the TTZ, as the routers within the TTZ have a full view of the network topology, as well as reachable destinations.

The destination within the TTZ, 100::/64, would be advertised by each TTZ edge router as if it were a directly connected destination. Hence G, C, and D would each advertise 100::/64 as a connected interface in their router LSA, or their type 1 advertisement. This allows routers outside the TTZ to reach destinations within the TTZ, without revealing the internal topology of the TTZ.

Information about the topology within the TTZ is carried in a set of opaque LSAs which are not forwarded outside the TTZ. This allows the TTZ to maintain consistent internal state, without burdening the rest of the network with internal topology information, or topology changes.

TTZ’s are a rather narrowly focused solution; it is not likely you will see these used in an OSPF network near you any time soon. On the other hand, they are an interesting, experimental, addition to OSPF that can be used to solve a set of edge cases.

You can read about TTZs more in RFC8099.

Posted in STANDARDS, TECH, WRITTEN

The EIGRP SIA Incident: Positive Feedback Failure in the Wild

Reading a paper to build a research post from (yes, I’ll write about the paper in question in a later post!) jogged my memory about an old case that perfectly illustrated the concept of a positive feedback loop leading to a failure. We describe positive feedback loops in Computer Networking Problems and Solutions, and in Navigating Network Complexity, but clear cut examples are hard to find in the wild. Feedback loops almost always contribute to, rather than independently cause, failures.

Many years ago, in a network far away, I was called into a case because EIGRP was failing to converge. The immediate cause was neighbor flaps, in turn caused by Stuck-In-Active (SIA) events. To resolve the situation, someone in the past had set the SIA timers really high, as in around 30 minutes or so. This is a really bad idea. The SIA timer, in EIGRP, is essentially the amount of time you are willing to allow your network to go unconverged in some specific corner cases before the protocol “does something about it.” An SIA event always represents a situation where “someone didn’t answer my query, which means I cannot stay within the state machine, so I don’t know what to do—I’ll just restart the state machine.” Now before you go beating up on EIGRP for this sort of behavior, remember that every protocol has a state machine, and every protocol has some condition under which it will restart the state machine. IT just so happens that EIGRP’s conditions for this restart were too restrictive for many years, causing a lot more headaches than they needed to.

So the situation, as it stood at the moment of escalation, was that the SIA timer had been set unreasonably high in order to “solve” the SIA problem. And yet, SIAs were still occurring, and the network was still working itself into a state where it would not converge. The first step in figuring this problem out was, as always, to reduce the number of parallel links in the network to bring it to a stable state, while figuring out what was going on. Reducing complexity is almost always a good, if counterintuitive, step in troubleshooting large scale system failure. You think you need the redundancy to handle the system failure, but in many cases, the redundancy is contributing to the system failure in some way. Running the network in a hobbled, lower readiness state can often provide some relief while figuring out what is happening.

In this case, however, reducing the number of parallel links only lengthened the amount of time between complete failures—a somewhat odd result, particularly in the case of EIGRP SIAs. Further investigation revealed that a number of core routers, Cisco 7500’s with SSE’s, were not responding to queries. This was a particularly interesting insight. We could see the queries going into the 7500, but there was no response. Why?

Perhaps the packets were being dropped on the input queue of the receiving box? There were drops, but not nearly enough to explain what we were seeing. Perhaps the EIGRP reply packets were being dropped on the output queue? No—in fact, the reply packets just weren’t being generated. So what was going on?

After collecting several show tech outputs, and looking over them rather carefully, there was one odd thing: there was a lot of free memory on these boxes, but the largest block of available memory was really small. In old IOS, memory was allocated per process on an “as needed basis.” In fact, processes could be written to allocate just enough memory to build a single packet. Of course, if two processes allocate memory for individual packets in an alternating fashion, the memory will be broken up into single packet sized blocks. This is, as it turns out, almost impossible to recover from. Hence, memory fragmentation was a real thing that caused major network outages.

Here what we were seeing was EIGRP allocating single packet memory blocks, along with several other processes on the box. The thing is, EIGRP was actually allocating some of the largest blocks on the system. So a query would come in, get dumped to the EIGRP process, and the building of a response would be placed on the work queue. When the worker ran, it could not find a large enough block in which to build a reply packet, so it would patiently put the work back on its own queue for future processing. In the meantime, the SIA timer is ticking in the neighboring router, eventually timing out and resetting the adjacency.

Resetting the adjacency, of course, causes the entire table to be withdrawn, which, in turn, causes… more queries to be sent, resulting in the need for more replies… Causing the work queue in the EIGRP process to attempt to allocate more packet sized memory blocks, and failing, causing…

You can see how this quickly developed into a positive feedback loop—

EIGRP receives a set of queries to which it must respond
EIGRP allocates memory for each packet to build the responses
Some other processes allocate memory blocks interleaved with EIGRP’s packet sized memory blocks
EIGRP receives more queries, and finds it cannot allocate a block to build a reply packet
EIGRP SIA timer times out, causing a flood of new queries…

Rinse and repeat until the network fails to converge.

There are two basic problems with positive feedback loops. The first is they are almost impossible to anticipate. The interaction surfaces between two systems just have to be both deep enough to cause unintended side effects (the law of leaky abstractions almost guarantees this will be the case at least some times), and opaque enough to prevent you from seeing the interaction (this is what abstraction is supposed to do). There are many ways to solve positive feedback loops. In this case, cleaning up the way packet memory was allocated in all the processes in IOS, and, eventually, giving the active process in EIGRP an additional, softer, state before it declared a condition of “I’m outside the state machine here, I need to reset,” resolved most of the incidents of SIA’s in the real world.

But rest assured—there are still positive feedback loops lurking in some corner of every network.

Posted in DESIGN, TECH, WRITTEN

DFS and Low Points

On a recent history of networking episode, Alia talked a little about Maximally Redundant Trees (MRTs), and the concept of Depth First Search (DFS) numbering, along with the idea of a low point. While low points are quickly explained in my new book in the context of MRTs, I thought it worthwhile to revisit the concept in a blog post. Take a look at the following network:

On the left side is a small network with the nodes (think of these as routers) being labeled from A through G. On the right side is the same network, only each node has been numbered by traversing the graph, starting at A. This process, in a network, would either require some device which knows about every node and edge (link) in the network, or it would require a distributed algorithm that “walks” the network from one node to another, numbering each node as it is touched, and skipping any node that has already been visited (again, for more details on this, please see the book).

Once this numbering has been done, the numbers now produce this interesting property: if you remove the parent of any node, and the node can still reach a number lower than its own number, the network is two-connected. Take E, numbered as 5, as an example. E’s parent is D, labeled as 3 on the numbered side of the illustration. If you remove D from the network, what is the lowest numbered node E can reach? Start by jumping to the lowest numbered neighbor. In this case, E only has one neighbor remaining, C, which is numbered 6. From here, what is the lowest numbered neighbor of C? It is A, with a number of 1.

D, then, can reach a node which is numbered 1 through some other neighbor than its parent. This means D has some other path to the parent than through its parent, which means D is part of a topology with at least two connections to some other node in the network—it is two connected.

Using this sort of calculation, you can find alternate paths in a network. The problem with using DFS numbering for this is what was stated above—the calculation requires either a “walk through the network” protocol, or it requires some device with a complete view of the network (an LSDB, in link state terms). Neither of these are really conducive to real time calculation during a topology change. MRT solves this by using low points from DFS numbering with Dijkstra’s SPF algorithm to allow the calculation of disjoint paths in near real time in a distributed control plane.

Posted in STANDARDS, TECH, WRITTEN

If you haven’t found the tradeoff…

This week, I ran into an interesting article over at Free Code Camp about design tradeoffs. I’ll wait for a moment if you want to go read the entire article to get the context of the piece… But this is the quote I’m most interested in:

[time-span]

Just like how every action has an equal and opposite reaction, each “positive” design decision necessarily creates a “negative” compromise. Insofar as designs necessarily create compromises, those compromises are very much intentional. (And in the same vein, unintentional compromises are a sign of bad design.)

In other words, design is about making tradeoffs. If you think you’ve found a design with no tradeoffs, well… Guess what? You’ve not looked hard enough. This is something I say often enough, of course, so what’s the point? The point is this: We still don’t really think about this in network design. This shows up in many different places; it’s worth taking a look at just a few.

Hardware is probably the place where network engineers are most conscious of design tradeoffs. Even so, we still tend to think sticking a chassis in a rack is a “future and requirements proof solution” to all our network design woes. With a chassis, of course, we can always expand network capacity with minimal fuss and muss, and since the blades can be replaced, the life cycle of the chassis should be much, much, longer than any sort of fixed configuration unit. As for port count, it seems like it should always be easier to replace line cards than to replace or add a box to get more ports, or higher speeds.

Cross posted at CircleID

But are either of these really true? While it might “just make sense” that a chassis box will last longer than a fixed configuration box, is there real data to back this up? Is it really a lower risk operation to replace the line cards in a chassis (including the brains!) with a new set, rather than building (scaling) out? And what about complexity—is it better to eat the complexity in the chassis, or the complexity in the network? Is it better to push the complexity into the network device, or into the network design? There are actually plenty of tradeoffs consider here, as it turns out—it just sometimes takes a little out of the box thinking to find them.

What about software? Network engineers tend to not think about tradeoffs here. After all, software is just that “stuff” you get when you buy hardware. It’s something you cannot touch, which means you are better off buying software with every feature you think you might ever need. There’s no harm in this right? The vendor is doing all the testing, and all the work of making certain every feature they include works correctly, right out of the box, so just throw the kitchen sink in there, too.

Or maybe not. My lesson here came through an experience in Cisco TAC. My pager went off one morning at 2AM because an image designed to test a feature in EIGRP had failed in production. The crash traced back to some old X.25 code. The customer didn’t even have X.25 enabled anyplace in their network. The truth is that when software systems get large enough, and complex enough, the laws of leaky abstractions, large numbers, and unintended consequences take over. Software defined is not a panacea for every design problem in the world.

What about routing protocols? The standards communities seem focused on creating and maintaining a small handulf of routing protocols, each of which is capable of doing everything. After all, who wants to deploy a routing protocol only to discover, a few years later, that it cannot handle some task that you really need done? Again, maybe not. BGP itself is becoming a complex ecosystem with a lot of interlocking parts and pieces. What started as a complex idea has become more complex over time, and we now have engineers who (seriously) only know one routing protocol—because there is enough to know in this one protocol to spend a lifetime learning it.

In all these situations we have tried to build a universal where a particular would do just fine. There is another side to this pendulum, of course—the custom built network of snowflakes. But the way to prevent snowflakes is not to build giant systems that seem to have every feature anyone can ever imagine.

The way to prevent landing on either extreme—in a world where every device is capable of anything, but cannot be understood by even the smartest of engineers, and a world where every device is uniquely designed to fit its environment, and no other device will do—is consider the tradeoffs.

If you haven’t found the tradeoffs, you haven’t looked hard enough.

A corollary rule, returning to the article that started this rant, is this: unintentional compromises are a sign of bad design.

Posted in DESIGN, WRITTEN

Leave Your Ego at the Door

You are just about to walk into the interview room. Regardless of whether you are being interviewed, or interviewing—what are you thinking about? Are you thinking about winning? Are you thinking about whining? Or are you thinking about engaging? I have noticed, on many mailing lists, and in many other forums, that interviews in our world have devolved into a contest of egos.

The person on the other side of the table has some certification I don’t care about—how can I prove they are dumb, not as smart as their certification might indicate, or… The person on the other side of the table claims to know some protocol, can I find some bit of information they don’t know? These kinds of questions are really just ego questions—and you need to leave them at the door. This is particularly acute with certifications right now—a lot of people doubt the value of certifications, claiming folks who have them don’t know anything, the certifications are worthless, they don’t reflect the real world, etc.

I will agree that we have a problem with the depth and level of knowledge of network engineers at the moment. We all need to grow up a little, learn technologies rather than CLIs, and actually learn how to be engineers. On the other hand, when you interview someone, or when you are being interviewed…

Leave your ego at the door.

Is it really worth losing a really good hire because you needed your ego stroked by “beating” someone in an interview?

No, I didn’t think so.

Posted in CULTURE, WRITTEN