



# Programmable Multi-Dimensional Table Filters for Line Rate Network Functions

Vishal Shrivastav



# Evolution of Programmable Data Plane Hardware\*



<sup>\*</sup> Not an exhaustive list

#### **Attributes**

|         | Delay | Utilization |
|---------|-------|-------------|
| 1 91115 |       |             |
|         |       |             |
|         |       |             |
|         |       |             |

#### **Performance-aware Routing**

Filter paths with delay < d and utilization < u. Choose a random path from the filtered set

#### **Attributes**



#### **Performance-aware Routing**

Filter paths with delay < d and utilization < u. Choose a random path from the filtered set

#### **Congestion-aware Load Balancing**

Filter path with minimum congestion



#### **Performance-aware Routing**

Filter paths with delay < d and utilization < u. Choose a random path from the filtered set

#### **Congestion-aware Load Balancing**

Filter path with minimum congestion

or

Filter d random egress ports. Choose the least queued port port from those d ports

#### **Attributes**

|             | mem | bw | сри |
|-------------|-----|----|-----|
| <i>er</i> s |     |    |     |
| 2           |     |    |     |
| Sel         |     |    |     |
|             |     |    |     |

#### **Performance-aware Routing**

Filter paths with delay < d and utilization < u. Choose a random path from the filtered set

#### **Congestion-aware Load Balancing**

Filter path with minimum congestion or

Filter d random egress ports. Choose the least queued port port from those d ports

#### **Resource-aware L4 Load Balancing**

Filter servers with avail mem > m and avail bw > b. From the filtered set, choose server with least cpu utilization

#### **Attributes**



#### **Performance-aware Routing**

Filter paths with delay < d and utilization < u. Choose a random path from the filtered set

#### **Congestion-aware Load Balancing**

Filter path with minimum congestion or

Filter d random egress ports. Choose the least queued port port from those d ports

#### **Resource-aware L4 Load Balancing**

Filter servers with avail mem > m and avail bw > b. From the filtered set, choose server with least cpu utilization

#### **Data Plane Diagnosis**

Filter switch ports with packet rate > t

#### **Attributes**

|       | A (0/1) | B (0/1) |
|-------|---------|---------|
| Paths |         |         |
|       |         |         |
|       |         |         |
|       |         |         |

#### **Policy Compliance**

From all available paths, filter the paths not carrying tenant A's or B's traffic.

Choose a path at random from the filtered paths to route a new flow from tenant C.

#### **Performance-aware Routing**

Filter paths with delay < d and utilization < u. Choose a random path from the filtered set

#### **Congestion-aware Load Balancing**

Filter path with minimum congestion or

Filter d random egress ports. Choose the least queued port port from those d ports

#### Resource-aware L4 Load Balancing

Filter servers with avail mem > m and avail bw > b. From the filtered set, choose server with least cpu utilization

#### **Data Plane Diagnosis**

Filter switch ports with packet rate > t



#### **Policy Compliance**

From all available paths, filter the paths not carrying tenant A's or B's traffic.

Choose a path at random from the filtered paths to route a new flow from tenant C.

#### **Performance-aware Routing**

Filter paths with delay < d and utilization < u. Choose a random path from the filtered set

#### **Congestion-aware Load Balancing**

Filter path with minimum congestion

or

Filter d random egress ports. Choose the least queued port port from those d ports

#### Resource-aware L4 Load Balancing

Filter servers with avail mem > m and avail bw > b. From the filtered set, choose server with least cpu utilization

#### **Data Plane Diagnosis**

Filter switch ports with packet rate > t

#### **Chained Multi-dimensional Filtering**



#### **Policy Compliance**

From all available paths, filter the paths not carrying tenant A's or B's traffic.

Choose a path at random from the filtered paths to route a new flow from tenant C.

#### **Performance-aware Routing**

Filter paths with delay < d and utilization < u. Choose a random path from the filtered set

#### **Congestion-aware Load Balancing**

Filter path with minimum congestion or

Filter d random egress ports. Choose the least queued port port from those d ports

#### Resource-aware L4 Load Balancing

Filter servers with avail mem > m and avail bw > b. From the filtered set, choose server with least cpu utilization

#### **Data Plane Diagnosis**

Filter switch ports with packet rate > t

**Chained Multi-dimensional Filtering at Line Rate** 

### State-of-the-Art

#### **Current Programmable Switch Processing Pipeline**



Does not support line rate multi-dimensional filtering

# Thanos



# Thanos

#### **Thanos Switch Processing Pipeline**





**Programmable Filter Module** 

# Filter Abstractions and Primitives

**Unary Filter Processing Unit (UFPU)** 

UFPU: table  $\rightarrow$  table

#### **Unary Filter Processing Unit (UFPU)**



#### Filter a load balancing server at random



#### **Unary Filter Processing Unit (UFPU)**



#### Filter load balancing servers with avail mem > X



#### **Unary Filter Processing Unit (UFPU)**



Filter load balancing servers in a round robin manner weighted by avail bw



#### **Unary Filter Processing Unit (UFPU)**



#### Filter load balancing server with min cpu utilization



#### **Unary Filter Processing Unit (UFPU)**

UFPU: table → table 

random()

predicate(attr relop X)

weighted-round-robin(attr)

min/max(attr)

# Filter N least cpu utilized load balancing servers or

Filter N random load balancing servers



#### **Unary Filter Processing Unit (UFPU)**

UFPU: table → table weighted-round-robin(attr)

min/max(attr)

# Filter N least cpu utilized load balancing servers or

Filter N random load balancing servers



#### **Unary Filter Processing Unit (UFPU)**

UFPU: table → table 

random()

predicate(attr relop X)

weighted-round-robin(attr)

min/max(attr)

# Filter N least cpu utilized load balancing servers or Filter N random load balancing servers

Attributes (attr)



#### **Unary Filter Processing Unit (UFPU)**

UFPU: table → table | random() | predicate(attr relop X) | weighted-round-robin(attr) | min/max(attr)

# Filter N least cpu utilized load balancing servers or

Filter N random load balancing servers

Attributes (attr)





K-UFPU comprises N UFPUs and adds a new configurable parameter K

K specifies the length of chain (max N) (by setting K=1, K-UFPU reduces to UFPU)

#### **Unary Filter Processing Unit (UFPU)**

UFPU: table → table 

random()

predicate(attr relop X)

weighted-round-robin(attr)

min/max(attr)

# Filter N least cpu utilized load balancing servers or

Filter N random load balancing servers







K-UFPU comprises N UFPUs and adds a new configurable parameter K

K specifies the length of chain (max N) (by setting K=1, K-UFPU reduces to UFPU)

We use K-UFPU (instead of UFPU) as the basic computing unit

#### **Unary Filter Processing Unit (UFPU)**

UFPU: table → table weighted-round-robin(attr)

min/max(attr)

#### **Binary Filter Processing Unit (BFPU)**

BFPU: table, table → table intersection() difference()



#### **Unary Filter Processing Unit (UFPU)**

#### **Binary Filter Processing Unit (BFPU)**

BFPU: table, table → table 
intersection()
difference()

Filter all load balancing servers with avail mem > X and avail bw > Y



#### **Unary Filter Processing Unit (UFPU)**

ufpu: table → table 

random()

predicate(attr relop X)

weighted-round-robin(attr)

min/max(attr)

#### **Binary Filter Processing Unit (BFPU)**

BFPU: table, table → table intersection()
difference()

Filter all load balancing servers with avail mem > X and avail bw > Y and # conn < Z



#### **Unary Filter Processing Unit (UFPU)**

UFPU: table → table 

random()

predicate(attr relop X)

weighted-round-robin(attr)

min/max(attr)

#### **Binary Filter Processing Unit (BFPU)**

BFPU: table, table → table ← intersection()
difference()

Filter all load balancing servers with avail mem > X and avail bw > Y and # conn < Z
Filter m load balancing servers at random



#### **Unary Filter Processing Unit (UFPU)**

UFPU: table → table | random() | predicate(attr relop X) | weighted-round-robin(attr) | min/mov/(attr)

#### **Binary Filter Processing Unit (BFPU)**

BFPU: table, table → table ← intersection()
difference()

Filter all load balancing servers with avail mem > X and avail bw > Y and # conn < Z Filter m load balancing servers at random:

Filter load balancing server with min cpu util



#### **Unary Filter Processing Unit (UFPU)**

random()

min/max(attr)

UFPU: table → table ← predicate(attr *relop* X) weighted-round-robin(attr)

#### **Binary Filter Processing Unit (BFPU)**

union()

BFPU: table, table → table 
intersection()

difference()

#### A 5-stage serial chain filter pipeline

(outputs from stage i are inputs to stage i+1)



# From Abstractions to Hardware Design

# Two Hardware Components



# Hardware Component # 1



#### 1. Multi-dimensional Table

# Hardware Component # 2



#### 2. Programmable Filter Pipeline

# Multi-Dimensional Table



How to design an efficient data structure for a multi-dimensional relational table?

Should allow line rate read, write, update!

### Multi-Dimensional Table Data Structure

#### **Limitations of Classic Data Structures**

#### No universal data structure

- Range trees / B-Trees for range filtering
- Heap for min/max filtering
- Disjoint Set Data Structure for set operations
- Either compromise on performance of certain operations or pay the cost of maintaining multiple data structures over the same data

#### **Hierarchical Structure**

- Fundamental O(log(N)) latency
- Hard to pipeline<sup>[1]</sup>

[1] Yi-Hua E. Yang and Viktor K. Prasanna. "High Throughput and Large Capacity Pipelined Dynamic Search Tree on FPGA". In Proceedings of FPGA, 2010

### Multi-Dimensional Table Data Structure

#### **Our Solution:**

Sorted Multidimensional Bidirectional Map (SMBM)

### Multi-Dimensional Table Data Structure

#### **Our Solution:**

### Sorted Multidimensional Bidirectional Map (SMBM)

| ID | X  | Y  |  |  |  |
|----|----|----|--|--|--|
| 1  | 15 | 6  |  |  |  |
| 2  | 4  | 19 |  |  |  |
| 3  | 22 | 19 |  |  |  |
| 4  | 15 | 8  |  |  |  |

**Example Table** 

#### **Our Solution:**

### Sorted Multidimensional Bidirectional Map (SMBM)

ID

|    |    |    |           |   |    | -  |
|----|----|----|-----------|---|----|----|
| ID | X  | Υ  | Stored as | 1 | 4  | 6  |
| 1  | 15 | 6  |           | 2 | 15 | 8  |
| 2  | 4  | 19 |           |   |    |    |
| 3  | 22 | 19 |           | 3 | 15 | 19 |
| 4  | 15 | 8  |           | 4 | 22 | 19 |
|    |    |    |           |   |    |    |

**Example Table** 

Store each dimension as *flat* list of flip-flops

X

Allows parallel access and processing

#### **Our Solution:**

### Sorted Multidimensional Bidirectional Map (SMBM)



**Example Table** 

Each list is kept sorted

Allows fast max/min filter operations

#### **Our Solution:**

### Sorted Multidimensional Bidirectional Map (SMBM)



Each dimension is stored as an *independent* list

Allows parallel filter operations on multiple dimensions

#### **Our Solution:**

### Sorted Multidimensional Bidirectional Map (SMBM)



**Example Table** 

Bidirectional mapping between ID and attributes

#### **Our Solution:**

### Sorted Multidimensional Bidirectional Map (SMBM)



**Example Table** 

Bidirectional mapping between ID and attributes

Forward map keeps track of which attributes belong to which ID

#### **Our Solution:**

### Sorted Multidimensional Bidirectional Map (SMBM)



**Example Table** 

#### Bidirectional mapping between ID and attributes

Forward map keeps track of which attributes belong to which ID Reverse map allows fast mapping of filtered attributes to their respective IDs

### SMBM Performance

#### **Our Solution:**

### Sorted Multidimensional Bidirectional Map (SMBM)



Can read the entire data structure in parallel in 1 clock cycle

Add and Delete can be issued every clock cycle with a latency of 2 cycles

# Filter Pipeline



How to design a *fully reconfigurable* and *fast* filter pipeline?

Can express any arbitrary chain of filter operation with chain length <= k on the n input lines

Runs at line rate

# Filter Pipeline Layout



# Reconfigurable Pipeline



#### In each stage:

- 1. Ability to apply any filter operation to an input (or pair of input) line
- 2. Ability to connect the output of a filter operation to any output line

# Single Pipeline Stage Design

#### Naive design with both K-UFPUs and BFPUs



# Single Pipeline Stage Design

#### Our design with both K-UFPUs and BFPUs



#### Apply a UFPU op on Input 1 and connect to Output 3



#### Apply a UFPU op on Input 1 and connect to Output 3



#### Apply a BFPU op on Input 1 and 3 connect to Output 2



#### Apply a BFPU op on Input 1 and 3 and connect to Output 2



# Filter Pipeline Performance

- Filter pipeline can process a new filter request every clock cycle
- Our implementation runs at clock speeds in excess of 1 GHz
  - 1 GHz is the typical clock speed of today's switches
- However, scalability is limited to a few 1000s of table entries
  - ...beyond that the clock speed falls below 1 GHz
  - Still sufficient for many applications where table entries include network paths, switch ports, servers in a cluster, etc.

### Application Performance

#### Load balancing client requests

**Policy 1.** Select a server uniformly at random.

**Policy 2.** Select a server uniformly at random from the set of servers with CPU utilization < X and available memory > Y and available bandwidth > Z. If the filtered set is empty, select a server uniformly at random from the entire set.



# Application Performance

#### In-network caching of relational graph filter queries



Response time with caching normalized w.r.t. no caching

# Summary

- Thanos extends programmable switches with the ability to do programmable line rate filtering over a multi-dimensional table
- Use cases include performance-aware routing, load balancing, network diagnosis, security, firewall...
- The design runs in excess of 1 GHz clock speed and scales to thousands of table entries
- Evaluations show up to 1.7x improvement in performance of key network functions, such as routing and load balancing
- Overall, advances the current state of in-network computing
  - enables richer algorithms for existing in-network applications
  - enables new in-network applications, e.g., caching of relational queries

# Thank you!