## Towards High-speed Networking in the Post-Moore Era

Vishal Shrivastav

Ph.D. Thesis Defense Cornell University 2020









datacenter serves 2.6 billion active users daily



In 2018, 40% of world total data was stored and processed inside datacenters























Network Stack



**Network Stack** 



**Network Stack** 







**Network Stack** 



**Network Stack** 

**Applications** 

CPU processing improving **2x / 20 years** [Source: Hennessy & Patterson 6e]





































# End-host Network Stack Operation: Packet Scheduling



# End-host Network Stack Operation: Packet Scheduling



# End-host Network Stack Operation: Packet Scheduling



# End-host Network Stack Operation: Packet Scheduling



# End-host Network Stack Operation: Packet Scheduling













<sup>\*\*1</sup> core, 1500B packets, running variant of SJF algorithm



<sup>\*\*1</sup> core, 1500B packets, running variant of SJF algorithm



<sup>\*\*1</sup> core, 1500B packets, running variant of SJF algorithm



<sup>\*\*1</sup> core, 1500B packets, running variant of SJF algorithm

## How to build a packet scheduler that is simultaneously

Programmable, Scalable, High-speed?

express wide range of packet scheduling algorithms schedule 10s of thousands of flows

ref. [SENIC - NSDI'14] [Carousel - SIGCOMM'17] schedule at line rate 100G+

## How to build a packet scheduler that is simultaneously

Programmable, Scalable, High-speed?

express wide range of packet scheduling algorithms schedule 10s of thousands of flows

ref. [SENIC - NSDI'14] [Carousel - SIGCOMM'17] schedule at line rate 100G+



## PIEO





more expressive than any state-of-the-art packet scheduling primitive





more expressive than any state-of-the-art packet scheduling primitive



#### Scalable

easily scales to 10s of thousands of flows





more expressive than any state-of-the-art packet scheduling primitive



#### Scalable

easily scales to 10s of thousands of flows



#### High-speed

makes scheduling decisions in O(1) time [4 clock cycles]





Abstraction

more expressive than any state-of-the-art packet scheduling primitive



#### Scalable

Hardware Design

easily scales to 10s of thousands of flows



#### High-speed

makes scheduling decisions in O(1) time [4 clock cycles]





Abstraction

more expressive than any state-of-the-art packet scheduling primitive



#### Scalable

Hardware Design

easily scales to 10s of thousands of flows



#### High-speed

makes scheduling decisions in O(1) time [4 clock cycles]

Scheduling Algorithms



Scheduling Algorithms



when an element becomes eligible for scheduling?



what order to schedule amongst eligible elements?



Scheduling Algorithms



when an element becomes eligible for scheduling?

encode using a  $t_{eligible}$  value



what order to schedule amongst eligible elements? encode using a rank value



Scheduling Algorithms



when an element becomes eligible for scheduling?

encode using a  $t_{eligible}$  value



what order to schedule amongst eligible elements? encode using a rank value

whenever the link is idle:

among all elements satisfying the eligibility predicate  $t_{current} \ge t_{eligible}$ : schedule the smallest ranked element



Scheduling Algorithms



when an element becomes eligible for scheduling?

encode using a  $t_{eligible}$  value



what order to schedule amongst eligible elements? encode using a rank value

whenever the link is idle:

among all elements satisfying the eligibility predicate  $t_{current} \geq t_{eligible}$ : schedule the smallest ranked element

PIEO scheduler simply schedules the **smallest ranked eligible** element at any given time

programmed
based on
the choice of
scheduling algorithm
element
rank
the teligible

programmed
based on
the choice of
scheduling algorithm
element
rank
teligible

rank ordered list increasing rank value

| 10 | 12 | 13 | 16 | 19 | 21 | 22 |
|----|----|----|----|----|----|----|
| 16 | 9  | 4  | 13 | 6  | 2  | 15 |





| 10 | 12 | 13 | 16 | 19 | 21 | 22 |
|----|----|----|----|----|----|----|
| 16 | 9  | 4  | 13 | 6  | 2  | 15 |

dequeue()

| 10 | 12 | 13 | 16 | 19 | 21 | 22 |
|----|----|----|----|----|----|----|
| 16 | 9  | 4  | 13 | 6  | 2  | 15 |



| 10 | 12 | 13 | 16 | 19 | 21 | 22 |
|----|----|----|----|----|----|----|
| 16 | 9  | 4  | 13 | 6  | 2  | 15 |





| enqueue(  1  ) Pusn-in                                 | 10  | 12 | 13 | 10  | 19 | 21 |     |
|--------------------------------------------------------|-----|----|----|-----|----|----|-----|
|                                                        | 16  | 9  | 4  | 13  | 6  | 2  | 15  |
|                                                        | . • | _  |    | Liv |    |    | . • |
| serts element at position dictated by its rank value l |     |    |    |     |    |    |     |

dequeue()

| 10 | 12 | 13 | 16 | 19 | 21 | 22 |
|----|----|----|----|----|----|----|
| 16 | 9  | 4  | 13 | 6  | 2  | 15 |

10 12 13 16 10 21 22



| 10 | 12 | 13 | 16 | 19 | 21 | 22 |
|----|----|----|----|----|----|----|
| 16 | 9  | 4  | 13 | 6  | 2  | 15 |



inserts element at position dictated by its rank value

| dequeue() | 10 | 12 | 13 | 16 | 19 | 21 | 22 |  |
|-----------|----|----|----|----|----|----|----|--|
|           | 16 | 9  | 4  | 13 | 6  | 2  | 15 |  |

16



| 10 | 12 | 13 | 16 | 19 | 21 | 22 |
|----|----|----|----|----|----|----|
| 16 | 9  | 4  | 13 | 6  | 2  | 15 |

15





inserts element at position dictated by its rank value





| 10 | 12 | 13 | 16 | 19 | 21 | 22 |
|----|----|----|----|----|----|----|
| 16 | 9  | 4  | 13 | 6  | 2  | 15 |





inserts element at position dictated by its rank value

| 10 | 12 | 13 | 16 | 18 | 19 | 21 | 22 |
|----|----|----|----|----|----|----|----|
| 16 | 9  | 4  | 13 | 1  | 6  | 2  | 15 |

13

16

22

15

 $t_{current} = 7$ 

dequeue() "Extract-Out"

returns "smallest ranked eligible" element

| 10 | 12 | 13 | 16 | 19 | 21 | 22 |
|----|----|----|----|----|----|----|
| 16 | 9  | 4  | 13 | 6  | 2  | 15 |



| 10 | 12 | 13 | 16 | 19 | 21 | 22 |
|----|----|----|----|----|----|----|
| 16 | 9  | 4  | 13 | 6  | 2  | 15 |





|                                            | $t_{current} =$ | = 7 | fili | filter: $t_{current} \geq t_{e}$ 13   16   19   21 |   |   |    |  |
|--------------------------------------------|-----------------|-----|------|----------------------------------------------------|---|---|----|--|
| dequeue() "Extract-Out"                    |                 |     |      |                                                    |   |   |    |  |
| returns "smallest ranked eligible" element | 16              | 9   | 4    | 13                                                 | 6 | 2 | 15 |  |



| 10 | 12 | 13 | 16 | 19 | 21 | 22 |
|----|----|----|----|----|----|----|
| 16 | 9  | 4  | 13 | 6  | 2  | 15 |



4

enqueue( 1 ) "Push-In"

inserts element at position dictated by its rank value

| 10 | 12 | 13 | 16 | 18 | 19 | 21 | 22 |
|----|----|----|----|----|----|----|----|
| 16 | 9  | 4  | 13 | 1  | 6  | 2  | 15 |

 $t_{current} = 7 \quad \textbf{filter} : t_{current} \ge t_{eligible}$   $dequeue() \text{ "Extract-Out"} \qquad \qquad 13 \qquad \qquad 10 \quad 12 \quad 16 \quad 19 \quad 21 \quad 22$ 

returns "smallest ranked eligible" element

 10
 12
 16
 19
 21
 22

 16
 9
 13
 6
 2
 15



| 10 | 12 | 13 | 16 | 19 | 21 | 22 |
|----|----|----|----|----|----|----|
| 16 | 9  | 4  | 13 | 6  | 2  | 15 |

## Push-In-Extract-Out Primitive



inserts element at position dictated by its rank value



dequeue() "Extract-Out"
returns "smallest ranked eligible" element

 

| 10 | 12 | 16 | 19 | 21 | 22 |
|----|----|----|----|----|----|
| 16 | 9  | 13 | 6  | 2  | 15 |





### Push-In-Extract-Out Primitive



returns a specific element

- Work conserving
  - e.g., DRR, WFQ, WF<sup>2</sup>Q
- Non-work conserving
  - e.g., Token Bucket, RCSP
- Hierarchical scheduling
  - e.g., HPFQ
- Asynchronous scheduling
  - e.g., Starvation avoidance, D<sup>3</sup>
- Priority scheduling
  - e.g., SJF, SRTF, LSTF, EDF
- Complex scheduling policies
  - mixture of shaping and ordering



- Work conserving
  - e.g., DRR, WFQ, WF<sup>2</sup>Q
- Non-work conserving
  - e.g., Token Bucket, RCSP
- Hierarchical scheduling
  - e.g., HPFQ
- Asynchronous scheduling
  - e.g., Starvation avoidance, D<sup>3</sup>
- Priority scheduling
  - e.g., SJF, SRTF, LSTF, EDF
- Complex scheduling policies
  - mixture of shaping and ordering

for each element:
 calculate **start\_time** and **finish\_time**at time x, all elements s.t. **virtual\_time(x)** >= **start\_time**:
 schedule element with **smallest finish\_time** 



- Work conserving
  - e.g., DRR, WFQ, WF<sup>2</sup>Q
- Non-work conserving
  - e.g., Token Bucket, RCSP
- Hierarchical scheduling
  - e.g., HPFQ
- Asynchronous scheduling
  - e.g., Starvation avoidance, D<sup>3</sup>
- Priority scheduling
  - e.g., SJF, SRTF, LSTF, EDF
- Complex scheduling policies
  - mixture of shaping and ordering

for each element:
 calculate **start\_time** and **finish\_time**at time x, all elements s.t. **virtual\_time(x)** >= **start\_time**:
 schedule element with **smallest finish\_time** 

programming PIEO

 $rank = finish\_time$  $t_{eligible} = start\_time$ 

Predicate for filtering at dequeue at time x:

 $(virtual\_time(x) \ge t_{eligible})$ 



- Work conserving
  - e.g., DRR, WFQ, WF<sup>2</sup>Q
- Non-work conserving
  - e.g., Token Bucket, RCSP
- Hierarchical scheduling
  - e.g., HPFQ
- Asynchronous scheduling
  - e.g., Starvation avoidance, D<sup>3</sup>
- Priority scheduling
  - e.g., SJF, SRTF, LSTF, EDF
- Complex scheduling policies
  - mixture of shaping and ordering

for each element:
 calculate **start\_time** and **finish\_time**at time x, all elements s.t. **virtual\_time(x) >= start\_time**:
 schedule element with **smallest finish\_time** 

programming PIEO

 $rank = finish\_time$ 

 $t_{eligible} = start\_time$ 

Predicate for filtering at dequeue at time x:

 $(virtual\_time(x) \ge t_{eligible})$ 



e.g.





#### Programmable

Abstraction

more expressive than any state-of-the-art packet scheduling primitive



#### Scalable

Hardware Design

easily scales to 10s of thousands of flows



#### High-speed

makes scheduling decisions in O(1) time [4 clock cycles]















Is it fundamentally necessary to access and compare O(N) elements in parallel to maintain an (exact) ordered list (of size N) in O(1) time?

Is it fundamentally necessary to access and compare O(N) elements in parallel to maintain an (exact) ordered list (of size N) in O(1) time?

We present a design that can maintain an (exact) ordered list in O(1) time, but only needs to access and compare  $O(\sqrt{N})$  elements in parallel.

Is it fundamentally necessary to access and compare O(N) elements in parallel to maintain an (exact) ordered list (of size N) in O(1) time?

We present a design that can maintain an (exact) ordered list in O(1) time, but only needs to access and compare  $O(\sqrt{N})$  elements in parallel.

#### **Key Insight**

"All problems in computer science can be solved by another level of indirection"

David Wheeler











































enqueue(f), dequeue(), dequeue(f) each execute in exactly 4 clock cycles

... at the cost of 2x memory overhead

#### Implementation



- Implemented PIEO scheduler on a Stratix V FPGA
  - 234K logic modules (ALMs)
  - 6.5MB SRAM
  - 40Gbps interface bandwidth
- ~1300 LOCs in System Verilog

### Memory Consumption



PIEO on FPGA: 80 MHz (for 32K flows)

PIEO on FPGA: 80 MHz (for 32K flows) ----- 50ns per primitive operation

PIEO on FPGA: 80 MHz (for 32K flows) ----- 50ns per primitive operation

~200 Gbps scheduling rate for 1500B (MTU) packets

PIEO on FPGA: 80 MHz (for 32K flows) ----- 50ns per primitive operation

~200 Gbps scheduling rate for 1500B (MTU) packets

PIEO on ASIC : >1 GHz (expected)

PIEO on FPGA: 80 MHz (for 32K flows) ----- 50ns per primitive operation

~200 Gbps scheduling rate for 1500B (MTU) packets

PIEO on ASIC : >1 GHz (expected) → <4ns per primitive operation

PIEO on FPGA: 80 MHz (for 32K flows) ----- 50ns per primitive operation

~200 Gbps scheduling rate for 1500B (MTU) packets

PIEO on ASIC : >1 GHz (expected) → <4ns per primitive operation

>3 Tbps scheduling rate for 1500B (MTU) packets

## Scalability



### Scalability





Link Domain-specific **Processor FIFO** Programmable X **Packet** Scalable 🗶 Scheduler **Programmable** 











### Part II: Switching Fabric



# Part II: Switching Fabric



## Part II: Switching Fabric



Packet Switching





#### Packet Switching

(Per packet, Per hop Decisions)





#### Packet Switching

(Per packet, Per hop Decisions)



Packet Switch



#### Packet Switching

(Per packet, Per hop Decisions)



Packet Switch



#### Packet Switching

(Per packet, Per hop Decisions)



Packet Switch



#### Packet Switching

(Per packet, Per hop Decisions)



Packet Switch



Circuit Switch

#### Packet Switching

(Per packet, Per hop Decisions)



Packet Switch













~2010 milliseconds



#### Challenge of Circuit Switching

Centralized



Today nanoseconds

#### Challenge of Circuit Switching



#### Challenge of Circuit Switching



# How to build a fast control plane for circuit switching?

nanosecond-scale circuit scheduling with high performance

# How to build a fast control plane for circuit switching?

nanosecond-scale circuit scheduling with high performance





# Shoal



- 1. Physical Layer: Fast circuit scheduling mechanism
- 2. Routing: Bounded worst-case throughput
- 3. Congestion Control: Bounded worst-case queuing



- 1. Physical Layer: Fast circuit scheduling mechanism
- 2. Routing: Bounded worst-case throughput
- 3. Congestion Control: Bounded worst-case queuing

Achieves comparable or better performance than several recent packet-switched network designs





static, pre-defined schedule









static, pre-defined schedule





N-1 timeslots (epoch)

|                  |   | 1 | 2 | 3 | 4 | 5 |
|------------------|---|---|---|---|---|---|
| <b>End-hosts</b> | A | В |   |   |   |   |
|                  | В | С |   |   |   |   |
|                  | С | D |   |   |   |   |
|                  | D | Е |   |   |   |   |
|                  | Е | F |   |   |   |   |
|                  | F | Α |   |   |   |   |

static, pre-defined schedule





#### N-1 timeslots (epoch)

|                  |   | 1 | 2 | 3 | 4 | 5 |
|------------------|---|---|---|---|---|---|
| <b>End-hosts</b> | Α | В |   |   |   |   |
|                  | В | С |   |   |   |   |
|                  | С | D |   |   |   |   |
|                  | D | Е |   |   |   |   |
|                  | Е | F |   |   |   |   |
|                  | F | Α |   |   |   |   |

static, pre-defined schedule



#### Synchronous System

ns-precision time sync DTP [SIGCOMM'16]



#### N-1 timeslots (epoch)

|   | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| Α | В | С | D | Е | F |
| В | С | D | Е | F | Α |
| С | D | Е | F | Α | В |
| D | Е | F | Α | В | С |
| Е | F | Α | В | С | D |
| F | Α | В | С | D | Е |

**End-hosts** 

static, pre-defined schedule



#### Synchronous System

ns-precision time sync DTP [SIGCOMM'16]



#### N-1 timeslots (epoch)

4 В Ε F C D Ε Α В Α Ε В В D Α В C D Ε

**End-hosts** 

static, pre-defined schedule



#### Synchronous System

ns-precision time sync DTP [SIGCOMM'16]





Full Mesh Virtual Topology

N-1 timeslots (epoch)

|   | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| Α | В | С | D | Е | F |
| В | С | D | Е | F | Α |
| С | D | Е | F | Α | В |
| D | Е | F | Α | В | С |
| Е | F | Α | В | С | D |
| F | Α | В | С | D | Е |

**End-hosts** 

static, pre-defined schedule



## Synchronous System

ns-precision time sync DTP [SIGCOMM'16]





Full Mesh Virtual Topology

N-1 timeslots (epoch)

|   | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| Α | В | С | D | Е | F |
| В | С | D | Е | F | Α |
| С | D | Е | F | Α | В |
| D | Е | F | Α | В | С |
| Е | F | Α | В | С | D |
| F | Α | В | С | D | Е |

**End-hosts** 

static, pre-defined schedule



## Synchronous System

ns-precision time sync DTP [SIGCOMM'16]





Full Mesh Virtual Topology



|   | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| Α | В | С | D | Е | F |
| В | С | D | Е | F | Α |
| С | D | Е | F | Α | В |
| D | Е | F | Α | В | С |
| Е | F | Α | В | С | D |
| F | Α | В | С | D | Е |

**End-hosts** 

static, pre-defined schedule



## Synchronous System

ns-precision time sync DTP [SIGCOMM'16]







Valiant Load Balancing





Valiant Load Balancing Uniform load balancing

Routing in Shoal



Valiant Load Balancing





Valiant Load Balancing





Valiant Load Balancing







Valiant Load Balancing







Valiant Load Balancing







Valiant Load Balancing







Valiant Load Balancing







Valiant Load Balancing



Bounded worst-case throughput of 50% compared to an ideal packet-switched network



**Congestion Control** 

Each per-destination queue  $Q_i$  for each destination i is bounded!

 $len(Q_i) \le 1 + incast\_degree(i)$  packets

- Fast, de-centralized, and traffic-agnostic circuit scheduling
- Routing with bounded worst-case throughput
- Congestion Control with bounded worst-case queuing

- Fast, de-centralized, and traffic-agnostic circuit scheduling
- Routing with bounded worst-case throughput
- Congestion Control with bounded worst-case queuing
- Worst-case network throughput of 50% (compared to an ideal packet-switched network)

- Fast, de-centralized, and traffic-agnostic circuit scheduling
- Routing with bounded worst-case throughput
- Congestion Control with bounded worst-case queuing
- Worst-case network throughput of 50% (compared to an ideal packet-switched network)
  - ▶ Equip each node with 2x bandwidth

- Fast, de-centralized, and traffic-agnostic circuit scheduling
- Routing with bounded worst-case throughput
- Congestion Control with bounded worst-case queuing
- Worst-case network throughput of 50% (compared to an ideal packet-switched network)
  - ▶ Equip each node with 2x bandwidth
  - power (Shoal) < power (packet-switched network with 1/2 bandwidth of Shoal)</p>
  - cost (Shoal) < cost (packet-switched network with 1/2 bandwidth of Shoal)</li>

### Implementation

#### Implementation



FPGA code for the implementation of Shoal is available at: <a href="https://github.com/vishal1303/Shoal">https://github.com/vishal1303/Shoal</a>

# Implementation



FPGA code for the implementation of Shoal is available at: <a href="https://github.com/vishal1303/Shoal">https://github.com/vishal1303/Shoal</a>

# Implementation



FPGA code for the implementation of Shoal is available at: <a href="https://github.com/vishal1303/Shoal">https://github.com/vishal1303/Shoal</a>

#### Evaluation

- Packet-level simulator in C
- 512 end-hosts
- Shoal has 2x bandwidth (100Gbps vs. 50Gbps)
   at lower cost and power!



Short Flows (<100KB)



Long Flows (>1MB)

#### Evaluation

- Packet-level simulator in C
- 512 end-hosts
- Shoal has 2x bandwidth (100Gbps vs. 50Gbps)
   at lower cost and power!





Long Flows (>1MB)

Shoal performs comparable or better than several recent packet-switched network designs











### Future Directions

#### Future Directions

# High-speed and Programmable Processors for (stateful) Network Functions

e.g., load balancers, firewalls, deep packet inspectors

#### **Future Directions**

# High-speed and Programmable Processors for (stateful) Network Functions

e.g., load balancers, firewalls, deep packet inspectors

Low latency Optical Circuit Switching







#### **Bandwidth Demand**







#### **Networking Infrastructure Speed**



General-purpose Processor

Packet Switch



Thesis

Domain-specific Processor

Fast Circuit Switch



#### **Bandwidth Demand**







#### **Networking Infrastructure Speed**

end-host ——switching fabric ——

General purpose Processor

Domain-specific Processor

Thesis —

Packet Switch

Fast Circuit Switch

Programmability vs. Performance?



Fast Control Plane?



Programmable and high-performance domain-specific processor for packet scheduling





High-performance, ns-granularity control plane for circuit switching

# Thank you!