TCP Acceleration as an OS Service¶
User Guide¶
Quick Start¶
Building¶
Requirements:
TAS is built on top of Intel DPDK for direct access to the NIC. We have tested this version with dpdk versions 17.11.9, 18.11.5, 19.11.
Assuming that dpdk is installed in ~/dpdk-inst
TAS can be built as follows
(for a system installation of dpdk the RTE_SDK
variable does not need to be
passed explicitly):
make RTE_SDK=~/dpdk-inst
This will build the TAS service (binary tas/tas
), client libraries (in
lib/
), and a few debugging tools (in tools/
).
Running¶
Before running TAS the following steps are necessary:
Make sure
hugetlbfs
is mounted on/dev/hugepages
and enough huge pages are allocated for TAS and dpdk.Binding the NIC to the dpdk driver, as with any other dpdk application (for Intel NICs use
vfio
becauseuio
does not support multiple interrupts).
sudo modprobe vfio-pci
sudo mount -t hugetlbfs nodev /dev/hugepages
echo 1024 | sudo tee /sys/devices/system/node/node*/hugepages/hugepages-2048kB/nr_hugepages
sudo ~/dpdk-inst/sbin/dpdk-devbind -b vfio-pci 0000:08:00.0
To run (--ip-addr
and --fp-cores-max
are the minimum arguments typically
needed to run tas):
sudo code/tas/tas --ip-addr=10.0.0.1/24 --fp-cores-max=2
Once tas is running, applications that directly link to libtas
or
libtas_sockets
can be run directly. To run an unmodified application with
sockets interposition run as follows (for example):
sudo LD_PRELOAD=lib/libtas_interpose.so ../benchmarks/micro_rpc/echoserver_linux 1234 1 foo 8192 1
In Qemu/KVM¶
For functional testing and development TAS can run in Qemu (with or without
acceleration through KVM). We have tested this with the virtio
dpdk driver.
By default, the qemu virtio device only provides a single queue, and thus only
allows TAS to run on a single core. To run a virtual machine with support for
multiple queue, qemu requires a tap device with multi-queue support enabled.
Here is an example sequence of commands to create a tap device with multi queue support and then start a qemu instance that binds this tap device to a multi-queue virtio device:
sudo ip link add tastap0 type tuntap
sudo ip tuntap add mode tap multi_queue name tastap0
sudo ip link set dev tastap0 up
qemu-system-x86_64 \
-machine q35 -cpu host \
-drive file=vm1.qcow2,if=virtio \
-netdev tap,ifname=tastap0,script=no,downscript=no,vhost=on,queues=8,id=nInt\
-device virtio-net-pci,mac=52:54:00:12:34:56,vectors=18,mq=on,netdev=nInt \
-serial mon:stdio -m 8192 -smp 16 -display none -enable-kvm
Inside the virtual machine, the following sequence of commands first takes the
linux network interface down, binds it to the uio_pci_generic
driver that
the dpdk virtio PMD supports, and then reserves huge pages:
sudo ifconfig enp0s2 down
sudo modprobe uio
sudo modprobe uio_pci_generic
sudo dpdk-devbind.py -b uio_pci_generic 0000:00:02.0
echo 1024 | sudo tee /sys/devices/system/node/node*/hugepages/hugepages-2048kB/nr_hugepages
Virtio does not support all the NIC features that we depend on in physical NICs. In particular virtio does not support transmit checksum offload or the RSS redirection table TAS uses for scaling up and down. The dpdk virtio PMD also does not support multiple MSI-X interrupts. To run TAS given these constraints, the following command line parameters disable the use of these features (note that this implies busy polling and no autoscaling):
sudo code/tas/tas --ip-addr=10.0.0.1/24 --fp-cores-max=8 \
--fp-no-xsumoffload --fp-no-ints --fp-no-autoscale
Kernel NIC Interface¶
TAS supports the DPDK kernel NIC interface (KNI) to pass packets to the Linux kernel network stack. With KNI enabled, TAS becomes an opt-in fastpath where TAS-enabled applications operate through TAS, and other applications can use the Linux network stack as before, sharing the same physical NIC.
To run TAS with KNI the first step is to load the rte_kni
kernel module.
Next, when run with the --kni-name=
option, TAS will create a KNI dummy
network interface with the specified name. After assigning an IP address to this
network interface, the Linux network stack can send and receive packets through
this interface as long as TAS is running. Here is the complete sequence of
commands:
sudo modprobe rte_kni
sudo code/tas/tas --ip-addr=10.0.0.1/24 --kni-name=tas0
# in separate terminal
sudo ifconfig tas0 10.0.0.1/24 up
TAS Command-Line Parameters¶
IP Configuration¶
--ip-addr=ADDR[/PREFIXLEN]
Set local IP address. Currently only exactly one IP address is supported.
--ip-route=DEST[/PREFIX],NEXTHOP
Add an IP route for the destination subnet
DEST/PREFIX
viaNEXTHOP
. Can be specified more than once. For example, a default route could be--ip-route=0.0.0.0/0,192.168.1.1
.
Fast Path Configuration¶
--fp-cores-max=CORES
Maximum number of cores to use for fast-path. (default: 1)
--fp-no-ints
Disable receive interrupts in the NIC driver, switches over to just polling.
--fp-no-xsumoffload
Disable transmit checksum offloads, primarily useful to run TAS with NICs that do not support checksum offload, but comes at a slight performance cost.
--fp-no-autoscale
Disable auto scaling, instead fix the number of cores used by the fast path to the maximum.
--fp-no-hugepages
Do not use huge pages for the shared memory region between TAS and applications. (DPDK still uses huge pages for it’s buffers unless explicitly disabled through
--dpdk-extra
)
--dpdk-extra=ARG
Pass
ARG
through as a parameter to the dpdk EAL. (see https://doc.dpdk.org/guides/linux_gsg/linux_eal_parameters.html)
TCP Protocol Parameters¶
--tcp-rtt-init=RTT
Initial RTT used for congestion control. Is updated with actual measurements when they arrive.
--tcp-link-bw=BANDWIDH
Link bandwidth in GBPS. TODO: what is this used for? (default: 10).
--tcp-rxbuf-len=LEN
Connection receive buffer len in bytes (default: 8,192).
--tcp-txbuf-len=LEN
Connection transmit buffer len in bytes (default: 8,192).
--tcp-handshake-timeout=TIMEOUT
TCP handshake timeout in microseconds (default 10,000us).
--tcp-handshake-retries=RETRIES
Maximum retries for timeouts during handshake. (default: 10).
Congestion Control Parameters¶
--cc=ALGORITHM
Choose which congestion control algorithm to use. The supported options are:
dctcp-rate
: dctcp algorithm adapted to directly operate on the connection rate.
dctcp-win
: original dctcp algorithm with the window converted to a rate for enforcement.
timely
: latency-based TIMELY control law.
const-rate
: set all connections to a constant rate (effectively disables congestion control, useful for debugging).
--cc-control-interval=INT
Control interval length as multiples of the connection’s RTT. (default: 2)
--cc-control-granularity=G
Minimal control loop granularity. Control loop is only executed at most once every
G
microseconds. (default: 50)
--cc-rexmit-ints=INTERVALS
Number of connection cnotrol intervals before TAS triggers a re-transmit. (default: 4).
DCTCP¶
For the dctcp-rate
and dctcp-win
algorithm:
--cc-dctcp-weight=WEIGHT
EWMA weight for dctcp’s ECN rate (alpha, default: 0.0625).
--cc-dctcp-mimd=INC_FACT
Enable mutliplicative increase by
INC_FACT
(disabled by default, only used for tests).
--cc-dctcp-min=RATE
Minimum rate to set for flows (kbps, default: 10000).
Timely¶
Parameters for the timely
algorithm:
--cc-timely-tlow=TIME
Tlow threshold in microseconds. (default: 30)
--cc-timely-thigh=TIME
Thigh threshold in microseconds. (default: 150)
--cc-timely-step=STEP
Additive increase step size in kbps (default: 10000)
--cc-timely-init=RATE
Initial connection rate in kbps (default: 10000)
--cc-timely-alpha=FRAC
EWMA weight for rtt diff. (default: 0.02)
--cc-timely-beta=FRAC
Multiplicative decrease factor. (default: 0.8)
--cc-timely-minrtt=RTT
Minimal RTT without queueing in microseconds. (default: 11)
--cc-timely-minrate=RTT
Minimal connection rate to use in kbps (default: 10000)
Constant Rate¶
For the const-rate
“algorithm” the following configuration options apply:
--cc-const-rate=RATE
Sets the rate to use in kbps.
ARP Protocol Parameters¶
--arp-timeout=TIMEOUT
Initial ARP request timetout in microseconds. This doubles with every retry (default: 500).
--arp-timeout-max=TIMEOUT
Maximal ARP timeout in microseconds. If the retry-timeout grows larger than this, the request fails. (default: 10,000,000 us)
Slowpath Queues¶
--nic-rx-len=LEN
Number of entries in TAS slowpath receive queue. (default: 16,384).
--nic-tx-len=LEN
Number of entries in TAS slowpath transmit queue. (default: 16,384).
--app-kin-len=LEN
Application slow path receive queue length in bytes. (default: 1,048,576).
--app-kout-len=LEN
Application slow path transmit queue length in bytes. (default: 1,048,576).
Host Kernel Interface¶
--kni-name=NAME
Enables the DPDK kernel network interface, by creating a dummy network interface with the name
NAME
. (default: disabled)
Miscellaneous¶
--quiet
Disable non-essential logging.
--ready-fd=FD
Causes TAS to write to file descriptor
FD
when ready. Can be used by supervisor processes to detect when TAS is ready, e.g. used in full system tests.
TAS Troubleshooting¶
Developer Guide¶
Code Structure¶
tas/
: service implementation
tas/fast
: TAS fast path
tas/slow
: TAS slow path
lib/
: client libraries
lib/tas
: lowlevel TAS client library (interface:lib/tas/include/tas_ll.h
)
lib/sockets
: socket emulation layer
tools/
: debugging tools
API¶
The full API documentation extracted by doxygen can be found here.
TAS Low-Level Application API¶
-
int
flextcp_init
(void)¶ Initializes global flextcp state, must only be called once.
- Return
0 on success, < 0 on failure
Contexts¶
-
struct
flextcp_context
¶ A flextcp context is per-thread state for the stack. (opaque) This includes:
admin queue pair to kernel
notification queue pair to flexnic
-
int
flextcp_context_create
(struct flextcp_context *ctx)¶ Create a flextcp context.
-
int
flextcp_context_poll
(struct flextcp_context *ctx, int num, struct flextcp_event *events)¶ Poll events from a flextcp socket.
Warning
doxygenfunction: Cannot find function “flextcp_block” in doxygen xml output for project “TAS” from directory: xml/
Connections¶
-
struct
flextcp_connection
¶ TCP connection. (opaque)
-
int
flextcp_connection_open
(struct flextcp_context *ctx, struct flextcp_connection *conn, uint32_t dst_ip, uint16_t dst_port)¶ Open a connection (asynchronous).
-
int
flextcp_connection_close
(struct flextcp_context *ctx, struct flextcp_connection *conn)¶ Close a connection (asynchronous).
-
int
flextcp_connection_rx_done
(struct flextcp_context *ctx, struct flextcp_connection *conn, size_t len)¶ Receive processing for `len’ bytes done.
-
ssize_t
flextcp_connection_tx_alloc
(struct flextcp_connection *conn, size_t len, void **buf)¶ Allocate transmit buffer for `len’ bytes, returns number of bytes allocated.
NOTE: short allocs can occur if buffer wraps around
-
ssize_t
flextcp_connection_tx_alloc2
(struct flextcp_connection *conn, size_t len, void **buf_1, size_t *len_1, void **buf_2)¶ Allocate transmit buffer for `len’ bytes, returns number of bytes allocated. May be split across two buffers, in case of wrap around.
-
int
flextcp_connection_tx_send
(struct flextcp_context *ctx, struct flextcp_connection *conn, size_t len)¶ Send previously allocated bytes in transmit buffer
-
int
flextcp_connection_tx_close
(struct flextcp_context *ctx, struct flextcp_connection *conn)¶ Send previously allocated bytes in transmit buffer
-
int
flextcp_connection_tx_possible
(struct flextcp_context *ctx, struct flextcp_connection *conn)¶ Make sure there is room in the context send queue (not send buffer)
Returns 0 if transmit is possible, -1 otherwise.
-
int
flextcp_connection_move
(struct flextcp_context *ctx, struct flextcp_connection *conn)¶ Move connection to specfied context
Listeners¶
-
struct
flextcp_listener
¶ TCP listening “socket”. (opaque)
-
FLEXTCP_LISTEN_REUSEPORT
¶
-
int
flextcp_listen_open
(struct flextcp_context *ctx, struct flextcp_listener *lst, uint16_t port, uint32_t backlog, uint32_t flags)¶ Open a listening socket (asynchronous).
-
int
flextcp_listen_accept
(struct flextcp_context *ctx, struct flextcp_listener *lst, struct flextcp_connection *conn)¶ Accept connections on a listening socket (asynchronous). This can be called more than once to register multiple connection handles.
Events¶
-
enum
flextcp_event_type
¶ Types of events that can occur in flextcp contexts
Values:
-
enumerator
FLEXTCP_EV_LISTEN_OPEN
¶ flextcp_listen_open() result.
-
enumerator
FLEXTCP_EV_LISTEN_NEWCONN
¶ New connection on listening socket arrived.
-
enumerator
FLEXTCP_EV_LISTEN_ACCEPT
¶ Accept operation completed
-
enumerator
FLEXTCP_EV_CONN_OPEN
¶ flextcp_connection_open() result
-
enumerator
FLEXTCP_EV_CONN_CLOSED
¶ Connection was closed
-
enumerator
FLEXTCP_EV_CONN_RECEIVED
¶ Data arrived on connection
-
enumerator
FLEXTCP_EV_CONN_SENDBUF
¶ More send buffer available
-
enumerator
FLEXTCP_EV_CONN_RXCLOSED
¶ Receive stream closed
-
enumerator
FLEXTCP_EV_CONN_TXCLOSED
¶ transmit stream closed
-
enumerator
FLEXTCP_EV_CONN_MOVED
¶ Connection moved to new context
-
enumerator
-
struct
flextcp_event
¶ Events that can occur on flextcp contexts.
Public Members
-
struct flextcp_event::[anonymous]::[anonymous]
listen_open
¶
-
struct flextcp_event::[anonymous]::[anonymous]
listen_newconn
¶
-
struct flextcp_event::[anonymous]::[anonymous]
listen_accept
¶
-
struct flextcp_event::[anonymous]::[anonymous]
conn_open
¶
-
struct flextcp_event::[anonymous]::[anonymous]
conn_received
¶
-
struct flextcp_event::[anonymous]::[anonymous]
conn_sendbuf
¶
-
struct flextcp_event::[anonymous]::[anonymous]
conn_rxclosed
¶
-
struct flextcp_event::[anonymous]::[anonymous]
conn_txclosed
¶
-
struct flextcp_event::[anonymous]::[anonymous]
conn_moved
¶
-
struct flextcp_event::[anonymous]::[anonymous]
conn_closed
¶
-
struct flextcp_event::[anonymous]::[anonymous]