Connection-Oriented Transport: TCP

type

status

date

slug

summary

The TCP Connection

基本概念

connection-oriented

当两个应用进程想要互传数据时，必须首先要进行handshake，也就是它们必须互发一些初始段（preliminary segments）来建立随后数据传输的参数
并且两边都要初始化许多TCP状态变量

TCP连接并不是物理上的连接，而是一种逻辑连接，共同状态只存在于两个进行通信的端系统的TCP中

Recall that because the TCP protocol runs only in the end systems and not in the intermediate network elements (routers and link-layer switches), the intermediate network elements do not maintain TCP connection state. In fact, the intermediate routers are completely oblivious（毫不知情） to TCP connections; they see datagrams, not connections.

TCP连接提供全双工服务

当数据从进程A传输的进程B的同时，也可以有数据从进程B传输到进程A
TCP连接是point-to-point服务，只能一对一不能一对多

So-called “multicasting”—the transfer of data from one sender to many receivers in a single send operation—is not possible with TCP.

建立连接的大概过程

由client process发起连接，另一个进程就叫server process

client process首先通知它的传输层它想与server的某个进程建立连接，发送一个SYN(SEQ=x)

clientSocket.connect((serverName,serverPort))

client process的TCP就会与server的的TCP建立一个连接，发送一个SYN(SEQ=y, ACK=x+1)，然后就可以开始正常发送数据了

这里浅提一下这个TCP连接的三次握手过程： 客户首先发送一个特殊的TCP报文段，服务器用另一个特殊的TCP报文段来响应，最后，客户再用第三个特殊报文段作为相应。前两个报文段不承载“有效载荷”，也就是不包含应用层数据；而第三个报文段可以承载有效载荷。由于这两台主机之间发送了3个报文段，所以这种连接建立过程就叫做三次握手(three-way handshake)

传输数据

The client process passes a stream of data through the socket (the door of the process)

client直接把data交给send buffer，这个缓存区是在初始化三次握手的时候就建立的

From time to time, TCP will grab chunks of data from the send buffer and pass the data to the network layer. TCP何时把segment传出去并没有一个明确的规范

The maximum amount of data that can be grabbed and placed in a segment is limited by the maximum segment size (MSS)

Note that the MSS is the maximum amount of application-layer data in the segment, not the maximum size of the TCP segment including headers. (This terminology is confusing, but we have to live with it, as it is well entrenched(根深蒂固的).)

封装数据：data+TCP header = TCP segment

TCP pairs each chunk of client data with a TCP header, thereby forming TCP segments. The segments are passed down to the network layer, where they are separately encapsulated within network-layer IP datagrams.

接收端

When TCP receives a segment at the other end, the segment’s data is placed in the TCP connection’s receive buffer. The application reads the stream of data from this buffer.

所以TCP连接的组成包括：一台主机上的缓存、变量和与进程连接的套接字，以及另一台主机上的另一组缓存、变量和与进程连接的套接字。而在这两台主机之间的网络元素（路由器、交换机和中继器）中，没有为该连接分配任何缓存和变量。

TCP Segment Structure

The TCP segment consists of header fields and a data field

header

源端口号和目标端口号（跟UDP一样）

are used for multiplexing/demultiplexing data from/to upper-layer applications.

sequence number和acknowledgment number都是32 bits
The 16-bit receive window field is used for flow control.

We will see shortly that it is used to indicate the number of bytes that a receiver is willing to accept.

The 4-bit header length field

specifies the length of the TCP header in 32-bit words. The TCP header can be of variable length due to the TCP options field. (Typically, the options field is empty, so that the length of the typical TCP header is 20 bytes.)

规定options的长度，导致header长度可变

The optional and variable-length options field

TCP报文段中的Options字段用于在TCP连接建立和维护过程中传递一些额外的信息和选项。这些选项提供了灵活性和扩展性，使TCP协议能够适应不同的网络环境和需求

一些常见的TCP Options包括：

最大报文段长度（Maximum Segment Size，MSS）：用于指定最大的数据段大小，帮助通信双方协商一个合适的数据包大小。

窗口扩大因子（Window Scale）：用于扩大窗口大小的标度因子，支持更大的窗口大小，提高数据传输效率。

发送方和接收方可以协商窗口大小，将window size field最多可以左移14位

时间戳（Timestamps）：用于记录数据包发送和接收的时间戳，帮助测量网络延迟和计算往返时间。

选择确认选项（Selective Acknowledgment，SACK）：允许接收端报告丢失的数据段范围，提高丢包情况下的数据恢复速度。

告知发送方已经收到的序号范围
告知发送方已经收到的序号，即使前面一个包丢了
这样一来发送方就能明确知道到底应该重发哪个包

紧急指针（Urgent Pointer）：指示紧急数据在数据流中的位置，用于紧急数据传输。

无操作（No-Operation，NOP）：用于填充选项字段，保持选项字段的字节对齐。

The flag field contains 6 bits.

The ACK bit is used to indicate that the value carried in the acknowledgment field is valid; that is, the segment contains an acknowledgment for a segment that has been successfully received. The RST、SYN and FIN bits are used for connection setup and teardown, as we will discuss at the end of this section. The CWR and ECE bits are used in explicit congestion notification, as discussed in Section 3.7.2. Setting the PSH bit indicates that the receiver should pass the data to the upper layer immediately. Finally, the URG bit is used to indicate that there is data in this segment that the sending-side upper-layer entity has marked as “urgent.”

ECN：路由器置1，表示发现拥塞要求降速
ECE：接收方置1，要求发送方降速
CWR：发送方置1，表示已经降速

The location of the last byte of this urgent data is indicated by the 16-bit urgent data pointer field.
checksum

和UDP的类似

data

最大容量由MSS规定
When TCP sends a large file, such as an image as part of a Web page, it typically breaks the file into chunks of size MSS (except for the last chunk, which will often be less than the MSS).
Interactive applications, however, often transmit data chunks that are smaller than the MSS

关于sequence number

The sequence number for a segment is therefore the byte-stream number of the first byte in the segment.

Sequence Number（序列号）：Sequence Number是TCP报文段中的一个字段，用于标识发送的数据字节流中第一个字节的序列号。TCP使用序列号来对数据包进行排序和重组，确保数据在传输过程中按正确的顺序到达。每个TCP数据包都带有一个序列号，接收端根据序列号将数据包按序组装成完整的数据流。

sequence number的开头是一个随机数，目的是为了安全：

让别的应用无法很容易地猜到sequence number
一般是使用当前时钟，使用一个算法计算出sequence number

关于acknowledgment number

The acknowledgment number that Host A puts in its segment is the sequence number of the next byte Host A is expecting from Host B.

累积确认：

ack一个number，比如8，那么也就会同时确认8之前的所有sequence number，表示8之前的所有序列全部都接收了
也就是说，如果到达一个乱序的报文段，那么要么将它抛弃掉，要么将它缓存
ack number n就表示0-n-1的数据我已经全部收到了，我希望你下次发我从n开始的数据
当flag field里面的ack置1时ack number才有效

Telnet: A Case Study for Sequence and Acknowledgment Numbers

suppose the starting sequence numbers are 42 and 79 for the client and server. Recall that the acknowledgment number is the sequence number of the next byte of data that the host is waiting for. After the TCP connection is established but before any data is sent, the client is waiting for byte 79 and the server is waiting for byte 42.

第三次返回：Seq=43,ACK=80这个时候，并没有data部分，但是Seq number仍然存在，因为TCP有一个Sequence number field，所以这个段必须得有一个sequence number，这里就填入发送成功后的下一个序号。

Round-Trip Time Estimation and Timeout

TCP应对丢包的方式跟rdt3.0是一样的，都是采用超时重传

相比于之前已经探讨过的rdt3.0的方法，还有一些小问题，其中主要的是：超时间隔长度的设置。显然 Timeout should be larger than the connection’s round-trip time(RTT), that is, the time from when a segment is sent until it is acknowledged. Otherwise, unnecessary retransmissions would be sent. 但是到底应该多大？刚开始如何估计往返时间呢？是否应该为所有未确认的报文段各设一个定时器？我们接下来讨论：

Estimating the Round-Trip Time

The sample RTT, denoted SampleRTT, for a segment is the amount of time between when the segment is sent (that is, passed to IP) and when an acknowledgment for the segment is received.

大多数TCP的实现仅在某个时刻做一次SampleRTT测量，而不是为每个发送的报文段测量一个SampleRTT。在任意时刻，仅为一个已发送的但目前尚未被确认的报文段测量SampleRTT，从而产生一个接近每个RTT的新SampleRTT值。

如果有重传，则忽略此次测量

对于采样的RTT，我们会采取一种“平均”的方式，因为SampleRTT值会随着路由器的拥塞和端系统负载的变化而变化，所以我们会采取这样的手段：TCP维持一个SampleRTT均值（称为EstimatedRTT）。一旦获得一个新的SampleRTT时，TCP就会根据下列公式来更新EstimatedRTT：

The recommended value of α is α = 0.125 (that is, 1/8) [RFC 6298]

可以看出实际上这个是一个加权平均，并且把较大的权重放在最近的sample上而不是较老的sample上。每次计算时老的采样值都会被乘上一次，也就是说上一次的平均值被乘上，再上一次的平均值就被乘上，每次最新的采样值对平均值的贡献最大，老的采样值的贡献指数下降，这样是非常合理的

估计SampleRTT会偏离EstimatedRTT多远

因为设置超时时间还需要考虑往返延迟的变化幅度，如果往返延迟的变化特别大，超时时间自然也要设置得比较长

defines the RTT variation, DevRTT, as an estimate of how much SampleRTT typically deviates（偏离） from EstimatedRTT:

The recommended value of β is 0.25.

Setting and Managing the Retransmission Timeout Interval

有了前面的EstimatedRTT和DevRTT，那我们该选择什么值来作为TCP的timeout interval呢？一般来说，既要保证比EstimatedRTT大（必须的），还要考虑了波动情况(波动大就等久点，波动小就等短点)：

只要收到报文段并且更新了EstimatedRTT之后，就使用上述公式再次计算TimeoutInterval

Reliable Data Transfer

TCP creates a reliable data transfer service on top of IP’s unreliable besteffort service.

TCP’s reliable data transfer service ensures that the data stream that a process reads out of its TCP receive buffer is uncorrupted, without gaps, without duplication, and in sequence; that is, the byte stream is exactly the same byte stream that was sent by the end system on the other side of the connection.

In our earlier development of reliable data transfer techniques, it was conceptually(概念上地) easiest to assume that an individual timer is associated with each transmitted but not yet acknowledged segment. But in this way, timer management can require considerable overhead(开支) Thus, the recommended TCP timer management procedures [RFC 6298] use only a single retransmission timer, even if there are multiple transmitted but not yet acknowledged segments.

We first present a highly simplified description of a TCP sender that uses only timeouts to recover from lost segments; we then present a more complete description that uses duplicate acknowledg-ments in addition to timeouts. In the ensuing discussion, we suppose that data is being sent in only one direction, from Host A to Host B, and that Host A is sending a large file.

下图是一个 highly simplified description of a TCP sender

其中包含了三个事件

事件1

TCP从应用程序接收数据，将数据封装在一个报文段中。每一个报文段都有一个序号，这个序号就是报文段第一个字节的字节流编号
如果定时器还没有运行，就启动定时器，think of the timer as being associated with the oldest unacknowledged segment
将报文段交给IP
计算出下一个需要的数据的第一个字节的字节流编号

事件2

TCP responds to the timeout event
重传还没有ACK的sequence number最小的报文段
重新启动计时器

事件3

The third major event that must be handled by the TCP sender is the arrival of an acknow-ledgment segment (ACK) from the receiver (more specifically, a segment contain-ing a valid ACK field value).
On the occurrence of this event, TCP compares the ACK value y with its variable SendBase. The TCP state variable SendBase is the sequence number of the oldest unacknowledged byte. (Thus SendBase–1 is the sequence number of the last byte that is known to have been received correctly and in order at the receiver.)
ACK的y表示确认y之前的所有字节都被正常收到，因此如果y>sendBase，把sendBase改成y，y才是最新的没有接收到的segment的第一个字节，y之前的所有字节都被正确接收了
当前如果还有任何没有ACK的segment，都重启定时器，再次进入循环然后接收数据
注意最后一句 It also restarts the timer if there currently are any not-yet-acknowledged segments.为什么？因为如果你不重置timer，那么timer当前记录的是之前最老的未确认报文段从出发到ACK的时间，然而现在最老的已经接收了，而目前还未ACK的segment的发出时间肯定更晚，那么timer就会更容易超时，因为起始时间偏早。最坏的情况是目前的未确认报文是在上一个ACK接收前的一瞬间才发出的，那么这个timer就多记录了一整个RTT，而重置它才是记录该未确认报文的发出时间，所以重置是很有必要的。

例子

丢包了，触发超时重传

如果ACK丢了同样是超时重传

返回来的是ACK 120，由于累积确认机制，就算发送方没收到ACK 100，也知道120之前的全都ACK了，因此不会重传

第一个包超时了，触发重传，重新计时，因此第二个包不会超时，ACK 120会准时送达，而Seq=92的包到达时候，返回的ACK还是ACK 120

Doubling the Timeout Interval(超时间隔加倍)

whenever the timeout event occurs, TCP retransmits the not-yet-acknowledged segment with the smallest sequence number. But each time TCP retransmits, it sets the next timeout interval to twice the previous value, rather than deriving it from the last EstimatedRTT and DevRTT. This modi-fication provides a limited form of congestion control. In times of congestion, if the sources continue to retransmit packets persistently, the congestion may get worse.

因为超时不仅会重传，还会重启timer，因此下一次超时的时间点就是这一次的两倍

Fast Retransmit

超时重传也许会导致传输时间变长

When a segment is lost, this long timeout period forces the sender to delay resending the lost packet, thereby increasing the end-to-end delay. Fortunately, the sender can often detect packet loss well before the timeout event occurs by noting so-called duplicate ACKs.（通过重复ACK来判断是否丢包了）To understand the sender’s response to a duplicate ACK, we must look at why the receiver sends a duplicate ACK in the first place.

主要就是这一条👇

当TCP接收方接收到一个序号大于下一个所期望的、按序的报文段时，它就检测到了数据流中的一个间隔（gap），这就是说明有报文段丢失。(larger than the next, expected, in-order sequence number, it detects a gap in the data stream) This gap could be the result of lost or reordered segments within the network. 因为TCP没有使用显示的NAK，所以它会发送一个重复的ACK—reacknowledges the last in-order byte of data it has received. (Note that Table 3.2 allows for the case that the receiver does not discard out-of-order segments.) If the TCP sender receives three duplicate ACKs for the same data, it takes this as an indication that the segment following the segment that has been ACKed three times has been lost. In the case that three duplicate ACKs are received, the TCP sender performs a fast retransmit[RFC 6581], retransmitting the missing segment before that segment’s timer expires.

可以将之前的event3替换为如下代码：

如果y>SendBase，则将SendBase替换为y，如果还有没ACK的，重启计时器

如果同一个ACK收到了三次且它小于等于现在期望接到的ACK，则说明发生了丢包，快速重传

GO-Back-N or Selective Repeat?

所以总的来说，我们更应该把目前广泛使用的TCP实现方式（非RFC 2018的提议）看成是hybrid of GBN and SR protocols.

Flow Control

💡

回忆 TCP双方都设有有一个receive buffer，当它收到正确有序的byte时，会将byte放进这个buffer，与之相关联的应用会从这个buffer里面读取数据，但不是数据一到就开始读取。因此，如果应用层读取太慢，这个buffer就很容易overflow。

flow control就是为了解决这个问题——通过比较sender发送数据的速度和receiver接收数据的速度

接下来讨论flow control，假设receiver会直接丢弃掉无序的segment

receive window

TCP provides flow control by having the sender maintain a variable called the receive window，用于表示接收方的receive buffer还有多少剩余空间

Let’s investigate the receive window in the context of a file transfer. Sup- pose that Host A is sending a large file to Host B over a TCP connection. Host B allocates a receive buffer to this connection; denote its size by RcvBuffer. From time to time, the application process in Host B reads from the buffer. Define the following variables:

LastByteRead：B中的应用进程从buffer里面读取的最后一个byte的编号

LastByteRcvd：B中从网络层拿上来放进buffer的最后一个byte的编号

buffer不能 overflow：计算出存在buffer里面的byte数量
计算出buffer里面可用的空间：rwnd

B（receiver）给A发信息时，将rwnd的值存在segment的receive window field里面，并且rwnd的初始值就是RcvBuffer

而A（sender）这边也存着两个变量，LastByteSent和LastByteAcked 注意这个LastByteAcked是上一个已经确认的number，不是ackno

LastByteSent-LastByteAcked就是这中间发出去了但没有ACK的byte，只要保证这个值≤rwnd的值，就可以保证发过去的信息不会让B那边的buffer overflow

当B的rwnd=0时，A还是要不断发送没有data的segment，以引起B发送ACK，来告知rwnd的当前值，如果B的应用进程读取了buffer里面的内容，rwnd就会变成一个大于0的值，A收到这个信息之后就可以继续发送数据了。如果rwnd=0之后A就不再发送信息，那么B就不会发送ACK，A就永远也不会知道rwnd何时再次大于0，接受和发送就会在这里卡住

而UDP就没有流量控制，因此segment丢失后也没有应对措施，segment使buffer overflow也是有可能的

TCP Connection Management

TCP连接怎么建立的

假设client想要与server建立TCP连接，首先client的应用进程会通知client的TCP它想要跟server的某个进程建立连接，然后client的TCP就会这样：

TCP client发送一个特殊的segment给server，这个segment不包含应用层的data，它的SYN flag=1，因此这个也叫做SYN segment。为了保证安全性，它的sequence number是一个随机值，表示这个stream的第一个seqno。然后这个segment就被封装为一个IP diagram发送给sever

TCP SYN segment成功传到server之后，server在内存中为receive buffer分配空间，然后server也会发送一个不包含应用层数据的connection- granted segment给client，里面包含SYN flag=1，和ACKno=client_isn+1，还有server自己的ISN，server_isn。这个就是在说：“我收到了你的SYN packet和你的ISN准备开始一个连接，并且我的ISN是server_isn”。这个connection-granted segment也叫SYNACK segment

client收到了SYNACK segment，也在自己内存中为receive buffer分配空间，然后发送一个对SYNACK segment的ACK，其中ackno=server_isn+1，SYN=0，这时payload里面就可以包含client要发送给server的数据了

这三个步骤完成之后，client和server就可以给对方发送数据了，这三个步骤合在一起叫three-way handshake

TCP连接关闭

现假设client要求拆除连接，client发送一个FIN=1的segment，表示没有数据要发给server了，连接变为一个半关闭连接，从此接收方的seqno都等于x

server回复一个segment，FIN=0，表示还有数据要发，ack=x+1

最后服务器发完数据了，发送一个FIN=1的segment，表示server也没有数据要发了，server连接关闭

然后client确认真的没有数据要发送了，于是seqno变为x+1，发送一个ack，timeout之后client连接关闭

连接关闭后，双方之前为连接分配的内存空间都会被回收

以上的流程都是基于client和server双方都在准备进行TCP通信，但要是server收到一个TCP segment指定的端口号为80，但是server的80端口并没有在运行web服务呢？

这时server就会给client发送一个错误信息，也就是一个把RST flag设成1的TCP segment，这就是在说：我没有socket来接收你的segment，请你不要再发送segment了

如果client收到这个TCP RST，就表明segment成功到达目标主机，但这个主机上并没有哪个应用运行在TCP port 80；如果client什么也没有收到，那么这个SYN segment很可能是被什么防火墙block了

SYN flood attack

We’ve seen in our discussion of TCP’s three-way handshake that a server allocates and initializes connection variables and buffers in response to a received SYN. The server then sends a SYNACK in response, and awaits an ACK segment from the client. If the client does not send an ACK to complete the third step of this 3-way handshake, eventually (often after a minute or more) the server will terminate the half-open connection and reclaim the allocated resources. This TCP connection management protocol sets the stage for a classic Denial of Service (DoS) attack known as the SYN flood attack.

攻击者发送大量的SYN segment，并且不会完成第三次握手

于是server那边就会因为这些SYN segment分配大量buffer，以至于没办法为合法的client提供服务

防止攻击的手段：SYN cookies

当server收到一个SYN segment，它首先创建一个TCP sequence number，用SYN segment的源地址源端口目标地址目标端口来创建，这个sequence number就叫cookie

然后server发送一个SYNACK包过去，合法的client就会再发送一个ACK，并且这个包的源地址源端口目标地址和目标端口可以跟那个sequence number对上，然后server就会创建一个连接和socket