From 43c26a1a45938624fb9301e8bf7dfabbed293619 Mon Sep 17 00:00:00 2001 From: Davide Caratti Date: Thu, 18 May 2017 15:44:41 +0200 Subject: net: more accurate checksumming in validate_xmit_skb() skb_csum_hwoffload_help() uses netdev features and skb->csum_not_inet to determine if skb needs software computation of Internet Checksum or crc32c (or nothing, if this computation can be done by the hardware). Use it in place of skb_checksum_help() in validate_xmit_skb() to avoid corruption of non-GSO SCTP packets having skb->ip_summed equal to CHECKSUM_PARTIAL. While at it, remove references to skb_csum_off_chk* functions, since they are not present anymore in Linux _ see commit cf53b1da73bd ("Revert "net: Add driver helper functions to determine checksum offloadability""). Signed-off-by: Davide Caratti Signed-off-by: David S. Miller --- Documentation/networking/checksum-offloads.txt | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/checksum-offloads.txt b/Documentation/networking/checksum-offloads.txt index 56e36861245f..d52d191bbb0c 100644 --- a/Documentation/networking/checksum-offloads.txt +++ b/Documentation/networking/checksum-offloads.txt @@ -35,6 +35,9 @@ This interface only allows a single checksum to be offloaded. Where encapsulation is used, the packet may have multiple checksum fields in different header layers, and the rest will have to be handled by another mechanism such as LCO or RCO. +CRC32c can also be offloaded using this interface, by means of filling + skb->csum_start and skb->csum_offset as described above, and setting + skb->csum_not_inet: see skbuff.h comment (section 'D') for more details. No offloading of the IP header checksum is performed; it is always done in software. This is OK because when we build the IP header, we obviously have it in cache, so summing it isn't expensive. It's also rather short. @@ -49,9 +52,9 @@ A driver declares its offload capabilities in netdev->hw_features; see and csum_offset given in the SKB; if it tries to deduce these itself in hardware (as some NICs do) the driver should check that the values in the SKB match those which the hardware will deduce, and if not, fall back to - checksumming in software instead (with skb_checksum_help or one of the - skb_csum_off_chk* functions as mentioned in include/linux/skbuff.h). This - is a pain, but that's what you get when hardware tries to be clever. + checksumming in software instead (with skb_csum_hwoffload_help() or one of + the skb_checksum_help() / skb_crc32c_csum_help functions, as mentioned in + include/linux/skbuff.h). The stack should, for the most part, assume that checksum offload is supported by the underlying device. The only place that should check is @@ -60,7 +63,7 @@ The stack should, for the most part, assume that checksum offload is may include other offloads besides TX Checksum Offload) and, if they are not supported or enabled on the device (determined by netdev->features), performs the corresponding offload in software. In the case of TX - Checksum Offload, that means calling skb_checksum_help(skb). + Checksum Offload, that means calling skb_csum_hwoffload_help(skb, features). LCO: Local Checksum Offload -- cgit From aad9c8c470f2a8321a99eb053630ce0e199558d6 Mon Sep 17 00:00:00 2001 From: Miroslav Lichvar Date: Fri, 19 May 2017 17:52:38 +0200 Subject: net: add new control message for incoming HW-timestamped packets Add SOF_TIMESTAMPING_OPT_PKTINFO option to request a new control message for incoming packets with hardware timestamps. It contains the index of the real interface which received the packet and the length of the packet at layer 2. The index is useful with bonding, bridges and other interfaces, where IP_PKTINFO doesn't allow applications to determine which PHC made the timestamp. With the L2 length (and link speed) it is possible to transpose preamble timestamps to trailer timestamps, which are used in the NTP protocol. While this information could be provided by two new socket options independently from timestamping, it doesn't look like they would be very useful. With this option any performance impact is limited to hardware timestamping. Use dev_get_by_napi_id() to get the device and its index. On kernels with disabled CONFIG_NET_RX_BUSY_POLL or drivers not using NAPI, a zero index will be returned in the control message. CC: Richard Cochran Acked-by: Willem de Bruijn Signed-off-by: Miroslav Lichvar Signed-off-by: David S. Miller --- Documentation/networking/timestamping.txt | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt index 96f50694a748..ce11e3a08c0d 100644 --- a/Documentation/networking/timestamping.txt +++ b/Documentation/networking/timestamping.txt @@ -193,6 +193,16 @@ SOF_TIMESTAMPING_OPT_STATS: the transmit timestamps, such as how long a certain block of data was limited by peer's receiver window. +SOF_TIMESTAMPING_OPT_PKTINFO: + + Enable the SCM_TIMESTAMPING_PKTINFO control message for incoming + packets with hardware timestamps. The message contains struct + scm_ts_pktinfo, which supplies the index of the real interface which + received the packet and its length at layer 2. A valid (non-zero) + interface index will be returned only if CONFIG_NET_RX_BUSY_POLL is + enabled and the driver is using NAPI. The struct contains also two + other fields, but they are reserved and undefined. + New applications are encouraged to pass SOF_TIMESTAMPING_OPT_ID to disambiguate timestamps and SOF_TIMESTAMPING_OPT_TSONLY to operate regardless of the setting of sysctl net.core.tstamp_allow_data. -- cgit From 67953d47bb24e63d209705f745a0de411a4c6578 Mon Sep 17 00:00:00 2001 From: Miroslav Lichvar Date: Fri, 19 May 2017 17:52:39 +0200 Subject: net: fix documentation of struct scm_timestamping The scm_timestamping struct may return multiple non-zero fields, e.g. when both software and hardware RX timestamping is enabled, or when the SO_TIMESTAMP(NS) option is combined with SCM_TIMESTAMPING and a false software timestamp is generated in the recvmsg() call in order to always return a SCM_TIMESTAMP(NS) message. CC: Richard Cochran CC: Willem de Bruijn Signed-off-by: Miroslav Lichvar Acked-by: Willem de Bruijn Signed-off-by: David S. Miller --- Documentation/networking/timestamping.txt | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt index ce11e3a08c0d..50eb0e554778 100644 --- a/Documentation/networking/timestamping.txt +++ b/Documentation/networking/timestamping.txt @@ -322,7 +322,7 @@ struct scm_timestamping { }; The structure can return up to three timestamps. This is a legacy -feature. Only one field is non-zero at any time. Most timestamps +feature. At least one field is non-zero at any time. Most timestamps are passed in ts[0]. Hardware timestamps are passed in ts[2]. ts[1] used to hold hardware timestamps converted to system time. @@ -331,6 +331,12 @@ a HW PTP clock source, to allow time conversion in userspace and optionally synchronize system time with a userspace PTP stack such as linuxptp. For the PTP clock API, see Documentation/ptp/ptp.txt. +Note that if the SO_TIMESTAMP or SO_TIMESTAMPNS option is enabled +together with SO_TIMESTAMPING using SOF_TIMESTAMPING_SOFTWARE, a false +software timestamp will be generated in the recvmsg() call and passed +in ts[0] when a real software timestamp is missing. This happens also +on hardware transmit timestamps. + 2.1.1 Transmit timestamps with MSG_ERRQUEUE For transmit timestamps the outgoing packet is looped back to the -- cgit From b50a5c70ffa4fd6b6da324ab54c84adf48fb17d9 Mon Sep 17 00:00:00 2001 From: Miroslav Lichvar Date: Fri, 19 May 2017 17:52:40 +0200 Subject: net: allow simultaneous SW and HW transmit timestamping Add SOF_TIMESTAMPING_OPT_TX_SWHW option to allow an outgoing packet to be looped to the socket's error queue with a software timestamp even when a hardware transmit timestamp is expected to be provided by the driver. Applications using this option will receive two separate messages from the error queue, one with a software timestamp and the other with a hardware timestamp. As the hardware timestamp is saved to the shared skb info, which may happen before the first message with software timestamp is received by the application, the hardware timestamp is copied to the SCM_TIMESTAMPING control message only when the skb has no software timestamp or it is an incoming packet. While changing sw_tx_timestamp(), inline it in skb_tx_timestamp() as there are no other users. CC: Richard Cochran CC: Willem de Bruijn Signed-off-by: Miroslav Lichvar Acked-by: Willem de Bruijn Signed-off-by: David S. Miller --- Documentation/networking/timestamping.txt | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt index 50eb0e554778..196ba17cc344 100644 --- a/Documentation/networking/timestamping.txt +++ b/Documentation/networking/timestamping.txt @@ -203,6 +203,14 @@ SOF_TIMESTAMPING_OPT_PKTINFO: enabled and the driver is using NAPI. The struct contains also two other fields, but they are reserved and undefined. +SOF_TIMESTAMPING_OPT_TX_SWHW: + + Request both hardware and software timestamps for outgoing packets + when SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE + are enabled at the same time. If both timestamps are generated, + two separate messages will be looped to the socket's error queue, + each containing just one timestamp. + New applications are encouraged to pass SOF_TIMESTAMPING_OPT_ID to disambiguate timestamps and SOF_TIMESTAMPING_OPT_TSONLY to operate regardless of the setting of sysctl net.core.tstamp_allow_data. -- cgit From 85cfa71764cab95228e0abebdd77e0382c3c34be Mon Sep 17 00:00:00 2001 From: Jesse Brandeburg Date: Thu, 11 May 2017 11:23:21 -0700 Subject: i40evf: update i40evf.txt with new content The addition of the AVF and virtchnl code to the i40evf driver means we should update the i40evf.txt file with the most up to date information. It seems this file hasn't been updated in a while, so the changes cover a little more than just AVF, but it's all only in the i40evf.txt. Signed-off-by: Jesse Brandeburg Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- Documentation/networking/i40evf.txt | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/i40evf.txt b/Documentation/networking/i40evf.txt index 21e41271af79..e9b3035b95d0 100644 --- a/Documentation/networking/i40evf.txt +++ b/Documentation/networking/i40evf.txt @@ -1,8 +1,8 @@ Linux* Base Driver for Intel(R) Network Connection ================================================== -Intel XL710 X710 Virtual Function Linux driver. -Copyright(c) 2013 Intel Corporation. +Intel Ethernet Adaptive Virtual Function Linux driver. +Copyright(c) 2013-2017 Intel Corporation. Contents ======== @@ -11,19 +11,26 @@ Contents - Known Issues/Troubleshooting - Support -This file describes the i40evf Linux* Base Driver for the Intel(R) XL710 -X710 Virtual Function. +This file describes the i40evf Linux* Base Driver. -The i40evf driver supports XL710 and X710 virtual function devices that -can only be activated on kernels with CONFIG_PCI_IOV enabled. +The i40evf driver supports the below mentioned virtual function +devices and can only be activated on kernels running the i40e or +newer Physical Function (PF) driver compiled with CONFIG_PCI_IOV. +The i40evf driver requires CONFIG_PCI_MSI to be enabled. The guest OS loading the i40evf driver must support MSI-X interrupts. +Supported Hardware +================== +Intel XL710 X710 Virtual Function +Intel Ethernet Adaptive Virtual Function +Intel X722 Virtual Function + Identifying Your Adapter ======================== -For more information on how to identify your adapter, go to the Adapter & -Driver ID Guide at: +For more information on how to identify your adapter, go to the +Adapter & Driver ID Guide at: http://support.intel.com/support/go/network/adapter/idguide.htm -- cgit From 28036f44851e2515aa91b547b45cefddcac52ff6 Mon Sep 17 00:00:00 2001 From: David Howells Date: Mon, 5 Jun 2017 14:30:49 +0100 Subject: rxrpc: Permit multiple service binding Permit bind() to be called on an AF_RXRPC socket more than once (currently maximum twice) to bind multiple listening services to it. There are some restrictions: (1) All bind() calls involved must have a non-zero service ID. (2) The service IDs must all be different. (3) The rest of the address (notably the transport part) must be the same in all (a single UDP socket is shared). (4) This must be done before listen() or sendmsg() is called. This allows someone to connect to the service socket with different service IDs and lays the foundation for service upgrading. The service ID used by an incoming call can be extracted from the msg_name returned by recvmsg(). Signed-off-by: David Howells --- Documentation/networking/rxrpc.txt | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt index 1b63bbc6b94f..b7115ec55e04 100644 --- a/Documentation/networking/rxrpc.txt +++ b/Documentation/networking/rxrpc.txt @@ -600,6 +600,10 @@ A server would be set up to accept operations in the following manner: }; bind(server, &srx, sizeof(srx)); + More than one service ID may be bound to a socket, provided the transport + parameters are the same. The limit is currently two. To do this, bind() + should be called twice. + (3) The server is then set to listen out for incoming calls: listen(server, 100); -- cgit From 4722974d90e06d0164ca1b73a6b34cec6bdb64ad Mon Sep 17 00:00:00 2001 From: David Howells Date: Mon, 5 Jun 2017 14:30:49 +0100 Subject: rxrpc: Implement service upgrade Implement AuriStor's service upgrade facility. There are three problems that this is meant to deal with: (1) Various of the standard AFS RPC calls have IPv4 addresses in their requests and/or replies - but there's no room for including IPv6 addresses. (2) Definition of IPv6-specific RPC operations in the standard operation sets has not yet been achieved. (3) One could envision the creation a new service on the same port that as the original service. The new service could implement improved operations - and the client could try this first, falling back to the original service if it's not there. Unfortunately, certain servers ignore packets addressed to a service they don't implement and don't respond in any way - not even with an ABORT. This means that the client must then wait for the call timeout to occur. What service upgrade does is to see if the connection is marked as being 'upgradeable' and if so, change the service ID in the server and thus the request and reply formats. Note that the upgrade isn't mandatory - a server that supports only the original call set will ignore the upgrade request. In the protocol, the procedure is then as follows: (1) To request an upgrade, the first DATA packet in a new connection must have the userStatus set to 1 (this is normally 0). The userStatus value is normally ignored by the server. (2) If the server doesn't support upgrading, the reply packets will contain the same service ID as for the first request packet. (3) If the server does support upgrading, all future reply packets on that connection will contain the new service ID and the new service ID will be applied to *all* further calls on that connection as well. (4) The RPC op used to probe the upgrade must take the same request data as the shadow call in the upgrade set (but may return a different reply). GetCapability RPC ops were added to all standard sets for just this purpose. Ops where the request formats differ cannot be used for probing. (5) The client must wait for completion of the probe before sending any further RPC ops to the same destination. It should then use the service ID that recvmsg() reported back in all future calls. (6) The shadow service must have call definitions for all the operation IDs defined by the original service. To support service upgrading, a server should: (1) Call bind() twice on its AF_RXRPC socket before calling listen(). Each bind() should supply a different service ID, but the transport addresses must be the same. This allows the server to receive requests with either service ID. (2) Enable automatic upgrading by calling setsockopt(), specifying RXRPC_UPGRADEABLE_SERVICE and passing in a two-member array of unsigned shorts as the argument: unsigned short optval[2]; This specifies a pair of service IDs. They must be different and must match the service IDs bound to the socket. Member 0 is the service ID to upgrade from and member 1 is the service ID to upgrade to. Signed-off-by: David Howells --- Documentation/networking/rxrpc.txt | 34 ++++++++++++++++++++++++++-------- 1 file changed, 26 insertions(+), 8 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt index b7115ec55e04..2a1662760450 100644 --- a/Documentation/networking/rxrpc.txt +++ b/Documentation/networking/rxrpc.txt @@ -433,6 +433,13 @@ AF_RXRPC sockets support a few socket options at the SOL_RXRPC level: Encrypted checksum plus entire packet padded and encrypted, including actual packet length. + (*) RXRPC_UPGRADEABLE_SERVICE + + This is used to indicate that a service socket with two bindings may + upgrade one bound service to the other if requested by the client. optval + must point to an array of two unsigned short ints. The first is the + service ID to upgrade from and the second the service ID to upgrade to. + ======== SECURITY @@ -588,7 +595,7 @@ A server would be set up to accept operations in the following manner: The keyring can be manipulated after it has been given to the socket. This permits the server to add more keys, replace keys, etc. whilst it is live. - (2) A local address must then be bound: + (3) A local address must then be bound: struct sockaddr_rxrpc srx = { .srx_family = AF_RXRPC, @@ -604,11 +611,22 @@ A server would be set up to accept operations in the following manner: parameters are the same. The limit is currently two. To do this, bind() should be called twice. - (3) The server is then set to listen out for incoming calls: + (4) If service upgrading is required, first two service IDs must have been + bound and then the following option must be set: + + unsigned short service_ids[2] = { from_ID, to_ID }; + setsockopt(server, SOL_RXRPC, RXRPC_UPGRADEABLE_SERVICE, + service_ids, sizeof(service_ids)); + + This will automatically upgrade connections on service from_ID to service + to_ID if they request it. This will be reflected in msg_name obtained + through recvmsg() when the request data is delivered to userspace. + + (5) The server is then set to listen out for incoming calls: listen(server, 100); - (4) The kernel notifies the server of pending incoming connections by sending + (6) The kernel notifies the server of pending incoming connections by sending it a message for each. This is received with recvmsg() on the server socket. It has no data, and has a single dataless control message attached: @@ -620,13 +638,13 @@ A server would be set up to accept operations in the following manner: the time it is accepted - in which case the first call still on the queue will be accepted. - (5) The server then accepts the new call by issuing a sendmsg() with two + (7) The server then accepts the new call by issuing a sendmsg() with two pieces of control data and no actual data: RXRPC_ACCEPT - indicate connection acceptance RXRPC_USER_CALL_ID - specify user ID for this call - (6) The first request data packet will then be posted to the server socket for + (8) The first request data packet will then be posted to the server socket for recvmsg() to pick up. At that point, the RxRPC address for the call can be read from the address fields in the msghdr struct. @@ -638,7 +656,7 @@ A server would be set up to accept operations in the following manner: RXRPC_USER_CALL_ID - specifies the user ID for this call - (8) The reply data should then be posted to the server socket using a series + (9) The reply data should then be posted to the server socket using a series of sendmsg() calls, each with the following control messages attached: RXRPC_USER_CALL_ID - specifies the user ID for this call @@ -646,7 +664,7 @@ A server would be set up to accept operations in the following manner: MSG_MORE should be set in msghdr::msg_flags on all but the last message for a particular call. - (9) The final ACK from the client will be posted for retrieval by recvmsg() +(10) The final ACK from the client will be posted for retrieval by recvmsg() when it is received. It will take the form of a dataless message with two control messages attached: @@ -656,7 +674,7 @@ A server would be set up to accept operations in the following manner: MSG_EOR will be flagged to indicate that this is the final message for this call. -(10) Up to the point the final packet of reply data is sent, the call can be +(11) Up to the point the final packet of reply data is sent, the call can be aborted by calling sendmsg() with a dataless message with the following control messages attached: -- cgit From 4e255721d1575a766ada06dc7eb03acdcd34eaaf Mon Sep 17 00:00:00 2001 From: David Howells Date: Mon, 5 Jun 2017 14:30:49 +0100 Subject: rxrpc: Add service upgrade support for client connections Make it possible for a client to use AuriStor's service upgrade facility. The client does this by adding an RXRPC_UPGRADE_SERVICE control message to the first sendmsg() of a call. This takes no parameters. When recvmsg() starts returning data from the call, the service ID field in the returned msg_name will reflect the result of the upgrade attempt. If the upgrade was ignored, srx_service will match what was set in the sendmsg(); if the upgrade happened the srx_service will be altered to indicate the service the server upgraded to. Note that: (1) The choice of upgrade service is up to the server (2) Further client calls to the same server that would share a connection are blocked if an upgrade probe is in progress. (3) This should only be used to probe the service. Clients should then use the returned service ID in all subsequent communications with that server (and not set the upgrade). Note that the kernel will not retain this information should the connection expire from its cache. (4) If a server that supports upgrading is replaced by one that doesn't, whilst a connection is live, and if the replacement is running, say, OpenAFS 1.6.4 or older or an older IBM AFS, then the replacement server will not respond to packets sent to the upgraded connection. At this point, calls will time out and the server must be reprobed. Signed-off-by: David Howells --- Documentation/networking/rxrpc.txt | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt index 2a1662760450..18078e630a63 100644 --- a/Documentation/networking/rxrpc.txt +++ b/Documentation/networking/rxrpc.txt @@ -325,6 +325,8 @@ calls, to invoke certain actions and to report certain conditions. These are: RXRPC_LOCAL_ERROR -rt error num Local error encountered RXRPC_NEW_CALL -r- n/a New call received RXRPC_ACCEPT s-- n/a Accept new call + RXRPC_EXCLUSIVE_CALL s-- n/a Make an exclusive client call + RXRPC_UPGRADE_SERVICE s-- n/a Client call can be upgraded (SRT = usable in Sendmsg / delivered by Recvmsg / Terminal message) @@ -387,6 +389,23 @@ calls, to invoke certain actions and to report certain conditions. These are: return error ENODATA. If the user ID is already in use by another call, then error EBADSLT will be returned. + (*) RXRPC_EXCLUSIVE_CALL + + This is used to indicate that a client call should be made on a one-off + connection. The connection is discarded once the call has terminated. + + (*) RXRPC_UPGRADE_SERVICE + + This is used to make a client call to probe if the specified service ID + may be upgraded by the server. The caller must check msg_name returned to + recvmsg() for the service ID actually in use. The operation probed must + be one that takes the same arguments in both services. + + Once this has been used to establish the upgrade capability (or lack + thereof) of the server, the service ID returned should be used for all + future communication to that server and RXRPC_UPGRADE_SERVICE should no + longer be set. + ============== SOCKET OPTIONS @@ -566,6 +585,17 @@ A client would issue an operation by: buffer instead, and MSG_EOR will be flagged to indicate the end of that call. +A client may ask for a service ID it knows and ask that this be upgraded to a +better service if one is available by supplying RXRPC_UPGRADE_SERVICE on the +first sendmsg() of a call. The client should then check srx_service in the +msg_name filled in by recvmsg() when collecting the result. srx_service will +hold the same value as given to sendmsg() if the upgrade request was ignored by +the service - otherwise it will be altered to indicate the service ID the +server upgraded to. Note that the upgraded service ID is chosen by the server. +The caller has to wait until it sees the service ID in the reply before sending +any more calls (further calls to the same destination will be blocked until the +probe is concluded). + ==================== EXAMPLE SERVER USAGE -- cgit From f8fe99754673719ab791713a676bf27dae616fbc Mon Sep 17 00:00:00 2001 From: "yuval.shaia@oracle.com" Date: Mon, 5 Jun 2017 10:18:40 +0300 Subject: net: phy: Delete unused function phy_ethtool_gset It's unused, so remove it. Signed-off-by: Yuval Shaia Reviewed-by: Florian Fainelli Signed-off-by: David S. Miller --- Documentation/networking/phy.txt | 1 - 1 file changed, 1 deletion(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt index 16f90d817224..bdec0f700bc1 100644 --- a/Documentation/networking/phy.txt +++ b/Documentation/networking/phy.txt @@ -295,7 +295,6 @@ Doing it all yourself settings in the PHY. int phy_ethtool_sset(struct phy_device *phydev, struct ethtool_cmd *cmd); - int phy_ethtool_gset(struct phy_device *phydev, struct ethtool_cmd *cmd); Ethtool convenience functions. -- cgit From 515559ca21713218595f3a4dad44a4e7eea2fcfb Mon Sep 17 00:00:00 2001 From: David Howells Date: Wed, 7 Jun 2017 16:27:15 +0100 Subject: rxrpc: Provide a getsockopt call to query what cmsgs types are supported Provide a getsockopt() call that can query what cmsg types are supported by AF_RXRPC. --- Documentation/networking/rxrpc.txt | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt index 18078e630a63..bce8e10a2a8e 100644 --- a/Documentation/networking/rxrpc.txt +++ b/Documentation/networking/rxrpc.txt @@ -406,6 +406,10 @@ calls, to invoke certain actions and to report certain conditions. These are: future communication to that server and RXRPC_UPGRADE_SERVICE should no longer be set. +The symbol RXRPC__SUPPORTED is defined as one more than the highest control +message type supported. At run time this can be queried by means of the +RXRPC_SUPPORTED_CMSG socket option (see below). + ============== SOCKET OPTIONS @@ -459,6 +463,11 @@ AF_RXRPC sockets support a few socket options at the SOL_RXRPC level: must point to an array of two unsigned short ints. The first is the service ID to upgrade from and the second the service ID to upgrade to. + (*) RXRPC_SUPPORTED_CMSG + + This is a read-only option that writes an int into the buffer indicating + the highest control message type supported. + ======== SECURITY -- cgit From e754eba685aac2a9b5538176fa2d254ad25f464d Mon Sep 17 00:00:00 2001 From: David Howells Date: Wed, 7 Jun 2017 12:40:03 +0100 Subject: rxrpc: Provide a cmsg to specify the amount of Tx data for a call Provide a control message that can be specified on the first sendmsg() of a client call or the first sendmsg() of a service response to indicate the total length of the data to be transmitted for that call. Currently, because the length of the payload of an encrypted DATA packet is encrypted in front of the data, the packet cannot be encrypted until we know how much data it will hold. By specifying the length at the beginning of the transmit phase, each DATA packet length can be set before we start loading data from userspace (where several sendmsg() calls may contribute to a particular packet). An error will be returned if too little or too much data is presented in the Tx phase. Signed-off-by: David Howells --- Documentation/networking/rxrpc.txt | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt index bce8e10a2a8e..8c70ba5dee4d 100644 --- a/Documentation/networking/rxrpc.txt +++ b/Documentation/networking/rxrpc.txt @@ -327,6 +327,7 @@ calls, to invoke certain actions and to report certain conditions. These are: RXRPC_ACCEPT s-- n/a Accept new call RXRPC_EXCLUSIVE_CALL s-- n/a Make an exclusive client call RXRPC_UPGRADE_SERVICE s-- n/a Client call can be upgraded + RXRPC_TX_LENGTH s-- data len Total length of Tx data (SRT = usable in Sendmsg / delivered by Recvmsg / Terminal message) @@ -406,6 +407,19 @@ calls, to invoke certain actions and to report certain conditions. These are: future communication to that server and RXRPC_UPGRADE_SERVICE should no longer be set. + (*) RXRPC_TX_LENGTH + + This is used to inform the kernel of the total amount of data that is + going to be transmitted by a call (whether in a client request or a + service response). If given, it allows the kernel to encrypt from the + userspace buffer directly to the packet buffers, rather than copying into + the buffer and then encrypting in place. This may only be given with the + first sendmsg() providing data for a call. EMSGSIZE will be generated if + the amount of data actually given is different. + + This takes a parameter of __s64 type that indicates how much will be + transmitted. This may not be less than zero. + The symbol RXRPC__SUPPORTED is defined as one more than the highest control message type supported. At run time this can be queried by means of the RXRPC_SUPPORTED_CMSG socket option (see below). @@ -577,6 +591,9 @@ A client would issue an operation by: MSG_MORE should be set in msghdr::msg_flags on all but the last part of the request. Multiple requests may be made simultaneously. + An RXRPC_TX_LENGTH control message can also be specified on the first + sendmsg() call. + If a call is intended to go to a destination other than the default specified through connect(), then msghdr::msg_name should be set on the first request message of that call. @@ -764,6 +781,7 @@ The kernel interface functions are as follows: struct sockaddr_rxrpc *srx, struct key *key, unsigned long user_call_ID, + s64 tx_total_len, gfp_t gfp); This allocates the infrastructure to make a new RxRPC call and assigns @@ -780,6 +798,11 @@ The kernel interface functions are as follows: control data buffer. It is entirely feasible to use this to point to a kernel data structure. + tx_total_len is the amount of data the caller is intending to transmit + with this call (or -1 if unknown at this point). Setting the data size + allows the kernel to encrypt directly to the packet buffers, thereby + saving a copy. The value may not be less than -1. + If this function is successful, an opaque reference to the RxRPC call is returned. The caller now holds a reference on this and it must be properly ended. @@ -931,6 +954,17 @@ The kernel interface functions are as follows: This is used to find the remote peer address of a call. + (*) Set the total transmit data size on a call. + + void rxrpc_kernel_set_tx_length(struct socket *sock, + struct rxrpc_call *call, + s64 tx_total_len); + + This sets the amount of data that the caller is intending to transmit on a + call. It's intended to be used for setting the reply size as the request + size should be set when the call is begun. tx_total_len may not be less + than zero. + ======================= CONFIGURABLE PARAMETERS -- cgit From 99c195fb4eea405160ade58f74f62aed19b1822c Mon Sep 17 00:00:00 2001 From: Dave Watson Date: Wed, 14 Jun 2017 11:37:51 -0700 Subject: tls: Documentation Add documentation for the tcp ULP tls interface. Signed-off-by: Boris Pismenny Signed-off-by: Dave Watson Signed-off-by: David S. Miller --- Documentation/networking/tls.txt | 135 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 135 insertions(+) create mode 100644 Documentation/networking/tls.txt (limited to 'Documentation/networking') diff --git a/Documentation/networking/tls.txt b/Documentation/networking/tls.txt new file mode 100644 index 000000000000..77ed00631c12 --- /dev/null +++ b/Documentation/networking/tls.txt @@ -0,0 +1,135 @@ +Overview +======== + +Transport Layer Security (TLS) is a Upper Layer Protocol (ULP) that runs over +TCP. TLS provides end-to-end data integrity and confidentiality. + +User interface +============== + +Creating a TLS connection +------------------------- + +First create a new TCP socket and set the TLS ULP. + + sock = socket(AF_INET, SOCK_STREAM, 0); + setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls")); + +Setting the TLS ULP allows us to set/get TLS socket options. Currently +only the symmetric encryption is handled in the kernel. After the TLS +handshake is complete, we have all the parameters required to move the +data-path to the kernel. There is a separate socket option for moving +the transmit and the receive into the kernel. + + /* From linux/tls.h */ + struct tls_crypto_info { + unsigned short version; + unsigned short cipher_type; + }; + + struct tls12_crypto_info_aes_gcm_128 { + struct tls_crypto_info info; + unsigned char iv[TLS_CIPHER_AES_GCM_128_IV_SIZE]; + unsigned char key[TLS_CIPHER_AES_GCM_128_KEY_SIZE]; + unsigned char salt[TLS_CIPHER_AES_GCM_128_SALT_SIZE]; + unsigned char rec_seq[TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE]; + }; + + + struct tls12_crypto_info_aes_gcm_128 crypto_info; + + crypto_info.info.version = TLS_1_2_VERSION; + crypto_info.info.cipher_type = TLS_CIPHER_AES_GCM_128; + memcpy(crypto_info.iv, iv_write, TLS_CIPHER_AES_GCM_128_IV_SIZE); + memcpy(crypto_info.rec_seq, seq_number_write, + TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE); + memcpy(crypto_info.key, cipher_key_write, TLS_CIPHER_AES_GCM_128_KEY_SIZE); + memcpy(crypto_info.salt, implicit_iv_write, TLS_CIPHER_AES_GCM_128_SALT_SIZE); + + setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info)); + +Sending TLS application data +---------------------------- + +After setting the TLS_TX socket option all application data sent over this +socket is encrypted using TLS and the parameters provided in the socket option. +For example, we can send an encrypted hello world record as follows: + + const char *msg = "hello world\n"; + send(sock, msg, strlen(msg)); + +send() data is directly encrypted from the userspace buffer provided +to the encrypted kernel send buffer if possible. + +The sendfile system call will send the file's data over TLS records of maximum +length (2^14). + + file = open(filename, O_RDONLY); + fstat(file, &stat); + sendfile(sock, file, &offset, stat.st_size); + +TLS records are created and sent after each send() call, unless +MSG_MORE is passed. MSG_MORE will delay creation of a record until +MSG_MORE is not passed, or the maximum record size is reached. + +The kernel will need to allocate a buffer for the encrypted data. +This buffer is allocated at the time send() is called, such that +either the entire send() call will return -ENOMEM (or block waiting +for memory), or the encryption will always succeed. If send() returns +-ENOMEM and some data was left on the socket buffer from a previous +call using MSG_MORE, the MSG_MORE data is left on the socket buffer. + +Send TLS control messages +------------------------- + +Other than application data, TLS has control messages such as alert +messages (record type 21) and handshake messages (record type 22), etc. +These messages can be sent over the socket by providing the TLS record type +via a CMSG. For example the following function sends @data of @length bytes +using a record of type @record_type. + +/* send TLS control message using record_type */ + static int klts_send_ctrl_message(int sock, unsigned char record_type, + void *data, size_t length) + { + struct msghdr msg = {0}; + int cmsg_len = sizeof(record_type); + struct cmsghdr *cmsg; + char buf[CMSG_SPACE(cmsg_len)]; + struct iovec msg_iov; /* Vector of data to send/receive into. */ + + msg.msg_control = buf; + msg.msg_controllen = sizeof(buf); + cmsg = CMSG_FIRSTHDR(&msg); + cmsg->cmsg_level = SOL_TLS; + cmsg->cmsg_type = TLS_SET_RECORD_TYPE; + cmsg->cmsg_len = CMSG_LEN(cmsg_len); + *CMSG_DATA(cmsg) = record_type; + msg.msg_controllen = cmsg->cmsg_len; + + msg_iov.iov_base = data; + msg_iov.iov_len = length; + msg.msg_iov = &msg_iov; + msg.msg_iovlen = 1; + + return sendmsg(sock, &msg, 0); + } + +Control message data should be provided unencrypted, and will be +encrypted by the kernel. + +Integrating in to userspace TLS library +--------------------------------------- + +At a high level, the kernel TLS ULP is a replacement for the record +layer of a userspace TLS library. + +A patchset to OpenSSL to use ktls as the record layer is here: + +https://github.com/Mellanox/tls-openssl + +An example of calling send directly after a handshake using +gnutls. Since it doesn't implement a full record layer, control +messages are not supported: + +https://github.com/Mellanox/tls-af_ktls_tool -- cgit From c017ce0a9ad753e3835d7f8f5fca89960380e816 Mon Sep 17 00:00:00 2001 From: Vincent Bernat Date: Tue, 27 Jun 2017 15:42:57 +0200 Subject: net: remove policy-routing.txt documentation It dates back from 2.1.16 and is obsolete since 2.1.68 when the current rule system has been introduced. Signed-off-by: Vincent Bernat Signed-off-by: David S. Miller --- Documentation/networking/policy-routing.txt | 150 ---------------------------- 1 file changed, 150 deletions(-) delete mode 100644 Documentation/networking/policy-routing.txt (limited to 'Documentation/networking') diff --git a/Documentation/networking/policy-routing.txt b/Documentation/networking/policy-routing.txt deleted file mode 100644 index 36f6936d7f21..000000000000 --- a/Documentation/networking/policy-routing.txt +++ /dev/null @@ -1,150 +0,0 @@ -Classes -------- - - "Class" is a complete routing table in common sense. - I.e. it is tree of nodes (destination prefix, tos, metric) - with attached information: gateway, device etc. - This tree is looked up as specified in RFC1812 5.2.4.3 - 1. Basic match - 2. Longest match - 3. Weak TOS. - 4. Metric. (should not be in kernel space, but they are) - 5. Additional pruning rules. (not in kernel space). - - We have two special type of nodes: - REJECT - abort route lookup and return an error value. - THROW - abort route lookup in this class. - - - Currently the number of classes is limited to 255 - (0 is reserved for "not specified class") - - Three classes are builtin: - - RT_CLASS_LOCAL=255 - local interface addresses, - broadcasts, nat addresses. - - RT_CLASS_MAIN=254 - all normal routes are put there - by default. - - RT_CLASS_DEFAULT=253 - if ip_fib_model==1, then - normal default routes are put there, if ip_fib_model==2 - all gateway routes are put there. - - -Rules ------ - Rule is a record of (src prefix, src interface, tos, dst prefix) - with attached information. - - Rule types: - RTP_ROUTE - lookup in attached class - RTP_NAT - lookup in attached class and if a match is found, - translate packet source address. - RTP_MASQUERADE - lookup in attached class and if a match is found, - masquerade packet as sourced by us. - RTP_DROP - silently drop the packet. - RTP_REJECT - drop the packet and send ICMP NET UNREACHABLE. - RTP_PROHIBIT - drop the packet and send ICMP COMM. ADM. PROHIBITED. - - Rule flags: - RTRF_LOG - log route creations. - RTRF_VALVE - One way route (used with masquerading) - -Default setup: - -root@amber:/pub/ip-routing # iproute -r -Kernel routing policy rules -Pref Source Destination TOS Iface Cl - 0 default default 00 * 255 - 254 default default 00 * 254 - 255 default default 00 * 253 - - -Lookup algorithm ----------------- - - We scan rules list, and if a rule is matched, apply it. - If a route is found, return it. - If it is not found or a THROW node was matched, continue - to scan rules. - -Applications ------------- - -1. Just ignore classes. All the routes are put into MAIN class - (and/or into DEFAULT class). - - HOWTO: iproute add PREFIX [ tos TOS ] [ gw GW ] [ dev DEV ] - [ metric METRIC ] [ reject ] ... (look at iproute utility) - - or use route utility from current net-tools. - -2. Opposite case. Just forget all that you know about routing - tables. Every rule is supplied with its own gateway, device - info. record. This approach is not appropriate for automated - route maintenance, but it is ideal for manual configuration. - - HOWTO: iproute addrule [ from PREFIX ] [ to PREFIX ] [ tos TOS ] - [ dev INPUTDEV] [ pref PREFERENCE ] route [ gw GATEWAY ] - [ dev OUTDEV ] ..... - - Warning: As of now the size of the routing table in this - approach is limited to 256. If someone likes this model, I'll - relax this limitation. - -3. OSPF classes (see RFC1583, RFC1812 E.3.3) - Very clean, stable and robust algorithm for OSPF routing - domains. Unfortunately, it is not widely used in the Internet. - - Proposed setup: - 255 local addresses - 254 interface routes - 253 ASE routes with external metric - 252 ASE routes with internal metric - 251 inter-area routes - 250 intra-area routes for 1st area - 249 intra-area routes for 2nd area - etc. - - Rules: - iproute addrule class 253 - iproute addrule class 252 - iproute addrule class 251 - iproute addrule to a-prefix-for-1st-area class 250 - iproute addrule to another-prefix-for-1st-area class 250 - ... - iproute addrule to a-prefix-for-2nd-area class 249 - ... - - Area classes must be terminated with reject record. - iproute add default reject class 250 - iproute add default reject class 249 - ... - -4. The Variant Router Requirements Algorithm (RFC1812 E.3.2) - Create 16 classes for different TOS values. - It is a funny, but pretty useless algorithm. - I listed it just to show the power of new routing code. - -5. All the variety of combinations...... - - -GATED ------ - - Gated does not understand classes, but it will work - happily in MAIN+DEFAULT. All policy routes can be set - and maintained manually. - -IMPORTANT NOTE --------------- - route.c has a compilation time switch CONFIG_IP_LOCAL_RT_POLICY. - If it is set, locally originated packets are routed - using all the policy list. This is not very convenient and - pretty ambiguous when used with NAT and masquerading. - I set it to FALSE by default. - - -Alexey Kuznetov -kuznet@ms2.inr.ac.ru -- cgit From 75674c4cbb03a8b6ed7ef1f72b9e9aecc0bfd9dc Mon Sep 17 00:00:00 2001 From: Matteo Croce Date: Fri, 30 Jun 2017 18:21:47 +0200 Subject: Documentation: fix wrong example command In the IPVLAN documentation there is an example command line where the master and slave interface names are inverted. Fix the command line and also add the optional `name' keyword to better describe what the command is doing. v2: added commit message Signed-off-by: Matteo Croce Signed-off-by: David S. Miller --- Documentation/networking/ipvlan.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/ipvlan.txt b/Documentation/networking/ipvlan.txt index 24196cef7c91..1fe42a874aae 100644 --- a/Documentation/networking/ipvlan.txt +++ b/Documentation/networking/ipvlan.txt @@ -22,9 +22,9 @@ The driver can be built into the kernel (CONFIG_IPVLAN=y) or as a module There are no module parameters for this driver and it can be configured using IProute2/ip utility. - ip link add link type ipvlan mode { l2 | l3 | l3s } + ip link add link name type ipvlan mode { l2 | l3 | l3s } - e.g. ip link add link ipvl0 eth0 type ipvlan mode l2 + e.g. ip link add link eth0 name ipvl0 type ipvlan mode l2 4. Operating modes: -- cgit