Linux Plumbers Conference 2010 User Visible Networking Issues Micro-conference Notes http://domsch.com/blog for more links Use Twitter/Identi.ca hashtag #lpc2010 http://www.linuxplumbersconf.org/2010/ocw/sessions/465 http://wiki.linuxplumbersconf.org/2010:user-visible_networking_issues Matt lead the introduction. Dan Williams up first - The Challenge of Mobile Broadband ========================================== Slide: Old Spice Guy on a horse with a tablet-like device. Slide: So Everything is Great, right? NetworkManager makes it easy, popular cards work, hardware is $100, so it's becoming commonplace But it's not "easy". Ton of hardware, lots of ways to talk to them. Top Tier vendors Second Tier vendors: Huawei, ZTE Third Tier (aka Septic Tank): longcheer, Simtech, JRD/TCT Embedd ed vendors Standards are for Losers - Command Set differences. AT command sets differ for each vendor, heavily underspecified - command sets are retrieved by reverse-engineering Windows drivers - hard to contact vendors Does Not Play Well WIth Others Proprietary Protocols: QCDM, QMI Vendor participation A+: Sierra, option, Ericsson (they hand out docs, answer firmware questions, contribute drivers to Linux) C-: Huawei, ZTE, Novatel (contribute USB ID updates, sometimes answer questions) D-: Qualcomm Android has drivers for Qualcomm "gobi" chipset Qualcomm has a binary library to implement their special commands Need to continue developing the Qualcomm / Linux relationship. It's a big company, some folks there "get it", others, not so much. modeswitch: devices first appear as fake CD-ROM with Windows drivers; needs to be suppressed in Linux usb-modeswitch tool: has DB of 3G cards, and issues these USB commands or AT commands to flip device over to modem mode devices usually export several ports for various functionality (GPS, firmware upgrades, diagnostics, etc.); driver needs to know which port maps where, and talk to the right port for actual internet connectivity some vendors (option) offer AT commands to enumerate/tag the ports, for most others they need to be scraped from Windows .inf drivers modeswitch invoked by udev rules (usually vendor/product IDs) most of the time switching latency due to firmware reboot, re-enumeration on the USB bus. Worst case ~10 seconds. parts of usb-modeswitch are written in tcl, which slows down boot and is expensive; should be ported to C (or vala?) device detection is pretty good for normal devices (data cards). Cell phones are the new challenge. Lots of cell phones. Can't rely on static port tagging Some vendors use the same USB IDs for wildly different phone hardware as phones and connection technologies get faster, they move away from classic ppp, towards pseudo ethernet devices The 4G Devolution WiMax: Intel and Beceem (usb dongles) Userspace stacks are huge, move a lot of security and other infrastructure from firmware no unified driver API Current drivers need cleanup love. Volunteers? Shout out to shemminger for his efforts thus far. LTE: only card is Samsung right now. 50,000 deployed, no drivers yet. Qualcomm chipsets starting to be released What do we need? - lots more testers (new devices every week), submit new IDs to usb-modeswitch - more reverse engineering - public ridicule - better serial drivers (latency) - better vendor participation - janitors (cleanup published drivers) - configuration croudsourcing (udev rules) Next Up: Matt Domsch - Discussing Network Device Naming Slide with lots of eth cables: Which is eth0? * net device naming is transient and (without intervention) based on system attributes that have no relevance to what we might want the name to be * can change with os/version/pci scan order/etc, or when adding new cards * This is problematic for large scale deployments and "turnkey" no-config * sysadmins want to have consistency Slide: Problem Enumeration != Naming * Sysadmins expect names to be deterministic * Enumeration is completely non-deterministic * module load order / racing udev renames * pci discovery order * Undefined relationship between physical labling and enumeration * We've tried to fix this: * MAC->name mapping (70-persistent-net.rules generator) -> prevents installing the same image on multiple machines. Strong desire to reduce on-disk state * pci-id->name mapping * These solutions add state (system specific state information at that) * some vendors try to externalize the MAC address assignment, like HP blades; they carry no "personality" on their own, but get MACs assigned to depending on where you plug them * Names limited to 15 chars and a limited character set * only 1 name for each device allowed * It would be nice to fix all these problems Question: If not using stateful system-specific info, how do we map interfaces to names * Might migrate mac address with system exhanges * other solutions may exist Slide: Current solutions / Hacks * persistent-net-rules -> introduces state to stateless systems * pci=bfsort -> ties mapping to bus topology * force pci bus layout types in hardware, so that all OSes detect them in same order - not cost reasonable * pxelinux + ks-bootif -> only a 1 NIC solution and only for PXE * Need to improve Slide: Standards * HP added NIC ordering info to SMBIOS vendor fields - >has gone unused * Dell added labels to device objects in SMBIOS 2.6 standards * Dell is working on labels and indexing in ACPI Question: Are these SMBIOS in order, or do they assign labels * yes, it assign labels, so it can be mixed with external cards without breaking the ordering * But it doesn't work for add-in cards Question: Can we use vpd to store simmilar information * Maybe, would need standardization Slide: Proposed (and rejected) solutions * Rename devices using biosdevname - > never got much traction * /dev/net/eth* device nodes for network devices -> core network maintainers were unwilling * OS propted names during install -> um, no. * libnetdevname - > too much fan out for updating of user space tools * Reservation of the first N ethN names - > requires naming policy in kernel, so, no. Slide: Current proposal * Use Bios-provided index in udev rules * only rename LOM's * needs help - > need to give users the ability to opt-out/in * requires systems to have latest smbios 2.6 support * network manager to display bios labels as informational text * jon masters: should just be renamed once in the kernel, for race-free processing * sems easier to first implement it in udev and put the policy to a real-life test Lots of discussion regarding shortcommings of bios bound names * Generally it seems like using SMBIOS is good-enough * dell will be adding udev rules to rename eth to lom for on board devices Ying Cai - SO_REUSEPORT ===================== Problems: * Servers with very high transaction rates * behind one server socket there might be a big number of serving threads * only one process can bind to a single socket Standard solution: one listener thread, dispatches incoming connections to server threads -> bottleneck Google solution: all server thread accept() on single socket -> lock contention, poor balancing UDP has similar problems, but has SO_REUSEADDR to allow multiple sockets to bind() New socket option SO_REUSEPORT -> allow multiple socktes to bind()/listen() to same local address/port -> avoids contention load balancing done by kernel by random assignment of incoming connection to thread security: all sockets must be opened by same user, to avoid spoofing not enabled by default; enabled with a sysctl net.core.allow_reuseport=1, and setsockopt SOREUSEADDR|SO_REUSEPORT status: * sent upstream, not accepted yet Question: if apps compile in support for that, admin can then break this by disabling the sysctl * option was introduced for a first-level security concern * mitigated by requiring all processes to use the option * idea: require all listeners to be in same process group, cgroup, or netlabel Problem: hashing * hash includes number of opened server sockets, so as soon as this changes, packets get hashed into different socket * use fixed number of sockets and open them all up front * allow multiple sockets to share TCP request table * get rid of the hash, pick local server socket which is on the same CPU Problem: Cache * didn't solve cache line bouncing yet * different requests could hit different CPUs * Solution: bind server thread to a single CPU, use Receive Packet Steering and XPS-mq (transmit packet steering) to stay on the same CPU * In multi-queue, set up hardware receive queues to be per-CPU (except that not all hardware will have as many queues as the system has CPUs) Other scalability issues: scheduler interactions, locking contentions (HTB qdisc lock). Cannot remove the qdisc lock. Shyam Iyer - Simplify network config for VMs by harminizing multiple bridging, QOS, DCB, ... ===================================================================== Background: * old days: ifconfig, ethtool, etc. on plain ethX devices * Storage controller cards: SCSI low-level driver, SCSI/SAS/FC link types Evolution: * ethernet became faster than storage, and more reliable -> start using ethernet to do storage, too * finally common standard "Data Centre Bridging" -> set of features/notifications for controlling QoS, congestion notification, etc. Data Center Bridging TLVs sent over LLDP to exchange parameters on a single link; allows hardware based QoS, like splitting your available bandwidth across several storage and network links Converged Network Adapters (CNAs) hitting the market soon - ethernet, iSCSI, FCoE functionality, sometimes as a storage HBA, delivered through NIC partitioning and SR-IOV. These will require configuration, and each vendor has their own way to configure their devices. Some devices have embedded switches. Need to comprehend virtual switches (hypervisors) as well. "Hairpin" mode (traffic leaves a physical port, and is destined for another OS that's on the same physical port - the remote switch must receive and send the packets right back. Bridging modes: VEB, VEPA Complications: - Vendor Implementations -- vlan management - vconfig -- DCB management - dcbnl netlink communication for configuration of ethernet devices, rather than ioctl()) -- multiqueue -- QoS implementation --- TC filters, qdisc - iSCSI offload vendors -- vlan management --- iscsi iface configuration -- vconfig - FC offload vendors -- vlan management --- sysfs attributes --- switch-based operation FC storage vendors haven't traditionally had to work in the networking space. Several different vendor's cards in a single data center environment. QoS Implementations: Per VM QoS configuration - Storage QoS config --- per VM i/O priority via blkio controller cgroup for i/o controllers. Libvirt can provide an API. --- Load balance an aggregated/multipathed I/O path to a LUN Network QoS configuration --- per VM network I/O priority can be configured via tc and lldpad Need to expose traffic class queues LLDP is link-layer, does a VM do links all the way to the external switch? Firmware may terminate LLDP, and not pass that info up to the OS Lively conversation about different configurability by each vendor - hard to wrangle into a single config API. Storage adapters doing DCB configuration don't (today) expose a network device for use with dcbnl, but they could. Stephen: need to see code. Intel: we have dcbnl in the kernel, but no one else uses it.