the ‘zero-copy’ initiative a look at the ‘zero-copy’ concept and an x86 linux implementation...
Post on 22-Dec-2015
222 views
TRANSCRIPT
The ‘zero-copy’ initiative
A look at the ‘zero-copy’ concept and an x86 Linux implementation for the case of outgoing packets
From Wikipedia, the free encyclopedia:
Zero-copy is an adjective that refers to computer operations in which theCPU does not perform the task of copying data from one area of memory to another.The availability of zero-copy versions of operating system elements such as device drivers, file systems and network protocol stacks greatly increasesthe performance of many applications, since using a CPU that is capable of complex operations just to make copies of data can be a great waste ofresources. Zero-copy also reduces the number of context-switches from User space to Kernel space and vice-versa. Several OS like Linux support zero copying of files through specific API's like sendfile, sendfile64, etc.Techniques for creating zero-copy software include the use of DMA-based copying, and memory-mapping through an MMU. These features require specific hardware support and usually involve particular memory alignment requirements.Zero-copy protocols are especially important for high-speed networks, as memory copies would cause a serious workload for the host cpu. Still, such protocols have some initial overhead so that avoiding programmed IO (PIO) there only makes sense for large messages.
Application source-code
char message[] = “This is a test of network-packet transmission \n”;
int main( void ) {
int fd = open( “/dev/nic”, O_RDWR );if ( fd < 0 ) { perror( “/dev/nic” ); exit(1); }
int msglen = strlen( message );
int nbytes = write( fd, message, msglen );if ( nbytes < 0 ) { perror( “write” ); exit(1); }
printf( “Transmitted %d bytes \n”, nbytes ); }
Transmit operation
application program
user data-buffer
runtime library
write()
Linux OS kernel
nic device-driver
my_write()
file subsystem
hardware
packet buffer copy_from_user()
DMA
user space kernel space
We want to eliminate this copying-operation
Our driver’s packet-layout
packet-buffer in kernel-space
destn-address source-address TYPE/LENGTH
count
-- data --
-- data --
-- data –
base-address (64-bits) statusPacket-length
specialCSS
16 bytes
cmdCSO
Format for Legacy Transmit-Descriptor
Can zero-copy be transparent?
• We would like to implement the zero-copy concept in out ‘nic2.c’ character driver in such a manner that no changes would be required to an ‘application’ program’s code
• We will show how to do this for ‘outgoing’ packets (i.e., by modifying ‘my_write()’), but achieving zero-copy with ‘incoming’ packets would be a lot more complicated!
TX Descriptor’s CMD byte
IDE
VLE
0 0RS
IC
IFCS
EOP
Command-Byte Format
EOP = End-Of-Packet (1=yes, 0=no)
RS = Report Status (1=yes, 0=no)
VLE = VLAN-tag Enable
Key question: What will the NIC do if we don’t set the EOP-bit in a TX Descriptor?
Splitting our packet-layout
packet-buffer in kernel-space
destn-address source-address TYPE/LENGTH
count
-- data --
-- data --
-- data –
base-address (64-bits) statusPacket-Length(=HDR)
specialCSScmdEOP=0
CSO
Format for Legacy Transmit-Descriptor Pair
base-address (64-bits) statusPacket-Length(=LEN)
specialCSScmdEOP=1
CSO
HDR
LEN
packet-buffer in user-space
packet-buffer in kernel-space
Splitting our packet-buffer
destn-address source-address TYPE/LENGTH
count
-- data --
-- data --
-- data –
base-address (64-bits) statusPacket-Length(=HDR)
specialCSScmdEOP=0
CSO
Format for Legacy Transmit-Descriptor Pair
base-address (64-bits) statusPacket-Length(=LEN)
specialCSScmdEOP=1
CSO
HDR
LEN
Two physical packet-buffers comprise one logical packet that gets transmitted!
Transmitting a ‘split-packet’
NIC hardware
Device-driver module
Application-program
User-space
Kernel-space
packet-data buffer
packet-header buffer
DMA
DMA
The 82573L controller ‘merges’ the contents of these separate buffers into just a single ethernet-packet
The ‘virt_to_phys()’ macro
• Linux provides a convenient macro which kernel-module code can employ to obtain the physical-address for a memory-region from its virtual-address – but it only works for addresses that aren’t in ‘high’ memory
• For ‘normal’ memory-regions, conversion between ‘virtual’ and ‘physical’ addresses amounts to a simple addition/subtraction
Linux memory-mapping
user space
kernel space
CPU’s virtual address-space
HMA
896-MB
physical RAM
There is more physical RAM in our classroom’s systems than can be ‘mapped’ into the available address-range for kernel virtual addresses
= persistent mapping = transient mappings
Two-Level Translation Scheme
PAGEDIRECTORY
CR3
PAGETABLES
PAGEFRAMES
Linear to Physical
physical address-spaceoffsettable-index
linear address
CR3
dir-index
page frame pagedirectory
pagetable
Address-translation
• The CPU examines any virtual address it encounters, subdividing it into three fields
offset into page-frame
index intopage-directory
index into page-table
31 22 21 12 11 0
10-bits 10-bits 12-bits
This field selects one of the 1024 array-entries inthe Page-Directory
This field selects one of the 1024 array-entries in that Page-Table
This field provides the offset to one of the 4096 bytes in that Page-Frame
Format of a Page-Table entry
PAGE-FRAME BASE ADDRESS PWUPWT
PCD
AD00
31 12 11 10 9 8 7 6 5 4 3 2 1 0
AVAIL
LEGEND P = Present (1=yes, 0=no) W = Writable (1 = yes, 0 = no) U = User (1 = yes, 0 = no) A = Accessed (1 = yes, 0 = no) D = Dirty (1 = yes, 0 = no)
PWT = Page Write-Through (1=yes, 0 = no)PCD = Page Cache-Disable (1 = yes, 0 = no)
Finding the user-buffer’s PFN
• To program the ‘base-address’ field in the second TX-Descriptor, our driver’s ‘write()’ function will need to know which physical Page-Frame the application’s buffer lies in
• And its PFN (Page-Frame Number) can be found from its virtual address by ‘walking-the-cpu-page-tables’ – even when Linux puts some page-tables in ‘high’ memory
Performing ‘virt_to_phys()’
ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos ) {
unsigned int _cr3, *pgdir, *pgtbl, pfn_pgtbl, pfn_frame;unsigned int dindex, pindex, offset;
// take apart the virtual-address of the user’s ‘buf’ variabledindex = ((int)buf >> 22) & 0x3FF; // pgdir-index (10-bits)pindex = ((int)buf >> 12) & 0x3FF; // pgtbl-index (10-bits)offset = ((int)buf >> 0) & 0xFFF; // frame-offset (12-bits)
// then walk the CPU’s paging-tables to get buf’s physical-address asm(“ mov %%cr3, %%eax \n mov %%eax, %0 “ : “=m”(_cr3) : : “ax” );pgdir = (unsigned int*)phys_to_virt( _cr3 & ~0xFFF );pfn_pgtbl = (pgdir[ dindex ] >> 12);pgtbl = (unsigned int *)kmap( &mem_map[ pfn_pgtbl ] );pfn_frame = (pgtbl[ pindex ] >> 12);kunmap( &mem_map[ pfn_pgtbl ];txring[ txtail + 1 ].base_address = (pfn_frame << 12) + offset;
Can’t cross a ‘page-boundary’
• In order for the NIC to fetch the user’s data using its Bus-Master DMA capability, it is necessary for the buffer needs to reside in a physically contiguous memory-region
• But we can’t be sure Linux will have setup the CPU’s page-tables that way – unless the ‘buf’ is confined to a single page-frame
buf
Truncate ‘len’ if necessary
ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos ) {
if ( offset + len > PAGE_SIZE ) len = PAGE_SIZE – offset;
buf
offset len
PAGE_SIZE PAGE_SIZE PAGE_SIZE
‘zerocopy.c’
• We created this modification of our ‘nic2.c’ device-driver so it’s ‘my_write()’ function lets an application perform transmissions without performing a memory-to-memory copy-operation (i.e., copy_from_user()’ )
• It is not so easy to implement ‘zero-copy’ for receiving packets – can you say why?
Website article
• We’ve posted a link on our CS686 website to a frequently cited research-article about the various issues that arise when trying to implement the ‘zero-copy’ concept for the case of ‘incoming’ network-packets:
The Need for Asynchronous, Zero-Copy Network I/O, by Ulrich Drepper, Red Hat, Inc.