the ‘zero-copy’ initiative a look at the ‘zero-copy’ concept and an x86 linux implementation...

The ‘zero-copy’ initiative

A look at the ‘zero-copy’ concept and an x86 Linux implementation for the case of outgoing packets

From Wikipedia, the free encyclopedia:

Zero-copy is an adjective that refers to computer operations in which theCPU does not perform the task of copying data from one area of memory to another.The availability of zero-copy versions of operating system elements such as device drivers, file systems and network protocol stacks greatly increasesthe performance of many applications, since using a CPU that is capable of complex operations just to make copies of data can be a great waste ofresources. Zero-copy also reduces the number of context-switches from User space to Kernel space and vice-versa. Several OS like Linux support zero copying of files through specific API's like sendfile, sendfile64, etc.Techniques for creating zero-copy software include the use of DMA-based copying, and memory-mapping through an MMU. These features require specific hardware support and usually involve particular memory alignment requirements.Zero-copy protocols are especially important for high-speed networks, as memory copies would cause a serious workload for the host cpu. Still, such protocols have some initial overhead so that avoiding programmed IO (PIO) there only makes sense for large messages.

Application source-code

char message[] = “This is a test of network-packet transmission \n”;

int main( void ) {

int fd = open( “/dev/nic”, O_RDWR );if ( fd < 0 ) { perror( “/dev/nic” ); exit(1); }

int msglen = strlen( message );

int nbytes = write( fd, message, msglen );if ( nbytes < 0 ) { perror( “write” ); exit(1); }

printf( “Transmitted %d bytes \n”, nbytes ); }

Transmit operation

application program

user data-buffer

runtime library

write()

Linux OS kernel

nic device-driver

my_write()

file subsystem

hardware

packet buffer copy_from_user()

DMA

user space kernel space

We want to eliminate this copying-operation

Our driver’s packet-layout

packet-buffer in kernel-space

destn-address source-address TYPE/LENGTH

count

-- data --

-- data --

-- data –

base-address (64-bits) statusPacket-length

specialCSS

16 bytes

cmdCSO

Format for Legacy Transmit-Descriptor

Can zero-copy be transparent?

• We would like to implement the zero-copy concept in out ‘nic2.c’ character driver in such a manner that no changes would be required to an ‘application’ program’s code

• We will show how to do this for ‘outgoing’ packets (i.e., by modifying ‘my_write()’), but achieving zero-copy with ‘incoming’ packets would be a lot more complicated!

TX Descriptor’s CMD byte

IDE

VLE

0 0RS

IC

IFCS

EOP

Command-Byte Format

EOP = End-Of-Packet (1=yes, 0=no)

RS = Report Status (1=yes, 0=no)

VLE = VLAN-tag Enable

Key question: What will the NIC do if we don’t set the EOP-bit in a TX Descriptor?

Splitting our packet-layout



count

-- data --

-- data --

-- data –

base-address (64-bits) statusPacket-Length(=HDR)

specialCSScmdEOP=0

CSO

Format for Legacy Transmit-Descriptor Pair

base-address (64-bits) statusPacket-Length(=LEN)

specialCSScmdEOP=1

CSO

HDR

LEN

packet-buffer in user-space


Splitting our packet-buffer


count

-- data --

-- data --

-- data –

base-address (64-bits) statusPacket-Length(=HDR)

specialCSScmdEOP=0

CSO

Format for Legacy Transmit-Descriptor Pair

base-address (64-bits) statusPacket-Length(=LEN)

specialCSScmdEOP=1

CSO

HDR

LEN

Two physical packet-buffers comprise one logical packet that gets transmitted!

Transmitting a ‘split-packet’

NIC hardware

Device-driver module

Application-program

User-space

Kernel-space

packet-data buffer

packet-header buffer

DMA

DMA

The 82573L controller ‘merges’ the contents of these separate buffers into just a single ethernet-packet

The ‘virt_to_phys()’ macro

• Linux provides a convenient macro which kernel-module code can employ to obtain the physical-address for a memory-region from its virtual-address – but it only works for addresses that aren’t in ‘high’ memory

• For ‘normal’ memory-regions, conversion between ‘virtual’ and ‘physical’ addresses amounts to a simple addition/subtraction

Linux memory-mapping

user space

kernel space

CPU’s virtual address-space

HMA

896-MB

physical RAM

There is more physical RAM in our classroom’s systems than can be ‘mapped’ into the available address-range for kernel virtual addresses

= persistent mapping = transient mappings

Two-Level Translation Scheme

PAGEDIRECTORY

CR3

PAGETABLES

PAGEFRAMES

Linear to Physical

physical address-spaceoffsettable-index

linear address

CR3

dir-index

page frame pagedirectory

pagetable

Address-translation

• The CPU examines any virtual address it encounters, subdividing it into three fields

offset into page-frame

index intopage-directory

index into page-table

31 22 21 12 11 0

10-bits 10-bits 12-bits

This field selects one of the 1024 array-entries inthe Page-Directory

This field selects one of the 1024 array-entries in that Page-Table

This field provides the offset to one of the 4096 bytes in that Page-Frame

Format of a Page-Table entry

PAGE-FRAME BASE ADDRESS PWUPWT

PCD

AD00

31 12 11 10 9 8 7 6 5 4 3 2 1 0

AVAIL

LEGEND P = Present (1=yes, 0=no) W = Writable (1 = yes, 0 = no) U = User (1 = yes, 0 = no) A = Accessed (1 = yes, 0 = no) D = Dirty (1 = yes, 0 = no)

PWT = Page Write-Through (1=yes, 0 = no)PCD = Page Cache-Disable (1 = yes, 0 = no)

Finding the user-buffer’s PFN

• To program the ‘base-address’ field in the second TX-Descriptor, our driver’s ‘write()’ function will need to know which physical Page-Frame the application’s buffer lies in

• And its PFN (Page-Frame Number) can be found from its virtual address by ‘walking-the-cpu-page-tables’ – even when Linux puts some page-tables in ‘high’ memory

Performing ‘virt_to_phys()’

ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos ) {

unsigned int _cr3, *pgdir, *pgtbl, pfn_pgtbl, pfn_frame;unsigned int dindex, pindex, offset;

// take apart the virtual-address of the user’s ‘buf’ variabledindex = ((int)buf >> 22) & 0x3FF; // pgdir-index (10-bits)pindex = ((int)buf >> 12) & 0x3FF; // pgtbl-index (10-bits)offset = ((int)buf >> 0) & 0xFFF; // frame-offset (12-bits)

// then walk the CPU’s paging-tables to get buf’s physical-address asm(“ mov %%cr3, %%eax \n mov %%eax, %0 “ : “=m”(_cr3) : : “ax” );pgdir = (unsigned int*)phys_to_virt( _cr3 & ~0xFFF );pfn_pgtbl = (pgdir[ dindex ] >> 12);pgtbl = (unsigned int *)kmap( &mem_map[ pfn_pgtbl ] );pfn_frame = (pgtbl[ pindex ] >> 12);kunmap( &mem_map[ pfn_pgtbl ];txring[ txtail + 1 ].base_address = (pfn_frame << 12) + offset;

Can’t cross a ‘page-boundary’

• In order for the NIC to fetch the user’s data using its Bus-Master DMA capability, it is necessary for the buffer needs to reside in a physically contiguous memory-region

• But we can’t be sure Linux will have setup the CPU’s page-tables that way – unless the ‘buf’ is confined to a single page-frame

buf

Truncate ‘len’ if necessary

ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos ) {

if ( offset + len > PAGE_SIZE ) len = PAGE_SIZE – offset;

buf

offset len

PAGE_SIZE PAGE_SIZE PAGE_SIZE

‘zerocopy.c’

• We created this modification of our ‘nic2.c’ device-driver so it’s ‘my_write()’ function lets an application perform transmissions without performing a memory-to-memory copy-operation (i.e., copy_from_user()’ )

• It is not so easy to implement ‘zero-copy’ for receiving packets – can you say why?

Website article

• We’ve posted a link on our CS686 website to a frequently cited research-article about the various issues that arise when trying to implement the ‘zero-copy’ concept for the case of ‘incoming’ network-packets:

The Need for Asynchronous, Zero-Copy Network I/O, by Ulrich Drepper, Red Hat, Inc.

http://people.redhat.com/drepper/newni.pdf

the ‘zero-copy’ initiative a look at the ‘zero-copy’ concept and an x86 linux implementation...

Documents

userspace packetbuffer

descriptor slide

data baseaddress

copyingoperation slide

copies of data

task of copying data

cso hdr len slide

address type length