Download - Epoll - from the kernel side
Epoll - From the kernel side
“There are no secret messages in the source code. “
Lijin Liu <[email protected]>twiiter: @llj098 & http://blog.fatlj.me
Some basics about I/O
•Q: What is I/O? • A: The I/O is connecting the
CPU to the outside world.
Some basics about I/O
• Three kinds of I/O:• memory-mapped input/output• I/O-mapped input/output• direct memory access (DMA)
Some basics about I/O
• PCI, ISA, EISA, NuBus..• PCI controller
• Interrupt Controller• Also an I/O device• some device is able to communicate with it , and
needless to talk with CPU• POLLED I/O• delay handle
I/O models – back to software world
• - Blocking IO– normal read/write/open... system call
• - NON-Blocking IO– fcntl/ioctl
• IO-Mulitiplex– SELECT
• - Event driven– EPOLL/KQUEUE
• AIO– IOCP
NON/Blocking I/O
• - user space api/system call– read,write,accept,open,close..
• - Block IO– per connection per thread/process
• - NONBlocking IO– iotcl/fcntl – loop check
IO-Multiplex
• select/poll• Shortcomings• fd number is limited• another type of loop check
SELECT/POLL Internals - basics
• process : task_struct• No thread in linux, just process or ‘task’• data structure– include/linux/list.h
• - process scheduler – CFS
• Process state machine
SELECT/POLL Internals - basics
SELECT/POLL Internals - basics
• sleep/wake up mechanism in linux kernel – wait_queue
• structures:struct __wait_queue {
unsigned int flags;void *private;
wait_queue_func_t func; /*callback function*/struct list_head task_list;
}
SELECT/POLL Internals – basics
• How to wait: • linux/kernel/sched/core.c : schedule() -> __schedule():
...next = pick_next_task(rq);...context_switch(rq, prev, next); /* unlocks the rq */switch_mm(oldmm, mm, next); /*arch independent, x86: arch/x86/include/asm/mmu_context.h*/.....
SELECT/POLL Internals - some basics
• Interruption– interrupt controller • Device• programmable
– interrupt handler• Often, the device driver will register as interupt handler
– Softirq• bottom halves • ksoftirqd
SELECT/POLL Internals
• Why poll/select is not cool?
• fs/select.c do_select() :
for (j = 0; j < __NFDBITS; ++j, ++i, bit <<= 1) { ... file = fget_light(i, &fput_needed); if (file) { f_op = file->f_op; mask = DEFAULT_POLLMASK; if (f_op && f_op->poll) mask = (*f_op->poll)(file, retval ? NULL : wait); ... } }
SELECT/POLL Internals -- tcp_poll
• - in the vfs part, the xxx_poll will not block• - net/ipv4/tcp.c : unsigned int tcp_poll()
/* omit connect/close state chek */ .....if (tp->urg_seq == tp->copied_seq &&!sock_flag(sk, SOCK_URGINLINE) &&tp->urg_data)
target++; if (tp->rcv_nxt - tp->copied_seq >= target)
mask |= POLLIN | POLLRDNORM; ....
Here Comes the EPOLL• User space API
• epoll_create() , epoll_ctl() , epoll_wait()
• structuresstruct epoll_event {
uint32_t events;epoll_data_t data;
};
typedef union epoll_data {void *ptr;int fd;uint32_t u32;uint64_t u64;
} epoll_data_t;
• - LT/ET mode
Epoll code demo #define MAX_EVENTS 10 struct epoll_event ev, events[MAX_EVENTS]; int efd = epoll_create(1024); ... epoll_ctl(efd,EPOLL_CTL_ADD,listenfd,&ev); ... while(1) { int n = epoll_wait(efd,events,MAX_EVENTS,-1); for(i = 0;i < n;i++){ if(events[n].data.fd == listenfd) { conn = accept(listenfd,(struct sockaddr *)addr,&addrlen); setnonblocking(conn); ev.events = EPOLLIN| EPOLLET; ev.data.fd = conn; epoll_ctl(efd,EPOLL_CTL_ADD,conn,&ev); } else{ do_work(events[n].data.fd); } } }
EPOLL Internals
• some structures• kernel side : • eventpoll main structure,epoll_create() makes• epitem wrap of a file, this struct in the RB tree• eppoll_entry wait structure for poll hooks• epoll_event same as user space
• user space: • epoll_event like eventpoll above• epoll_data custom data area
EPOLL Internals
• why it works?• kernel is event based, user space maybe not
• How it works?• add the fd to the epoll by epoll_ctl()• use epoll_wait() sleep to fish active fds• the interruption happen• send the active fds to the user space• wake up the slept process
EPOLL Internals
• add fd to epoll
• fs/eventpoll.c:• epoll_ctl() -> ep_insert() -> ep_rbtree_insert()
• when we add an fd to a eventpoll, first initilate corresponding structure: epitem
• setup some callback function for this file• add the epitem to the rbtree
EPOLL Internals - how to sleep
• two wait_queues• one for the process right now• one for the ksoftirqd
• epoll_wait() system call• set the current process to TASK_INTERUPPTABLE• schedule()
epoll Internals - how to wakeup
• Work flow• Interrupt handler • fd active• wait_queue #1 actived on ksoftirqd• epoll_callback() fired , active wait_queue #2• copies the ready fds to the user space• set the user process running• user process is scheduled, wake up!
EPOLL Internal - show to wakeup
• Tcp demo- cd net/ipv4/- af_inet: struct net_protocol tcp_protocol- tcp_ipv4.c:tcp_v4_rcv()- tcp_ipv4.c:tcp_v4_do_rcv()- tcp_input.c:tcp_rcv_established()- cd ../core- sock.c:sock->sk_data_ready()- sock.c:sock->sock_def_readable()
- ep_poll_callback() : - add the fd to the epoll's ready list - active the blocked process above (by epoll_wait)- after the blacked process wake: - ep_send_events() - ep_scan_ready_list() : copy the epoll's readylist to a tmp list(ref copy) - ep_send_events_proc() : transfer to user space - move the ovflist_list to the ready list of epoll
EPOLL the whole picture
• - two wait_queue, one for ksoftirqd, one for user process, one fire another
• - three lock(two mutex,one spinlock)• - an ep_item red-black tree
Compare to the IOCP
• IOCP is AIO,EPOLL/KQUEUE is event base multiplexing
• IOCP need to take care of the IO operation• EPOLL is just an notification mechanism, light,
flexible• IOCP need a thread pool overhead
References
• Linus and kernel hackers - Linux kernel source tree – http://kernel.org
• Robert Love - Linux Kernel Development – http://www.amazon.com/Linux-Kernel-Development-Robert-Love/dp/0672329468/
•Jonathan Corbet , Alessandro Rubini , Greg Kroah-Hartman – Linux Device Driver – http://www.amazon.com/Linux-Device-Drivers-Jonathan-Corbet/dp/0596005903/• Christian Benvenuti - Understanding the linux network internals• http://www.amazon.com/Understanding-Network-Internals-Christian-Benvenuti/dp/05
96002556/•Randall Hyde (Author) - Write Great Code: Volume 1: Understanding the Machine
• http://www.amazon.com/Write-Great-Code-Understanding-Machine/dp/1593270038•W. Richard Stevens , Bill Fenner, Andrew M. Rudoff -Unix Network Programming, Volume 1
– http://www.amazon.com/Unix-Network-Programming-Sockets-Networking/dp/0131411551
•W. Richard Stevens , Stephen A. Rago - Advanced Programming in the UNIX Environment– http://www.amazon.com/Programming-Environment-Addison-Wesley-Professional-Co
mputing/dp/0321525949/•David A Rusling - The Linux Kernel http://tldp.org/LDP/tlk/dd/interrupts.html
References• - linux kernel 中 epoll 的设计和实现
– http://www.pagefault.info/?p=264
• IOCP , kqueue , epoll ... 有多重要? – http://blog.codingnow.com/2006/04/iocp_kqueue_epoll.html
• The linux kernel's interrupt controller API – http://www.stillhq.com/pdfdb/000447/data.pdf
• mapped IO – http://en.wikipedia.org/wiki/Port-mapped_I/O
• wikepedia DMA– http://en.wikipedia.org/wiki/Direct_memory_access
• Improving (network) I/O performance – http://www.xmailserver.org/linux-patches/nio-improve.html