gsc/pub/master/sreddy/doc/reportv4.docx · web viewserial attached scsi [1], the successor of scsi...

Investigating Serial Attached SCSI (SAS) over TCP (tSAS)

Master’s Project Report

by

Deepti Reddy

As part of the requirements for the degree of

Master of Science in Computer Science

University of Colorado, Colorado Springs

Committee Members and Signatures

Approved by Date

______________________________ _______________

Project Advisor: Dr. Edward Chow

______________________________ _______________

Member: Dr. Xiaobo Zhou

______________________________ _______________

Member: Dr. Chuan Yue

1

Acknowledgements I would like to express my sincere thanks and appreciation to Dr. Chow. His support and encouragement as my advisor helped me learn a lot and fuelled my enthusiasm for this project. He has been an excellent professor, mentor and advisor throughout my Master’s program at UCCS.

Appreciation and thanks also to my committee members Dr. Zhou and Dr. Yue for their guidance and support. I would also like to thank Patricia Rea for helping me with all the required paper work during my Master’s program.

2

Investigating Serial Attached SCSI (SAS) over TCP (tSAS).............................................................................4

1. Abstract...............................................................................................................................................5

2. Background on SCSI, iSCSI & SAS.............................................................................................................5

2.1 SCSI (Small Computer Systems Interface)..........................................................................................5

2.1.1 SCSI Architecture Model........................................................................................................6

2.1.2 SCSI Command Descriptor Block.........................................................................................7

2.1.3 Typical SCSI IO Transfer.......................................................................................................9

2.1.4 Limitations of SCSI..........................................................................................................10

2.2 iSCSI (Internet Small Computer System Interface)...........................................................................11

2.2.1 iSCSI Session and Phases..................................................................................................11

2.2.2 iSCSI PDU.............................................................................................................................12

2.2.3 Data Transfer between Initiator and Target(s)............................................................13

2.2.4 Read/Write command sequence in iSCSI....................................................................15

2.3 Serial Attached SCSI (SAS)....................................................................................................................16

Figure 2.3.0 – A typical SAS topology.....................................................................................................18

2.3.1 Protocols used in SAS....................................................................................................18

2.3.2 Layers of the SAS Standard..........................................................................................18

2.3.3 SAS Ports.........................................................................................................................20

2.3.4 Primitives..........................................................................................................................20

2.3.5 SSP frame format............................................................................................................21

2.3.6 READ/WRITE command sequence..............................................................................25

3.0. tSAS (Ethernet SAS).......................................................................................................................27

3.1 Goal, Motivation and Challenges of the Project........................................................................27

3.2 Project Implementation.............................................................................................................27

3.2.0 tSAS Topology and Command flow sequence............................................................28

3.2.1 Software and Hardware solutions for tSAS implementations........................................32

3.2.2 Primitives..........................................................................................................................34

3.2.4 Task Management................................................................................................................40

3.2.5 tSAS mock application to compare with an iSCSI mock application..............................40

3.3 Performance evaluation............................................................................................................43

3.3.0 Measuring SAS performance using IOMeter in Windows and VDbench in Linux. .43

3.3.1 Measuring iSCSI performance using IOMeter in Windows.......................................50

3

3.3.2 Measuring tSAS performance using the client and server mock application written and comparing it to the iSCSI client/server mock application as well as to legacy SAS and legacy iSCSI....................................................................................................................................55

4.0 Similar Work..................................................................................................................................73

5.0 Future Direction.............................................................................................................................73

6.0 Conclusion (Lessons learned).........................................................................................................74

7.0 References.....................................................................................................................................74

8.0 Appendix........................................................................................................................................77

8.1 How to run the tSAS and iSCSI mock initiator (client) and target (server) application...............77

8.2 How to run iSCSI Server and iSCSI Target Software...................................................................81

8.3 How to run LeCroy SAS Analyzer Software................................................................................81

8.4 WireShark to view the WireShark traces...................................................................................81

Investigating Serial Attached SCSI (SAS) over TCP (tSAS)Project directed by Professor Edward Chow

4

1. Abstract

Serial Attached SCSI [1], the successor of SCSI is gaining popularity by leaps and bounds in enterprise storage systems. SAS is reliable, cheaper, faster and more scalable than its predecessor SCSI. One of the limiting features of SAS is its distance limitation. A single point to point SAS cable connection can cover only around 8 meters. To scale topologies to support a large number of devices beyond the native port count, expanders are used in SAS topologies [2]. With zoning [2] capabilities introduced in SAS2 expanders, SAS is gaining popularity in Storage Area Networks. With the growing demand for SAS in large topologies arises the need to investigate SAS over TCP (tSAS) to increase the distance and scalability of SAS. The iSCSI protocol [3] today provides similar functionality where it sends SCSI commands over TCP. However, SAS drives and SAS expanders can’t be used in an iSCSI topology such that the iSCSI HBA talks directly in-band to SAS devices making the iSCSI back-end less scalable than tSAS. The iSCSI specification is leveraged heavily for design consideration for tSAS.

The goal of this project is to provide research results for future industry specification for tSAS and iSCSI. The project involves understanding the iSCSI protocol as well as the SAS protocol and providing guidance on how tSAS can be designed. The project also involves investigating sending a set of SAS commands and responses over TCP/IP (tSAS) to address scalability and the distance limitations of legacy SAS. A client prototype application will be implemented to send/receive a small set of commands. A server prototype application will be implemented that receives a set of tSAS commands from the client and sends tSAS responses. The server application mocks a tSAS Initiator while the client application mocks a tSAS target. The performance of tSAS will be compared to legacy SAS and iSCSI to determine the speed and scalability of tSAS. To compare fairly with legacy iSCSI, a client and server prototype that mock an iSCSI Initiator and iSCSI target will also be implemented.

2. Background on SCSI, iSCSI & SAS

2.1 SCSI (Small Computer Systems Interface)

Since work first began in 1981 on an I/O technology that was later named the Small Computer System Interface, this set of standard electronic interfaces has evolved to keep pace with a storage industry that demands more performance, manageability, flexibility, and features for high-end desktop/server connectivity each year [4]. SCSI allows connectivity with up to seven devices on a narrow bus and 15 devices on a wide bus, plus the controller [5].

5

The SCSI protocol is an application layer storage protocol. It's a standard for connecting peripherals to your computer via a standard hardware interface, which uses standard SCSI commands. The primary motivation for SCSI was to provide a way to logically address blocks. Logical addresses eliminate the need to physically address data blocks in terms of cylinder, head, and sector. The advantage of logical addressing is that it frees the host from having to know the physical organization of a drive [6][7][8][9]. Currently the SCSI protocol being used is SCSI-3. The SCSI standard defines the data transfer process over a SCSI bus, arbitration policies, and even device addressing [10].

Below is a snapshot of SCSI history:

Type/Bus Approx. Speed Mainly used forSCSI-2 (8 bit narrow) 10 MB/Sec Scanners, Zip-drives, CD-

ROMsUltraSCSI (8-bit narrow) 20 MB/Sec CD-Recorders, Tape Drives,

DVD drivesUltra Wide SCSI (16-bit wide) 40 MB/Sec Lower end Hard Disk DrivesUltra2 SCSI (16-bit wide) 80 MB/Sec Mid range Hard Disk DrivesUltra 160-SCSI (16-bit Wide) 160 MB/Sec High end Hard Dis Drives and

Tape DrivesUltra-320 SCSI (16-bit Wide) 320 MB/Sec State-of-the-art Hard Disk

Drives, RAID backup applications

Ultra-640 SCSI (16-bit Wide) 640 MB/Sec High end Hard Disk Drives, RAID applications, Tape Drives

Figure 2.1.0 – Snapshot of SCSI History [10]

The SCSI protocol emerged as the predominant protocol inside host servers because of its well-standardized and clean message-based interface [11]. Moreover, in later years, SCSI supported command queuing at the storage devices and also allowed for overlapping commands [11]. In particular, since the storage was local to the server, the preferred SCSI transport that was used was Parallel SCSI where multiple storage devices were connected to the host server using cable-based bus [11].

2.1.1 SCSI Architecture ModelThe SCSI architecture model is a client-server model. The initiator (Host Bus Adapter) initiates commands and acts like the client while the target (hard disk drives, tape drives etc) responds to commands initiated by the initiator and therefore act as servers. Figure 2.1.1.0 & 2.1.1.1 show the SCSI architecture model[9][12].

6

Figure 2.1.1.0: SCSI Standards Architecture Model [9][12]

Figure 2.1.1.1: Basic SCSI Architecture[9]

2.1.2 SCSI Command Descriptor Block

7

Protocol Data Units (PDUs) are passed between the initiator and target to send commands between a client and server. A PDU in SCSI is known as a Command Descriptor Block (CDB). It is used to communicate a command from a SCSI application client to a SCSI device server. In other words, the CDB defines the operation to be performed by the server. A CDB may have a fixed length of 16 bytes or a variable length between 12 and 260 bytes. A typical 10 byte CDB format is shown below in Figure 2.1.2.0 [9] [13] [14].

Figure 2.1.2.0: 10 byte SCSI CDB

SCSI Common CDB Fields:

Operation Code:The first byte of the CDB consists of the operation code (opcode) and it identifies the operation being requested by the CDB. The two main Opcodes of interest for this project are Read and Write opcodes. The opcode for a Read operation is 0x28 and the opcode for a Write operation is 0x2A [9] [13] [14].

Logical block address:The logical block addresses on a logical unit or within a volume/partition begins with block zero and is contiguous up to the last logical block of that logical unit or within that volume/partition [9] [13][14].

Transfer length:The transfer length field specifies the amount of data to be transferred for each IO. This is usually the number of blocks. Some commands use transfer length to specify the requested number of bytes to be sent as defined in the command description. A transfer length of zero implies that no data will be transferred for the particular command. A command without any data and simply a response (non-DATA command) will have the transfer length set to a value of zero [9][13][14][15].

Logical Unit Numbers:

8

The SCSI protocol defines how to address the various units to which the CDB is to be delivered to. Each SCSI device (target) can be subdivided into one or more logical units (LUNs). A logical unit is simply a virtual controller that handles SCSI communications on behalf of storage devices in the target. Each logical unit has an address associated with it which is referred to as the logical unit number. Each target must have at least one LUN. If only one LUN is present, it is assigned as LUN0 [9][13][14][15].

For more details on these fields, please refer to the SCSI spec [12].

2.1.3 Typical SCSI IO TransferThe three main phases of an IO transfer are the command phase, the data phase and the status phase. The initiator sends the command to a target. Data is then exchanged between the target and initiator. Finally, the target sends the status completion for the command to the initiator. Certain commands known as non-DATA commands do not have a data phase. Figure 2.1.3.0 shows a SCSI IO transfer for a non-data command while Figure 2.1.3.1 shows a SCSI IO transfer for a data command [7][8][9][10].

Figure 2.1.3.0: Non-Data Command Transfer[9]

9

Figure 2.1.3.1: Data I/O Operation[9]

2.1.4 Limitations of SCSIAlthough the SCSI protocol has been successfully used for many years, it has limited capabilities in terms of the realization of storage networks due the limitations of the SCSI bus [11]. As the need for storage and servers grew over the years, the limitations of SCSI as a technology became seemingly obvious [14]. Initially, the use of parallel cables limited the number of storage devices and the distance capability of the storage devices from the host server. The length of the bus limits the distance over which SCSI may operate (maximum of around 25 meters)[9][14]. The limits imply that adding additional storage devices means the need to purchase a host server for attaching the storage [14]. Second, the concept of attaching storage to every host server in the topology means that the storage has to be managed on a per-host basis. This is a costly solution for centers with a large number of host servers. Finally, the technology doesn’t allow for a convenient sharing of storage between several host servers, nor typically does the SCSI technology allow for easy addition or removal of storage without host server downtime [16].

Despite these limitations, the SCSI protocol is still of importance since it can be used with other protocols simply by replacing the SCSI bus with a different interconnection type such as fibre channel, IP networks etc [9][16]. Availability of high bandwidth, low latency network interconnects such as Fibre Channel (FC) and Gigabit Ethernet (GbE) along with the complexities of managing dispersed islands of data storage, led to the development of Storage Area Networks (SANs)[16]. Lately, Internet Protocol (IP) is advocated as an alternative to transport SCSI traffic over long distance [11]. Proposals like iSCSI try to standardize the encapsulation of SCSI data in TCP/IP (Transmission Control Protocol/Internet Protocol) packets [11][17]. Once

10

the data is in IP packets, it can be carried over a range of physical network connections. Today, GbE is widely used for local area networks (LANs) and campus networks [11].

2.2 iSCSI (Internet Small Computer System Interface)The advantages of IP networks are seemingly obvious. The presence of well tested and established protocols like TCP/IP, allow IP networks both wide-area connectivity as well as proven bandwidth sharing capabilities. The emergence of Gigabit Ethernet indicates that the bandwidth requirements of serving storage over a network should not be an issue [15].

The limitations of the SCSI bus, identified in the previous section, and the increased desire for IP storage led to the development of iSCSI. iSCSI was developed as an end-to-end protocol to enable transportation of storage I/O block data over IP networks thus dispensing with the physical bus implementation as the transport mechanism[7][20][21]. iSCSI works by mapping SCSI functionality to the TCP/IP protocol. By utilizing; TCP flow control, congestion control, segmentation mechanisms, IP addressing, and discovery mechanisms, iSCSI facilitates remote backup, storage, and data mirroring [7][20][22]. The iSCSI protocol standard defines amongst other things, the way SCSI commands can be carried over the TCP/IP protocol [7][23].

2.2.1 iSCSI Session and Phases Data is transferred between an initiator and target via an iSCSI session. An iSCSI session is a physical or logical link which carries TCP/IP protocols and iSCSI PDUs, between an initiator and target. The PDUs in turn carry SCSI commands and data in the form of SCSI CDBs [7][23].

There are four phases in a session, where the first phase, login, starts with the establishment of the first TCP connection [19]. The four phases are:

1) Initial login phase: In this phase, an initiator sends the name of the initiator and target, and specifies the authentication options. The target then responds with the authentication options the target selects[19].

2) Security authentication phase: This phase is used to exchange authentication information (ID, password, certificate, etc.) based on the agreed authentication methods to make sure each party is actually talking to the intended party. The authentication can occur both ways such that a target can authenticate an initiator, and an initiator can also request the authentication of the target. This phase is optional[19]

3) Operational negotiating phase: The Operational negotiating phase is used to exchange certain operational parameters such as protocol data unit (PDU) length and buffer size. This phase as well is optional [19].

11

4) Full featured phase: This is the normal phase of an iSCSI session where iSCSI commands, and data messages are transferred between an initiator and a target(s)[19].

2.2.2 iSCSI PDU The iSCSI PDU is the equivalent of the SCSI CDB. It is used to encapsulate the SCSI CDB and any associated data. The general format of a PDU is shown in Figure 2.2.2.0. It is comprised of a number of segments, one of which is the basic header segment (BHS). The BHS is mandatory and is the segment that is mostly used. The BHS segment layout is shown in Figure 2.2.2.1. It has a fixed length of 48 bytes. The Opcode, TotalAHSLength, and DataSegmentLength fields in the BHS are mandatory fields in all iSCSI PDUs. The Additional Header Segment (AHS) begins with 4-byte Type-Length-Value (TLV) information. This field specifies the length of the actual AHS following the TLV. The Header and Data 19 digests are optional values. The purpose of these fields is to protect the integrity the authenticity of the header and data. The digest types are negotiated during the login phase [9].

Figure 2.2.2.0 – iSCSI PDU Structure

12

Figure 2.2.2.0 – Basic Header Segment (BHS)

2.2.3 Data Transfer between Initiator and Target(s)

Once the full feature phase of the normal session has been established, data can be exchanged between the initiator and the target(s). The normal session is used to allow transfer of data or/from the initiator and target.

Let us assume that an application on the initiator wishes to perform storage I/O to/from the target. This can be broken down into two stages:

1. Progression of the SCSI command through the initiator, and 2. Progression of the SCSI command through the target.

To help assist in understanding the progression of the commands, the iSCSI protocol layering model is shown in Figure 2.2.3.0 [9].

13

Figure 2.2.3.0 – iSCSI protocol layering model

Progression of a SCSI Command through the Initiator

1. The user/kernel application on the initiator issues a system call for an I/O operation which is sent to the SCSI layer.

14

3. On receipt at the SCSI layer, the system call is converted into a SCSI command and a CDB containing this information is then constructed. The SCSI CDB is then passed to the iSCSI initiator protocol layer [9].

4. At the iSCSI protocol layer, the SCSI CDB and any SCSI data are encapsulated into a PDU and the PDU is forwarded to the TCP/IP layer [9].

5. At the TCP layer, a TCP header is added. The IP layer encapsulates the TCP segment by adding an IP header before the TCP header [9].

6. The IP datagram is passed to the Ethernet Data Link Layer where it is framed with Ethernet headers and trailers. The resulting datagram is finally placed on the network [9].

Progression of a SCSI Command through the Target

1. At the target, the Ethernet frame is stripped off at the Data Link Layer. The IP datagram is passed up to the TCP/IP layer [9].

2. The IP and TCP layers each check and strip off headers and pass iSCSI PDU up to the iSCSI layer [9].

3. At the iSCSI layer, the SCSI CDB is extracted from the iSCSI PDU and passed along with the data to the SCSI layer [9].

4. Finally, the SCSI layer sends the SCSI request and data to the upper layer application [9].

2.2.4 Read/Write command sequence in iSCSI

Read Operation Example

+------------------+-----------------------+----------------------+ |Initiator Function| PDU Type | Target Function | +------------------+-----------------------+----------------------+ | Command request |SCSI Command (READ)>>> | | | (read) | | | +------------------+-----------------------+----------------------+ | | |Prepare Data Transfer | +------------------+-----------------------+----------------------+ | Receive Data | <<< SCSI Data-In | Send Data | +------------------+-----------------------+----------------------+ | Receive Data | <<< SCSI Data-In | Send Data | +------------------+-----------------------+----------------------+

15

| Receive Data | <<< SCSI Data-In | Send Data | +------------------+-----------------------+----------------------+ | | <<< SCSI Response |Send Status and Sense | +------------------+-----------------------+----------------------+ | Command Complete | | | +------------------+-----------------------+----------------------+

Figure 2.2.4.1 Read Operation Example[3]

Write Operation Example

+------------------+-----------------------+---------------------+ |Initiator Function| PDU Type | Target Function | +------------------+-----------------------+---------------------+ | Command request |SCSI Command (WRITE)>>>| Receive command | | (write) | | and queue it | +------------------+-----------------------+---------------------+ | | | Process old commands| +------------------+-----------------------+---------------------+ | | | Ready to process | | | <<< R2T | WRITE command | +------------------+-----------------------+---------------------+ | Send Data | SCSI Data-Out >>> | Receive Data | +------------------+-----------------------+---------------------+ | | <<< R2T | Ready for data | +------------------+-----------------------+---------------------+ | | <<< R2T | Ready for data | +------------------+-----------------------+---------------------+ | Send Data | SCSI Data-Out >>> | Receive Data | +------------------+-----------------------+---------------------+ | Send Data | SCSI Data-Out >>> | Receive Data | +------------------+-----------------------+---------------------+ | | <<< SCSI Response |Send Status and Sense| +------------------+-----------------------+---------------------+ | Command Complete | | | +------------------+-----------------------+---------------------+

To learn more about the SCSI command PDU, the Ready To Transfer (R2T) PDU, SCSI Data-In PDU and the SCSI Data-Out PDU, please refer to the iSCSI specification [3].

Figure 2.2.4.1 Read Operation Example[3]

2.3 Serial Attached SCSI (SAS)

SAS is the successor of SCSI technology and is becoming wide-spread as performance requirements and addressability exceeds well beyond what legacy SCSI supports.

In 2004, SAS interfaces were initially introduced at 3Gb/s. Currently, supporting 6Gb/s and moving to 12Gbps by 2012, SAS interfaces have significantly increased the available bandwidth

16

offered by legacy SCSI storage systems. Though fibre channel is more scalable, it is a costly solution for use in a SAN. Table 2.3.0 compares SCSI, SAS and Fibre Channel technologies.

SCSI SAS Fibre ChannelTopology Parallel Bus Full Duplex Full DuplexSpeed 3.2 Gbps 3 Gbps, 6Gbps

Moving to 12 Gbps2 Gbps4 GbpsMoving to 8 Gbps

Distance 1 to 12 meters 8 meters 10 kmDevices SCSI only SAS & SATA Fibre Channel onlyNumber of Targets

14 devices 128 expandersby 1 expander.>16,000 with cascaded expanders

127 devices in a loop. Switched fabric can go to millions of devices

Connectivity Single-port Dual-port Dual-portDrive Form Factor

3.5” 2.5” 3.5”

Cost Low Medium HighTable 2.3.0 – Comparing SCSI, SAS and Fibre Channel

An initiator, also called a Host Bus Adapter or Controller, is used to send commands to SAS targets. SAS controller devices have a limited number of ports. A narrow Port in SAS consisting of a single port is referred to as a PHY [1].

Expander devices in a SAS domain facilitate communication between multiple SAS devices. Expanders have a typical port count of 12 to 36 ports while SAS controllers have a typical port count of 4-16 ports. Expanders can be cascaded as well to increase scalability. One of the most significant SAS feature is the transition from 3.5” drives to 2.5” drives. This helps reduce floor space and power consumption [2]. Another advantage of using SAS targets is that a SAS hard drive is dual-ported providing a redundant path to each hard drive in case of an Initiator/Controller fail-over. Also, unlike SCSI, SAS employs a serial means of data transfer like fibre channel [25]. Serial interfaces are known to reduce crosstalk and related signal integrity issues.

Figure 2.3.0 shows an example of a typical SAS topology. SAS commands originate from the HBA driver and are eventually sent to the HBA. The SAS controller/HBA sends commands to the disk drives through the expander for expander attached targets/drives. The target replies to the command through the expander. The expander simply acts like a switch and routes the commands to the appropriate target and routes the responses from a particular target to the Controller.

17

Figure 2.3.0 – A typical SAS topology

2.3.1 Protocols used in SASThe three protocols used in SAS are Serial Management protocol, Serial SCSI Protocol and SATA Tunnel Protocol. Serial Management Protocol (SMP) [1] is used to discover the SAS topology and to perform system management. The Serial SCSI Protocol (SSP) [1] is used to send SCSI commands and receive responses from SAS targets. SATA Tunnel Protocol (STP) [1] is used to communicate with SATA targets in a SAS topology.

2.3.2 Layers of the SAS StandardBelow is the organization and layers of the SAS standard:

18

X X X

X- DISK DRIVES- EXPANDERS

MPT Driver

SAS CONTROLLER(HOST BUS ADAPTER)

Figure 2.3.2.0 – Layers of the SAS Standard

As can be seen from the above Figure 2.3.2.0, the SAS Physical layer consists of:

a) Passive interconnect (e.g., connectors and cable assemblies); and

b) Transmitter and receiver device electrical characteristics.

The phy layer state machines interface between the link layer and the physical layer to keep track of dword synchronization [2]. The link layer defines primitives, address frames, and connections. Link layer state machines interface to the port layer and the phy layer and perform the identification and hard reset sequences, connection management, and SSP, STP, and SMP specific frame transmission and reception [2]. The port layer state machines interface with one or more SAS link layer state machines and one or more SSP, SMP, and STP transport layer state machines to establish port connections and disconnections. The port layer state machines also interpret or pass transmit data, receive data, commands, and confirmations between the link and transport layers. The transport layer defines frame formats. Transport layer state machines interface to the application layer and port layer and construct and parse frame contents [2]. The application layer defines SCSI, ATA, and management specific features [2].

19

2.3.3 SAS PortsA port contains one or more phys. Ports in a device are associated with physical phys based on the identification sequence. A port is a wide port if there is more than one phy in the port. A port is a narrow port if there is only one phy in the port. In other words, a port contains groups of phy with the same SAS address, attached to another group of phys with the same SAS address [2]. Each device in the topology has a unique SAS address. Therefore, for example if a HBA is connected using PHYs 0,1,2 and 3 to expander A and PHYs 4,5,6 & 7 to expander B, PHYs 0,1,2 & 3 of the HBA are a single wide-port and PHYs 4,5 6 & 7 are part of another wide-port.

Figure 2.3.3.0 – Wide Ports in SAS

2.3.4 PrimitivesPrimitives are DWords mainly used to manage flow control. Some of the common primitives are:

1. ALIGN(s) – Used during speed negotiation of a link, rate matching of connections etc2. AIP(s) (Arbitration in Progress) - AIP is transmitted by an expander device after a

connection request to specify that the connection request is being processed and specify the status of the connection request.

3. BREAKS(s) – A phy aborts a connection requests and break a connection by transmitting the BREAK primitive sequence

4. CLOSE – A close primitive is used to close a connection

20

5. OPEN ACCEPT – Specifies a connection has been accepted6. OPEN REJECT – These primitives are used to specify that a connection has been rejected

and specifies the reason for the rejection as well.7. ACK – Specifies the acknowledgement of a SSP frame8. NAK – Negative acknowledgement of a SSP frame9. RRDY – Advertise SSP frame credit10. BROADCAST(s) – Used to notify SAS ports of events such as change in topology etc [1]

To learn more about the other primitives and the primitives mentioned above, please refer to the SAS Specification.

2.3.5 SSP frame formatIn this project, we primarily work with SSP Read/Write commands. A typical SSP frame format is below:

21

Figure 2.3.5.0 – SSP Frame Format

The Information Unit is a DATA frame, XFER_RDY frame, COMMAND frame, RESPONSE frame or TASK frame. For SSP requests of interest for this project, the information unit is either a COMMAND frame, XFER_RDY frame, DATA frame or a RESPONSE frame [2].

Command frame:

22

The COMMAND frame is sent by an SSP initiator port to request that a command be processed. The command frame consists of the logical unit number the command is intended for as well as the SCSI CDB that contains the type of command, transfer length etc [2].

Figure 2.3.5.1 – SSP Command Frame

XFER_RDY frame:

The XFER_RDY frame is sent by an SSP target port to request write data from the SSP initiator port during a write command or a bidirectional command [2].

Figure 2.3.5.2 – SSP XFER_RDY Frame

23

The REQUESTED OFFSET field contains the application client buffer offset of the segment of write data in the data-out buffer that the SSP initiator port may transmit to the logical unit using write DATA frames [2].

The WRITE DATA LENGTH field contains the number of bytes of write data the SSP initiator port may transmit to the logical unit using write DATA frames from the application client data-out buffer starting at the requested offset [2].

DATA frame:

Figure 2.3.5.3 – SSP Data Frame

A typical DATA frame in SAS is limited to 1024 bytes (1K) [2].

Response Frame:

The response frame is sent by an SSP target port n response to a SSP command by an initiator [2].

24

Figure 2.3.5.4 – SSP Response Frame

A successful write/read completion will not contain any sense data. In this project, we work with successful read/write completions and therefore sense data won’t be returned by the target.

2.3.6 READ/WRITE command sequenceSSP Read Sequence [20]

25

Figure 2.3.6.0 – SSP Read Sequence

SSP Write Sequence[20]

Figure 2.3.6.1 – SSP Write Sequence

26

3.0. tSAS (Ethernet SAS)

3.1 Goal, Motivation and Challenges of the ProjectThe goal of this project is to investigate sending a set of SAS commands, data and responses over TCP/IP and to investigate how tSAS can performs against legacy iSCSI and legacy SAS as best as possible. Since Ethernet contains its own physical layer, SAS over TCP (tSAS) eliminates the need for the SAS physical layer overcoming the distance limitations of SAS. This overcomes the distance limitation of the Serial Attached Small Computer System Interface (SAS) physical layer interface so that SAS storage protocol may be used for communication between host systems and storage controllers in the Storage Area Network (SAN) [21]. SANs allow sharing of data storage over long distances and still permit centralized control and management [16]. More particularly, the SAN embodiments can comprise at least one host computer system and at least one storage controller that are physically separated by greater than around 8 meters which is the physical limitation of a SAS cable. An Ethernet fabric can connect the host computer system(s) and storage controller(s)[21]. The SAS storage protocol over TCP can also be used to communicate between storage controllers/hosts and SAS expanders as explained later in this section.

Using gigabit Ethernet (10G/40G/100G) [32], tSAS also overcomes the 6G and 12G limitations of SAS2 (6G) and SAS3 (12G) respectively. As mentioned earlier in this paper, the main challenge of developing an tSAS client/server application is that there is no standard specification for tSAS. We will leverage Michael Ko’s patent [21] on SAS over Ethernet[27] to help us through the process of defining our tSAS protocol required for this project.

Similar to iSCSI, TCP was chosen as the transport for tSAS. TCP has many features that are utilized by iSCSI. The exact same features and reasoning is behind the choice of using TCP for tSAS as well.

• TCP provides reliable in-order delivery of data.

• TCP provides automatic retransmission of data that was not acknowledged.

• TCP provides the necessary flow control and congestion control to avoid overloading a congested network.

• TCP works over a wide variety of physical media and inter-connect topologies. [23]

3.2 Project Implementation

27

3.2.0 tSAS Topology and Command flow sequenceThe Figure 3.2.0.0 below shows a typical usage of tSAS to expand scalability, speed and distance of legacy SAS by using a tSAS HBA. In Figure 3.2.0.0, tSAS is the protocol of communication used between a remote tSAS HBA and a tSAS controller. The tSAS controller is connected to the back-end expander and drives using legacy SAS cables.

Figure 3.2.0.0 – Simple tSAS Topology

All SSP frames will be encapsulated in an Ethernet frame. Figure 3.2.0.1 shows how an Ethernet frame with the SSPframe data encapsulated in it looks. The tSAS Header is the SSP Frame Header and the tSAS Data is the SSP Information Unit (Refer to Figure 2.3.4.0 for the SSP frame format).

28

X


SCSI Driver

tSAS CONTROLLER/

HBA

- TCP link- SAS link

SCSI Driver

tSAS

HBA

Figure 3.2.0.1 – tSAS header and data embedded in an Ethernet frame

The back-end of tSAS is a tSAS HBA that can receive tSAS commands, strip off the TCP header and pass on the SAS command to the expander and drives. The back-end of tSAS will talk in-band to the SAS expanders and drives. The remote tSAS Initiator communicates with the tSAS target by sending tSAS commands. Figure 3.2.0.2 shows the typical SSP request/response read data flow. The tSAS SSP Request is initially sent by the tSAS Initiator to the tSAS Target over TCP. The tSAS Target strips off the TCP header and sends the SSP request using the SAS Initiator block on the tSAS Target to the SAS expander. The SAS expander sends the data frames and the SSP Response to the tSAS Target. Finally, the tSAS Target embeds the SSP data frames and response frame over TCP and sends the frames to the tSAS Initiator. A write (Figure 3.2.0.3) will look the same with the tSSP Request sent by the initiator followed by the Xfer_rdy (similar to the R2T in iSCSI) sent by the target followed by the DATA sent by the initiator and finally the tSAS response from the target.

29

Ethernet Header

IP Heade

r

TCP Heade

r

tSAS

Header

tSASData

Ethernet Trailer TCP

SegmentIP Datagra

mEthernet Frame

tSAS SSP Request

tSAS Initiator

tSAS Target SAS Expander

Open Address FrameOpen Accept

SSP Request FrameData Frame

.

.Data Frame

tSAS SSP Response

SSP Response Frame

Data Frame..

Data Frame

Figure 3.2.0.2 – tSAS Read SSP Request & Response Sequence Diagram.

This figure doesn’t show all the SAS primitives exchanged on the SAS wire within a connection after the Open Accept

Figure 3.2.0.3 – tSAS WRITE SSP Request & Response Sequence Diagram.

This figure doesn’t show all the SAS primitives exchanged on the SAS wire within a connection after the Open Accept

SAS over Ethernet can also be used for a SAS controller to communicate with a SAS expander. In SAS1, expanders did not have support to receive SAS commands out-of-band. SAS1 controllers/HBAs would need to send commands to an expander in-band even for expander diagnosis and management. SAS HBAs/controllers have a lot more complex functionally than expanders. Diagnosing issues by sending commands in-band to expanders made it harder and time-consuming to root cause where the problem is in the SAS topology. Also, managing expanders via in-band lacked the advantage of remotely managing expanders via out-of-band over Ethernet. With the gaining popularity of zoning, expander vendors have implemented support for limited SMP zoning commands out-of-band via Ethernet in SAS2 [1]. A client management application is used to send a limited set of SMP commands out-of-band to the expander. The expander processes the commands and sends the SMP responses out-of-band to the management application. Figure 3.2.0.4 shows the communication between the client management application and the expander during a SMP command.

30

tSAS SSP Request

tSAS Initiator

tSAS Target SAS Expander


SSP Request FrameXfer Rdy Frame

.

.Data Frame

tSAS SSP ResponseSSP Response Frame

Data Frame..

Data Frame

.

.Data Frame

Xfer Rdy Frame

Figure 3.2.0.4 – SMP Request & Response Sequence Diagram.

This figure doesn’t show the SAS primitives exchanged on the SAS wire within a connection after the Open Accept

This already existing functionality on a SAS expander can be leveraged to design the tSAS functionality on an Expander to communicate via TCP with a SAS controller/HBA. Figure 3.2.0.5 shows a topology where the tSAS protocol is used for communication between the tSAS Controller and the back-end expander as well. Michael Ko’s patent doesn’t cover using tSAS to talk with expanders. However, expanders can also be designed to send commands/data/responses via TCP.

31

SMP Request

ClientSMP Initiator Port

on ExpanderSMP Target Port on

Target Expander


SMP RequestSMP Response

Close

SMP Response

Figure 3.2.0.5 -Topology where tSAS is used to communicate with an expander

3.2.1 Software and Hardware solutions for tSAS implementationsSimilar to iSCSI, tSAS can be implemented in both hardware and software. This is one of the benefits of iSCSI and tSAS since each organization can customize their SAN configuration based on budget and the performance needed [23][24].

Software based tSAS solution:

This solution is cheaper than a hardware based tSAS solution since you do not need extra hardware to implement this tSAS solution. In this solution, all tSAS processing is done by the processor and TCP/IP operations are also executed by the CPU. The NIC is merely an interface to the network, this implementation requires a great deal of CPU cycles hurting the overall performance of the system [23][24].

TCP/IP Offload engine tSAS solution:

As network infrastructures have reached Gigabit speeds, network resources are becoming more abundant and the bottleneck is moving from the network to the processor. Since TCP/IP processing requires a large portion of CPU cycles, a software tSAS implementation may be used along with specialized network interface cards with TCP offload engines (TOEs) on board. NICs with integrated TOEs have hardware built into the card that allows the TCP/IP processing to be

32

X


SCSI Driver

tSAS CONTROLLER/

HBA

- TCP link- SAS link

tSAS

HBA

SCSI Driver

done at the interface. This prevents the TCP/IP processing from making it to the CPU freeing the system processor to spend its resources on other applications [23][24].

Figure 3.2.1.0 – TCP/IP Offload Engine [23] [24]

Hardware Based tSAS solution:

In a hardware-based tSAS environment, the initiator and target machines contain a host bus adapter (HBA) that is responsible for both TCP/IP and tSAS processing. This will free the CPU from both TCP/IP and tSAS functions. This dramatically increases performance in those settings where the CPU may be burdened with other tasks [23][24].

33

Figure 3.2.1.2 – tSAS implementations [25]

.

3.2.2 PrimitivesIn conventional SAS storage protocol, the SAS link layer uses a construct known as primitives. Primitives are special 8b/10b encoded characters that are used as frame delimiters, for out of band signaling, control sequencing, etc. Primitives were explained in section 2.3.3. These SAS primitives are defined to work in conjunction with the SAS physical layer [21].As far as primitives go, all ALIGN(s), OPEN REJECT(s), OPEN(s), CLOSE, DONE, BREAK, HARD RESET, NAKs, RRDY etc can simply be ignored on the tSAS protocol side since these are link layer primitive required only on the SAS side. For example, if an IO on the SAS side is timed out or fails due to NAKs or BREAKs, OPEN timeouts or OPEN REJECTs, the IO will simply timeout on the tSAS side to the tSAS Initiator. The primitives of interest include BROADCAST primitives, especially BROADCAST CHANGE primitive, as this primitive tells an initiator that the topology has changed

34

Software tSAS Software tSAS with TCP Offload

Hardware tSAS with TCP Offload

and to re-discover the topology using SMP commands. However, since, as discussed above, the SAS physical layer is unnecessary, an alternate means of conveying the SAS primitives is needed. In one embodiment, this can be accomplished by defining a SAS primitive to be encapsulated in an Ethernet frame [21].

The SAS Trace below in Figure 3.2.2.0 shows the primitives on a READ command exchanged on the wire between the initiator and target. The lower panel shows the primitives such as RRDys, ACKs, DONEs, CLOSEs etc exchanged during a READ command sequence. These primitives are not required in the tSAS protocol. Please refer to Appendix Section 8.0 for information on SAS trace capturing.

35

Figure 3.2.2.0 – Primitives on the SAS side [25]

36

3.2.3 Discovery

Discovery in tSAS will be similar to SAS and will be accomplished by sending Serial management protocol (SMP) commands over TCP to the initiators and expanders downstream to learn the topology.

The SMP Request frame will be embedded in an Ethernet frame and sent to the expander/initiator. The expander/initiator will reply to the SMP Request by sending a SMP Response frame embedded in an Ethernet frame. Figure 3.2.0.4 in section 3.2.0 shows how the SMP commands are communicated in tSAS. For more information on SMP commands and Discovery, please refer to the SAS Specification [1].

For example, on a Discover List command that is used to return information on an attached device/PHY, the SMP Discover List Command is sent by the initiator and the SMP Discover List is sent via TCP as the response.

37

Figure 3.2.3.0 – SMP Discover List Request Frame

38

Figure 3.2.3.1 – SMP Discover List Response Frame

Since Ethernet frames are assembled such that they include a cyclic redundancy check (CRC) for providing data protection, a SMP frame that is encapsulated in an Ethernet frame can rely on the data protection afforded by this same cyclic redundancy check [21]. In other words, the SAS side CRC on the request and response SMP frame need not be transmitted.

39

Please refer to the SAS Specification [1] for information on these SMP commands.

3.2.4 Task ManagementSimilar to SAS and iSCSI, a Task Frame will be sent by the initiator to another initiator/expander in the topology. A Task Management may be sent to manage a target. For instance, when IOs to a target fail, the host may request a Task Management Target Reset command to reset the target in the hope that the target is reset and cooperates after being reset. A host may request a Task Management LUN reset to reset an entire LUN and have all IOs to that LUN be failed.

To learn more about the various Task Commands, please refer to the SAS [1] and iSCSI specifications [3].

3.2.5 tSAS mock application to compare with an iSCSI mock applicationFor the purpose of investigating iSCSI vs tSAS, a client application and a server application that communicate using iSCSI and tSAS are written. A tSAS client application will send read/write tSAS commands to the tSAS server application which will process and send responses to the client. Similarly, an iSCSI client application will send certain read/write iSCSI commands to the iSCSI server application which will process and send responses to the client. Commands are sent single threaded such that the queue depth (number of outstanding commands) is one. The algorithm used for the tSAS application and the iSCSI application is similar helping us investigate the two protocols.

Initially, the tSAS application is written such that each REQUEST, RESPONSE and DATA frame is encapsulated into an independent Ethernet frame. Revisiting the SSP Format in Figure 2.3.40, the entire SSP Frame excluding the CRC is encapsulated into an Ethernet frame. Since Ethernet frames are assembled such that they include a cyclic redundancy check (CRC) for providing data protection, a SAS/SSP frame that is encapsulated in an Ethernet frame can rely on the data protection afforded by this same cyclic redundancy check [21].

In SAS, each data frame is 1K in length. In the initial design, each Ethernet frame that carries the data frame only carried 1K of DATA..

This causes the time to complete the IO to be significantly high compared to not limiting the amount of data to be sent in each frame to 1K. The performance was hence slow. The application was then revised to send more than 1K of data in each frame by maxing out the Data that can be stuffed into each Ethernet frame. This means that each Ethernet frame can contain more than just 1K of data. Below are the results from this implementation.

40

Below are the numbers of running the tSAS application when each REQUEST, RESPONSE and DATA frame is individually encapsulated into an Ethernet frame and sent across. The test bench used for this experiment is a Windows Server 2008 machine with the client and server application running such that the client sends requests and the server replies to requests. A Netgear Prosafe 5 port Gigabit switch model GS 105 is used in between such that the client and server auto negotiate to 1 Gbps.

1 Gbps:

READ

Transfer Length (KB) Average Time from Read Command to Completion (milliseconds) in tSAS where each DATA frame is encapsulated in an Ethernet frame

IOPSI/Os per second

1 KB 0.249 ms 4016.0642 KB 0.206 ms 4854.3684 KB 0.216 ms 4629.628 KB 0.368 ms 2717.39116 KB 0.495 ms 2020.2032 KB 0.28 ms 3571.428

64 KB 1.616 ms 618.811128 KB 2.711 ms 368.867256 KB 4.913 ms 203.541512 KB 5.954 ms 167.9541024 KB 7.681 ms 130.1912048 KB 16.111 ms 62.069

Table 3.2.5.0 - Average Time from Read Command to Completion (milliseconds) in tSAS where each DATA frame is encapsulated in an Ethernet frame

Below are the numbers of running the tSAS application when each REQUEST, RESPONSE and DATA frame is encapsulated into an Ethernet frame and sent across. However, in this implementation, the DATA is not limited to 1K in each Ethernet frame. DATA frames are combined to use each Ethernet frame to maximum capacity.

Transfer Length (KB) Average Time from Read Command to Completion (milliseconds) in tSAS where each Ethernet frame containing SSP Data is used efficiently

IOPSI/Os per second

1 KB 0.199 ms 5025.1252 KB 0.114 ms 8771.9294 KB 0.280 ms 3571.428

41

8 KB 0.258 ms 3875.96816 KB 0.174 ms 5747.12432 KB 0.455 ms 2197.80264 KB 0.828 ms 1207.729128 KB 1.418 ms 705.218256 KB 2.714 ms 368.459512 KB 3.000 ms 333.3331024 KB 3.756 ms 266.2402048 KB 7.854 ms 127.323

Table 3.2.5.1 - Average Time from Read Command to Completion (milliseconds) in tSAS where each Ethernet frame containing SSP Data is used efficiently

Below are the numbers of running the iSCSI application when each REQUEST, RESPONSE and DATA frame is encapsulated into an Ethernet frame and sent across. In this implementation, the DATA is not limited to 1K in each Ethernet frame. The iSCSI implementation itself doesn’t pack each SCSI DATA frame into a separate Ethernet frame. It allows DATA frames to be combined such that more than just a single DATA frame is sent. Therefore, in our implementation as well DATA frames are combined to use each Ethernet frame to maximum capacity.

Transfer Length (KB) Average Time from Read Command to Completion (milliseconds) in iSCSI where each Ethernet frame containing SCSI Data is Maxed out

IOPSI/Os per second

1 KB 0.189 ms 5291.0052 KB 0.261 ms 3831.4174 KB 0.205 ms 4878.0488 KB 0.501 ms 2996.00716 KB 0.327 ms 3058.10432 KB 0.454 ms 2202.64364 KB 0.898 ms 1113.585128 KB 1.421 ms 703.729256 KB 3.311 ms 302.023512 KB 3.138 ms 318.6741024 KB 4.955 ms 201.8162048 KB 8.942 ms 111.831

Table 3.2.5.2 - Average Time from Read Command to Completion (milliseconds) in iSCSI where each Ethernet frame containing SCSI Data is Maxed out

As can be seen from Tables 3.2.5.0, 3.2.5.1 and 3.2.5.2:

1. A tSAS implementation where each DATA frame is encapsulated in a separate Ethernet frame is not an efficient implementation

42

2. A tSAS implementation where more than just 1 K of DATA (a single DATA frame) is encapsulated in an Ethernet frame is more efficient. This is comparable to the iSCSI implementation in the market as well and the iSCSI client/server app written. Therefore, for the rest of this project we will go with this tSAS implementation.

3.3 Performance evaluation3.3.0 Measuring SAS performance using IOMeter in Windows and VDbench in Linux

3.3.0.1 SAS Performance using IOMeter

Iometer is an I/O subsystem measurement and characterization tool that can be used in both single and clustered systems [32]. Iometer is both a workload generator as it performs I/O operations in order to stress the system being tested, and a measurement tool as it examines and records the performance of its I/O operations and their impact on the system under test. It can either be configured to emulate a disk target or network I/O load of any program. It can also be used to generate entirely synthetic I/O loads. It can also generate and measure loads on single or multiple networked systems [32].

Iometer can be used to measure and characterize the: Performance of network controllers. Performance of disk controllers. Bandwidth and latency capabilities of various buses. Network throughput to attached drive targets. Shared bus performance. System-level performance of a hard drive . System-level performance of a network [32].

Iometer consists of two programs, namely, Iometer and Dynamo. Iometer is the name of the controlling program. Using the graphical user interface, a user can configure the workload, set the operating parameters, and start and stop tests. Iometer tells Dynamo what to do, collects the resulting data, and summarizes the results into output files. Only one copy of IOMeter should be running at a time. It is typically run on the server machine. Dynamo is the IO workload generator. It doesn’t come with a user interface. At the Iometer’s command, Dynamo performs I/O operations, records the performance information and finally returns the data to IOMeter [32].

In this project, IOMeter is used to measure performance of a SAS topology/drive.

The test bench used to measure SAS performance via the IOMeter is:

43

1. The Operating System used is Windows Server 2008. 2. The server used was a Super Micro server3. A SAS 6 Gbps HBA in a PCIe slot4. The HBA attached to the 6 Gbps SAS Expander5. The 6G SAS expander attached downstream to a 6G SAS drive.6. A LeCroy SAS Analyzer placed between the target and expander7. IOMeter was set to have a maximum number of outstanding IOs of 1. In other words,

the queue depth is set to 1. This makes IOs single-threaded. This option was used since the mock server and client iSCSI and tSAS applications also have a queue depth of 1.

8. For the maximum I/O rate (I/O operations per second), the Percent Read/Write Distribution was set to 100% Read while testing the read performance and was set to 100% write while testing the write performance. The Percent Random/Sequential Distribution was set to 100% Sequential while testing the read and write performance.

9. For measurements taken without an expander, the SAS drive was directly attached to the SAS analyzer and the SAS analyzer was attached to the HBA.

A SAS Protocol Analyzer can be used to capture SSP/STP/SATA traffic between various components in a SAS topology. For example, a SAS Protocol Analyzer can be placed between an Initiator and an Expander to capture the IO traffic between the Initiator and the Expander. Similarly, a SAS protocol analyzer may be placed between drives and an Expander helping the user to capture IO traffic between the drives and an Expander. A capture using the SAS Protocol Analyzer is commonly known as a SAS trace.

44

Figure 3.3.0.1.0 – SAS Trace using Le Croy SAS Protocol Analyzer

Timings on READ and WRITE commands with transfer sizes of 1K, 2K, 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K, 1024K and 2048K are captured. The first two tables in Table 3.3.0.1.0 and Table 3.3.0.1.1 capture performance of READs. Figure 3.3.0.1.0 shows READ performance when a SAS drive is direct-attached to the HBA. Figure 3.3.0.1.1 shows READ performance when a SAS drive is connected to a HBA via an expander.

Performance of READ10 command using Direct-Attached drive:

Transfer Length (KB)

Average Time from Read Command to Completion (milliseconds) using IOMeter – Direct-Attached

Average time from READ command completion on the SAS trace from the drive

Average IOPSI/Os per Second

1 KB 0.0644 ms 0.0365 ms 15527.9502 KB 0.0768 ms 0.0389 ms 13020.8334 KB 0.0800 ms 0.0563 ms 12500

45

8 KB 0.0916 ms 0.0508 ms 10917.03016 KB 0.112 ms 0.0675 ms 8928.57132 KB 0.219 ms 0.180 ms 4566.2164 KB 0.438 ms 0.376 ms 2283.105 128 KB 0.861 ms 0.788 ms 1161.440256 KB 1.706 ms 1.579 ms 586.166512 KB 3.409 ms 3.264 ms 293.3411024 KB 6.896 ms 6.693 ms 145.0112048 KB 30.972 ms 21.653 ms 46.182

Table 3.3.0.1.0 – Direct-Attached SSP READ performance

In Table 3.3.0.1.0, the average time for READ command to complete using IOMeter is the value calculated by IOMeter. The average time for READ command to complete using the SAS analyzer is the time it takes for the drive to respond to the command once the HBA sends the command. As can be seen, the drive is the bottle neck in this topology. The I/Os per second is not always a direct multiple of the average time for the IO completion due to delays at the HBA, hardware etc. However it is close enough and in this project we assume that the IOPS is 1000 ms/ (Average time in millisecond for 1 IO to complete).


Average Time from Read Command to Completion (milliseconds) using IOMeter - Expander Attached

Average time from READ command completion on the SAS trace from the drive

Average IOPSI/Os per second

Average time for READ Completion without including delay from the drive

1 KB 0.0649 ms 0.0365 ms 15408.320 0.0284 ms2 KB 0.0709 ms 0.0389 ms 14104.372 0.032 ms4 KB 0.0810 ms 0.0563 ms 12345.679 0.0247 ms8 KB 0.0840 ms 0.0508 ms 11904.761 0.0332 ms16 KB 0.113 ms 0.0675 ms 8849.557 0.0455 ms32 KB 0.225 ms 0.180 ms 4444.444 0.045 ms64 KB 0.416 ms 0.376 ms 2403.846 0.04 ms128 KB 0.872 ms 0.788 ms 1146.788 0.084 ms256 KB 1.716 ms 1.579 ms 582.750 0.137 ms512 KB 3.418 ms 3.264 ms 292.568 0.154 ms1024 KB 7.022 ms 6.693 ms 142.409 0.329 ms2048 KB 31.344 ms 21.653 ms 31.904 9.691 ms

Table 3.3.0.1.1 – Expander-Attached READ performance

46

As can be seen from the above tables, the performance numbers on READ commands of various transfer lengths when the SAS target is directly connected to the HBA or is behind an expander are very similar. In other words, the timing on the wire between the HBA and the expander is less than 1 millisecond for transfer sizes between 1K and 2048K. The HBA and the expander are generally designed such that h/w takes most of the heavy weight lighting when it comes to IO path/transfers.

Performance of WRITE10 command:

Timings on WRITE commands of sizes 1K, 2K, 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K, 1024K and 2048K are captured below. Table 3.3.0.1.2 shows READ performance when a SAS drive is direct-attached to the HBA. Table 3.3.0.1.3 shows READ performance when a SAS drive is connected to a HBA via an expander.

Transfer Length (KB) Average Time from Write Command to Completion (milliseconds) using IOMeter – Direct Attached

1 KB 6.014 ms2 KB 6.020 ms4 KB 6.030 ms8 KB 6.059 ms16 KB 6.111 ms32 KB 6.216 ms64 KB 6.424 ms128 KB 6.836 ms256 KB 7.672 ms512 KB 9.338 ms1024 KB 12.824 ms2048 KB 37.346 ms

Table 3.3.0.1.2– Direct-Attached WRITE performance


Average Time from Write Command to Completion (milliseconds) using IOMeter – Expander Attached

Average time from WRITE command completion on the SAS trace from the drive

IOPSI/Os per second

Average time for WRITE Completion without including delay from the drive

1 KB 6.012 ms 5.957 ms 166.334 0.055 ms2 KB 6.020 ms 5.964 ms 166.112 0.056 ms4 KB 6.032 ms 5.990 ms 165.782 0.042 ms8 KB 6.059 ms 6.011 ms 165.047 0.048 ms

47

16 KB 6.110 ms 6.054 ms 163.666 0.056 ms32 KB 6.215 ms 6.157 ms 160.901 0.058 ms64 KB 6.424 ms 6.378 ms 155.666 0.046 ms128 KB 6.839 ms 6.782 ms 146.220 0.057 ms256 KB 7.672 ms 7.573 ms 130.344 0.099 ms512 KB 9.337 ms 9.204 ms 107.101 0.133 ms1024 KB 12.665 ms 12.466 ms 78.957 0.199 ms2048 KB 37.345 ms 27.751 ms 26.777 9.74 ms

Table 3.3.0.1.3 – Expander-Attached WRITE performance

3.3.0.2 SAS Performance using VDBench in Linux

Vdbench is a disk and tape I/O workload generator that is used for testing and benchmarking of existing and future storage products. Vdbench generates a wide variety of controlled storage I/O workloads by allowing the user to set workload parameters such as allowing control over workload parameters such as I/O rate, transfer sizes, read and write percentages, and random or sequential workloads etc [37].

The test bench used to measure SAS performance via the VDBench is:

1. The Operating System used is Red Hat Enterprise Linux 5.4. 2. The server used was a Super Micro server3. A SAS 6 Gbps HBA in a PCIe slot4. The HBA attached to the 6 Gbps SAS Expander5. The 6G SAS expander attached downstream to a 6G SAS drive.6. A LeCroy SAS Analyzer placed between the target and expander7. VDBench was set to have a maximum number of outstanding IOs of 1. In other words,

the queue depth is set to 1. This makes IOs single-threaded. This option was used since the mock server and client iSCSI and tSAS applications also have a queue depth of 1.

8. For the maximum I/O rate (I/O operations per second), the Percent Read/Write distribution was set to 100% Read while testing the read performance and was set to 100% write while testing the write performance. The Percent Random/Sequential Distribution was set to 100% Sequential while testing the read and write performance. Please refer to Appendix in Section 8 to learn more about VDBench and the scripts used.

Performance of READ10 command using VDBench


Average Time from Read Command to Completion

Average time from READ command completion on the


Average time for READ Completion without including delay from

48

(milliseconds) using VDBench

SAS trace from the drive

the drive

1 KB 0.0780 ms 0.034 ms 12820.51 0.044 ms2 KB 0.0850 ms 0.039 ms 11764.705 0.046 ms4 KB 0.100 ms 0.056 ms 10000 0.044 ms8 KB 0.132 ms 0.085 ms 7575.75 0.047 ms16 KB 0.264 ms 0.215 ms 3787.87 0.049 ms32 KB 0.528 ms 0.476 ms 1893.94 0.052 ms64 KB 1.058 ms 0.998 ms 945.18 0.060 ms128 KB 2.117 ms 2.051 ms 472.366 0.066 ms256 KB 4.235 ms 4.162 ms 236.127 0.073 ms512 KB 8. 572 ms 8.367 ms 116.658 0.205 ms1024 KB 18.317 ms 12.576 ms 54.594 5.741 ms2048 KB 34.323 ms 20.983 ms 29.135 13.34 ms

Table 3.3.0.2.0 – SSP READ performance using VDBench

Performance of WRITE10 command using VDBench


Average Time from Write Command to Completion (milliseconds) using VDBench

Average time from Write command completion on the SAS trace from the drive


Average time for WRITE Completion without including delay from the drive

1 KB 6.006 ms 5.965 ms 166.500 0.041 ms2 KB 6.025 ms 5.978 ms 165.975 0.047 ms4 KB 6.058 ms 6.012 ms 165.070 0.046 ms8 KB 6.121 ms 6.068 ms 163.371 0.053 ms16 KB 6.252 ms 6.203 ms 159.95 0.049 ms32 KB 6.518 ms 5.638 ms 153.421 1.15 ms64 KB 4.086 ms 4.023 ms 244.738 0.063 ms128 KB 5.142 ms 5.078 ms 194.476 0.064 ms256 KB 7.290 ms 7.199 ms 137.174 0.091 ms512 KB 14.511 ms 11.895 ms 68.913 2.616 ms1024 KB 23.083 ms 19.738 ms 43.321 3.345 ms2048 KB 40.105 ms 35.078 ms 24.935 5.027 ms

Table 3.3.0.2.1 – SSP WRITE performance using VDBench

The following conclusions can be drawn from the tests above using IOMeter and VDBench:

1. Looking at the performance numbers above, one notices that the performance drops drastically for a 2048K Read/Write as compared to a 1024K Read/Write. After analyzing the SAS traces that were collected for the transfer sizes of 1K, 2K, 4K, 8K, 16K, 32K, 64K,

49

128K, 256K, 512K, 1024K and 2048K on Reads and Writes, and analyzing the SAS traces, one finds that up until 1024K transfer size requests, the HBA sends a command to the target requesting for all the data in a single IO. However, on the 2048K transfer sizes and higher, the HBA sends commands of varying transfer sizes to the target. In other words, a single IO doesn’t fetch the entire 2048K of data on a read and a single IO is not used to write 2048K of data to a drive. Multiple smaller transfer sizes IOs are used to read or write 2048K of data from or to the disk respectively causing the performance to suddenly drop. This most likely an optimization or limitation of the driver.

2. One also notices that performance on READs is better than the performance of WRITEs. This is obvious as the frame sizes on READs is lesser and the number of frame being transmitted and the number of handshakes that occur on a READ command is lesser than what occurs on a WRITE command. Also, it takes more time for a drive to write DATA than to read from Disk. The IOmeter user guide as well states “For the maximum I/O rate (I/O operations per second), try changing the Transfer Request Size to 512 bytes, the Percent Read/Write Distribution to 100% Read, and the Percent Random/Sequential Distribution to 100% Sequential.”

3. At smaller transfer sizes, the performance difference between each transfer size is not so apparent. However, at larger transfer sizes (above 256K etc), the performance and time for IOs to complete is more visibly lower and higher respectively.

4. The results obtained via VDBench are slightly poorer than the results obtained via IOMeter. A different SAS drive was used for both tests and therefore the SAS drive performance used during VDBench is poorer than the SAS drive performance used using IOMeter testing. Also, timings can vary as the OS and driver are different for Windows and Red hat Linux.

Note: The SAS Analyzer Traces, performance results, VDBench scripts etc are located in the SASAnalyzerTraces folder in the project folder where all the deliverables are located. Refer to section Appendix 8.

3.3.1 Measuring iSCSI performance using IOMeter in Windows

The following measurements are taken on the following test bench:

1. An iSCSI software Initiator running on a windows system. The Starwind iSCSI Initiator was used as the iSCSI Initiator. Please refer to Appendix in Section 8 to learn more about the StarWind iSCSI Initiator.

2. An iSCSI software Target emulated on a windows system. The KernSafe iSCSI Target was used to create an iSCSI target and talk to it. Please refer to Appendix in Section 8 to learn more about the KernSafe iSCSI Target.

50

3. The iSCSI target was created using a SCSI USB flash drive4. The iSCSI Initiator and iSCSI Target system are connected to each other via a NetGear

Pro Safe Gigabit Switch at a connection rate of 1Gbps.5. READs/WRITEs of transfer lengths/sizes 1K, 2K, 4K, 8K, 16K, 32K, 64K, 128K, 256K,

1024K, 2048K are issued by the iSCSI Initiator6. A WireShark analyzer is also running on the Initiator system to view the data passed

between the iSCSI Initiator and iSCSI Target. Please refer to Appendix in Section 8 to learn more about the WireShark Network Protocol Analyzer.

7. IOMeter is used to view the performance of these transfer sizes8. The number of outstanding IOs (queue Depth) is set to 1 in the IOMeter. 9. On each READ, the test is set to 100% sequential READS. On each WRITE, the test is set

to 100% sequential WRITEs.

1Gbps

Read

Table shows the iSCSI Read Completion timings.


Average Time from Read Command to Completion (milliseconds) using IOMeter with iSCSI Device

IOPS I/Os per second

1 KB 1.208 ms 827.8102 KB 1.423 ms 702.7404 KB 2.377 ms 845.3088 KB 2.252 ms 444.04916 KB 3.251 ms 307.59732 KB 4.550 ms 219.78064 KB 5.683 ms 175.963 128 KB 14.640 ms 68.306256 KB 28.505 ms 35.081512 KB 164.172 ms 6.0911024 KB 415.445 ms 2.4072048 KB 913.563 ms 1.094

Table 3.3.1.0 – iSCSI Read Completion Timings at 1 Gbps

Write

Table shows the iSCSI Write Completion timings.


Average Time from Read Command to Completion (milliseconds) using IOMeter with iSCSI Device

MBs/Sec

51


Table 3.3.1.1 – iSCSI Write Completion Timings at 1 Gbps

The above timings include the delay at the USB flash drive. Since USB flash drives are slow, we then ran IOMeter on the machine connected to the USB to get the read/write timings when IOs are issued to the SCSI drive directly.

The following measurements are taken on the following test bench:

1. A virtual SCSI target (USB flash drive) was used as the SCSI target.2. READs/WRITEs of transfer lengths/sizes 1K, 2K, 4K, 8K, 16K, 32K, 64K, 128K, 256K,

1024K, 2048K are issued by the iSCSI Initiator3. IOMeter is used to benchmark the SCSI device

READ

Table shows the SCSI Read Completion timings.


Average Time from Read Command to Completio (milliseconds) using IOMeter with SCSI Device

IOPSI/Os per second

1 KB 0.919 ms 1088.1392 KB 1.073 ms 931.9664 KB 1.194 ms 837.5208 KB 1.453 ms 688.23116 KB 1.984 ms 504.03232 KB 3.448 ms 290.02364 KB 4.455 ms 224.466128 KB 7.044 ms 141.964256 KB 13.205 ms 75.728512 KB 25.885 ms 38.6321024 KB 51.234 ms 19.518

52

2048 KB 102.571 ms 9.749

Table 3.3.1.2 – SCSI Read Completion Timings

WRITE

Table shows the SCSI Write Completion timings.


Average Time from WriteCommand to Completion (milliseconds) using IOMeter with SCSI Device

MBs/Sec


Table 3.3.1.3 – SCSI Write Completion Timings

To get the performance of iSCSI without including the time it takes for IOs to complete from the SCSI target itself (the bottleneck), the iSCSI performance timings via IOMeter is subtracted by the SCSI Performance timings via IOMeter. These results then make it more feasible for us to compare the iSCSI numbers here to the mock client/server iSCSI application written for this project.

Read

Table shows the iSCSI Read Completion timings without including the time it takes for IOs to complete from the SCSI target itself.


Average Time from Read Command to Completion (milliseconds) without including the time it takesFor IOS to complete from the SCSI target

1 KB 1.208 – 0.919 = 0.289 ms2 KB 0.35 ms

53

4 KB 0.585 ms8 KB 0.799 ms16 KB 1.267 ms32 KB 1.102 ms64 KB 1.228 ms128 KB 7.596 ms256 KB 15.3 ms512 KB 138.287 ms1024 KB 264.211 ms2048 KB 810.992 ms

Table 3.3.1.4 – iSCSI Read Completion Timings without including delay at the drive

Write

Table shows the iSCSI Write Completion timings without including the time it takes for IOs to complete from the SCSI target itself.


Average Time from Read Command to Completion (milliseconds) without including the time it takesFor IOS to complete from the SCSI target

1 KB 0.398 ms2 KB 0.815 ms4 KB 1.013 ms8 KB 0.818 ms16 KB 1.34 ms32 KB 2.028 ms64 KB 4.262 ms128 KB 9.987 ms256 KB 18.287 ms512 KB 163.677 ms1024 KB 473.518 ms2048 KB 814.723 ms

Table 3.3.1.5 – iSCSI Write Completion Timings without including delay at the drive

Note: The IOMeter data collected as well as any WireShark Traces are located in the in the project where all the deliverables are located. Refer to Section Appendix 8.

54

3.3.2 Measuring tSAS performance using the client and server mock application written and comparing it to the iSCSI client/server mock application as well as to legacy SAS and legacy iSCSI

A. The tSAS performance was measured by running the client/server application written. The test bench used to test the tSAS applications is two Windows 208 Server systems connected using a netgear switch with a connection rate of 10 Mbps, 100 Mbps and 1 Gbps. One windows machine runs the client application while the other runs the server application.

10 Mbps:

READ:

Transfer Length (KB) Average Time from Read Command to Completion (milliseconds) using mock application in iSCSI

IOPSI/Os per second

1 KB 2.786 ms 358.9372 KB 5.968 ms 167.5604 KB 7.541 ms 132.6088 KB 11.002 ms 90.89216 KB 18.258 ms 54.77032 KB 175.630 ms 5.69364 KB 197.788 ms 5.055128 KB 255.342 ms 3.916256 KB 601.288 ms 1.663512 KB 741.555 ms 1.3481024 KB 2228.483 ms (~2.228 sec) 0.4482048 KB 3863.979 ms (~3.863 sec) 0.259

Table 3.3.2.0 – READ Command Timings iSCSi Mock app at 10 Mbps


Average Time from Read Command to Completion (milliseconds) using mock application in tSAS

IOPSI/Os per second

1 KB 2.543ms 393.2362 KB 5.933 ms 168.5484 KB 6.896 ms 145.0118 KB 10.902 ms 91.72616 KB 18.152 ms 55.090

55

32 KB 153.126 ms 6.53064 KB 192.224 ms 5.202128 KB 192.103 ms 5.205256 KB 576.096 ms 1.736512 KB 996.854 ms 1.0031024 KB 1614.082 ms (~1.614 sec) 0.6192048 KB 3615. 275 ms (~3.615 sec) 0.276

Table 3.3.2.1 – READ Command Timings tSAS Mock app at 10 Mbps

Figure 3.3.2.0– iSCSI vs tSAS Read Completion Time at 10 Mbps. [X Axis =

Transfer Size in kilo bytes, Y Axis = Completion time in milliseconds]

Looking at the chart above, tSAS performs better than iSCSI. One also observes that at small READ transfers, iSCSI and tSAS have a more similar performance at 10 Mbps. However, at larger transfer sizes, tSAS performs more visibly better than iSCSI.

WRITE

Transfer Length (KB) Average Time from Write Command to Completion (milliseconds) using mock application in iSCSI

IOPSI/Os per second

1 KB 13.968 ms 71.5922 KB 14.909 ms 67.0734 KB 16.867 ms 59.2878 KB 20.078 ms 49.80516 KB 27.365 ms 36.54332 KB 505.044 ms 1.98064 KB 710.429 ms 1.407128 KB 1572.559 ms (~1.572 sec) 0.636256 KB 3380.042 ms (~3.380 sec) 0.256512 KB 6886.112 ms (~6.886 sec) 0.1451024 KB 1431.612 ms (~14.316 sec) 0.6982048 KB 1977.700 ms (~19.777 sec) 5.056

56

Table 3.3.2.2 – WRITE Command Timings iSCSi Mock app at 10 Mbps


Average Time from Write Command to Completion (milliseconds) using mock application in tSAS

IOPSI/Os per second

1 KB 4.892 ms 204.4152 KB 5.849 ms 170.9694 KB 8.054 ms 91.0898 KB 10.979 ms 91.08316 KB 17.984 ms 55.60532 KB 233.573 ms 4.28164 KB 614.819 ms 1.626128 KB 1584.924 ms (~1.584 sec) 0.631256 KB 3540.684 (~3.540 sec) 0.282512 KB 6684.609 (~6.684 sec) 0.1491024 KB 1245.677 ms (~12.456 sec) 0.8032048 KB 1772.838 ms (~17.728 sec) 0.564

Table 3.3.2.3 – WRITE Command Timings tSAS Mock app at 10 Mbps

57

0 500 1000 1500 2000 25000

1000

2000

3000

4000

5000

6000

7000

8000

tSAS vs iSCSI Write 10Mbps

tSAS Write 10MBpsiSCSI Write 10MBps

Figure 3.3.2.1– iSCSI vs tSAS Write Completion Time at 10 Mbps. [X Axis =


Looking at the chart above, tSAS performs better than iSCSI. One also observes that at small READ transfers, iSCSI and tSAS have a more similar performance. However, at larger transfer sizes, tSAS performs more visibly better than iSCSI.

100 Mbps


Average Time from Read Command to Completion (milliseconds) using mock application in iSCSI

IOPSI/Os per second

1 KB 1.996ms 501.0022 KB 2.692 ms 371.4714 KB 2.579 ms 387.7478 KB 3.093 ms 323.31016 KB 3.802 ms 263.09232 KB 15.001 ms 66.66264 KB 17.193 ms 58.163

58

128 KB 35.913 ms 27.845256 KB 82.172 ms 12.169512 KB 115.905 ms 8.6271024 KB 311.735 ms 3.2082048 KB 577.684 ms 1.731

Table 3.3.2.4 – READ Command Timings iSCSI Mock app at 100 Mbps



IOPSI/Os per second


Table 3.3.2.5 – READ Command Timings tSAS Mock app at 100 Mbps

59

0 500 1000 1500 2000 25000

100

200

300

400

500

600

700

tSAS vs iSCSI Read 100 Mbps

tSAS Read 100 MbpsiSCSI Read 100 Mbps



Looking at the chart above in Figure 3.3.2.2, tSAS performs better than iSCSI for all transfer sizes captured.


Average Time from Write Command to Completion (milliseconds) using mock application in iSCSI

IOPSI/Os per second

1 KB 14.559 ms 68.6862 KB 15.203 ms 65.7764 KB 14.716 ms 67.9538 KB 15.030 ms 66.53316 KB 22..011 ms 45.43132 KB 25.735 ms 38.85764 KB 55.918 ms 17.883128 KB 110.481 ms 9.051256 KB 193.932 ms 5.156512 KB 272.651 ms 3.6671024 KB 350.924 ms 2.8492048 KB 772.876 ms 1.294

Table 3.3.2.6 – WRITE Command Timings iSCSI Mock app at 100 Mbps

60



IOPSI/Os per second


Table 3.3.2.7 – WRITE Command Timings tSAS Mock app at 100 Mbps

0 500 1000 1500 2000 25000

100

200

300

400

500

600

700

800

900

tSAS vs iSCSi Write 100 Mbps

tSAS Write 100 MbpsiSCSI Write 100 Mbps



Looking at the chart above, tSAS performs better than iSCSI overall.

1 Gbps:61


Average Time from Read Command to Completion (milliseconds) using mock application in iSCSI

IOPSI/Os per second

1KB 1.999 ms 500.2502KB 1.231 ms 812.3474KB 1.227 ms 814.9958KB 1.436 ms 696.37816KB 1.338 ms 747.38432KB 1.795 ms 557.10364KB 2.401 ms 416.493128KB 4.264 ms 234.521256KB 7.072 ms 141.402512KB 12.395 ms 80.6771024KB 24.880 ms 40.1932048KB 44.383 ms 22.531

Table 3.3.2.8 – READ Command Timings iSCSI Mock app at 1000 Mbps (1 Gbps)



IOPSI/OS per second


Table 3.3.2.9 – READ Command Timings tSAS Mock app at 1000 Mbps (1 Gbps)

62

0 500 1000 1500 2000 25000

5

10

15

20

25

30

35

40

45

50

tSAS vs iSCSI Read 1 Gbps

tSAS Read 1 GbpsiSCSi Read 1 Gbps



Looking at the chart above, tSAS performs better than iSCSI. At smaller transfer sizes tSAS and iSCSI perform more similar while at larger transfer sizes tSAS performs visibly faster than iSCSI.


Average Time from Write Command to Completion (milliseconds) using mock application in iSCSI

IOPSI/Os per second


Table 3.3.2.10 – WRITE Command Timings iSCSI Mock app at 1000 Mbps (1 Gbps)

63



IOPsI/Os per s econd


Table 3.3.2.11 – WRITE Command Timings tSAS Mock app at 1000 Mbps (1 Gbps)

0 500 1000 1500 2000 25000

10

20

30

40

50

60

70

80

tSAS vs iSCSI Write 1 Gbps

tSAS Write 1 GbpsiSCSI Write 1 Gbps



64

tSAS performs visibly better than iSCSI on both small and larger transfer sizes for Writes as 1 Gbps.

From the data collected on the tSAS mock application and the iSCSI mock application, the following conclusions can be drawn:

1. tSAS performs better than iSCSI overall at all transfer sizes regardless of the speed of the connection between the initiator and the target. The reason for this can be easily attributed to the fact that the REQUEST, TRANSFER READY (Xfer Rdy for SAS) and RESPONSE frame sizes are smaller in tSAS vs the REQUEST, TRANSGFER READY (R2T) and RESPONSE frame sizes in iSCSI

2. At smaller speeds, the performance of iSCSI and tSAS is very comparable with tSAS performing slightly better than iSCSI. However, as transfer sizes get larger, tSAS performs more visibly better than iSCSI.

3. Overall, WRITE performance is poorer than READ performance in tSAS and iSCSI. This can be attributed to the fact that handshaking is more for WRITEs than READs. On WRITEs, the initiator needs to wait for the transfer ready (Xfer_Rdy or R2T) frame before sending data.

4. For better performance, it may be best to use smaller transfer sizes since at larger transfer sizes the error rate and retransmission of packets on TCP is higher looking at the wireshark traces collected.

B. Next we will look at how tSAS performs at different connections speeds for a fixed transfer size at each connection rate.

The below graph in Figure 3.3.2.6 compares tSAS READ performance at varying connection speeds for a 16K Transfer size.

65

0 200 400 600 800 1000 12000

1

2

3

4

5

6

7

Time for READ Completion for a transfer size of 2K at 10Mbps, 100Mbps and 1Gbps

Time for READ Completion

Figure 3.3.2.6–tSAS READ Completion Time for a transfer size of 2K at 10 Mbps, 100 Mbps and 1 Gbps. [X Axis = Transfer Size in kilo bytes, Y Axis = Completion time in milliseconds]


66

0 200 400 600 800 1000 12000

2

4

6

8

10

12

14

16

18

20

Time for READ Compeltion with transfer size of 16K at 10 Mbps, 100 Mbps and 1 Gbps

Time for READ Compeltion



67

0 200 400 600 800 1000 12000

200

400

600

800

1000

1200

Time for READ Completion with transfer size of 512K at 10 Mbps, 100Mbps and 1Gbps

Time for READ Compeltion


As can be seen for the graphs above, performance drastically improves from 10 Mbps to 1 Gbps.

C. With 40 Gbps and 100 Gbps soon to be available [36], tSAS performance should outperform SAS. From a performance analysis done by Netapp on 1 Gbps and 10 Gbps Ethernet server scalability [35], one can infer that at 10 Gbps can perform 4.834 times better than 1 Gbps. Extrapolating this, we can assume that 100 Gbps will perform 48.34 times better than 1 Gbps. These numbers can be used as guidance for understanding the performance of our tSAS numbers at 10 G and higher connection rates. Below are extrapolated values for tSAS at 10 Gbps and 100 Gbps. These values are compared to the values we obtained using legacy SAS and legacy iSCSI.

10/40/100 Gbps tSAS Read Completion:


Average Time from Read Command to Completion (milliseconds) using mock application in tSAS (100 Gbps)48.34 times better than values in

Average Time from Read Command to Completion (milliseconds) using mock application in tSAS (10 Gbps)4.834 times better than values in

68

Table 3.3.2.9 Table 3.3.2.91 KB 0.041 ms 0.408 ms2 KB 0.0311 ms 0.312 ms4 KB 0.035 ms 0.350 ms8 KB 0.0258 ms 0.258 ms16 KB 0.0206 ms 0.257 ms32 KB 0.0353 ms 0.353 ms64 KB 0.054 ms 0.543 ms128 KB 0.0924 ms 0.924 ms256 KB 0.160 ms 1.604 ms512 KB 0.279 ms 2.700 ms1024 KB 0.489 ms 4.899 ms2048 KB 0.840 ms 8.404 ms

Table 3.3.2.12 – READ Command Timings tSAS Mock app at 10/40/100 Gbps (Extrapolated from the values at 1 Gbps from Table 3.3.2.9)

10/40/100 Gbps tSAS Write Completion:


Average Time from Write Command to Completion (milliseconds) using mock application in tSAS (100 Gbps)48.34 times better than values in Table 3.3.2.11

Average Time from Write Command to Completion (milliseconds) using mock application in tSAS (10 Gbps) 4.834 times better than values in Table 3.3.2.11

1 KB 0.028 ms 0.278 ms2 KB 0.029 ms 0.290 ms4 KB 0.031 ms 0.309 ms8 KB 0.036 ms 0.366 ms16 KB 0.031 ms 0.315 ms32 KB 0.050 ms 0.500 ms64 KB 0.055 ms 0.549 ms128 KB 0.077 ms 0.776 ms256 KB 0.142 ms 1.423 ms512 KB 0.293 ms 2.903 ms1024 KB 0.569 ms 5.696 ms2048 KB 0.956 ms 9.569 ms

Table 3.3.2.13 – WRITE Command Timings tSAS Mock app at 10/40/100 Gbps (Extrapolated from the values at 1 Gbps from Table 3.3.2.9)

Comparing tSAS results at 100 Gbps to legacy SAS without delay at the SAS drive:

69

One way to compare our tSAS mock implementation to legacy SAS is to compare it to legacy SAS performance without the delay at the drive. Figure 3.3.2.9 and Figure 3.3.2.10 do just that. We notice from these figures that our tSAS mock application performs similar to legacy SAS. However, at a transfer size of 2048K, the tSAS application out performs legacy SAS clearly. This again is due to the fact that at 2048K in legacy SAS, the Read/Write command is slit into several smaller transfers causing the performance to drop at 2048K transfer sizes. We already discussed this in Section 3.3.0.2

0 500 1000 1500 2000 25000

2

4

6

8

10

12

14

16

Comparing tSAS at 100 Gbps to legacy SAS Per-formance Results without Delay at the Drive

Read extrapolated tSAS 100 GbpsLegacy SAS(SSP) Read Timings using IOMeter without Delay at the driveLegacy SAS(SSP) Read Timings using VDBench without Delay at the Drive

Figure 3.3.2.9–Comparing tSAS Read at 100 Gbps to legacy SAS Performance Results without Delay at the Drive. [X Axis = Transfer Size in kilo bytes, Y Axis = Completion time in milliseconds]

70

0 500 1000 1500 2000 25000

2

4

6

8

10

12

tSAS 100 Gbps Performance vs Legacy SSP Write tim-ings without delay at the drive

Write extrapolated tSAS 100 GbpsLegacy SAS(SSP) Write Timings using IOMeter without Delay at the driveLegacy SAS(SSP) Write Timings using VDBench without Delay at the Drive

Figure 3.3.2.10–Comparing tSAS Write at 100 Gbps to legacy SAS Performance Results without Delay at the Drive. [X Axis = Transfer Size in kilo bytes, Y Axis = Completion time in milliseconds]

Comparing tSAS results at 100 Gbps to legacy SAS by looking at performance numbers between the HBA and the expander:

As mentioned in section 3.3.0.1 and looking at Tables 3.3.0.1.0 and 3.3.0.1.1, the delay between the HBA and expander is in the order of microseconds (less than a millisecond for all transfer sizes between 1K to 2048K). Comparing this to our tSAS mock application performance, we can easily see that tSAS performance is much slower than legacy SAS between a HBA and an expander. Since we can use tSAS between a HBA and an expander, this is a valid comparison of tSAS to legacy SAS. However, without having a solution where tSAS is implemented in hardware by using a tSAS HBA, it may not be fair to compare our tSAS results at 100 Gbps to legacy SAS between the HBA and the expander.

Comparing tSAS results at 100 Gbps to legacy iSCSI without delay at the SCSI drive:

iSCSI Read/Write performance results at 1 Gbps without the delay at the drive from Table 3.3.1.4 and Table 3.3.1.5 are used to calculate the performance of legacy iSCSI at 100 Gbps. The values in Table 3.3.1.4 and 3.3.1.5 Table are simply divided by 48.348 since we are assuming

71

that 100 Gbps connection rate performs 43.348 times better than 1 Gbps connection rates. Figure 3.3.2.11 and Figure 3.3.2.12 compare tSAS at 100 Gbps to legacy iSCSI at 100 Gbps.

0 500 1000 1500 2000 25000

2

4

6

8

10

12

14

16

18

Read performance tSAS at 100 Gbps vs legacy iSCSI at 100 Gbps using IOMeter without delay

at the drive

Read performance extrapo-lated tSAS 100 GbpsLegacy iSCSI Read timings using IOMeter without de-lay at the drive extrapo-lated 100 Gbps

Figure 3.3.2.11– Read performance tSAS at 100 Gbps vs legacy iSCSI at 100 Gbps using IOMeter without delay at the drive [X Axis = Transfer Size in kilo bytes, Y Axis = Completion time in milliseconds]

0 500 1000 1500 2000 250002468

1012141618

Write performance tSAS at 100 Gbps vs legacy iSCSI at 100 Gbps using IOMeter without delay

at the drive

Write performance ex-trapolated tSAS 100 GbpsLegacy iSCSI Write timings using IOMeter without de-lay at the drive extrapo-lated 100 Gbps

72

Figure 3.3.2.12– Write performance tSAS at 100 Gbps vs legacy iSCSI at 100 Gbps using IOMeter without delay at the drive [X Axis = Transfer Size in kilo bytes, Y Axis = Completion time in milliseconds]

The above Figures show that tSAS at 100 Gbps outperforms legacy iSCSI at 100 Gbps.

However, it is not completely fair to compare the tSAS numbers with the iSCSI numbers we got using the StarWind iSCSI Initiator and KernSafe iSCSI target (Table 3.3.1.4 and 3.3.1.5). tSAS outperforms the iSCSI performance numbers we got using legacy iSCSI. However, our tSAS implementation is not a full implementation of the tSAS Software Initiator or Target. Therefore, it is best to stick with the comparison of tSAS with the iSCSI mock application itself.

Note: The WireShark Traces are located in the project folder where all the deliverables are located. Refer to Appendix 8.4.

4.0 Similar Work

1. Michael Ko’s patent on Serial Attached SCSI over Ethernet proposes a very similar solution to the tSAS solution provided in this project.

2. iSCSI specification (SCSI over TCP) itself is similar to a tSAS solution (SAS over TCP). The iSCSI solution can be heavily leveraged for a tSAS solution.

3. The Fibre Channel over TCP/IP specification also can be leveraged to design and implement a tSAS solution [31].

5.0 Future Direction1. The tSAS mock application can be run using a faster switch with connection rate

of 10 Gbps to get more data points2. The tSAS mock application can be designed such that it uses piggy backing where

the SSP Read Response frame from the target is piggy backed with the last DATA frame sent by the target. This may slightly improve READ performance.

3. Jumbo frames can be used to increase the amount of DATA that is passed from the initiator and target per Ethernet packet improving the performance results.

4. Using an existing Generation 3 SAS HBA and expanders that have an Ethernet Port, read/write commands can be implemented on an expander and the HBA such that they are sent via TCP. This can be used to benchmark and see the feasibility further of tSAS. An embedded TCP./IP stack such as lwIP can be used to implement this [33].

73

5. The Storage Associations can be motivated with the results of this project to work on a tSAS specification

6.0 Conclusion (Lessons learned)Overall, tSAS is a viable solution. tSAS will be faster than a similar iSCSI implementation due to the frame sizes in tSAS being smaller than frame sizes in iSCSI. Also, in a tSAS topology the back-end will always be a legacy SAS drive as opposed to iSCSI where the back-end may be a SCSI drive which is much slower than a SAS drive.

At smaller transfer sizes, the performance of a tSAS and iSCSI solution may be very similar with tSAS performing slightly better than iSCSI. However, at larger transfer sizes, tSAS should be a better solution improving the overall performance of a storage system.

For tSAS to outperform a typical SAS solution today, a HBA solution of tSAS should be used to increase performance. A software solution of tSAS may not be a good choice if the aim is to beat the performance of legacy SAS. However, with 40G/100G Ethernet in the horizon [36], a software solution of tSAS can provide both performance and prove to be a cheaper solution. tSAS can also make use of jumbo frames to increase performance.

From a pure interest of overcoming the distance limitation of legacy SAS, tSAS is an excellent solution since it sends SAS packets over TCP.

7.0 References

[1] T10/1760-D Information Technology – Serial Attached SCSI – 2 (SAS-2), T10, 18 April 2009, Available from http://www.t10.org/drafts.htm#SCSI3_SAS

[2] Harry Mason, Serial attached SCSI Establishes its Position in the Enterprise, LSI Corporation, available from http://www.scsita.org/aboutscsi/sas/6GbpsSAS.pdf

[3] http://www.scsilibrary.com/

[4] http://www.scsifaq.org/scsifaq.html

[5] Kenneth Y. Yun ; David L. Dill; A High-Performance Asynchronous SCSI Controller, available from http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=528789

74

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=528789

http://www.scsifaq.org/scsifaq.html

http://www.scsilibrary.com/

http://www.scsita.org/aboutscsi/sas/6GbpsSAS.pdf

http://www.t10.org/drafts.htm#SCSI3_SAS

[6] http://www.t10.org/scsi-3.htm

[7] Sarah Summers, Secure asymmetric iScsi system for online storage, 2008, University of Colorado, Colorado Springs, available from http://www.cs.uccs.edu/~gsc/pub/master/sasummer/doc/

[8] SCSI Architecture Model - 5 (SAM-5), Revision 21, T10, 2011/05/12, available from http://www.t10.org/members/w_sam5.htm

[9] SCSI Primary Commands - 4 (SPC-4), Revision 31, T10, 2011/06/13, available from http://www.t10.org/members/w_spc4.htm

[10] Marc Farley, Storage Networking Fundamentals: An Introduction to Storage Devices, Subsystems, Applications,Management, and File Systems, Cisco Press, 2005, ISBN 1-58705- 1621

[11] Huseyin Simitci; Chris Malakapalli; Vamsi Gunturu; Evaluation of SCSI Over TCP/IP and SCSI Over Fibre Channel Connections, XIOtech Corporation, available from http://www.computer.org/portal/web/csdl/abs/proceedings/hoti/2001/1357/00/13570087abs.htm

[12] Harry Mason, SCSI, the Industry Workhorse, Is Still Working Hard, Dec 2000, SCSI Trade Association available from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=889098&tag=1

[13] Mark S. Kolich, Basics of SCSI: Firmware Applications and Beyond, Computer Science Department, Loyola Marymount University, Los Angeles, available from http://mark.koli.ch/2008/10/25/CMSI499_MarkKolich_SCSIPaper.pdf

[14] Prasenjit Sarkar; Kaladhar Voruganti, IP Storage: The Challenge Ahead, IBM Almaden Research Center, San Jose, CA, available from http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.8984

[15] Prasenjit Sarkar; Sandeep Uttamchandani; Kaladhar Voruganti, Storage over IP: When Does Hardware Support help?, 2003 IBM Almaden Research Center, San Jose, California available from http://dl.acm.org/citation.cfm?id=1090723

[16] A. Benner, "Fibre Channel: Gigabit Communications and I/O for Computer Networks", McGraw-Hill, 1996.

75

http://dl.acm.org/citation.cfm?id=1090723

http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.8984

http://mark.koli.ch/2008/10/25/CMSI499_MarkKolich_SCSIPaper.pdf

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=889098&tag=1

http://www.computer.org/portal/web/csdl/abs/proceedings/hoti/2001/1357/00/13570087abs.htm

http://www.t10.org/members/w_spc4.htm

http://www.t10.org/members/w_sam5.htm

http://www.t10.org/cgi-bin/ac.pl?t=f&f=sam5r07.pdf

http://www.cs.uccs.edu/~gsc/pub/master/sasummer/doc/

http://www.t10.org/scsi-3.htm

[17] Infiniband Trade Association available from http://www.infinibandta.org

[18] K.Voruganti; P. Sarkar, An Analysis of Three Gigabit Networking Protocols for Storage Area Networks’, 20th IEEE International Performance, Computing, and Communications Conference”, April 2001, available from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=918661&tag=1

[19] Kalmath Meth; Julian Satran, Features of the iSCSI Protocol, August 2003, IBM Haifa Research Lab available from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1222720

[20] Yingping Lu; David H. C. Du, Performance Study of iSCSI-Based Storage Subsystems, IEEE Communications Magazine, August 2003, pp 76-82.

[21] Integration Scenarios for iSCSI and Fibre Channel, available from http://www.snia.org/forums/ipsf/programs/about/isci/iSCSI_FC_Integration_IPS.pdf

[22] Irina Gerasimov; Alexey Zhuravlev; Mikhail Pershin; Dennis V. Gerasimov, Design and Implementation of a Block Storage Multi-Protocol Converter, Proceedings of the 20th IEEE/11th NASA Goddard Conference of Mass Storage Systems and Technologies (MSS‟03) available from http://storageconference.org/2003/papers/26-Gerasimov-Design.pdf

[23] Internet Small Computer Systems Interface (iSCSI), http://www.ietf.org/rfc/rfc3720.txt

[24] Yingping Lu; David H. C. Du, Performance Study of iSCSI-Based Storage Subsystems, University of Minnesota, Aug 2003, available from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1222721

[25] Cai, Y.; Fang, L.; Ratemo, R.; Liu, J.; Gross, K.; Kozma, M.; A test case for 3Gbps serial attached SCSI (SAS) Test Conference, 2005. Proceedings. ITC 2005. IEEE International, February 2006, available from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1584027

[26] Rob Eliot, Serial Attached SCSI, HP Industry Standard Servers, Server Storage Advanced Technology , 30 September 2003 available from http://www.scsita.org/sas_library/tutorials/SAS_General_overview_public.pdf

[27] Michael A. Ko, LAYERING SERIAL ATTACHED SMALL COMPUTER SYSTEM INTERFACE (SAS) OVER ETHERNET, United States Patent Application 20080228897, 09/18/2008

76

http://www.scsita.org/sas_library/tutorials/SAS_General_overview_public.pdf

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1584027


http://www.ietf.org/rfc/rfc3720.txt

http://storageconference.org/2003/papers/26-Gerasimov-Design.pdf

http://www.snia.org/forums/ipsf/programs/about/isci/iSCSI_FC_Integration_IPS.pdf


http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=918661&tag=1

http://www.infinibandta.org/

available from http://www.faqs.org/patents/app/20080228897

[28] Mathew R. Murphy, iSCSI-based Storage Area Networks for Disaster Recovery Operations, The Florida State University, College of engineering, 2005, available from http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.127.8245

[29] “Increase Performance of Network-Intensive Applications with TCP/IP Offload Engines (TOEs),” Adaptec, Inc. White Paper, May 2003 available from http://www.probsolvesolutions.co.uk/solutions/white_papers/adaptec/NAC_TechPaper2.pdf

[30] IEEE P802.3ba 40Gb/s and 100Gb/s Ethernet Task Force available from http://www.ieee802.org/3/ba/

[31] M. Rajagopal; E. Rodriguez; R. Weber; Fibre Channel Over TCP/IP, Network Working Group, July 2004, available from http://rsync.tools.ietf.org/html/rfc3821

[32] IOMeter Users Guide, Version 2003.12.16 available from http://www.iometer.org/doc/documents.html

[33] The lwIP TCP/IP stack, available from http://www.sics.se/~adam/lwip/

[34] 29West Messaging Performance on 10-Gigabit Ethernet, September 2008, available from http://www.cisco.com/web/strategy/docs/finance/29wMsgPerformOn10gigtEthernet.pdf

[35] 1Gbps and 10Gbps Ethernet Server Scalability, NetApp, available from http://partners.netapp.com/go/techontap/matl/downloads/redhat- neterion_10g.pdf

[36] John D. Ambrosia, 40 gigabit Ethernet and 100 Gigabit Ethernet: The development of a flexible architecture available from http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4804384

[37] Henk Vandenbergh, VDBench Users Guide, Version 5.00, October 2008 available from http://iweb.dl.sourceforge.net/project/vdbench/vdbench/Vdbench%205.00/vdbench.pdf

77

http://iweb.dl.sourceforge.net/project/vdbench/vdbench/Vdbench%205.00/vdbench.pdf

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4804384

http://partners.netapp.com/go/techontap/matl/downloads/redhat-%20%20%20%20%20%20%20%20neterion_10g.pdf

http://partners.netapp.com/go/techontap/matl/downloads/redhat-%20%20%20%20%20%20%20%20neterion_10g.pdf

http://www.cisco.com/web/strategy/docs/finance/29wMsgPerformOn10gigtEthernet.pdf

http://www.sics.se/~adam/lwip/

http://www.iometer.org/doc/documents.html

http://rsync.tools.ietf.org/html/rfc3821

http://www.ieee802.org/3/ba/

http://www.probsolvesolutions.co.uk/solutions/white_papers/adaptec/NAC_TechPaper2.pdf

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.127.8245

http://www.faqs.org/patents/app/20080228897

8.0 Appendix

8.1 How to run the tSAS and iSCSI mock initiator (client) and target (server) application

Since the applications were written in C using Microsoft Visual Studio 2008 Professional Edition, you will need to have Visual Studio 2008 downloaded on the system where you would like to run these applications. A client.exe and server.exe files are provided for tSAS and iSCSi in the code directory of this project.

Please run the Server.exe and Client.exe program for either iSCSI or tSAS:

1. You will see the following screens when you run the Server.exe and Client.exe files respectively

78

2. Enter the IP address of your server/target to see the following output screens

79

3. Select if you would like to test READs/WRITEs along with the transfer size to see the

output of the test results

80

8.2 How to run iSCSI Server and iSCSI Target Software

1. The StarWind iSCSI Initiator was used for this project. a. You may download the StarWind iSCSi Initiator software for free from

http://www.starwindsoftware.com/iscsi-initiatorb. After installing the software please refer to the “Using as iSCSi Initiator” PDF file

included in the references section.

2. The Kern Safe iSCSI Target was used to create an iSCSI Target

a. You may download the iSCSI target software (KernSafe iStorage Server) from http://www.kernsafe.com/product/istorage-server.aspx.

b. After installing and running it, please Click on the Create Target to Create a target and specify the type of target you would like to create as well as security specifications.

8.3 How to run LeCroy SAS Analyzer SoftwareThe LeCroy SAS Analyzer software can be downloaded from http://lecroy.ru/protocolanalyzer/protocolstandard.aspx?standardID=7

You can open the SAS Analyzer Traces provided in the SAS Analyzer Traces folder with this software.

Running the Report->Statistical Report will give you the Average Completion time of IOs and other useful information.

The SAS Analyzer traces are located in the project deliverable folder.

8.4 WireShark to view the WireShark tracesThe Wireshark Network analyzer Software can be downloaded from http://www.wireshark.org/

This software will let you capture and view the WireShark traces provided with this project. The WireShark traces are located in the project deliverable folder.

8.5 VDBench for LinuxVDBench can be downloaded from http://sourceforge.net/projects/vdbench/

After installing VDBench on linux, you may use a script similar to the one below to run IOs and look at the performance results.

82

http://sourceforge.net/projects/vdbench/

http://www.wireshark.org/

http://lecroy.ru/protocolanalyzer/protocolstandard.aspx?standardID=7

http://www.kernsafe.com/product/istorage-server.aspx

http://www.starwindsoftware.com/iscsi-initiator

sd=s1,lun=/dev/sdb,align=4096,openflags=o_direct*wd=wd1,sd=(s1),xfersize=2048KB,seekpct=0,rdpct=0*rd=rd1,wd=wd1,iorate=max,forthreads=1,elapsed=300,interval=1*

Lun=/dev/sdb simply states the target you are testing.

Xfersize is used to change the transfer size [37].

Seekpct=0 states that all IOs are sequential [37].

Forthreads=1 states that the Queue Depth or Number of outstanding IOs is 1 [37].

Interval=1 will simply display/update the performance results onto the screen every second [37].

For additional information on each field and additional feields please refer to the VDBench user guide [37].

83

gsc/pub/master/sreddy/doc/reportv4.docx · web viewserial attached scsi [1], the successor of scsi...

Documents