Genomics Big data: SAN

NFS VS. SAN VS. lUSTRE

NFS (Network File System)

NFS has been around for over 20 years, is very stable, easy to use and most systems administrators, as well as users, are generally familiar with its strengths and weaknesses. In low end HPC storage environments, NFS can still be a very effective medium for distributing data, where low end HPC storage systems are defined as capacity under 100TB and high end generally above 1PB. However, for high end HPC clusters, the NFS server can quickly become a major bottleneck as it does not scale well when used in large cluster environments. The NFS server also becomes a single point of failure, for which the consequences of it crashing can become severe.

SAN (Storage Area Networks)

SAN file systems are capable of very high performance, but are extremely expensive to scale up since they are implemented using Fibre Channel and therefore, each node that connects to the SAN must have a Fibre Channel card to connect to the Fibre channel switch.

Lustre (a Global Parallel File System)

The main advantage of Lustre, a global parallel file system, over NFS and SAN file systems is that it provides; wide scalability, both in performance and storage capacity; a global name space, and the ability to distribute very large files across many nodes. Because large files are shared across many nodes in the typical cluster environment, a parallel file system, such as Lustre, is ideal for high end HPC cluster I/O systems.

A typical storage system consists of a variety of components, including disks, storage controllers, IO cards, storage servers, storage area network switches, and related management software. Fitting all these components together and tuning them to achieve optimal performance presents significant challenges.

If you are managing your own infrastructure in your own private data center, then you are bound to go through a selection of different storage offerings. Selecting a storage solution pretty much depends on your requirement. Before finalizing a particular storage option for your use case, a little bit of understanding about the technology is always helpful.

I was actually going to write an article about object storage (which is the current hottest storage option in the cloud). But before going and discussing that part of the storage arena, I thought its better to discuss the two main storage methods which co-exists together from a very long time, used by companies internally for their needs.

The decision of your storage type will depend on many factors like the below ones.

Type of data that you want to store
Usage pattern
Scaling concerns
Finally your budget

When you begin your career as a system administrator, you will often hear your colleagues talking about different storage methods like SAN, NAS, DAS etc. And without a little bit of digging, you are bound to get confused with different terms in storage. The confusion arises often because of the similarities between the different approaches in storage. The only hard and fast rule to stay up to date in technical terms, is to keep on reading stuffs (especially concepts behind a certain technology.)

Today we will be discussing two different methods that defines the structure of storage in your environment. Your choice of the two in your architecture should only depend on your use case, and type of data that you store.

By the end of this tutorial, I hope you will have a clear picture about the main two types of storage methods, and what to select for your need.

SAN (Storage Area Network) and NAS(Network Attached Storage)

The main things that differentiate each of these technologies are mentioned below.

How a storage is connected to a system. In short how the connection is made between the accessing system and the storage component (directly attached or network attached)
Type of cabling used to connect. In short this is the type of cabling done to connect a system to the storage component (eg. Ethernet & Fiber channel)
How are input and output requests done. In short this is the protocol used to conduct input and output requests (eg. SCSI, NFS, CIFS etc)

Related: How to monitor IO on linux

Let's discuss SAN first and then NAS, and at the end, let's compare each of these technologies to clear the differences between them.

SAN(Storage Area Network)

Today's applications are very much resource intensive, due to the kind of requests that needs to be processed simultaneously per second. Take example of an e-commerce website, where thousands of people are making orders per second, and all needs to be stored properly in the database for later retrieval. The storage technology used to store such high traffic data bases must be fast in request serving and response(in short it should be fast in Input and Output).

Related: Web server Performance test

In such cases(where you need high performance, and fast I/O ) we can use SAN.

SAN is nothing but a high speed network that makes connections between storage devices and servers.

Traditionally application servers used to have their own storage devices attached to them. Server's talk to these devices by a protocol known as SCSI(Small Computer System Interface). SCSI is nothing but a standard used to communicate between servers and storage devices. All normal hard disks, tape drives etc uses SCSI. In the beginning the storage needs of a server was fulfilled by a storage devices that was included inside the server(the server used to talk to those internal storage device, using SCSI. This is very much similar to how a normal desktop talks to its internal hard disk.).

Devices like Compact Disk drives are attached to the server(which are part of the server) using SCSI. The main advantage of SCSI for connecting devices to a server was its high throughput. Although this architecture is sufficient for low end requirements, there are few limitations like the below mentioned ones.

The server can only access data on the devices, which are directly attached to it.
If something happens to the server, access to data will fail (because the storage device is part of the server and is attached to it using SCSI)
There is a limit in the number of storage devices the server can access. In case the server needs more storage space, there will be no more space that can be attached, as the SCSI bus can accommodate only a finite number of devices.
Also the server using the SCSI storage has to be near the storage device(because parallel SCSI, which is the normal implementation in most computer's and servers, has some distance limitations. It can work up to 25 meters.)

Some of these limitations can be overcame using DAS (Directly Attached Storage). The media used to directly connect storage to the server can be any one of SCSI, Ethernet, Fiber channel etc.). Low complexity, Low investment, Simplicity in deployment caused DAS to be adopted by many for normal requirement's. The solution was good even performance wise, if used with faster mediums like fiber channel.

Even an external USB drive attached to a server is also a DAS(well conceptually its DAS, as its directly attached to the server's USB bus). But USB drives are normally not used due to the speed limitation of USB bus. Normally for heavy and large DAS storage solutions, the media used are SAS(Serially attached SCSI). Internally the storage device can use RAID(which normally is the case) or anything to provide storage volumes to servers. SAS storage options provide 6Gb/s speed these days.

An example of DAS storage device is Dell's MD1220

To the server, a DAS storage will appear very much similar to its own internal drive or an external drive that you plugged in.

Although DAS is good for normal needs and gives good performance, there are limitations like the number of servers that can access it. Storage device, or say DAS storage has to be near to the server (in the same rack or within the limits of the accepted distance of the medium used.).

It can be argued that, directly attached storage(DAS) is faster than any other storage methods. This is because it does not involve any overhead of data transfer over the network (all data transfer occurs on a dedicated connection between the server and the storage device. Mostly its Serially attached SCSI or SAS). However due to latest improvement's in fiber channel and other caching mechanism's, SAN also provides better speed's similar to DAS, and in some cases, it surpasses the speed provided by a DAS.

Before getting inside SAN, let's understand several media types and methods that are used to interconnect storage devices(when i say storage devices, please dont consider it as one single hard disk. Take it as an array of disk's, probably in some RAID level. Consider it as something like Dell's MD1200).

what is SAS(Serially Attached SCSI), FC(Fibre Channel), and iSCSI (Internet Small Computer System Interface)?

Traditionally the SCSI devices like the internal hard disk's are connected to a shared parallel SCSI bus. This means all devices attached, will be using the same bus to send/receive data. But shared parallel connections are not good for high accuracy, and create issues during high speed transfers. However a serial connection between the device and the server can increase the overall throughput of the data transfer. SAS connections between storage devices and servers uses a dedicated 300 MB/Sec per disk. Think of SCSI bus that shares the same speed for all devices connected.

SAS uses the same SCSI commands to send and receive data from a device. Also please do not think that SCSI is only used for internal storage. It is also used for external storage device to be connected to the server.

If data transfer performance and reliability is the choice, then using SAS is the best solution. In terms of reliability and error rate SAS disks are much better compared to the old SATA disks. SAS was designed by keeping performance in mind, due to which it is full-duplex. This means, data can be send and received simultaniously from a device using SAS. Also a single SAS host port can connect to multiple SAS drives using expanders. SAS uses point to point data transfer by using serial communication between devices (storage device, like disk drives & disk array's) and hosts.

The first generation of SAS provided around 3Gb/s of speed. The second generation of SAS improved this to 6Gb/s. And the third generation (which is currently used by many organization's for extremly high throughput) improved this to 12Gb/s.

Fiber Channel Protocol

Fiber Channel is a relatively new interconnection technology used for fast data transfer. The main purpose of its design is to enable transport of data at faster rates with a very less/negligible delay. It can be used to interconnect workstations, peripherals, storage array's etc.

The major factor that distinguishes fiber channel from other interconnecting method is that, it can manage both networking and I/O communication over a single channel using the same adapters.

ANSI (American National Standards Institute) standardized Fiber channel during 1988. When we say Fiber (in Fiber channel) do not think that it only supports optical fiber medium. Fiber is a term used for any medium used to interconnect in fiber channel protocol. You can even use copper wire for lower cost.

Please note the fact that fiber channel standard from ANSI supports networking, storage and data transfer. Fiber channel is not aware of the type of data that you transfer. It can send SCSI commands encapsulated inside a fiber channel frame(it does not have its own I/O commands to send and receive storage). The main advantage is that it can incorporate widely adopted protocols like SCSI and IP inside.

The components of making a fiber channel connection are mentioned below. The below requirement is very minimal to achieve a point to point connection. Typically this can be used for a direct connection between a storage array and a host.

An HBA (Host Bus Adapter) with Fiber channel port
Driver for the HBA card
Cables to interconnect devices in HBA fiber channel port

As mentioned earlier, SCSI protocol is encapsulated inside fiber channel. So normally SCSI data has to be modified to a different format that fiber channel can deliver to the destination. And when the destination receives the data it then retranslates it to SCSI.

You might be thinking that why do we need this mapping and re-mapping, why cant we directly use SCSI to deliver data. Its because SCSI cannot deliver data to greater distances to large number of devices (or large number of hosts).

Fiber cannel can be used to interconnect systems as far as 10KM (if used with optical fibers. You can increase this distance by having repeaters in between). And you can also transfer data to an extent of 30m using a copper wire for lower cost in fiber cannel.

With the emergence of fiber channel switches from variety of major vendors, connecting many large number of storage devices and servers have now become an easy task(provided you have the budget to invest). The networking ability of fiber channel led to the advanced adoption of SAN(Storage Area Networks) for faster, long distance, and reliable data access. Most of the high computing environment's(which requires fast and large volume data transfers) uses fiber channel SAN with optical fiber cables.

The current fiber channel standard (called as 16GFC) can transmit data at the rate of 1600MB/s(dont forget the fact that this standard was released in 2011). The upcoming standards in the coming years are expected to provide 3200MB/s and 6400MB/s speed.

iSCSI(Internet Small Computer System Interface )

iSCSI is nothing but an IP based standard for interconnecting storage arrays and hosts. It is used to carry SCSI traffic over IP networks. This is the simplest and cheap solution(although not the best) to connect to a storage device.

This is a nice technology for location independent storage. Because it can establish connection to a storage device using local area networks, Wide area network. Its a Storage Area Network interconnection standard. It does not require special cabling and equipments like the case of a fiber channel network.

To the system using a storage array with iSCSI, the storage appears as a locally attached disk. This technology came after fiber channel and was widely adopted due to it low cost.

Its a networking protocol which is made on top of TCP/IP. You can guess that its not at all good performance wise, when compared with fiber channel(simply because everything is running over TCP with no special hardware and modifications to your architecture.)

iSCSI introduces a little bit of CPU load on the server, because the server has to do the extra processing for all storage requests over the network, with the regular TCP.

Related: Linux CPU performance Monitoring

iSCSI has the following disadvantages, compared to fiber channel

iSCSI introduces a little bit more latency compared to fiber channel, due to the overhead of IP headers
Database applications have small read and write operations, which when done on iSCSI will introduce more latency
iSCSI when done on the same LAN, which contains other normal traffic (other infrastructure traffic other than iSCSI), it will introduce a read/write lag or say low performance.
The maximum speed/bandwidth is limited to your ethernet and network speed. Even if you aggregate multiple links, it does not scal to the level of a fiber channel.

NAS(Network Attached Storage)

The simplest definition of NAS is "Any server that shares its own storage with others on the network and acts as a file server is the simplest form NAS".

Please make a note of the fact that Network Attached Storage shares files over the network. Not storage device over the network.

NAS will be using an ethernet connection for sharing files over the network. The NAS device will have an IP address, and then will be accessible over the network through that IP address. When you access files on a file server on your windows system, its basically NAS.

The main difference is in how your computer or the server treats a particular storage. If the computer treats a storage as part of itself(similar to how you attach a DAS to your server), in other words, if the server's processor is responsible for managing the storage attached, it will be some sort of DAS. And if the computer/server treats the storage attached as another computer, which is sharing its data through the network, then its a NAS.

Directly attached storage(DAS) can be viewed as any other peripheral device like mouse keyboard etc. Because to the server/computer, its a directly attached storage device. However NAS is another server, or say an equipment having its own computing features that can share its own storage with others.

Even SAN storage can also be considered as an equipment that has its own processing/computing power. So the main difference between NAS, SAN and DAS is how the server/computer accessing it sees. A DAS storage device appears to the server as part of itself. The server sees it as its own physical part. Although the DAS storage device might not be inside the server(its normally another device with its own storage array), the server sees it as its own internal part(DAS storage appears to the server as its own internal storage)

When we talk about NAS, we need to call them shares rather than storage devices. Because NAS appears to a server as a shared folder instead of a shared device over the network. Please do not forget the fact that NAS devices are computers in themselves, who can share their storage space with others. When you share a folder with access control using SAMBA, its NAS.

Although NAS is a cheaper option for your storage needs. It really does not suit for an enterprise level high performance application. Never ever think of using a database storage (which needs to be high performing) with a NAS. The main downside of using NAS is its performance issue, and dependency on network(most of the times, the LAN which is used for normal traffic is also used for sharing storage with NAS, which makes it more congested)

Related: Linux Network Performance Tuning

When you share an export with NFS over the network, its also a form of NAS.

Related: NFS Tutorial in Linux

A NAS is nothing but a device/equipmet/server attached to TCP/IP network, that shares its own storage with other's. If you dig a little deeper, when a file read/write request is send to a NAS share attached to a server, the request is sent in the form of a CIFS(Common internet file system) or NFS(Network File system) requests over the network. The receiving end(NAS device), on receiving the NFS, CIFS request, will then convert it into the local storage I/O command set. This is the reason, why a NAS device has its own processing and computing power.

So NAS is file level storage(Because its basically a file sharing technology). This is because it hides the actual file system under the hood. It gives the users an interface to access its shared storage space using NFS, or CIFS.

A common use of NAS you can find is to provide each user with a home directory. These home directories are stored in a NAS device, and mounted to the computer, where the user logs in. As the home directory is networkly accessible, the user can log in from any computer on the network.

Advantages of NAS

NAS has a less complex architecture compared to SAN
Its cheaper to deploy in an existing architecture.
No modification is required on your architecture, as a normal TCP/IP network is the only requirement

Disadvantages of NAS

NAS is slow
Lowever throughput and high latency, due to which it cannot be used for high performance applications

Getting Back to SAN

Now let's get back to our discussion of SAN(Storage area network) which we started earlier in the beginning.

The first and foremost thing to understand about SAN (apart from the things we already discussed in the beginning) is the fact that its a block level storage solution. And SAN is optimized for high volume of block level data transfer. SAN is performs best when used with fiber channel medium (optical fibers, and a fiber channel switch )

Both NAS and SAN solves the problem of keeping the storage device nearer to the server accessing it(which was the case with DAS). A SAN storage can be alloted to a server, which in tern can share it with other's using NAS. Please do not forget the fact that the underlying disks on a DAS, NAS and a SAN can be in any form of a RAID (what makes the real difference is how the server access these storage devices, using which protocol and media).

The name Storage Area Network itself implies that the storage resides in its own dedicated network. Hosts can attach the storage device to itself using either Fiber channel, TCP/IP network (SAN uses iSCSI when used over tcp/ip network).

SAN can be considered as a technology that combines the best features of both DAS and NAS. If you remember, DAS appears to the computer as its own storage device, and is known for good speed, DAS is also a block level storage solution(if you remember, we never talked of CIFS or NFS during DAS). NAS is known for its flexibility, primary access through network, access control etc. SAN combines the best features of both of these worlds together because....

SAN storage also appears to the server as its own storage device
Its a block level storage solution
Good performance/speed
Networking features using iSCSI

SAN and NAS are not competing technologies, but were designed for different needs and purposes. As SAN is a block level storage solution, its best suited for high performance data base storage, email storage etc. Most modern SAN solutions provide, disk mirroring, archiving backup and replication features as well.

SAN is a dedicated network of storage devices(can include tape drives storages, raid disk arrays etc) all working together to provide an excellent block level storage. While NAS is a single device/server/computing appliance, sharing its own storage over the network.

Major Differences between SAN and NAS

SAN	NAS
Block level data access	File Level Data access
Fiber channel is the primary media used with SAN.	Ethernet is the primary media used with NAS
SCSI is the main I/O protocol	NFS/CIFS is used as the main I/O protocol in NAS
SAN storage appears to the computer as its own storage	NAS appers as a shared folder to the computer
It can have excellent speeds and performance when used with fiber channel media	It can sometimes worsen the performance, if the network is being used for other things as well(which normally is the case)
Used primarily for higher performance block level data storage	Is used for long distance small read and write operations

[1] http://www.slashroot.in/san-vs-nas-difference-between-storage-area-network-and-network-attached-storage

[2]http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/architecting-lustre-storage-white-paper.pdf

Genomics Big data

Tuesday, July 5, 2016

SAN