What to account for when starting a file hosting platform
*The following article is in no way a complete "HOWTO" on designing and implementing a file hosting platform. It is to be taken as a series of recommendations on the subject*
First of all, we need fast storage servers.
One would think that a few bulky servers with lots of HDD space, fast RAID10 arrays and Gig uplinks would be enough. In a real-world situation - this is not quite right. You will soon notice that you can't get even 2-300Mbit traffic from these servers, since the bottleneck in this situation would be the hard drives, which will have to do random multiple reads and writes concurrently.
A better option would be to use a few raid1 arrays. or even better, a few separate drives, since MTBF is pretty high for the current HDD models on the market.
This way, for the storage servers, you could use Dual or Quad CPU machines, with 10 - 20 x 1TB hard drives (how many drives - depends on the SCSI/SAS/SATA controller model), also the system partitions should be placed better on a separate RAID1 array. Usually - 2x36 or 2x72 GB SAS/SCSI drives would be a better setup.
This will ensure necessary performance at a reasonable cost.
With high performance storage that provides high disk I/O, comes the need in faster network links. Putting these servers on 100Mbps links would mean to underutilise the disk resources. 1Gbps links are more than welcome, but, be aware that the maximum throughput of a gig link in practice is somewhere around 700-800Mbps, and if your hosting provider is billing you for the allocated bandwidth - it is more convenient to get 800Mbps allotment from them, than get 1Gig and pay the extra 200-300Mbps while not using them.
Apart from high-performance hardware, we must use appropriate software for this job.
The most popular and most used open-source web server - Apache - won't give us the appropriate performance since it is not best-suited for serving (lots of) static content.
Instead, we consider nginx (http://sysoev.ru/nginx/ ) a much better alternative.
Nginx - a very fast and small footprint http server (it also has IMAP/POP3/SMTP proxy functions, but that is not related to the current subject), makes use of kqueue (FreeBSD), epoll (Linux), and sendfile (both OSs).
Now, the following highly depends on your application's architecture, and also on the possible amount of visitors that you are going to have, but there are some common ways to achieve the final design of all the infrastructure. Also, if your budget permits it, it is way better to have more resources from the start, and to expand later, than to discover very quickly that your implementation is not scalable, and be forced to re-design parts of the system and add new resources in a hurry.
- how will the users upload files, is this FTP? HTTP?
- how will they download the files? FTP/HTTP?
- do the end-users authenticate when downloading?
- will you separate users with paid accounts, and the others with free access?
If FTP is involved - you will need to use a ftp server that supports virtual user accounts, virtual quotas, and can be integrated with MySQL (or any other DBMS that you are going to use)
For the web part, this meaning the site, the eventual payment system, it is highly recommended to start with a few servers, not a single one, though this is the part that can be adjusted and scaled easier (than the DB backend for example).
To reduce the load on the web servers, don't use only Apache. Separate the static content (images, js, css files) from the scripts:
1. separate the static content to a different subdomain (this needs to be done within the application), and serve it through nginx
2. at the webserver level, setting up nginx in front of apache, and serve all content through nginx. requests to the dynamic content - should be proxied to apache.
3. in case you are using php-coded scripts, same as (2), but get rid of apache and proxy the requests to the .php files to a FastCGI backend (like PHP with the Php-fpm patch)
Once you run out of resources, you can easily add a new web server, mount the web content from a common NFS share, and:
1. Round-robin dns load-balancing
2. Add a software load balancer in front of the web servers, either nginx (in this case you can have multiple apache or fastcgi backends and serve static content dirrectly from the balancer), or Haproxy (which can only proxy the requests, but has advanced health checks, ACLs and other interesting features)
3. Add a hardware load balancer (Like alteon / radware / big-ip)
The most sensitive part: DB servers.
We will give examples assuming that you are going to use MySQL.
Implementing a successful topology from the start greatly depends on your application structure.
It is great if the application is able to separate database reads and writes, this gives you greater flexibility. It is also great if the application can use multiple DB servers.
When comes to hardware:
Use fast 15k rpm SCSI/SAS drives, RAID1 or RAID10. Get lots of RAM, DB servers love huge amounts of RAM.
Try to avoid RAID5 if possible, since it's write performance is slower than RAID1/RAID10, and also a RAID5 is slower if one of the drives is failed.
Use a scalable architecture.
As stated above with regards to the application structure, it is great if we can separate reads and writes to different servers. In this case we can use Master-Slave replication, and even better, Master-Multiple Slave replication. DB Inserts / Updates / Deletes should be performed on the master, all Selects can be performed on the slaves.
On the master: Store Databases and binary logs on different RAID1 arrays. Different physical devices - faster random reads/writes.
The most sensitive server is the Master. You can double it with an intermediary slave, which can became a Master in case the current Master fails (think about some automatization here. Tip: heartbeat / wackamole)
A sample diagram is attached, which includes the last few discussed details: