Designing RavenFS

time to read 6 min | 1121 words

What is Raven FS? Raven FS is a distributed file system designed to handle large file replication across WAN networks reliably.

What does it actually means? The scenario that we have is actually quite simple. Given that we have a file in location A and we need to have that file in location B (geo distributed) how do we move the file across the WAN? Let me make the problem slightly more interesting:

  • The file is large, we are talking about hundreds of megabytes at the low range and tens of gigabytes at the high end.
  • The two locations might be connected over WAN.
  • The connection is assumed to be flakey.

Let us consider the a few scenarios where this can be useful:

  • I have a set of videos that I would like to be edited in some fashion (say, putting Bang! and Wham! callouts in some places). Since I have zero ability in editing videos, I hire a firm in India to do that for me. The problem is that each video file is large, and just sending the files to India and back is a challenge. (Large file distributed collaboration problem)
  • I have a set of webservers where users can upload images. We need to send those images to background servers for processing, and then they need to be made available to the web servers again. The image sizes are too large to be sent over traditional queuing technologies. (Udi Dahan calls the problem the Data Bus).
  • I have a set of geo-distributed locations where I have a common set of files (think about something like scene information for rendering a game) that needs to be kept in sync. (Distributed file replication).

I have run into each of those problems (and others that fall into similar categories) several times in recent months. Enough to convince me that:

  • There is a need here that people would be willing to pay for.
  • It is something that we can provide a solution for.
  • There is a host of other considerations related to those set of problems that we can also provide a solution for. A simple example might be simple backup procedures.

The actual implementation will probably vary, but this is the initial design for the problem.

A RavenFS node is going to be running as an HTTP Web Server. That removes a lot of complexity from our life, since we can utilize a lot of pre-existing protocols and behaviors. HTTP already supports the notion of partial downloads / parallel uploads, (Range, If-Range, Content-Range), so we can re-use a lot of that.

From an external implementation perspective, RavenFS node exposes the following endpoints:

  • GET /static/path/to/file <- get the file contents, optionally just a range
  • PUT /static/path/to/file <- put file contents, optionally just a range
  • DELETE /static/path/to/file  <- delete the file
  • GET /metadata/path/to/file <- get the metadata about a file
  • GET /browse/path/to/directory <- browse the content of a directory
  • GET /stats <- number of files, current replication efforts, statistics on replication, etc

A file in RavenFS consists of:

  • The file name
  • The file length
  • A sequence of bytes that makes up the file contents
  • A set of key/value properties that contains file metadata

Internally, files are stored in a transactional store. Each file is composed of pages, each page is a maximum of 4 MB in size and is identified by its signature. (Actually, a pair of hash signatures, probably SHA256 & RIPEMD160, to avoid any potential for collision). The file contents are actually the list of pages that it is composed of.

The notion of pages is pretty important for several reasons:

  • It provides us with a standard way to identify pieces of the files.
  • Each page may be part of multiple files.
  • Pages are immutable, once they are written to storage, they cannot be modified (but they can be removed if no file is referencing this page).
  • It makes it easier to chunk data to send while replicating.
  • It drastically reduces the size taken by files that share much of the same information.

Let us try to analyze this further. Let us say that we have a 750MB video, we put this video file inside RavenFS. Internally, that file is chunked into 188 pages, each of them about 4 MB in size. Since we have setup replication to the RavenFS node in India, we start replicating each of those pages as soon as we are done saving it to the local RavenFS node. In other words, even while we are uploading the file to the local RavenFS node, it is being replicated to the remote RavenFS nodes, saving us the need to wait until the full file is loaded for replication to begin. Once the entire file has been replicated to the remote node, the team in India can start editing that file.

They make changes in three different places, then save the file again to RavenFS. In total, they have modified 24 MB, and in total modified 30 pages. That means that for the purpose of replicating back to the local RavenFS node, we need to send only 120 MB, instead of 750 MB.

This reduces both time and bandwidth required to handle replication. The same will happen, by the way, if we have a set of common files that have some common parts, we will not store the information twice. For that matter, the RavenFS client will be able to ask the RavenFS node about pages that are already stored, and so won’t need to even bother uploading pages that are already on the server.

Another important factor in the decision to use pages is that when replicating across unreliable medium, sending large files around in a single chunk is a bad idea, because it is pretty common for the connection to drop, and if you need a prefect connection for the duration of the transfer of a 1.5 GB file, you are going to be in a pretty bad place very soon.

Thoughts?