Backend Graph DB for Custom File System

This post is based on what I learned implementing Neo4Jfs, a customized Java file system built with a graph database (Neo4J) backend. In this post, I’ll identify the challenges in creating a custom file system, in particular file tree management, propose an alternative. If intrigued but unsure what creating a Java file system actually means, you may find Bootstrapping A Java File System helpful.

Overview

Hands up: how many of you have received a similar feature request from your product team?

A Glorius Feature

Users provide supporting information for projects by uploading and associating files with one or more customer projects. The files’ type or format is based on the project’s or customer’s requirements, such as Word documents, PDFs, spreadsheets, multimedia (images, audio, video), structured text (JSON), and others. File access is through the web application or downloaded via API calls and recipient-specific hyperlinks.

Files are organized into folders created for each project. All project are initialized with a My Project Folder to which files are uploaded or additional folders created. Customers usually create subfolders based on project requirements and then begin uploading files, but may also reorganize in an on-going fashion. The number of files and depth of subfolders is technically unlimited, though the expected depth is no more than seven (7) deep for the majority of projects. For example, a folder structure created financial planning:

My Project Folder

Customer A

Tax Documents

Previous Plans

Questionnaires

Customer B ….

Uploaded files are stored externally in our cloud provider’s storage resource. File security is managed via the applications authentication/authorization mechanism.

Additional folder/file operations should be supported by V2 – copy, rename, modify, delete, move, etc. – and managing folders/files must be intuitive based on user’s normal computer experiences.

Forgive me for skipping the flowery, vacuous, unnecessary prose included by “professional” project managers, but reading between the lines greatly simplifies: Implement a file management subsystem in which users manage files within the application without leveraging local storage, providing a user experience which mimics the daily file management tasks of a laptop or desktop.

Got it! It’s a custom file system. That I grok.

Implement A Custom File System? Why?

Storage requirements usually drive the need for implementing custom file management, such as:

Externally-persisted files – i.e., AWS S3, Azure Blob Service, Google Cloud Service, etc., most importantly separate from the running application or service – provide improved space and cost management and is beyond the capabilities of local disk storage;
Automatic encryption, enhanced authentication/authorization and other techniques provide additional file protections;
Automatic compression and copy-on-write techniques reduce overall space requirements and storage costs;
Optimized access times by caching, copying, or moving files closer to users whom require access;
Adaptive storage approaches based on customer product, file type managed, redundancy required, or other choices deemed important.

That said, modern tools have greatly simplified custom storage implementations, much more straight-forward and, date I say, even somewhat boring: create file and write bytes, open existing file and read bytes, copy file, delete file, blah blah blah. As aspiring software engineers, we first wrote Hello, World in your programming language of choice, immediately followed by a program with simple input/output. Been there, done that. Meh.

The real challenge is designing a functional, useful, efficient file tree. I may sound simple, but I’ve more fails – some epic – that unequivocal successes. In essence you’re recreating fundamental operating system concepts – e.g., *nix iNodes or NTFS‘s master file table in Windows – but for user land. Definitely simpler said than done.

Keeping-It-Simple Design

Most – but not all – custom file systems persist their file tree to a relational database. The data model is easy to design, understand, and work with.

Phase I

The directory table contains directories created via the application. Aside from the top-level, root directory, directories know its parent by parent_dir_id column, a (circular) foreign key back to the directory table.

The file table contains files created and uploaded via the application, its containing directory identified by the dir_id column, a foreign key to the directory table. The storage_id column holds some identifier that allows the externally-stored file to be accessed, retrieved, modified, deleted, whatever.

An understandable place to begin as you internalize the requirements, great forsmall, test-like file trees with a limited number of files and directories. However, problems begin to emerge as the directory depth increases, mostly problems with data access and performance. To retrieve /a/b/c/d/e/f/g/h/i/j/k/l.pdf the SQL statement may be as follows:

SELECT
    f.id,
    f.storage_id
FROM
    directory d1
      JOIN directory d2 ON (d2.parent_dir_id = d1.dir_id AND d2.name = 'a'
      JOIN directory d3 ON (d3.parent_dir_id = d2.dir_id AND d3.name = 'b'
      JOIN directory d4 ON (d4.parent_dir_id = d3.dir_id AND d4.name = 'c'
      JOIN directory d5 ON (d5.parent_dir_id = d4.dir_id AND d5.name = 'd'
      JOIN directory d6 ON (d6.parent_dir_id = d5.dir_id AND d6.name = 'e'
      JOIN directory d7 ON (d7.parent_dir_id = d6.dir_id AND d7.name = 'f'
      JOIN directory d8 ON (d8.parent_dir_id = d7.dir_id AND d8.name = 'g'
      JOIN directory d9 ON (d9.parent_dir_id = d8.dir_id AND d9.name = 'h'
      JOIN directory d10 ON (d10.parent_dir_id = d9.dir_id AND d10.name = 'i'
      JOIN directory d11 ON (d11.parent_dir_id = d10.dir_id AND d11.name = 'j'
      JOIN directory d12 ON (d12.parent_dir_id = d11.dir_id AND d12.name = 'k'
      JOIN file f ON (f.dir_id = f12.dir_id AND f.name = 'l.pdf'
WHERE
    d1.name = '/'

Hmmmm. Twelve joins in total, 11 involving directory. Most of us have experiences with excessive JOINs negatively impacting performance – relational databases do not model arbitrary-sized trees well – so are there alternatives? Multiple databases calls, one per directory retrieve, also perform poorly. [For NoSQL document-based databases, such as Mongo, individual calls are more likely as emulating relational JOINs is not particularly straight-forward, at least in my opinion.]`

Perhaps the data model can be extended to reduce the SQL complexity.

Phase II

The directory’s or file’s absolute path is calculated and stored at created time, enabling direct retrieval without any joins.

SELECT
    f.id,
    f.storage_id
FROM
    file f
WHERE
    f.full_path = '/a/b/c/d/e/f/g/h/i/j/k/l.pdf'

Performance is greatly improved using a single-table, indexed query. Wonderful, our problem solved, the world is good….maybe?

A Different Problem Emerges

File tree access performance is improved by the Phase II model changes as derived data – in this case, the fully-qualified path – is persisted with the directory/file itself, no longer requiring the multiple joins above. The trade-off cost is recalculating the derived data for structural changes to the file tree, such as renaming or moving directories or changing inherited permissions.

Example

The parent directory /a is renamed to /directory_formerly_known_as_a. The row is directory is updated to reflect the directory’s new name. Easy, peasy.

Next step is concerning: iterate through all child directories and files and recalculate the fully-qualified path.

func update_directory (parentDir, currentDir) {
    currentDir.full_path = parentDir.full_path + "/" + currentDir.name
    persist currentDir to database

    var fileList = currentDir.getFiles()
    updateFiles(currentDir, fileList)

    var directoryList = currentDir.getDirectories()
    for each childDir in directoryList {
        updateDirectory(currentDir, childDir)
    }
}

func update_files(currentDir, fileList) {
    for each file in fileList {
        file.full_path = currentDir.full_path + "/" + file.name
        persist file to database
     }
}

Analysis

Minimal impact for small, trivial file systems, no more than a blip. Conversely, when hundreds or thousands of directories and files are affected, the recalculations are time-consuming and costly, with error scenarios to handle, where users may view inconsistent data.

This is not contrived: I have worked on multiple app-specific file management solutions, allowing users to upload as few or as many files as needed, often later reorganizing as their needs evolve. Those solutions stored the file tree in either relational and document (NoSQL) databases. The never-ending challenge was balancing code complexity, performance, and user experience while resisting a rewrite. And beware a user who views the file tree with changes in-progress because their gut-check reaction is to redo the work, creating additional, unnecessary work. Uggh!

File Systems as Graphs

Ignoring links for now, a file system’s top-down structure lends itself to modeling as a tree structure. Additional abstractions and the file tree is represented as a simple graph (also known as strict graph) where each file or directory is a vertex on the graph with one and only one parent (aside from the root directory). Simple graphs do not loops, have a single edge between two vertices (two directories or a directory and file), and is finite in size.

[Though symbolic and hard links often are unimplemented, their presence is a minor inconvenience, changing the abstraction to some form of directed graph. Ultimately, it’s still a graph.]

We previously discussed limitations inherent in navigating a file tree (graphs) when stored in a relational or document database. Can we pivot to a database type that better supports the file-system-as-a-graph paradigm? Indeed we can: it’s time to understand graph databases!

Graph Databases Primer

Product-speak touts how graph databases were developed to derive insights from huge quantifies of interconnected data or created for use cases such as social networking, recommendation engines, and fraud detection when used to create relationships between data and quickly query these relationships. More understandable and real, graph databases store, manage, query, and traverse graph data. Unsurprisingly, graph databases excel in managing and navigating graphs, which greatly assists with our file-tree-as-a-graph paradigm.

Both nodes (vertices) and relationships (edges) are top-level elements in graph databases, in fact the only data containers available. Each created node or relationship is typed and properties set to provide more information for the node or relationship. The optimized data storage provides for fast query retrieval regardless (almost) of how many nodes and relationships are traversed and returned, and vastly different than the databases you grew up with!

One possible query to retrieve /a/b/c/d/e/f/g/h/i/j/k/l.pdf from Neo4J using Cypher, Neo4J’s native query language, is:

MATCH path=(r {name: '/'})-[d:PARENT_OF*11]->(p)-[c:CONTAINS]->(f {name: 'l.pdf'}) RETURN path

Explanation:

Start at the root directory /, find any file named l.pdf that is 11 subdirectories removed from root and return all nodes and relationships traversed.

The returned path – or paths – must be checked for correctness: multiple files named l.pdf may exist 11 subdirectories deep but in a different path that desired. The query is substantially simpler than the relational equivalent and yet completes remarkably quickly.

Introducing Neo4Jfs!

I created Neo4Jfs as a Java FileSystem which uses a graph database – Neo4J – for managing directories and files in the file tree, attempting to overcome limitations I’ve experienced elsewhere. Some highlights:

Fully-Functioning File System: The java.nio.file.Files methods used when managing your laptop’s file system may also used for managing a Neo4Jfs file system. No custom APIs to learn. Standard POSIX security. Input/output streams, SeekableByteChannels for managing raw file contents. Standard exceptions. Use Files.copy() to copy a file from local disk into Neo4Jfs. It’s that simple.
Run-time Data: The speed of Neo4J in traversing a file tree means derived data – such as fully-qualified pathnames and inherited permissions – is lazily calculated at run-time. No need to calculate in advance and update. Moving a directory containing 1000s of files? Single database update to change parent directory, completes in milliseconds.
Partitioned File Systems: Do customers require individual file systems into which files are uploaded? How about divisions within your organization? Prod vs. Non-Prod? Perhaps a reporting or billing requirement? File systems are unique by a URI‘s host and are automatically created as needed.
Configurable Storage: The default LocalStorageManager uses local disk: great for development and testing, no so much for prod. Implement your StorageManager which stores files wherever you desire: AWS S3, Azure Blob Service, Google Cloud Storage, all of the above, none of the above. It’s determined by you.
Extendable: Lots of room for improvements: enhance the authentication/authorization model to integrate with your IDP; create plug-ins for your workflows; extend file states to support file publishing or time-to-live. Your unique use cases drives what needs to be changed.
Open Source: Public code repository, allowing you to deep-dive and figure out how it works. Contribute PRs to fix bugs or clean up implementation. Would love to have others to help!

For additional information, read the project’s README.md.

Final Comments

If it isn’t obvious, I’m fairly excited to share Neo4Jfs with everyone: I considered doing this for years but only recently found the time to try. In all honesty, Java NIO was a complete to me, and so I’m really surprised and happy how much fun it was to implement. My previous Neo4J work was restricted to small proof-of-concepts and personal projects. I’m really impressed with how well it works here (even though it should have been obvious).

I know this is rather dense reading, but hopefully you enjoyed it. I look forward to your feedback and questions!

Image Credits

“Mount an S3QL File System” by xmodulo is licensed under CC BY 2.0.
“CSIRO ScienceImage 3216 Bioinformatics super computer rack” by Carl Davies, CSIRO is licensed under CC BY 3.0.
“Epicurious HTML graph” by Noah Sussman is licensed under CC BY 2.0.