XFS or ZFS for 120TB drive with many millions of small files

[-]

chaos_theo@reddit

See https://discussion.fedoraproject.org/t/xfs-with-external-disk-for-journal-metadata/109516 with howto for hybrid xfs on raid1+raid6 for splitting metadata and data, benchmark tool and results - compare with your own.

[-]

AltruisticCabinet9@reddit

As long as it's not all in the same directory xfs has performed well for me in the past with 100TB avg file size 1MB SVG files.

Not sure about ZFS but BTRFs has been great in raid 10, Zstd compressed on 220TB setups. But that's been for docker builds and registry backing and other non random io loads.

[-]

hornetmadness79@reddit

Partition size doesn't matter so much as inode exhaustion when you're dealing with that many small files.

[-]

undeleted_username@reddit

Why ZFS on top of LVM on top of RAID... instead of ZFS's RAIDZ?

[-]

rayholtz@reddit (OP)

So are you saying that I can just have 10+ raw drives in a server, and ZFS will make them into a RAID itself? Is that expandable later on?

I guess I was only aware of Hardware Raid levels, and then use LVM in case I needed to expand it down the road.

[-]

almostdvs@reddit

That is the preferred way to use ZFS. Why are you considering ZFS? It seems you are unfamiliar with several of its core features and guidelines

[-]

rayholtz@reddit (OP)

I am unfamiliar, that's why I am asking for help to figure out which one to use. I heard these two are more resilient than ext4 so I'm looking at ZFS and XFS.

[-]

stormcloud-9@reddit

I wouldn't say they're more resilient. Ext4 is plenty fine at not getting corrupted or whatnot. Even due to things like power loses or whatnot. The difference is in functionality offered.

Also as mentioned, with ZFS you don't really need LVM. Also with LVM you don't need mdraid.

[-]

Zomunieo@reddit

ZFS is more robust against corruption than ext4 because of scrubbing to detect and prevent bitrot. ext4 will never find out if long term infrequently accessed storage is failing.

If you use a SLOG, ZFS is more resilient against power failure.

For most users ext4 is fine, but ZFS will be better and the differences will matter at 10s of TBs.

[-]

kai_ekael@reddit

Now do performance tests. ZFS has overhead.

[-]

Zomunieo@reddit

Data loss has overhead, and it’s exponentially more costly than a performance hit.

[-]

kai_ekael@reddit

Oh, well then, everyone and their mother should use ZFS, regardless of anything else!

No, that's not the correct method to determine a suitable solution, fan boy.

[-]

bambinone@reddit

If you use a SLOG, ZFS is more resilient against power failure.

You get a ZFS Intent Log (ZIL) to persist and protect sync writes in adverse conditions whether you have an SLOG or not. It's just usually a lot slower to acknowledge sync writes without a SLOG. If you add a SLOG it should ideally be a mirrored pair or better of drives with power loss protection and low-latency writes at QD=1.

[-]

sumsabumba@reddit

zfs wants RAW harddrives.

zfs raid is not expandable.

It's great technology and I can recommend it, but read up on it.

[-]

symcbean@reddit

but read up on it.

^^ THIS !!!!

So many people obsess about the product yet fail to do ANY research before posting questions which mostly garner opinions about what other people like.

OP: Go read up on the process of replacing a failed drive in each architecture on your short list. That will give you a flavour of what to expect.

[-]

Klosterbruder@reddit

zfs raid is not expandable.

While technically (currently) true, I think this may sound a bit misleading for someone without ZFS experience.

One VDEV - which is more or less a ZFS equivalent to traditional RAID - has, so far, not been expandable, when it was RAID-Z1, -Z2 or -Z3. But, multiple VDEVs can be combined in a ZFS pool, and you can also add more VDEVs later on.

There's a better and more detailed explanation over here: https://www.reddit.com/r/zfs/comments/fn5ugg/zfs_topology_faq_whats_a_zpool_whats_a_vdev/

[-]

sumsabumba@reddit

I could have worded that better.

For most people expanding means add more drives too array. OP wants 10+ drives so he/she should probably use more than one VDEV anyway.

[-]

Mr_Engineering@reddit

So are you saying that I can just have 10+ raw drives in a server, and ZFS will make them into a RAID itself?

ZFS is more than just a filesystem, it's an entire storage infrastructure which includes software raid. Some people prefer to allow ZFS to handle striping, mirroring, and parity rather than hardware raid controllers.

Is that expandable later on?

ZFS raid expansion is a new feature. However, you can have multiple RAIDs within a volume group and you can always migrate to a larger RAID by adding a new larger raid to a volume group and then offlining the smaller RAID if desired. As long as the total capacity of the volume group doesn't fall below the total size, ZFS will move data around as needed.

[-]

bambinone@reddit

Some people prefer to allow ZFS to handle striping, mirroring, and parity rather than hardware raid controllers.

It's not a question of preference, or at least it shouldn't be. It is strongly discouraged to put ZFS on top of any other hardware or software RAID, storage virtualization, or logical volume manager.

[-]

justin-8@reddit

Yes. But note that it is not expandable.

[-]

Viruses_Are_Alive@reddit

Were you going to run raid5 on a 10 drive array?

[-]

llewellyn709@reddit

I have started with a single drive, defined as RAID1 and added the second drive later which included himself automatically in the RAID...

[-]

kai_ekael@reddit

I refuse to use XFS for one simple reason: you can NEVER shrink the damn thing. Sure, make a new filesystem and copy to it, but it you're talking TB of data...no thanks.

[-]

TasksRandom@reddit

It will have ~120TB of usable space on a raid5 LVM partition, shared out as SMB shares.

At ~120TB, I hope that raid 5 is a typo...

Also, avoid SMB unless you're in a Windows-centric environment. Not a ZFS expert, but I seem to recall you can share ZFS via NFS rather painlessly -- not that installing the nfs kernel server and exporting xfs is painful.

[-]

chaos_theo@reddit

As zfs uses the os nfs-server and so without installed one zfs cannot export any nfs share.

For any linux filesystem (including zfs also) works editing /etc/exports and do "exportfs -a" with

/local-path/dir client/IP|network(rw,sync,no_subtree_check) # maybe plus no_root_squash

or just on-the-fly until reboot (or unexporting):

exportfs -o rw,sync,no_subtree_check client/IP|network:/local-path/dir

[-]

wired-one@reddit

They person recommending the NAS device is right on this.

A Synology is a better choice for this, especially for sharing out your files to windows hosts. The management is easier, the integration into something like AD is easier, backups are easier. It's gonna be easier for whoever comes after you to manage it.

Now that I've said that and you are going to ignore my advice, let's talk about why RAID 5 in LVM is a bad idea. While it's possible to do, and flexible in design, it's limitations in management beat you up.

What's far better is to use ZFS or BTRFS to manage the disks.

With 20TB disks in RAIDZ2 you would need 12 drives drives total in 2 RAID groups of 6 drives to get 185 TB of storage.

This will handle just about anything. and has double parity in each group.

No need for LVM before doing the ZFS pool, it just kinda works.

You will however need room for 12 drives and you'll need ECC memory in the server.

You could also do this with BTRFS, which the synology uses under the hood, but if you aren't that experienced, I wouldn't necessarily recommend it.

[-]

chaos_theo@reddit

Haha, a synology ... the slowest nas systems ever available ... have fun, take a coffee with cupcake, after that a tea and come back tomorrow to see the data transfer isn't done also :-)

[-]

sarkyscouser@reddit

Not XFS, this was originally designed for small numbers of large files, not large numbers of small files

[-]

chaos_theo@reddit

What sense does originally designed for has while xfs is 30 production years old now and has lots of redesign and optimation "updates" over the epoches ?!

[-]

sarkyscouser@reddit

Then have a look at the phoronix tests more recently that include things like deleting or renaming large numbers of files. XFS performs much more poorly than EXT4 in these tests, but better in dealing with large files e.g. video editing, 3D modelling etc.

[-]

chaos_theo@reddit

That's xfs default which is good in single disk use. If you take a raid you ever have to tune from the beginning stripesize and on. Even the phoronix tests did NOT do hdd raid6 and even don't separate xfs metadata to ssd raid1 - ext4 cannot compare with xfs as it has to wait on hdd's while xfs do that on ssd. Think about that.

[-]

Ok_Size1748@reddit

Do not trust raid 5 for volumes so big. Consider at least raid 6 + spare.

[-]

PE1NUT@reddit

ZFS excels at keeping your files safe. If you add sufficient parity drives, a hot spare, and maybe keep an empty slot in case a drive needs replacement, it works incredibly well. We've been using it for over a decade, with over 800 drives in production, and have lost literally only one file due to a double hard disk failure on a long weekend. One of the things that makes ZFS stand out is the on-disk checksumming, and periodic scrubs of all your data, verifying that your data is indeed still all well, and that no disks are quietly succumbing to bitrot.

However, it has quite a few knobs to tweak, and if you're new to ZFS, you should probably find someone to help. The disk layout, use of SSD caches for metadata, compression etc. should all be considered for your application.

It's not the best in terms of write speed, due to the extra work and checksums being written.

There are no real recovery tools for ZFS, so if things do go terribly wrong, you must have backups.

XFS write speed will probably be better. But you'd have to rely on the underlying RAID to tell you when disks go bad (and they will!) and it won't guard against bitrot.

What is most important for your application? Write speed? Retention of the data? Costs?

[-]

chocopudding17@reddit

This is the best, most understandable, most concise answer to OP in this whole thread. OP, what are priorities of your system? What is your backup strategy?

[-]

andyniemi@reddit

ext4

[-]

chaos_theo@reddit

ext4 is unable to outsource metadata from data like xfs (or zfs) so it ext4 definitely much slower in handling of millions of millions of small files.

[-]

assid2@reddit

From basic usage xfs would do what you need. However, I personally would recommend ZFS. It is a self healing file system which offers a lot more features than discussed here. It allows on the fly compression with almost non existent overhead, snapshots, replication which is amazing for small files since it's block level instead of file level as it's based on snapshots and a lot of other features..

Replication will take care of your backups efficiently. ALWAYS BACKUP !!

I think you should consider separating the application layer from the storage layer and extend the storage to the application via NFS. You can use something like TrueNAS to manage the storage layer. You can always use fibre to connect the devices

[-]

shulemaker@reddit

He said SMB. It sounds like the cameras used by this app are connected to Windows machines and store jpegs so there is already compression. For the use case, zfs is overkill, and if you don’t stay on top the 80% free rule, with millions of tiny files, you start to get slowdown due to metaslab fragmentation. Then you have to worry about the zil and all the other zfs things, so it’s not “set it and forget it”. Backups would be simpler.

I have run large zfs (Solaris), xfs, ext4, and NetApp in prod, and as much as it pains me to say, zfs was those most finicky (I had the old thumper boxes). NetApp was the best, and ext4 was more reliable than xfs.

[-]

ravigehlot@reddit

Is NetApp a fair comparison here? Are you referring to ONTAP or NetApp as a whole? Comparing NetApp with filesystems like XFS or ZFS is somewhat of an apples-to-oranges situation.

[-]

shulemaker@reddit

The gold standard is a relevant reference point.

[-]

DigitalDefenestrator@reddit

At 120TB, I assume that's spinning drives? If you're writing lots of small files synchronously, you may see a huge benefit from a ZFS SLOG on nvme.

[-]

rayholtz@reddit (OP)

Its really writing no more that 7-10 files at once; one per manufacturing line. Then the files just sit there and stay FOREVER (ok more like 3 years), until someone sometime decides to look at the image of the widget IF we get a complaint about it.

[-]

whalesalad@reddit

I’d drop them into cold storage in AWS S3 and let this be someone else’s problem.

[-]

admalledd@reddit

If this is the case, then I would probably more recommend:

Whatever your team is currently most comfortable with, which sounds like LVM+XFS
while a bit more $$ (OK, a LOT more), consider an out-of-box vendor solution+backup node
if not worried about latency too much, and already having a cloud-provider partnership (AWS/GCP/Azure/etc), consider one of their "cloud fileshare" solutions, which while being more $$ per month, solve a lot of backup concerns

[-]

placated@reddit

Sounds like what you actually need is a cloud based object store. Pennies on the gig. Then you can work on higher level more useful stuff than worrying about filesystems.

[-]

iksdeecz@reddit

Neither. Use CEPH for object storage.

XFS if you look for single node storage server.

[-]

chaos_theo@reddit

100's of million of files are very easy to handle by xfs with metadata on ssd/nvme while data still on rust. Don't do lvm as it's an extra layer, just 1 xfs directly on combined hw/sw-raid1+hw-raid6 without label or partitions. If you need kind of limiting use xfs quotas of your smb shares.

[-]

markus_b@reddit

I've been working on supercomputers with filesystems with 100's of millions of files. This is a real pain, especially for metadata operations. It may make sense to choose a filesystem, where you can store metadata on SSD and files on spinning disk.

Your main problem will be around backups and traversing the entire file tree. We had situations, where the daily incremental backups toos more than 24h, just to determine which files to back up. The filesystem used was GPFS (spectrum scale) and it has a special API to scan inodes in parallel, seeding things up by a factor of 100 or more.

[-]

whatisnuclear@reddit

Millions of small files... make sure to disable the `atime` parameter if you go with ZFS so you don't waste zillions of iops writing timestamps on each file every time you ls it.

`sudo zfs set atime=off pool`

[-]

bambinone@reddit

Set xattr=sa as well and evaluate if it should be spinner data disks plus a special vdev or all flash in the first place.

[-]

zqpmx@reddit

Consider TrueNas and ZFS

How much Writing and reading are you planing to do.

How intensive, how much redundancy, how fast are you going to access?

Do you have already the hardware?

ZFS has nice features like transparent compression and integrity check of everything written.

[-]

welsh1lad@reddit

Worked in the media industry for 6 years, with storage servers up to peta bytes used xfs , as the raw FS . Fast io , good with small media files thumbs clips fX . Fa good over raid , think I used raid 10 at the time . This was back 9 years ago though .

[-]

rayholtz@reddit (OP)

Thanks for the info on XFS.

[-]

welsh1lad@reddit

This is an old video I did of petabyte raiding , don’t judge many years ago . But we had big clients like Sony cnn etc using xfs for edits https://youtu.be/30nQfxXwAsQ?si=vmAPqaaZAja-rjVQ

[-]

Seven-Prime@reddit

fun vid. reminds of when you could anticipate disk failures by that one drive who's light would stay on just a little too long when streaming. Racks of 146GB fibre channel drives for a single film.

[-]

Winter-Jello2775@reddit

yep that is shown in the following vid - https://youtu.be/oI8ch7Ah3fs?si=pZGYvMkju92SiSLe

[-]

aamfk@reddit

when you say 'many millions of small sizes' you're talking about SECTOR SIZES. Not Partition Types.

I think you should specify 'How Small' and 'How many Millions'.
I was unzipping a BUNCH of torrents yesterday, and my sector size is the default 4096. I wasted about 1/4 of my TOTAL fucking storage because the avg file size was 3000 bytes i think.

I'm NOT gonna re-do a hard disk in order to conserve 150gb of disk space. But I SHOULD!

I'm just not gonna do it right now

[-]

craigleary@reddit

Zfs properly set up and using zfs send / recv like with syncoid would make backups a breeze and incremental. Millions of files on xfs I can imagine will would be a very slow backup. On that alone I’d choose zfs.

[-]

flapjack74@reddit

For reference, primary sources like the OpenZFS Documentation, ZFS on Linux Wiki, XFS Wiki, and Red Hat XFS Guide offer excellent starting points for exploring these filesystems. It's a complex subject, and achieving full mastery depends largely on your specific needs. If your goal is simply long-term data storage where the data doesn’t change often, it might not be worth the deep dive.

Regarding ZFS, while much has already been said about it being a complete storage solution and not just a filesystem, one aspect often overlooked is its memory consumption for file caching. It has some really cool features, though.

As for XFS, it’s a good, stable filesystem. If you're dealing with a large number of files, you should consider mounting it with inode64.

This topic is broader than it seems at first glance. If the data is critical and the system can't afford 24-48 hours of downtime, you should think beyond just redundant disks and consider having a standby system (though not necessarily active-active) with redundant data also test backup and recovery times.

[-]

mumblerit@reddit

You don't want this to be your first dive into zfs, just use xfs with raid and some backup

[-]

enigmatic407@reddit

This is the way 💯

[-]

Fluid_Ask2636@reddit

Don’t go Raid 5. It’s only a matter of time till it will fail on you. Use at least Raid 6.

[-]

flaticircle@reddit

Nothing wrong with ext4. Format with -T small so you don't run out of inodes. IMO more recoverable than XFS if things go seriously south.

mkfs.ext4 -T small /dev/mapper/vg_foo-lv_bar

[-]

gmuslera@reddit

Besides the filesystem type, consider the directory structure too, having them layered to not have too much files in any individual directory.

[-]

rayholtz@reddit (OP)

Out of our control. It is a manufacturing software that writes to it. but it is subdivided to a point. Its not 20 million in one folder. Probably 10 main folders, then a few million subfolders in each, then maybe 20-ish files in each subfolder, \~500kb and less for each file.

[-]

Seven-Prime@reddit

You will have problems with that many files in a folder. You can 'crash' your system by performing an ls. Doing a directory listing will default to sort the entries, over nfs is extra fun. Lots of fun problems when you get to that high number of files, folders, and path lengths. Or being unable to create new files because you are out of extents. If only you knew about increasing that at filesystem creation time.

Stuff like this: https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/performance_tuning_guide/ch07s03s02s02#idm140718720709840

[-]

reingauf@reddit

yes, ls, and bash *, sort the stuff which makes it particularly inefficient

programmatically however this sorting is optional. find prints results instantly because it does not do the sorting step

so it's bad but not quite as terrible as ls make it seem. it's a problem for the admin, not for the program using this structure

[-]

gmuslera@reddit

It depends on what the related programs will do, and how it access the files. I don’t know how many disk blocks will fill one of those directories with millions of subentries, but every player that must operate with it (application programs, samba, mount and whatever) must load all those blocks for most things. And then, even without a sort trick, linearly going through millions of entries to get the inode of a particular file or displaying the folder in a desktop computer will take resources.

It is not something to easily dismiss, specially not knowing what other actors do with those files and folders.

Maybe with ZFS the directory structure is more optimized for searches, but samba may not be as much.

[-]

reingauf@reddit

the VFS cache is good at handling this

but yes, a good program would introduce more sub levels to allow more efficient lookups (similar to how hash tables work)

[-]

Seven-Prime@reddit

Indeed. ls also has a -1 option to not sort the results. Which is great if you know about it. Otherwise it's a few weeks of figuring it out.

[-]

miksu103@reddit

Another aspect to consider is using a purpose built storage appliance operating system instead of configuring everything from scratch on your own.

From your messages I can see that you are currently learning a lot regarding the topic. Configuring the whole thing on a blank OS like Ubuntu server means there are a lot of configurations to do and many possibilities for mistakes. Consider using a purpose built OS like Truenas instead. It has a lot of optimizations already built in, and good default configurations. You will also gain an easy to understand GUI to see the status of your storage, and convenience factors like built in monitoring and updates from the GUI.

[-]

kolorcuk@reddit

I recommend zfs. Zfs is great, battle tested for years. Ditch mdadm + lvm + xfs and just go full zfs.

Zfs however requires some ram memory for more features. You could also do l2arc cache with zfs on ssd, it should nicely scale with small files, but it requires even more memory.

[-]

TheIncarnated@reddit

I mean this with all respect.

Stop what you are doing and go and buy 2 x 12 bay Synology's. One as your main, the 2nd as a backup repository or HA.

Let that run everything you need. Make sure to also buy the synology's with redundant power supplies.

[-]

shulemaker@reddit

For millions of tiny files, ZFS is not the answer.

Between the two, XFS, but at one place I had a large EMC SAN-backed XFS volume like this used to export NFS with millions of tiny files, and the B-tree got corrupted and the data wasn’t fully recoverable. This was on RHEL 6. Red Hat recommended ext4, so I got a new lun and made the replacement filesystem on ext4.

[-]

reingauf@reddit

back in the Gentoo days when I used ccache (tons and tons of small files...) it was better to make a dedicated partition/filesystem just for that so it would not screw with anything else. and deleting all these files (when the cache was useless, because you changed cflags) it was a simple matter of mkfs that filesystem, because letting rm or find traverse these files was unbearably slow, while mkfs was instant

for some applications this is still a very good method. it does not always have to be one large filesystem for everything. divide and conquer, smaller filesystems are cool.

but it depends how you work with these files.

dealing with millions files, simple traversal with find, will also cause millions of stat() calls and what not. that overhead is not going away unless you package the files. some databases are also good with file contents.

[-]

autogyrophilia@reddit

As a ZFS evangelist, if you don't have a few weeks to fiddle around ZFS and see it's many advantages and peculiarities, I would just use MDADM with XFS.

MDADM is essentially hardware RAID done in software. It hurts a little when used for parity as it can't rely on a BBU to smooth things out with the caches that make parity raid so vulnerable. ZFS parity raid is special and sidesteps this issue enterily.

Consider deploying RAID6 (or zraid2) . Only one parity level is insufficient for all but the least critical workloads.

[-]

silence036@reddit

Is this something where OP could setup dedupe on the zfs pool? Lots of the images will probably be very similar to one another.

[-]

autogyrophilia@reddit

With binary data, being similar is not enough. Not only needs to be the same information, but it needs to be aligned over the same block boundaries (the reason the parameter --rsyncable exists for a few compressors) .

There are few ways to dedupe with rolling algorithms, but the only one I've seen that actually delivers without much troubleshooting is Windows Server Dedupe .

[-]

derprondo@reddit

Millions of small files? In the old days we would have told you to use ReiserFS, now known as MurderFS (since the creator killed his wife which became a 20/20 Dateline murder mystery). Just use XFS, you'll be fine.

[-]

Linux4ever_Leo@reddit

I've always found XFS to be rock solid and it's great on systems with a lot of files.

[-]

Bill_Guarnere@reddit

I had experience with something like that with millions and millions of small files (it was a service for a police force to let users access speeding images from cameras on the street and documents to pay fines and documentation).

The customer gave us a huge NFS export on a huge Dell enterprise Nas, I don't remember the filesystem if it was ext4 or xfs (on the Nas side obviously), but I remember they had huge problems with backups, basically filesystem objects were so many that every attempt to backup them required an enormous amount of time.

If today I would have another project like that I will absolutely suggest ZFS, raw SAS drives with a ton of ram for deduplication, and ZFS snapshots as backup, to transfer via ssh to another host as local backup and to a remote host as remote backup copy.

Much simpler and efficient.

[-]

welsh1lad@reddit

What’s the support for zfs these days , don’t think poster stated OS ? Used to be only sunMicroSystems , did here cononical put a zfs binary blob in the kernel most probably wrong on that , out of file system game now.

[-]

almostdvs@reddit

That is the preferred way to use ZFS. Why are you considering ZFS? It seems you are unfamiliar with several of its core features and guidelines

[-]

ultrahkr@reddit

ZFS has far more features than XFS. The most important being data checksum and built-in RAID.

[-]

robvas@reddit

XFS will work

Remember to split them into levels of directories

Some sort of other storage (database for example) is probably a better performing idea than just millions and millions of small files. Much easier to work with and faster.

[-]

rayholtz@reddit (OP)

This is not our application, but one we have to buy. Poorly written, but it is what we need, to do what we need to do. We are a manufacturer, not a software shop unfortunately.