With the influx of data coming from “data makers” such as Sequencers, CryoEM, Light Sheet Microscopy and other lab instruments it has been a challenge for the folks in Research Computing and R&D IT to keep pace. One common theme that all my customers are looking for is some sort of “silver bullet” to make heads or tails of the sea of data they are now responsible to curate.
Before you do a shopping trip to Big Data Lake-Mart, it’s important to focus on one key element that can tie all this data together, which is of course, metadata! Do you know every piece of a metadata you can extract from a BAM, CRAM, FASTQ, DICOM, FCS, HDF5, etc. file? (hint: in some cases.. it’s a lot). This is essentially machine data, but it can be combined with user defined metadata collected along the way. While it may not be a “silver bullet”, having a metadata strategy is a pathway to better data curation.
Advanced Metadata is critical. Metadata can be stored or transported more cheaply. We can use it to maximize storage assets and eliminate data that is not valuable. We can build in automation to put the data in the right place at the right time for the scientists (and their incessant need for MORE data ;-). To that end, we can also use metadata to securely share data with collaborators. We can use it for indexing and for fast search methods.
So like most things in IT, there are solutions that could be built or bought. If you want to build out a custom solution for your organization, you may want to consider iRODs. iRODs stands for Integrated Rule-Oriented Data System and as it sounds you can create rules for data workflows back ended by a metadata catalog. Dell EMC has contributed to the community with our front end UI for iRODs, called Metalnx.
For more information on Metalnx, visit our Code Dell EMC page: http://codedellemc.com/ . A word of caution, as powerful as it is, iRODs is not for the faint of heart and just as George Harrison said, you will need to invest whole lot of “money, patience and time…to do it right”.
This brings to those looking to buy a commercial solution, there are a lot of vendors out there. Please see the chart below and by no means is this an exhaustive list, just what we have come across in our travels.
This metadata topic is nothing new, I think it comes up almost every conference or symposium I attend, but I have yet to see the topic tackled head on. So this is exactly what we are going to do for the next Life Science Technology User Group coming up in the Fall. Stay tuned for that on lstug.com and on twitter @LSTUG.
So now I will put it you, the community, what have you seen? What works? What doesn’t? And what do you want to see the community work on to build better solutions in this space?
Until next time..
References: Stephen Worth & Sasha Paegle, Dell EMC