Launch HN: Aquarium (YC S20) – Improve Your ML Dataset Quality

pgao | 167 points

I'm an Aquarium user. There are two ways Aquarium provides value to my company. First, we improved our model performance. Second, I spent less time and less clicks curating my dataset.

Regarding model performance, I used Aquarium to improve the AUC for my model by 18 percentage points (i.e., comparing the AUC for the first model trained on my new dataset to the AUC for my production model).

Regarding dataset curation efficiency, I spent much less time curating my dataset using Aquarium than I would have spent using our own in-house tooling. For example, the embedding-based point cloud allowed me to identify lots of images with an issue at once, rather than image by image, click by click.

This thread has been mostly focused on improving model performance (i.e., my first point), but Aquarium is also valuable for improving model curation labor efficiency (i.e., my second point). For the business owner, dataset curation labor efficiency means less money wasted on having some of your most expensive employees, ML data scientists, clicking around and writing ad-hoc scripts. For the ML practitioner, dataset curation labor efficiency means fewer clicks and less wear on your carpal tunnels.

The founders, Peter and Quinn, didn't ask me to write this. I chose to write it because it's a great product that I think can help a lot of businesses and people.

tmshapland | 4 years ago

Hey! DVC maintainer and co-founder here. First of all, congrats and let me know if we can help you or you have some collaboration in mind! A few questions - how does workflow look like - do you expect users to upload all data to your service? How can data then be consumed from the platform?

ishcheklein | 4 years ago

Thanks for all the hard work and congrats on your launch!

I will definitely check this service out for a side project I'm working on that combines basketball and AI (https://www.myshotcount.com/)

stev3 | 4 years ago

I think is a great idea because as you mentioned quality Datasets can make your model work or not at all. However this is not addressing the big elephant in the room. Which is: no matter how much you curate or clean the data, you are limited to the dataset that you have. The big answer would be, how can you get more and better datasets. I think tooling is super important, but the big difference will be, how to collect/generate/capture reliable, defendable, datasets moving forward. I think your idea is complementary to this other project: https://delegate.dev

masio12 | 4 years ago

Dear @pgao thank you for the long intro with references and explanations. I went to your website and noticed the "getting started" is a contact form. Curious -- are you making a product to do this, or is it more consulting/advisory? I'm currently creating some fun datasets for public usage and i'd love to be a test rat for your software.

TuringNYC | 4 years ago

Thanks for sharing @pgao! This tool looks really valuable.

> Since embeddings can be extracted from most neural networks, this makes our platform very general. We have successfully analyzed dataset + models operating on images, 3D point clouds from depth sensors, and audio.

Are there any types of datasets/models that this tool would not work well with that you're aware of?

hughpeters | 4 years ago

If I understand correctly, it sounds like your platform is primarily intended for improving awareness and understanding of the data a team has, so they know which features to focus on and emphasize.

Do you think you'll get into synthetic data generation as well? In other words, improving dataset quality additively, not just curatively.

fractionalhare | 4 years ago

Have tested the tool a little bit for audio, and see a potential here. Especially useful for anyone who has a relatively large amount of unlabeled data, and want to be efficient in terms of what samples to spend resources labeling.

jononor | 4 years ago