Command-line Tools can be 235x Faster than a Hadoop Cluster (2014)

hd4 | 438 points

I have rewritten incredibly overarchitected stuff, Cassandra, Hadoop, Kafka, Node, Mongo etc with a plethora of the ‘the latest cool programming languages’ running on big clusters of Amazon and Google to simple, but not sexy, c# and mysql or pgsql. Despite people commenting on the inefficiency of ORMs and unscalable nature of the solution I picked, it easily outperformes in every way for the real worldcases these systems were used for. Meaning; far easier achitecture, easier to read and far better peformance in both latency and throughput for workloads that will probably never happen. Also; one language, less engineers needed, less maintenance and easily swappable databases. I understand that all that other tech is in fact ‘learning new stuff’ for RDD, but it was costing these companies a lot of money with very little benefit. If I need something for very high traffic and huge data, I still do not know if I would opt for Cassandra or Hadoop; even with proper setup, sure they scale but at what cost? I had far better results with kdb+, which requires very little setup and very minimal overhead if you do it correctly. Then again, we will never have to mine petabytes, so maybe the use case works there: would love to hear from people who tried different solutions objectively.

tluyben2 | 6 years ago

I once converted a simulation into cython from plain old python.

Because it fit in the CPU cache the speedup was around 10000x on a single machine (numerical simulations, amirite?).

Because it was so much faster all the code required to split it up between a bunch of servers in a map reduce job could be deleted, since it only needed a couple cores on a single machine for a ms or three.

Because it wasn't a map-reduce job, I could take it out of the worker queue and just handle it on the fly during the web request.

Sometimes it's worth it to just step back and experiment a bit.

3pt14159 | 6 years ago

"You can have a second computer once you've demonstrated you know how to use one".

mpweiher | 6 years ago

Recently was sorting a 10 million line CSV by the second field which was numerical. After an hour went by and it wasn't done, I poked around online and saw a suggestion to put the field sorted on first.

One awk command later my file was flipped. Run same exact sort command on this but without specifying field. Completed in 12 seconds.

Morals:

1. Small changes can have a 3+ orders of magnitude effect on performance

2. Use the Google, easier than understanding every tool on a deep enough level to figure this out yourself ;)

ikeboy | 6 years ago

See also: Scalability! But at what COST? [1] [2]

> The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation. COST weighs a system’s scalability against the overheads introduced by the system, and indicates the actual performance gains of the system, without rewarding systems that bring substantial but parallelizable overheads.

[1] http://www.frankmcsherry.org/assets/COST.pdf

[2] https://news.ycombinator.com/item?id=11855594

0xcde4c3db | 6 years ago

For my performance book, I looked at some sample code for converting public transport data in CSV format to an embedded SQLite DB for use on mobile. A little bit of data optimization took the time from 22 minutes to under a second, or ~1000x, for well over 100MB of Source data.

The target data went fro almost 200MB of SQLite to 7MB of binary that could just be mapped into memory. Oh, and lookup on the device also became 1000x faster.

There is a LOT of that sort of stuff out there, our “standard” approaches are often highly inappropriate for a wide variety of problems.

mpweiher | 6 years ago

in other words, "Too big for excel is not big data" https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

makapuf | 6 years ago

A lot of people are saying how they've worked on single-machine systems that performed far better than distributed alternatives. Yawn. So have I. So have thousands of others. It should almost be a prerequisite for working on those distributed systems, so that they can understand the real point of those systems. Sometimes it's about performance, and even then there's no "one size fits all" answer. Just as often it's about capacity. Seen any exabyte single machines on the market lately? Even more often that that, it's about redundancy and reliability. What happens when your single-machine wonder has a single hardware failure?

Sure, a lot of tyros are working on distributed systems because it's cool or because it enhances their resumes, but there are also a lot of professionals working on distributed systems because they're the only way to meet requirements. Cherry-picking examples to favor your own limited skill set doesn't seem like engineering to me.

notacoward | 6 years ago

See also the wonderful COST paper: https://www.usenix.org/conference/hotos15/workshop-program/p...

But the article is kind of wrong. It depends on your data size and problem - you can even use commandline-tools with Hadoop Map/Reduce and the Streaming API and Hadoop is still useful if you have a few terabytes of data that you can tackle with map and reduce algorithms and in that case multiple machines do help quite a lot.

anything that fits on your local ssd/hdd probably does not need hadoop... however you can run the same unix commands from the article just fine on a 20tb dataset with Hadoop.

Hadoop MapReduce/HDFS is a tool for a specific purpose not magic fairy dust. Google did build it for indexing and storing the web and probably not to calculate some big excel sheets...

nisa | 6 years ago

For many, the incentive from the gig is

a) resume enrichment by way of buzzword addition b) huge budget grants and allocation, purportedly for lofty goals while management is really unaware of real technology needs/options

Much has been talked about this already; sharing them again!

[1] https://news.ycombinator.com/item?id=14401399

[2] https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

[3] http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra...

[4] https://www.reddit.com/r/programming/comments/3z9rab/taco_be...

[5] https://www.mikecr.it/ramblings/taco-bell-programming/

raghava | 6 years ago

Most people who think they have big data don't.

jandrese | 6 years ago

    cat *.pgn | grep "Result" | sort | uniq -c
This pipeline has a useless use of cat. Over time I've found cat to be kind of slow as compared to actually passing a filename to a command when I can. If you rewrite it to be:

    grep -h "Result" *.pgn | ...
It would be much faster. I found this when I was fiddling with my current log processor to analyze stats on my blog.
Something1234 | 6 years ago

Un My case it was 2015. I was struggling with a 28GB CSV file that i needed to cut grabbing only 5 columns.

Tried spark on my laptop: waste if time. After 4h I killed al processes because it didn't read 25% of the file yet.

Same for hadoop, python and pandas, and a shiny new tool from google whose name I forgot long time ago.

Finally I installed cygwin con My laptop and 20 minutes later 'cut' gave me the results file I needded.

eb0la | 6 years ago

So there are 2197188 games in that file.

Extracting only the lines with "[Result]" in them into new file (using grep) takes about 3 seconds.

Importing that into a local Postgres database on my laptop takes about 1.5 seconds.

Then running a simple:

   select result, count(*)
   from results
   group by result;
Takes about 0.5 seconds.

So the total process took only 5 seconds (and now I can run much more aggregation queries on that)

hans_castorp | 6 years ago

I feel that we have spent 30 years replicating Bash across multiple computer systems.

The further my clients move to the cloud, the more shell scripts they write at the exclusion of other languages. and just like this, I have clients who have ripped out expensive enterprise data streaming tools and replaced them with bash.

The future of enterprise software is going to be a bloodbath.

exelius | 6 years ago

anecdota - We used fast csv tools and custom node.js scripts to wrangle import of ~500GB of geo polygon data into a large single vol postgresql+postGIS host.

We generate svg maps in psuedo-realtime from this data-set : 2MB maps render sub-second over the web, which feels 'responsive'.

I only mention this as many marketing people will call 50Million rows or 1TB "Big Data" and therefore suggest big / expensive / complex solutions. Recent SSD hosts have pushed up the "Big Data" watermark, and offer superb performance for many data applications.

[ yes, I know you can't beat magnetic disks for storing large videos, but thats a less common use-case ]

jgord | 6 years ago

Relevant article from 2013: 'Don't use Hadoop - your data isn't that big' [0] and the most recent HN discussion [1].

[0]: https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

[1]: https://news.ycombinator.com/item?id=14401399

icc97 | 6 years ago
[deleted]
| 6 years ago

Hi all, author here!

Many thanks for the feedback and comments. If you have any questions, I'm also happy to try to answer them.

I'm also working on a project, https://applybyapi.com which may be of interest to anyone here hiring developers and drowning in resumes.

adamdrake | 6 years ago

You can get very good improvements over Spark too. I've been using GNU Parallel + redis + Cython workers to calculate distance pairs for a disambiguation problem. But then again, if it fits into a few X1 instances, it's not big data!

jakosz | 6 years ago

If you need to do something more complicated where SQL would be really handy, and like here you're not going to update your data, give monetDB a try. Quite often when I'm about to break down and implore the hadoop gods, I remember about monetDB, and most of the time it's sufficient to solve my problem.

MonrtDB: columnar DB for a single machine with a psql like interface: https://www.monetdb.org/

dorfsmay | 6 years ago

Well yes in the edge case where you don't really have big data of course it will.

Where MapReduce Hadoop etcera shine is when a single dataset one of many you need to process is bigger than the biggest single disk availible - this changes with time

Back when I did M/R for BT the data set sizes where smaller - still having all of the Uk's larges single PR!ME superminis running your code was dam cool - even though I used a 110 baud dial up print terminal to control it.

walshemj | 6 years ago

I think much of this issue can be attributed to 2 most underrated things

1. Cache line misses. 2. So called definition of BigData. (if data can be easily fit into memory, then its not Big period! )

Many times, I have seen simple awk / grep commands will outperform Hadoop jobs. I personally feel, its lot better to spin up larger instances, compute your jobs and shut it down than bearing the operational overhead of managing hadoop cluster.

ram_rar | 6 years ago

Anchoring the search would allow the regex to terminate faster for non-matching lines:

    ... | mawk '/^\[Result' ...
verytrivial | 6 years ago

Esp. if you use multipipe-enhanced coreutils, like dgsh. https://www2.dmst.aueb.gr/dds/sw/dgsh/

rurban | 6 years ago

It would benefit people to actually understand algorithmic complexity of what they are doing before they go on these voyages to parallelize everything. It also helps to know what helps to parallelize and what doesn't.

noobermin | 6 years ago

Something, something, Joyent Manta: https://apidocs.joyent.com/manta/job-patterns.html

insaneirish | 6 years ago

Shouldn't this be marked 2014? The articles date is January 18, 2014.

ash_gti | 6 years ago

Partially because of the bloated architecture of those Hadoop, Kafka, etc. And of course Java. Implementing modern and lighter alternative to those in C++, Go or Rust would a step forward.

xvilka | 6 years ago

People are suckers for complex, over-engineered things. They associate mastering complexity with intelligence. Simple things that just work are boring / not sexy.

I’ll be in my corner doing asking “do we really need this?”/“have you tested this?”

mirceal | 6 years ago

I've heard this so often, especially from my boss - whoever this is. I come to the conclusion this is just a lame excuse because the people feel out of control and overwhelmed when using this kind of software. It is heavily Java based and to those who never touched javac, the Stack traces must look both intimidating and ridiculous.

On the other hand, processing data over several steps with a homegrown solution needs a lot of programming discipline and reasoning, otherwise your software turns into an unmaintainable and unreliable mess. In fact this is the case where I work right now...

blablabla123 | 6 years ago

This reminds me of the early EJBs which were really complicated and most people who used them didn't really need them.

forinti | 6 years ago

It can be colder at night than outside.

koomi | 6 years ago

I closed at "Since the data volume was only about 1.75GB". (And probably would have until 500GB+)

erazor42 | 6 years ago

Ah yes this is so true!

hajderr | 6 years ago

the hadoop solution may have taken more computer time, but that is far less valuable than a person's

the command-line solution probably took that person a couple or more hours to perfect

angel_j | 6 years ago

If I do one thing this year, I need to learn more about these tried and tested command-line unix shell utilities. It's becoming increasingly obvious that so many complicated and overwrought things are being unleashed on the world because (statistically) nobody knows this stuff anymore.

megaman22 | 6 years ago

Once, I saw a presentation of Unicage, a big data solution that had (perhaps still has) a free version https://youtu.be/h_C5GBblkH8?t=2566 It seems that it has evolved to a company now: http://www.unicage.com/products.html

Did anyone try the unicage solution?

libx | 6 years ago

What is this “grep” people keep mentioning? I’ve tried installing it from npm to no avail :/

tzahola | 6 years ago