I once converted a simulation into cython from plain old python.
Because it fit in the CPU cache the speedup was around 10000x on a single machine (numerical simulations, amirite?).
Because it was so much faster all the code required to split it up between a bunch of servers in a map reduce job could be deleted, since it only needed a couple cores on a single machine for a ms or three.
Because it wasn't a map-reduce job, I could take it out of the worker queue and just handle it on the fly during the web request.
Sometimes it's worth it to just step back and experiment a bit.
"You can have a second computer once you've demonstrated you know how to use one".
Recently was sorting a 10 million line CSV by the second field which was numerical. After an hour went by and it wasn't done, I poked around online and saw a suggestion to put the field sorted on first.
One awk command later my file was flipped. Run same exact sort command on this but without specifying field. Completed in 12 seconds.
Morals:
1. Small changes can have a 3+ orders of magnitude effect on performance
2. Use the Google, easier than understanding every tool on a deep enough level to figure this out yourself ;)
See also: Scalability! But at what COST? [1] [2]
> The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation. COST weighs a system’s scalability against the overheads introduced by the system, and indicates the actual performance gains of the system, without rewarding systems that bring substantial but parallelizable overheads.
For my performance book, I looked at some sample code for converting public transport data in CSV format to an embedded SQLite DB for use on mobile. A little bit of data optimization took the time from 22 minutes to under a second, or ~1000x, for well over 100MB of Source data.
The target data went fro almost 200MB of SQLite to 7MB of binary that could just be mapped into memory. Oh, and lookup on the device also became 1000x faster.
There is a LOT of that sort of stuff out there, our “standard” approaches are often highly inappropriate for a wide variety of problems.
Previous posts:
2016: https://news.ycombinator.com/item?id=12472905
in other words, "Too big for excel is not big data" https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
A lot of people are saying how they've worked on single-machine systems that performed far better than distributed alternatives. Yawn. So have I. So have thousands of others. It should almost be a prerequisite for working on those distributed systems, so that they can understand the real point of those systems. Sometimes it's about performance, and even then there's no "one size fits all" answer. Just as often it's about capacity. Seen any exabyte single machines on the market lately? Even more often that that, it's about redundancy and reliability. What happens when your single-machine wonder has a single hardware failure?
Sure, a lot of tyros are working on distributed systems because it's cool or because it enhances their resumes, but there are also a lot of professionals working on distributed systems because they're the only way to meet requirements. Cherry-picking examples to favor your own limited skill set doesn't seem like engineering to me.
See also the wonderful COST paper: https://www.usenix.org/conference/hotos15/workshop-program/p...
But the article is kind of wrong. It depends on your data size and problem - you can even use commandline-tools with Hadoop Map/Reduce and the Streaming API and Hadoop is still useful if you have a few terabytes of data that you can tackle with map and reduce algorithms and in that case multiple machines do help quite a lot.
anything that fits on your local ssd/hdd probably does not need hadoop... however you can run the same unix commands from the article just fine on a 20tb dataset with Hadoop.
Hadoop MapReduce/HDFS is a tool for a specific purpose not magic fairy dust. Google did build it for indexing and storing the web and probably not to calculate some big excel sheets...
For many, the incentive from the gig is
a) resume enrichment by way of buzzword addition b) huge budget grants and allocation, purportedly for lofty goals while management is really unaware of real technology needs/options
Much has been talked about this already; sharing them again!
[1] https://news.ycombinator.com/item?id=14401399
[2] https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
[3] http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra...
[4] https://www.reddit.com/r/programming/comments/3z9rab/taco_be...
Most people who think they have big data don't.
cat *.pgn | grep "Result" | sort | uniq -c
This pipeline has a useless use of cat. Over time I've found cat to be kind of slow as compared to actually passing a filename to a command when I can. If you rewrite it to be: grep -h "Result" *.pgn | ...
It would be much faster. I found this when I was fiddling with my current log processor to analyze stats on my blog.Un My case it was 2015. I was struggling with a 28GB CSV file that i needed to cut grabbing only 5 columns.
Tried spark on my laptop: waste if time. After 4h I killed al processes because it didn't read 25% of the file yet.
Same for hadoop, python and pandas, and a shiny new tool from google whose name I forgot long time ago.
Finally I installed cygwin con My laptop and 20 minutes later 'cut' gave me the results file I needded.
So there are 2197188 games in that file.
Extracting only the lines with "[Result]" in them into new file (using grep) takes about 3 seconds.
Importing that into a local Postgres database on my laptop takes about 1.5 seconds.
Then running a simple:
select result, count(*)
from results
group by result;
Takes about 0.5 seconds.So the total process took only 5 seconds (and now I can run much more aggregation queries on that)
I feel that we have spent 30 years replicating Bash across multiple computer systems.
The further my clients move to the cloud, the more shell scripts they write at the exclusion of other languages. and just like this, I have clients who have ripped out expensive enterprise data streaming tools and replaced them with bash.
The future of enterprise software is going to be a bloodbath.
anecdota - We used fast csv tools and custom node.js scripts to wrangle import of ~500GB of geo polygon data into a large single vol postgresql+postGIS host.
We generate svg maps in psuedo-realtime from this data-set : 2MB maps render sub-second over the web, which feels 'responsive'.
I only mention this as many marketing people will call 50Million rows or 1TB "Big Data" and therefore suggest big / expensive / complex solutions. Recent SSD hosts have pushed up the "Big Data" watermark, and offer superb performance for many data applications.
[ yes, I know you can't beat magnetic disks for storing large videos, but thats a less common use-case ]
Relevant article from 2013: 'Don't use Hadoop - your data isn't that big' [0] and the most recent HN discussion [1].
[0]: https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
Hi all, author here!
Many thanks for the feedback and comments. If you have any questions, I'm also happy to try to answer them.
I'm also working on a project, https://applybyapi.com which may be of interest to anyone here hiring developers and drowning in resumes.
You can get very good improvements over Spark too. I've been using GNU Parallel + redis + Cython workers to calculate distance pairs for a disambiguation problem. But then again, if it fits into a few X1 instances, it's not big data!
If you need to do something more complicated where SQL would be really handy, and like here you're not going to update your data, give monetDB a try. Quite often when I'm about to break down and implore the hadoop gods, I remember about monetDB, and most of the time it's sufficient to solve my problem.
MonrtDB: columnar DB for a single machine with a psql like interface: https://www.monetdb.org/
Well yes in the edge case where you don't really have big data of course it will.
Where MapReduce Hadoop etcera shine is when a single dataset one of many you need to process is bigger than the biggest single disk availible - this changes with time
Back when I did M/R for BT the data set sizes where smaller - still having all of the Uk's larges single PR!ME superminis running your code was dam cool - even though I used a 110 baud dial up print terminal to control it.
I think much of this issue can be attributed to 2 most underrated things
1. Cache line misses. 2. So called definition of BigData. (if data can be easily fit into memory, then its not Big period! )
Many times, I have seen simple awk / grep commands will outperform Hadoop jobs. I personally feel, its lot better to spin up larger instances, compute your jobs and shut it down than bearing the operational overhead of managing hadoop cluster.
Anchoring the search would allow the regex to terminate faster for non-matching lines:
... | mawk '/^\[Result' ...
Esp. if you use multipipe-enhanced coreutils, like dgsh. https://www2.dmst.aueb.gr/dds/sw/dgsh/
It would benefit people to actually understand algorithmic complexity of what they are doing before they go on these voyages to parallelize everything. It also helps to know what helps to parallelize and what doesn't.
Something, something, Joyent Manta: https://apidocs.joyent.com/manta/job-patterns.html
Shouldn't this be marked 2014? The articles date is January 18, 2014.
Partially because of the bloated architecture of those Hadoop, Kafka, etc. And of course Java. Implementing modern and lighter alternative to those in C++, Go or Rust would a step forward.
People are suckers for complex, over-engineered things. They associate mastering complexity with intelligence. Simple things that just work are boring / not sexy.
I’ll be in my corner doing asking “do we really need this?”/“have you tested this?”
'Nuff said: https://github.com/erikfrey/bashreduce
I've heard this so often, especially from my boss - whoever this is. I come to the conclusion this is just a lame excuse because the people feel out of control and overwhelmed when using this kind of software. It is heavily Java based and to those who never touched javac, the Stack traces must look both intimidating and ridiculous.
On the other hand, processing data over several steps with a homegrown solution needs a lot of programming discipline and reasoning, otherwise your software turns into an unmaintainable and unreliable mess. In fact this is the case where I work right now...
This reminds me of the early EJBs which were really complicated and most people who used them didn't really need them.
It can be colder at night than outside.
I closed at "Since the data volume was only about 1.75GB". (And probably would have until 500GB+)
Ah yes this is so true!
the hadoop solution may have taken more computer time, but that is far less valuable than a person's
the command-line solution probably took that person a couple or more hours to perfect
If I do one thing this year, I need to learn more about these tried and tested command-line unix shell utilities. It's becoming increasingly obvious that so many complicated and overwrought things are being unleashed on the world because (statistically) nobody knows this stuff anymore.
Once, I saw a presentation of Unicage, a big data solution that had (perhaps still has) a free version https://youtu.be/h_C5GBblkH8?t=2566 It seems that it has evolved to a company now: http://www.unicage.com/products.html
Did anyone try the unicage solution?
What is this “grep” people keep mentioning? I’ve tried installing it from npm to no avail :/
I have rewritten incredibly overarchitected stuff, Cassandra, Hadoop, Kafka, Node, Mongo etc with a plethora of the ‘the latest cool programming languages’ running on big clusters of Amazon and Google to simple, but not sexy, c# and mysql or pgsql. Despite people commenting on the inefficiency of ORMs and unscalable nature of the solution I picked, it easily outperformes in every way for the real worldcases these systems were used for. Meaning; far easier achitecture, easier to read and far better peformance in both latency and throughput for workloads that will probably never happen. Also; one language, less engineers needed, less maintenance and easily swappable databases. I understand that all that other tech is in fact ‘learning new stuff’ for RDD, but it was costing these companies a lot of money with very little benefit. If I need something for very high traffic and huge data, I still do not know if I would opt for Cassandra or Hadoop; even with proper setup, sure they scale but at what cost? I had far better results with kdb+, which requires very little setup and very minimal overhead if you do it correctly. Then again, we will never have to mine petabytes, so maybe the use case works there: would love to hear from people who tried different solutions objectively.