On Counting (2017)
It would have been cool to see a discussion about algorithmic solutions, rather than solutions based on tools like MySQL.
There are three basic approaches: sort and remove duplicates (the original bash script); insert all items into a set (e.g. hash table) that only keeps unique copies, and count its size; or probabilistic solutions like Count-Min-Sketch or HyperLogLog. But the problem with the latter is that they are approximate, which doesn't sound ideal when billing customers.
The problem with both of the first two approaches is that they require all items to be stored in memory at the same time. As long as that's true, either the sort or hashtable approach will work fine. But once you run out of RAM on a single machine, it's going to slow way way down as it swaps to and from disk constantly.
To me, the natural solution is to just split the dataset alphabetically into, say, 10 or 100 equal-size jobs, and run these either sequentially or in parallel on 10 or 100 machines. So for example if the unique IDs are random strings of digits, then everything starting with 00 is in the first job, everything starting with 01 is in the second, up to 99. For each job, apply either the sort or the set approach; shouldn't matter much.
(edit) For example, here's sequential pseudocode; the second step is embarrassingly parallel.
# split the records
for each record in records_list:
prefix := record[0:2]
write record to file "records"+prefix
# count
total = 0
for each records_prefix_file:
initialize hash_table
for each record in this records_prefix_file:
insert record into hash_table, ignoring if already present
total += size of hash_table
(second edit) I'm a theorist, not a practicioner, so I'm ignoring many practical issues about where to store and back up the data, etc.> The oddly named wc is a command used to count things.
It’s named after word count.
Is there a fast way to detect duplicates when you first generate the records? If so, could just keep a continuously updated counter for each client, incrementing it every time you add a record, and decrementing on duplicates to avoid double counting.
This person writes quiet well. Few articles draw me in and keep me there until the end. His use of language and story telling really flows and it also made me reminisce of when I used to be passionate about programming myself. Great writing.
If you're into fast cardinality estimation (HyperLogLog) and item counting (Count-min sketch, Bloom filters, hash tables), check out Bounter:
https://github.com/RaRe-Technologies/bounter/
(Pythonic interface on top of highly optimized algorithms, faster than dict but using limited memory, MIT license)
Reminds me quite a bit of https://adamdrake.com/command-line-tools-can-be-235x-faster-...
… in which, though, the author at least parallelises the counting.
AWS Athena is pretty good at tackling this problem, or PrestoDB running on EMR.
As long as your S3 data is reasonably partitioned, and you don't have millions of small files, it does a reasonable job -even on count(distinct)
It even support approx_distinct for hyperloglog estimation too
Rather than first building up a huge pile of logs and then counting them, move the count to the location that generates the logs or the location that stores them. Easily done, no need for algorithms or special case solutions, just a simple set and count. Added bonus, the “task” is axiomatically done when the month ends no difficulties or special considerations needed there.
It's worth mentioning that counting machines are way older than computers and the Census used to be done with "dumb" punched car munchers.
Why not count daily and incrementally? No reason to wait for the last day and then have to meet a deadline, do most of the processing in advance and only need to do a small amount of work each day to update.
aws S3 Select might be a good fit.
https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-s...
Sharding a simple bash script across multiple nodes is one of the original (and still one of the most common) uses of Pachyderm.
We actually have a tutorial based on it: https://github.com/pachyderm/pachyderm/blob/master/doc/examp...
(disclosure: I work at Pachyderm. https://pachyderm.io/)