On Counting (2017)

cdoxsey | 98 points

Sharding a simple bash script across multiple nodes is one of the original (and still one of the most common) uses of Pachyderm.

We actually have a tutorial based on it: https://github.com/pachyderm/pachyderm/blob/master/doc/examp...

(disclosure: I work at Pachyderm. https://pachyderm.io/)

msteffen | 6 years ago

It would have been cool to see a discussion about algorithmic solutions, rather than solutions based on tools like MySQL.

There are three basic approaches: sort and remove duplicates (the original bash script); insert all items into a set (e.g. hash table) that only keeps unique copies, and count its size; or probabilistic solutions like Count-Min-Sketch or HyperLogLog. But the problem with the latter is that they are approximate, which doesn't sound ideal when billing customers.

The problem with both of the first two approaches is that they require all items to be stored in memory at the same time. As long as that's true, either the sort or hashtable approach will work fine. But once you run out of RAM on a single machine, it's going to slow way way down as it swaps to and from disk constantly.

To me, the natural solution is to just split the dataset alphabetically into, say, 10 or 100 equal-size jobs, and run these either sequentially or in parallel on 10 or 100 machines. So for example if the unique IDs are random strings of digits, then everything starting with 00 is in the first job, everything starting with 01 is in the second, up to 99. For each job, apply either the sort or the set approach; shouldn't matter much.

(edit) For example, here's sequential pseudocode; the second step is embarrassingly parallel.

    # split the records
    for each record in records_list:
        prefix := record[0:2]
        write record to file "records"+prefix

    # count
    total = 0
    for each records_prefix_file:
        initialize hash_table
        for each record in this records_prefix_file:
            insert record into hash_table, ignoring if already present
        total += size of hash_table
(second edit) I'm a theorist, not a practicioner, so I'm ignoring many practical issues about where to store and back up the data, etc.
bo1024 | 6 years ago

> The oddly named wc is a command used to count things.

It’s named after word count.

saagarjha | 6 years ago

Is there a fast way to detect duplicates when you first generate the records? If so, could just keep a continuously updated counter for each client, incrementing it every time you add a record, and decrementing on duplicates to avoid double counting.

monochromatic | 6 years ago

This person writes quiet well. Few articles draw me in and keep me there until the end. His use of language and story telling really flows and it also made me reminisce of when I used to be passionate about programming myself. Great writing.

jdironman | 6 years ago

If you're into fast cardinality estimation (HyperLogLog) and item counting (Count-min sketch, Bloom filters, hash tables), check out Bounter:

https://github.com/RaRe-Technologies/bounter/

(Pythonic interface on top of highly optimized algorithms, faster than dict but using limited memory, MIT license)

Radim | 6 years ago

Reminds me quite a bit of https://adamdrake.com/command-line-tools-can-be-235x-faster-...

… in which, though, the author at least parallelises the counting.

mfontani | 6 years ago

AWS Athena is pretty good at tackling this problem, or PrestoDB running on EMR.

As long as your S3 data is reasonably partitioned, and you don't have millions of small files, it does a reasonable job -even on count(distinct)

It even support approx_distinct for hyperloglog estimation too

djhworld | 6 years ago

Rather than first building up a huge pile of logs and then counting them, move the count to the location that generates the logs or the location that stores them. Easily done, no need for algorithms or special case solutions, just a simple set and count. Added bonus, the “task” is axiomatically done when the month ends no difficulties or special considerations needed there.

tomtimtall | 6 years ago

It's worth mentioning that counting machines are way older than computers and the Census used to be done with "dumb" punched car munchers.

https://en.wikipedia.org/wiki/Tabulating_machine

thanatropism | 6 years ago

Why not count daily and incrementally? No reason to wait for the last day and then have to meet a deadline, do most of the processing in advance and only need to do a small amount of work each day to update.

ikeboy | 6 years ago
halayli | 6 years ago