Cleaning Up Speech Recognition with GPT

BeetleB | 28 points

"I’ll probably write an elisp function to send the output to GPT and insert the results into the buffer"

I suggest taking a look at my LLM command-line tool. It's great for cobbling together these kinds of things, because you can use whatever shell integration your environment has to pipe things to it.

    cat bad-dictation.txt | \
    llm -m gpt-4-turbo --system '
    You are going to correct for text that
    has been produced by voice recognition
    software. Rewrite any text provided to
    you. The text will not have punctuation,
    so please add where needed. If the text
    has the word period, then insert a
    period, and do not insert another one
    just before or after. If the text has
    “new line” or “newline”, insert a newline
    character and start a new paragraph. If
    the text says “comma”, insert a comma. If
    the text has a number spelled out,
    replace it with the actual number. So
    “five thousand” becomes “5000”. Also, if
    something seems off in the text, it was
    probably due to a misrecognized word.
    Please correct for it.' > out.txt
Should be easy to call that from elisp, and you can then install plugins to have it talk to other models like Claude or Llama 3: https://llm.datasette.io/en/stable/plugins/directory.html
simonw | 12 days ago

If you prefer an existing webapp over an elisp function, https://huggingface.co/spaces/ndurner/oai_chat. Choose Whisper as the model, upload your 25 MB chunks, hit Send, choose GPT-4 Turbo, ask it to clean up, hit Send. Then, hit the Download button (hidden away on the very bottom).

Helpful fact: Whisper works on 16 KHz sampling behind the scenes, so can make your recording smaller by downsampling to 22 KHz, mono. AAC is supported, and commenters to the web say Whisper is pretty robust so it doesn‘t have do be hi-fi - just so that you can make a split at the beginning of the QA session perhaps, if you can‘t fit it into one 25 MB chunk right away.

ndr_ | 12 days ago

I've tried using LLMs to restore and clean up raw unpunctuated transcripts, but they tend to hallucinate new words. And chunking is an issue since long transcripts need to be split according to the LLM input context size. But where do we split if we don't have the punctuation yet?

In Scribe I chose instead to restore punctuation marks using a token-classifier (here a DistilBert running in your browser) https://www.appblit.com/scribe

Laurent

ldenoue | 10 days ago

I take a lot of long voice notes thinking out loud during my walks. I use this prompt when I give the Whisper output to GPT-4. I've found it to be pretty reliable:

> Punctuate the following transcript of a voice note I took. Insert periods, commas, and paragraph breaks where appropriate. Remove filler words such as 'right?' 'you know', and 'uh'. But do retain my original wording! Do not paraphrase my sentences beyond recognition: this is not a rewriting task! The transcript now follows:

aragonite | 12 days ago

Use whisper (or whisperx)

psadri | 12 days ago
[deleted]
| 12 days ago