Another free program is 'Wavesurfer', downloadable from:
http://www.speech.kth.se/softwareIt has the recording function built in, and lots of options, including a spectrogram, but won't do waterfall plots.
As I understand it, the sample size establishes the amount of information you have to work with, and this translates into restrictions on the upper frequency limit and resolution of the FFT. In the 'old days', twenty years or so ago, all papers that discussed results of FFT analysis spent a lot of time talking about these tradeoffs, since processor speeds limited the amount of data they could use. Now that's not a problem, but it's nice to have some idea of how it works.
Nyquist's theorum says that you need at least two data points per wave period to define a sine wave, so the sample rate sets the upper frequency limit for the analysis. That is, if you record at 16,000 samples/second, the highest frequency the FFT can give you will be 8000 Hz. You have to keep in mind that the mechanics of the tap itself will also impose an upper limit: as long as the 'hammer' is in contact with the 'anvil' the thing can't vibrate freely. A soft tapper, like your finger tip or a super ball, might stay in contact with the top for 1/1000 of a second or longer. The sound will start to fade out at 500 Hz, and there will be very little energy above 1000 Hz. In this case a high sample rate, like 96,000/sec, would be nonsensical.
The length of the sample imposes a limit on the frequency resolution. The lowest frequency the FFT can break out will have one full wave length in the sample, so a 2 second sample will contain valid information down to 1/2 Hz, and a 1 second sample down to 1 Hz. The FFT sorts the information into 'bins' that are mutiples of that lowest resolvable frequency: for a 2 second sample the bins will be 1/2 Hz wide, and things that are closer together than that won't be resolved.
We can get by with short samples in looking at tap tones because we're not really interested in either high resolution or high frequency limits. There are so many possible vibration modes in a top above 1000 Hz that they're closer together than their band widths; you're in a 'resonance continuum'. At that point all you can do is get a rough idea of how many modes there are for every 1/3 octave band, say, and find the average damping by looking at the ratio of height from the peaks to the dips in the spectrum. Mode spacing down in the controlable area, below, say, 500 Hz, is usually something like one every fifty Hz or so, so .1Hz resolution won't do you much good. All in all, I'd suggest a low sample rate; for my sound card the lowest is 6000 samples/sec, but , for various reasons, I often use 8000 or 16,000. Multiple taps can stretch out the sample length, and give higher resolution, if you take some care to make sure they're all comparable.
I often use an ancient (in PC terms; it was written for 286 processors) freeware FFT program called FFT4WAV3. One of it's many quirks is that it can only use a sample of 34768 data points: one less and it crashes. To use a short sample I 'zero pad' the data: adding in a period of silence before and after the actual tap. This artificially inflates the resolution, of course. This is not a problem so far as I can find out, as long as you recognize that you're doing it. With FFT4WAV3 you can export the 'real' and 'imaginary' parts of the FFT to a spreadsheet as comma delimited text, and this allows you to average over the levels in the spectrum plot to reduce the resolution to it's proper limit of you want.
You will sometimes see references to 'windowing' the data. Ideally the data sample should start and end with vibration amplitudes near zero. There are various ways to accomplish this; zero padding is one, but others such as the 'Hamming' and 'Hanning' window use a mathematical process to, essentially, change the gain so that the signal starts off low, builds up, and then fades out. This is fine if the signal spectrum itself doesn't change during the sample time: for a tap or other 'noisy' signal it's better to use a 'rectanguilar' window (no change in gain) and zero padding if needed, or so I understand.