thoughtsfoki.blogg.se - Piano transcriber

The results for framewise prediction on the MAPS dataset can be found in Table 5. The dense network was presented one frame at a time, whereas the convolutional network was given a context in time of two frames to either side of the current frame, summing to 5 frames in total. This resulted in only 229 bins, which are logarithmically spaced in the higher frequency regions, and almost linearly spaced in the lower frequency regions as mentioned in Section 2.1. The choices for circular shift and zero padding ranged very low on the importance scale, so we simply left them switched off. We computed a logarithmically filtered spectrogram with logarithmic magnitude from audio with a sample rate of 44.1 kHz, a filterbank with 48 bins per octave, normed area filters, no circular shift and no zero padding. Table 2: The three most important parameters determining input representation for different model classesįigure 1: (a) Mean logistic regression performance dependent on spectrogram (b) Mean shallow net performance dependent on type of spectrogram 4.2 Greater context The lower performance of the constant-Q transform was quite unexpected in both cases and warrants further investigation.

Surprisingly, the spectrogram with logarithmically spaced bins and logarithmically scaled magnitude, L M, enables the shallow net to perform best, even though it is a clear mismatch for logistic regression. The error bars indicate the standard deviation for the spread in performance, caused by the rest of the varied parameters. In Figure 1, we can see the mean performance attainable with different types of spectrograms for both model classes. See text for a description of the value ranges. Table 1: For each spectrogram type, these are the parameters that were varied. For the computation of spectrograms we used Madmom and for the constant-Q transform we used the Yaafe library. Table 1 specifies which parameters are varied for which input type. Furthermore, we re-scale the magnitudes of the spectrogram bins to be in the range. The filterbank for LS and LM has a linear response (and lower resolution) for the lower frequencies, and a logarithmic response for the higher frequencies. We investigate the suitability of different types of spectrograms and constant-Q transforms as input representations for neural networks and compare four types of input representations: spectrograms with linearly spaced bins S, spectrograms with logarithmically spaced bins LS, spectrograms with logarithmically spaced bins and logarithmically scaled magnitude LM, as well as the constant-Q transform CQT. The exact parameterization of spectrograms is not entirely clear however, so we try to address this question in a systematic way. Time-frequency representations in the form of spectrograms still seem to have a distinct advantage over the raw audio input, as mentioned in.