Vocoders: a Conceptual Model for ITC EVP?

October 05, 2014

Ever since I’ve been in contact with Electronic Voice Phenomena (EVP[0]), some points came to my attention regarding paranormal audio.

First, there’s too little conceptual information about how the voices are actually generated (not talking about how they are recorded, but how they are really produced).

Second, there’s little consensus on the terminology that independent ITC researchers employ.

Third, overall recorded audio quality is far from ideal (even with high resolution digital hardware), cause different researchers settle on very different approaches (e.g., there are people out there still using old techniques, developed by the pioneers of the field, long before computer digital revolution became a reality).

Finally - and foremost - all these issues are aggravated by the fact that no conceptual model is offered to the general public. There’s the religious tendency to believe that EVP comes from other dimensions, produced by deceased people. And that’s basically it.


First of all, let’s review how trans audio is recorded.

Paranormal researchers play background sound on the environment, and start a recording session (usually making questions). In some setups, the background sound has specific structure (we’ll talk more about this later); in other scenarios, background noise (such as white noise) is chosen. The expression “background noise” refers to both cases.

In any situation, recorded answers are expected as outcome. Or intelligible/rational speech, that didn’t come from the recording environment, or nearby (radio sounds, next-door people talking, TV, neighborhood street noise, and so on).

One theory is that the phenomenon is made possible by technology from other dimension(s).

Even when dealing with trans voices from deceased ones, we’ll assume that the process is based on (vastly) superior technology (it’s common to find references to stations on specialized literature).

This is a strong assumption, similar to the ones made in A Challenge-Response Protocol for ITC Authentication.


One common aspect of paranormal voices always strike me as being very special: unnatural prosody. It’s common for some voices [1] to feel artificial, almost robotic. So I started to mess with the idea that some sort of speech synthesis generates them.

This is something difficult to prove (in a strict mathematical sense). Even for regular audio, asserting that some recorded speech is not natural is a hard task. From A Challenge-Response Protocol for ITC Authentication:

advanced methods, like hidden Markov models, are known to deliver extremely good results for speech synthesis and recognition, to the point that they can be used to bypass security systems; researchers had to develop countermeasures to avoid false positives in biometric authentication

ITC researchers normally talk about sound [de]modulation, to explain the fact that background noise is somehow manipulated to become paranormal voices.

I’m not concerned with the physics (like quantum mechanics) that allows transportation of trans audio from one dimension to another (I could barely explore this subject; I have no knowledge about it, and believe this is something for NASA, or maybe Google).

I’m concerned with the corresponding paranormal data stream, and interested in what is generated (i.e., the resulting signal). And here lies what I consider the most intrinsic aspect of the whole idea of paranormal synthesized speech: data compression.

As strange as it may sound, data compression allows us to build a conceptual model about EVP. And this framework provides interesting theoretical elements for the whole field of ITC audio investigation.


Imagine that we are capable of sending information back and forth two very different dimensions. Maybe, some kind of modulation is employed by this unbelievable technology, to carry communication signals (demodulated at the receiving endpoint).

Wouldn’t the engineers behind this spectacular apparatus care about resource optimization?

One way to accomplish this could be to improve bandwidth/throughput, reducing energy consumption.

Exceptional intellects are supposed to know information theory. It would be easy for them to encode signals using some form of data compression. Lossy or Lossless, even our telecommunication systems know how to take advantage of this.


The general problem of “paranormal data stream” bit rate reduction can be solved by redundancy elimination. But there’s one more issue to workaround: does a deceased person have a vocal tract (figure 1)? This may sound silly, but something has to “talk” to our equipments.

alt Sagittal section of human vocal tract Figure 1 - sagittal section of human vocal tract

So far, we identified the general problem of data channel optimization (solvable by some advanced compression form), and the specific problem of voice encoding. Could data compression also be used to handle the second issue?

In fact, this is possible. It’s so immediate, that even our technology leverages this everyday (e.g., in mobile telephony), through speech coding.

Parametric models can represent the spectral envelope of digital speech signals, in very compact modes, using predictive coding (an effective mechanism to deliver good intelligible voices, at low bit rates, without full waveform transfers).


Vocoders are systems used for analysis and synthesis of human speech, originally developed for telecommunication applications, as a way to compress data with lower bit rates.

They generate only the parameters of the vocal model, instead of a sample by sample waveform description. This allows significant reduction in the bandwidth required to transmit speech over communication links.

Modern vocoders - implemented in many different ways - are used in multiple scenarios: game development, linguistics, telephony, security systems, and even in medical research (just to name a few).

For the sake of simplicity (but without any lack of generality), we’re going to examine traditional linear predictive coding (LPC). It’s a powerful tool to expand on the idea of vocoders as good conceptual models for EVP.


I’m not going to dissect linear predictive vocoding. I refer the reader to [2] and [3]. My goal is to present just enough of the model in this section, and move on. Later, the basics will show their relevance.

LPC abstracts the human vocal tract mathematically, embracing the source-filter model of speech production. The simplified view:

alt human vocal tract source-filter model simplification Figure 2 - human vocal tract source-filter model simplification

The left side of the arrow depicts figure 1. The right side shows source-filter model (an approximation of the reality of speech production).

For LPC, a signal is broken into short chunks, considered stationary. These frames are then classified as voiced (V) or unvoiced (UV).

V segments are described by their energy level (gain, G), pitch, and linear predictive parameters (set of coefficients, in the order of 10 to 12+ numbers). UV segments are synthesized with white noise.

A speech signal is decoded by the same parametric representation that was used in the encoding/analysis [4]. The streamlined process:

alt LPC speech synthesis Figure 3 - LPC speech synthesis

An analogy to music helps to understand its value for speech transmission: if we picture voice as music, no song waveform is sent to receiver; a score is dispatched, and the original composition [re]played. For an outside listener, depending on the quality of the notation/player, little to no difference can be perceived (LPC would be notation; a synthesizer, the player).


Sonia Rinaldi, a recognized ITC researcher from IPATI, developed computer based trans audio recording techniques, capable of registering paranormal voices of extraordinary quality.

One of her methods is based on a specially crafted background noise. Syllables of real human voices are broken up, and then concatenated, to produce a sequence of unrelated sounds.

At the same time, it avoids false positives when transcribing the recordings, and is “modulated” as excellent trans voices.

Another technique of hers has similar results. But background noise is crafted in a slightly different manner: two (or more) foreign language tracks are mixed together (sometimes, with some shift among pauses/voices), producing, again, an unintelligible sequence of very rhythmic human sounds (regarding phonemes). There’s a partial and old description of this in English.

UPDATE NOTE, Apr/2017:: according to EVP and New Dimensions (by Alexander MacRae), researcher George Gilbert Bonner employed a very similar multi-voicing method, based in three foreign language recordings, played simultaneously as the noise source.

One important question may be raised about the aforementioned (digital) methods: why they succeeded - outperforming classic procedures in quality terms - where others have failed?

I believe vocoders may provide a reasonable explanation. Without any shadow of doubts, to the extent of what is possible in this kind of discipline (and taking into account that a supposition is at the core of this work).


All LPC schemes are based on the same model: excitation signals and filter. In the basic variant presented, excitation was an impulse train (for V frames), or white noise (for UV frames). Complex encoders may use more advanced designs.

From figure 3, it’s possible to see that a voiced frame pulse train is comprised of unit amplitude pulses, produced at the beginning of each pitch period. This is an emulation of the vocal tract excitation (generated by the vocal cords). With periodic pulses, speech sounds mechanical. According to Atal, this is far from ideal:

it is now well recognized that such a rigid idealization of the vocal excitation is often responsible for the unnatural quality associated with synthesized speech

He presents another multi-pulse model for the excitation of LPC synthesizers (emphasis is mine):

we find that this model is flexible enough to provide high quality speech even at low bit rates; (…) of course, if the number of pulses is increased to arbitrarily large value so that there is a pulse at every sampling instant, it should be possible to duplicate the original speech waveform (at the expense of a high bit rate)

Maybe, this is one connection between vocoders and the syllable EVP recording method: when small time windows are adopted for broken sounds, the likelihood of some phonemes being used as good frame excitation signals is increased (i.e., there are more opportunities for ideal pulse amplitude and location matches).

I like to think about the work of background noise crafting as randomized algorithms, that happen to produce good excitation for a vocoder of some type - a paranormal one.

Readers may raise an important question: wouldn’t random noise be more likely to provide even better matches? Assuming a vocoder conceptual model, the whispered trans voices of the past clearly show that this is quite the opposite.

When we use white noise as excitation, the resulting speech is somehow noticeable, albeit very hard to cleanup. And a whispered effect also appears (in fact, this is common with other excitation forms, as we’ll see).

That kind of noise doesn’t match really well what vocal cords do. In natural speech, human vocal cords provide excitation, vibrating at a frequency that depends on the speaker, gender, age, and on the intended inflection.

Another common way of producing background noise is to use sinusoidal waveforms instead of random noise of any type. The result is also poor: there is just not enough spectral “richness” in that kind of signal [5].


One way to assert the validity of the vocoder model applied to paranormal research, in practice, would be to get a reasonable number of trans audio files, and use LPC (or another vocoder model) to analyze them (not an absolute proof, though, considering inductive reasoning generalizations). Unfortunately, the underlying math makes it hard.

A best fit is always possible (in mean-squared error sense). And negligible residue thresholding poses fundamental challenges:

  • which vocoder implementation to adopt;
  • the absence of insignificant residuals could be a side effect of recording and/or edition (the process could introduce digital artifacts, not present on original “paranormally vocoded” signal);
  • background noise choice influences vocoder decodings;
  • how much negligibility is enough?

Thus, a small reversed experiment is proposed in this section, cause it’s not the intent of current work to prove anything.

To exhibit the potential of the vocoder model, I’ll degenerate a non-paranormal speech file through rounds of LPC, with different excitation signals. Specially for readers familiar with trans audios, the degenerations are similar to what is found on ITC day to day research.

I started with a 16 KB snippet of a file from pdsounds.org. Just to reinforce: this is not trans audio (the WAV contains regular human voice).

A girl can be heard asking: “Can you hear me now?” (note: HTML5 audio tags must be supported by your browser; tested on Windows and Linux, with IE, Chrome, and Firefox):

As noted earlier, more complex implementations are outside the scope of this post. But just to exemplify the value of a complex vocoder, I compressed original file with SoX GSM codec (it’s LPC-based). The quality is outstanding, with only 1.7 KB (an interesting curiosity to learn what can be achieved without resorting to pulse-code modulation):

Traditional LPC analysis and synthesis (Markel/Gray, 1976 - see Sound: To LPC, autocorrelation) were made with Praat. This implementation is more than enough to evaluate different excitation signals.

Here, corresponding speech was [re]synthesized using original pulse trains, derived from Praat’s pitch detection. Speech is understandable, but it becomes to feel unnatural (robotic prosody starts to show off):

If we feed white noise as excitation to original file LPC representation (for both V and UV frames), this is what comes out:

Note the typical whispered result. It’s radically different from the mixing of corresponding white noise and original signal, presented below (previous voice could be mixed to white noise, simulating more realistically a noise based EVP recording environment):

Following speeches were synthesized with harmonic pure tones as excitation (on top), and Meek’s mix as input (on bottom). Results are particularly awful:

As noted on [6] (a rare book, let me tell you), background noise is at the heart of paranormal audio recording. I.e., resulting quality is directly proportional to the noise played, as everything can be “borrowed” as modulating sources for trans audio (e.g., intentional and accidental use of mic feedback, coughing, creaking door sounds, water sounds, music, etc; as expected, results are open to very subjective interpretations).

To assess one of these weird background noises, a small classical music excitation signal was tested (again, mixing the musical snippet and original track would produce a very different output). Outcome is pretty interesting:

For the phoneme technique described earlier, a specially crafted background noise was built in an analogous way, with feminine voice.

The pace was made faster, trying to match the time frame of the vocoder adopted. This way, good excitation signals are more regularly fed than what would happen with normal syllables. That’s why a “burst” effect can be noted “behind” the speech.

When the syllable method is applied, the burst is hardly noticeable. Despite it, whoever tried the technique can note the peculiar resemblance:


No matter how advanced a technology is, it must operate on some bounds, subject to constraints. Sending trans voices to our dimension would be no different.

We started with an assumption: paranormal voices are product of high-level technology. As such, information theory is probably leveraged to optimize some resources.

Data compression was presented as a good solution for trans audio transmissions. Both as an optimization measure, and as a way to solve the problem of speech encoding.

Most important, we discussed the possibility of adopting vocoders as a conceptual model for EVP research. A small reversed experiment was conducted, suggesting, in practice, the affinity level between speech synthesis and trans voices.

Task specialization is something common when we talk about communication. Even our computer networks are abstracted in layers, by models such as OSI and INET. Nothing more natural for this alleged paranormal apparatus to work in a similar way, decoupling speech encoding, modulation, transmissions, demodulation, etc.

Of course, there are many more elements to balance in any ITC discipline. We can take high frequency background noise as example. Whenever used, and before filtering, does it have to be brought back to human audible range, for proper “borrowing” as an excitation signal to the theoretical paranormal vocoder?

Unfortunately, this is all too new. Right now, literature has more open questions (and theories) than concrete answers.

[0] - I’ll keep referring to EVP, (trans/paranormal/ITC) audio, speech, and/or voices interchangeably, in a more modern sense (i.e., not the local phenomenon of the past, usually registered by analog magnetized devices);
[1] - In particular, the ones captured digitally, from stations’ broadcasting;
[2] - http://www.data-compression.com/speech.html;
[3] - work of Atal; notably “Speech Analysis and Synthesis by Linear Prediction of the Speech Wave”, and “A new model of LPC excitation for producing natural-sounding speech at low bit rates”;
[4] - for exemplification purposes, residual/errors/innovations are left out of the discussion;
[5] - I tried to replicate some experiments of the past, and used a small mix of pure tones, described in SPIRICOM - An Electromagnetic-Etheric Systems Approach to Communications with other Levels of Human Consciousness” (George Meek‘s Metascience Foundation booklet); the resulting “whisperings” almost drove me nuts to extract, filter, and clean up in my DAW of choice, Audacity;
[6] - “Gravando Vozes do Alem” (2005, pages 19-28); in English, that would be something like “Recording Voices from Beyond”;