Audio Editing

Great narrative potential.

As in conventional film, audio editing also plays a decisive role in 360° film. Surprisingly, sound in 360° film is often neglected and only used as an accompaniment. This can be surprising because music and sound have extensive storytelling possibilities in 360° space and are often much easier and cheaper to realise compared to elaborate visual stagings and effects. Focus on the visual over the audio could be attributed to the fact that many 360° film directors come from a tech or filmmaking background which brings visual possibilities to the foreground.

360° film sound post-production

Although the conception, realisation, and production of 360 film sound on set can be realised with similar methods as in conventional film, the post-production process requires fundamentally different technical means. 360° film is a child of the digital age, know-how and willingness to constantly deal with new standards, numerous equipment and technical achievements is a prerequisite for a successful high-quality 360° post-production.

Planning Communicative Aspects
It is important to be in constant communication with those responsible for the images during the post-production process. Many things can be planned and defined in conceptual form, but this process is different in practice. A key aspect is the delivery of the audio data. Mistakes are often found when audio files are delivered to the editor. For example, the four Ambisonic tracks could be inaccurate (changed order or the start position is slightly shifted). In such cases, it takes both time and experience to restore everything. Recommendation: Whenever possible, multi-channel files (e.g. 4 channels in one wav file) should be recorded on set. An experienced editor knows how to import and lock the files so that nothing unwanted happens. During the exporting, the multi-channel files are often split into individual mono tracks (AAF, OMF). This is not a problem if the splitting happens after editing. The files can easily be re-coupled into multi-channel files in the DAW. It is also important that the video files are delivered to the recording studio with the correct image resolution for the editing software. It is tempting to use compressed videos for fast performance. If these are displayed in the opened 360° view mode, one can work well with them. However, if one wants to continuously check the results in a VR headset - placement errors are difficult to notice with poor video resolution. Moreover, viewing images with a poor resolution compression is tiring. (Even the maximum resolution in VR is not very high) Recommendation: Even if the video has a lot of data, one should, whenever possible, create and check the audio with the target image resolution. A high-performance workstation is a prerequisite for this process.
Clarification on how the director/editor wants to review/accept the created audio must be agreed upon. 360° audio is monitored binaurally but the delivery is uncoded (e.g. as a 9-channel Ambisonic file and a separate headlocked stereo file). If the audio is delivered this way, it must first be encoded and mixed with the film. Recommendation: The audio is encoded directly in the studio with the video (e.g. FB Encoder) and uploaded to a suitable platform (e.g. Facebook or Youtube) for approval (not public, of course). This also ensures that the correct film version is used.

The basic technical infrastructure

The prerequisite is a powerful digital audio workstation (DAW) with plenty of SSD storage space (or available via a fast network). 360° audio files are twice to four times as large as stereo files. Additionally, the projects are usually rendered in different formats and mixes and movie files that need to be dubbed are massively larger than conventional video formats. This requires a computer with at least twice as much RAM as with conventional audio computers and a very fast CPU.

Spatial simulation algorithms simultaneously calculate a lot of data and play back the audio with as little latency as possible. The spatialisation of sounds and ever higher resolution in Ambisonic projects sometimes require up to 16 channels per track (3d-order Ambisonic). If 90 tracks are to be played back simultaneously, a "workhorse," i.e. a very powerful computer, is necessary. The multi-channel technology means that high-quality plug-ins, which are DSP-hungry, need even more power and outsourcing options (look-ahead functions and low-latency processing).

Maximum GPU performance is also important. For clear and ergonomic work, several monitors are advised. Likewise, for mixing and monitoring real-time conditions, operating VR glasses on the DAW is indispensable.

High-quality multi-channel plug-ins are also needed. Reverberation algorithms must perform real 360°, not just reverberation to the front and back. Sound shaping tasks such as equalizing, compressing, denoising, limiting, etc. must work extremely accurately due to the required channel and phase rigidity and should be able to calculate up to 16 channels simultaneously without deviations. Listening with headphones reveals many things that cannot be perceived with conventional loudspeakers.

High quality headphones are essential. Closed main headphones with high impedance allow maximum control. It is also recommended to test with closed average headphones with medium or low impedance in order to simulate consumer use.

A fast internet connection is required throughout the process. Today, film sound is constantly being uploaded and downloaded during the approval process. With massively heavier data, a poor connection can not only jeopardise project deadlines, but also communication with production partners.

Which formats should be delivered?
Recommendation: This decision must be made by the recording studio based on the requirements of the target platforms and media. Video houses are often overwhelmed with all the audio possibilities (with or without HL stereo, 6-channel Youtube standard or 8-channel Facebook, 9- or 8-channel audio for glasses etc.).

Formats

A-Format
A-format is an unprocessed raw format that is output directly from the Ambisonic microphone (e.g. Ambeo). The 4 channels correspond to the orientation of the capsules, which record signals with the 8 characteristics:

Channel 1: FLU (Front-Left-Up), front top left (and the opposite direction BRD)
Channel 2: FRD (Front-Right-Down), front right down (and the opposite direction BLU)
Channel 3: BLD (Back-Left-Down), rear left bottom (and the opposite direction FRU)
Channel 4: BRU (Back-Right-Up), back right top (and the opposite direction FLD)

The 4 microphone capsules with figure-of-eight characteristics record all directions, creating a seamless audio sphere with which the environment can be recorded as 360° audio.

B-Format
B-format is the converted A-format into the coordinate system channels: X: front, back Y: left, right Z: top, bottom.

The 4th channel, the W channel, is the real centre, it contains all signals from all sides, like a microphone with omnidirectional characteristics. By combining all 4 channels, audio objects can be placed in all positions of our sphere. A distinction is made between the lesser known FUMA format and the widely used ambiX format described below.

ambiX
Due to its relatively simple handling, the ambiX format (Ambisonics exchangeable) is a widely popular method. As the name suggests, this format is well suited for exchange between different applications. Most plug-ins need ambiX files so that the directional mapping is still correct after processing.
Although most Ambisonic microphones produce the so-called A-format, this can be converted into an ambiX file in many recorders (e.g. Zoom F8). There are also plug-ins (free for download) that can convert the A format to ambiX in the user's DAW.
The ambiX file has 4 channels converted from A-format to the corresponding axis assignment. In contrast to the FUMA format, ambiX does not order the channel sequence alphabetically, but as follows:

Channel 1: W
Channel 2: Y
Channel 3: Z
Channel 4: X

If this order is not followed, the directional perception is wrong and unclear. Dynamic differences of the individual channels in the mix must always be maintained. If compressors, EQ’s or limiters are used incorrectly, or if the routing within the DAW is faulty, the 360° image can be affected.

FOA vs. HOA
FOA means "first order ambisonics," meaning the sphere conversion into the aforementioned 4 channels.

Representation of the mentioned 4 channels according to the principle FOA («first order ambisonics»).

One must further refine this technique and get the exact placement points for audio objects. If, for example, 8 microphone capsules are packed into an Ambisonic microphone instead of the existing 4, then each capsule would have a smaller area to cover. A more differentiated image of the acoustic environment is possible, because there is a smaller overlap from capsule to capsule.
HOA means «high order ambisonics». As the diagram below shows, this order can be extended by more detailed areas.

Representation of the increasingly finely divided channels according to the HOA principle (high order ambisonics).

The first 4 (0 - 3) channels remain FOA, so if you listen to an HOA mix (e.g. 16 channels) via the first 4 channels, you simply get an FOA mix. Today there are plug-ins that "upmix" FOA into a higher order (Harpex). The main application is aimed at the spatialisation of object-related audio sources. Although the localisation is more precise, this also considerably increases the CPU and memory demand on the computer.

Youtube, Facebook, HMD’s

As mentioned, there are different platform categories requiring different formats. Currently, the most common are «web formats»:

Youtube Standard Video: 4-channel FOA ambiX
Youtube with HL stereo: 4-channel FOA ambiX + 2-channel stereo file
Facebook Standard: 8-Channel Facebook File
Facebook with HL-Stereo: 8-channel Facebook + 2-channel stereo file

There are separate formats for HMD’s (e.g. Oculus Video), but all HMDs can play the web formats. For integration into VR or game applications, there are other special formats. The number of orders will continue to increase so that Youtube, for example, can process and play back 2nd order ambisonics.

Practical Limits of sound design

With the possession of fast computers, high-quality plug-ins and VR glasses, the possibilities of sound design seem limitless. Technically, this is almost the case. However, there are practical limitations that must be considered.

Coupling of eye and ear
Everything that happens in the «real» world we can grasp well with our senses, or check with our perceptual experiences. If, for example, moving objects are picked up by an Ambisonic microphone, they are naturally coupled to the optical system. The interaction between the ear and the eye works very well in the familiar binocular field of vision (approx. 214° horizontally, approx. 70° up and down). Everything that happens outside this field of vision is also outside our focus. Unknown acoustic events outside this zone force us to turn our head or even our body, because primal instincts require us to assess if there is a threat, and we have to flee or fight. We should therefore be careful about triggering these "reaction patterns" too often or at an awkward moment. They quickly lead to confusion and stress.
An absolute "don't" is an unclean placement of audio objects coupled with visual objects. Due to the habitual connection of ear and eye, a voice that is constantly shifted slightly to the right of a person speaking, for example, can cause irritation and even physical reactions such as headaches.

Rapid changes of direction, ear distance, and HRTF
We are quickly overwhelmed in 360°space when, for example, a sound suddenly whizzes by below us and comes to a stop at the top in the middle. Apart from the fact that we are physically very awkward in the X-axis anyway (if something flies directly towards us and underneath us, we can track it until we are looking at our own body, we would have to make a 180°turn in fractions to see it seamlessly fly away from us again), changes of direction in the 360°film world feel much more direct and thus more stressful, because the headphones avoid the natural separation of our ears by approx. 17cm (head width) and thus the natural delays of sound waves bending around our head.

The In-Ear Ambisonic microphones try to take this effect into account and achieve a "natural" ear distance. For self-created audio objects that are flown around, it is crucial which HRTF model is used. The HRTF model (Head-Related Transfer Functions) basically simulates the ear distance and is used for binaural decoding (i.e. the conversion of ambisonic files to a 360°stereo file playable by the headphones). This is based on the assumption of an "average head", which is, however, a very big compromise. One must be aware that this means that a 360°production sounds different for each person and the directional perceptions are not 100% identical.

Automation of movements/time dynamic sound design

For mono tracks in conventional sound, movements are automated with one parameter: position on the Y-axis. You decide when a sound should play between the left and right channels. A mono audio object in 360°film, on the other hand, needs 4 parameters:

Position on the Z-axis
Position on the X-axis
Position on the Y-axis
Distance to POV

Even if there is no movement on one axis, the system must "know" where the object is (starting point). Especially if you record mouse movements as automation, you will notice that each of the 4 parameters will be affected. If you work with a stereo object, there are 8 parameters just for the placement of both channels. Additionally, there are parameters from plug-ins, such as changing EQ curves to increase the sense of distance, effect sends, volume gradients and dynamic corrections. Therefore, a single audio track in a 360°project can easily have 12-16 parameters that need to be changed and adjusted over the film's timeline. If you imagine that there are editing changes on the image side, or that you want to couple different audio objects and execute the same movements, the complexity quickly rises. Although technically everything is feasible, at the beginning of a project you should be clear about what you are promising as a sound designer and what this means for the implementation, i.e. the effort and costs.

Compensation of the miking problems

On a 360° set, there should be nothing in the picture that is not part of the story. In 80% of all productions, the microphone is not positioned in a way that allows for the microphone and possibly also the recorder (which is also somewhere under the tripod) to be simply retouched out in post-production. These mistakes can only be partially corrected in post-production.

The main problems are:

Microphone is too close to the ground. The closer you place a microphone to the ground, the more sound reflections are picked up from below. The sound starts to distort, for example, with a height of 170 cm, it sounds like a child would hear it. This "error" is acoustically so complex that it cannot be restored with the means of a DAW.
Microphone too close to the stand. Depending on the thickness of the stand, this produces a sound shadow, a kind of wedge created in the sphere of influence where the sound is bent or does not reach the correct capsule at all. The resulting localisation deficits can only be corrected to a limited extent.
Noise between microphone and camera. Particular care must be taken here. A noise that occurs between the microphone and the camera is, for example, on the right side of the picture but on the left side of the sound field. This can be corrected with the rotation options, but this only works if other sounds that are correctly placed are not affected by the rotation.

Drone flight. Because of the loud whirring, 360° shots of the sky always have to be audio dubbed. Noises need to be diligently collected in order to create as authentic a sound image as possible. During the sound design it then becomes clear: much of what you see below gives off sounds. Certain audio objects run parallel to the drone flight, others run diametrically or turn around the POV because the drone could be flying in place and rotating. A distant city comes closer, goes from mono to stereo, a propeller plane roars by in the visible distance, etc. In this case, the result stands and falls with the conception of the shot in advance. It has proven useful to use a crane that is set up at different positions of the drone flight to produce sound-only captures. The "background noise" is thus optimally captured and can be faded between the audio files, the rotation can be combined with the drone movement. Important sounds are then selectively dubbed and provided with the usual methods for achieving distance and spatiality.

outlook

Sound design in 360° film is subject to more advanced and complex production techniques and acoustic laws than those in conventional film. Consistent attention to the following aspects can lead to a smooth sound production without expensive or frustrating surprises at the project's end:

Production conception, sound production and supervision on the whole project is necessary (set sound must know what post can do, post must know how it was recorded on set. Recommendation: photo documentation).
Due to the high complexity of 360° sound, budgets should include longer production times and higher costs.
A comprehensive knowledge of microphonics and sound engineering, acoustics and psychoacoustics is necessary for all stages of the sound design process.
Production and design methods are subject to changing standards and platform and technology developments to a great extent. An intensive examination of this is essential.
The reception situation determines how the sound is applied and mixed, different mixes have to be made for different situations.
Ongoing testing of the sound with VR glasses is important. Soundscapes that sound good on headphones and are created on the screen are not always the best for the VR viewing experience.
By listening through headphones, special attention must be paid to clean sound work.
360° sound design requires high-performance infrastructure and the know-how for system maintenance and scaling.
Not everything that software tools and plug-ins promise is purposeful and necessary. One must be aware that these technologies are also subject to the market laws and that needs are consciously created.
The wish for the perfect 360° sound will probably never be completely fulfilled. Thus, it is about getting the maximum out of the existing means. What counts in the end is a great, convincing and captivating overall experience, regardless of how many "sound rules" one has disregarded or followed!

A look into the future of sound design for the 360° film suggests that:

Camera and high-quality microphones will move even closer together as a technical unit (exactly the same POV)
Remote monitoring functions for the sound equipment become simple and stable
The platform providers should make an effort to standardise the formats, for the benefit of all involved.
At least 2nd Order Ambisonic or higher will be used
In the future, personal HRTF profiles can be loaded so that the binaural decoding is matched to the physical characteristics of the viewer.

Most importantly: glasses required for enjoying 360° productions will be smaller, lighter, more ergonomic and, above all, will provide better picture quality. This will be decisive in determining whether the format is given a chance.

Further Links