I²S is, err, fun.
What is I²S
Well, first off, it is grammatically like I²C which is an acronym with two Is in it which people then treat an acronym like a mathematical equation and so make it I²C. I²C is IIC. I²S is the same annoying logic, it is in fact IIS, which is Inter-Integrated Circuit Sound, which wikipedia says is pronounced "eye-squared-ess"[citation needed]), and yes, I know nobody that says squared! It is eye-two-ess in my book.
But what is it exactly?
Well, it is a standard (and I use the term loosely) for audio over digital signalling.
What this means in practice is digital microphones and digital speaker driver chips. And to be honest I am amazed. This is clever tech, and pretty much a result of mobile phones.
The options
One challenge is that there are many options. It seems, at least, you may need.
- MCLK which seems to be a master clock at a higher frequency. Now, whilst the code and hardware I use in ESP32 understands this, it seems it is usually not needed.
- A WS or LR clock (I assume WS is "Which Side", and LR is "Left/Right").
- A BLCK - a bit clock.
- A data line for the clocked data in or out.
This is not too bad, especially if MCLK is not needed.
Basically the WS clock is slower, and the BCLK is faster and it allows for each side to have a number of bits of data to be clocked. Simple enough.
PDM mode
The first thing that threw me is that there are I2S microphones that use PDM. I really don't quite grasp the logic here, sorry. PDM looks simple enough (mono) as it looks like you have on/off period that relate to the level of audio. But I am uncertain how that works as clocked left and right.
The PDM microphone I tried allows up to 4.8MHz clocking which is way over audio, so clearly means clocking more and sampling that to decode the PDM.
Seriously I don't (yet) understand, but it works, and the ESP32 can handle it and get 16 bits per sample for each side at various rates.
Standard mode
This makes more sense - you have a BCLK that, say, clocks 32 bits and then every 32 bits the WS clock toggles. So the LR clock is the sample rate, and one side clocked on BCLK when WS is low, and the other when WS is high. Clocked MSB first, signed.
What is actually cunning here is how many bits per sample. A microphone could supply 8, 16, 24, 32, whatever bits, MSB first, on the change of WS and clocked each BCLK. Then stop, and that would be noise for any extra bits. So if you clocked something that only does 24 bits at 32 BCLK per WS cycle, you get 32 bit data where top 24 bits is meaningful.
Philips mode?
There is, of course, a catch! There is a Philips mode, which means the data clocked on BCLK is one clock later than the change of WS clock. But standard mode has no such delay! Oddly, it seems Philips mode is more standard.
Stereo or mono
The underlying format is always Stereo (well, ish), but the hardware on the ESP32 is not daft. I can say I only want left or right channel mono from the stream. Output is always stereo, and there is an option to say mono which I assume (hope) sends same data on each side.
TDM mode
There is, it seems, another mode, where WS is a short pulse at start of frame, and BCLK allows lots of channels, well 8 channels I think, to be clocked. Is this a hark back to 8 track tape?
Hardware
The ESP32 handles all these modes, yay!.
For stereo input you wire two microphones on the same bus, one set as left and one as right.
For stereo output the same, a speaker driver wired as left, or right. It can also be wired as left+right even.
What is amazing is the chips that now exist.
Microphone
TDK do tiny microphones, a PDM one, and a 24 bit per sample I2S one. They are unbelievably good.
Speaker
This was even more impressive - Maxim do a really tiny speaker driver, and it is 1.3mm square BGA that does it all - a cap on power maybe, but it takes BCLK, LRCLK, DATA, and drives a 4 ohm speaker, and that is it! Use two and you have stereo.
Why is this so good?
The simple answer is the hardware in chips like the ESP32 mean that audio in, or out, is a DMA behind the scenes process allowing blocks of data to be sent or received reliably by the hardware.
Before this you would need a good quality ADC sampled at a consistent high speed rate. You would need a good quality DAC updated at a consistent high speed rate. This was hard work. A chip for each was complicated.
Now the microphone is one chip, and the speaker driver is one very tiny chip. And that is it!