生命脉络

长久以来,我一直害怕跟人说起我的这个想法:除去在科学领域奉献终生,我认为其他任何一种生活方式都是无意义的。因为除此之外,不管选择怎样一种生活和生存方式,都是在维持人个体或者小群体的自给自足,这样做是没有意义的。我曾长时间思考存在和意义,每次思考的结果都是对人类渺小的悲悯,但若非要在现存世界中寻一种有意义的生存方式,恐怕也就剩科学研究了。

在所有科学研究领域,我认为天体研究、航空航天、量子物理处于金字塔的塔顶,其他的各种科学学科都是为了研究为这几个领域而提供一切便利和支持来确保人类社会结构在研究过程中不散架的,之所以这些学科无法得到所有外界力量集中发展,原因有三:1.人类本身智力和想象力的局限。2.维持人类社会这个金字塔基部运作的成本太大,耗费了太多资源。3.来自人类社会内部意见的分歧。

大约几个月前,我开始笃定:以后自己的生命无非有两个走向:1.在专业领域获得一定学习经验后,转入管理学或商业,学习获得资本的方法,以供将来按照未来自己意愿改造世界,这一点与埃隆马斯克的做法不谋而合。2.在自己最感兴趣的科学领域研究到死,或者获得诺贝尔,然后去死。当下不得已而选择的学科为通信工程,未来很可能是物理。

二维码的生成细节和原理

QR图码(全称为快速响应矩阵图码;英语:Quick Response Code)是二维条码的一种,于1994年由日本DENSO WAVE公司发明。QR来自英文Quick Response的缩写,即快速反应,因为发明者希望QR码可以让其内容快速被解码。QR码使用四种标准化编码模式(数字,字母数字,字节(二进制)和汉字)来存储数据。QR码最常见于日本,为目前日本最流行的二维空间条码。QR码比较普通条码可以存储更多数据,也无需要像普通条码般在扫描时需要直线对准扫描仪。因此其应用范围已经扩展到包括产品跟踪,物品识别,文档管理,营销等方面。

基础知识

首先,我们先说一下二维码一共有40个尺寸。官方叫版本Version。Version 1是21 x 21的矩阵,Version 2是 25 x 25的矩阵,Version 3是29的尺寸,每增加一个version,就会增加4的尺寸,公式是:(V-1)*4 + 21(V是版本号) 最高Version 40,(40-1)*4+21 = 177,所以最高是177 x 177 的正方形。

下面我们看看一个二维码的样例:

定位图案
  • Position Detection Pattern是定位图案,用于标记二维码的矩形大小。这三个定位图案有白边叫Separators for Postion Detection Patterns。之所以三个而不是四个意思就是三个就可以标识一个矩形了。
  • Timing Patterns也是用于定位的。原因是二维码有40种尺寸,尺寸过大了后需要有根标准线,不然扫描的时候可能会扫歪了。
  • Alignment Patterns 只有Version 2以上(包括Version2)的二维码需要这个东东,同样是为了定位用的。
功能性数据
  • Format Information 存在于所有的尺寸中,用于存放一些格式化数据的。
  • Version Information 在 >= Version 7以上,需要预留两块3 x 6的区域存放一些版本信息。
数据码和纠错码
  • 除了上述的那些地方,剩下的地方存放 Data Code 数据码 和 Error Correction Code 纠错码。

数据编码

我们先来说说数据编码。QR码支持如下的编码:

Numeric mode 数字编码,从0到9。如果需要编码的数字的个数不是3的倍数,那么,最后剩下的1或2位数会被转成4或7bits,则其它的每3位数字会被编成 10,12,14bits,编成多长还要看二维码的尺寸(下面有一个表Table 3说明了这点)

Alphanumeric mode 字符编码。包括 0-9,大写的A到Z(没有小写),以及符号$ % * + – . / : 包括空格。这些字符会映射成一个字符索引表。如下所示:(其中的SP是空格,Char是字符,Value是其索引值) 编码的过程是把字符两两分组,然后转成下表的45进制,然后转成11bits的二进制,如果最后有一个落单的,那就转成6bits的二进制。而编码模式和字符的个数需要根据不同的Version尺寸编成9, 11或13个二进制(如下表中Table 3)

Byte mode, 字节编码,可以是0-255的ISO-8859-1字符。有些二维码的扫描器可以自动检测是否是UTF-8的编码。

Kanji mode 这是日文编码,也是双字节编码。同样,也可以用于中文编码。日文和汉字的编码会减去一个值。如:在0X8140 to 0X9FFC中的字符会减去8140,在0XE040到0XEBBF中的字符要减去0XC140,然后把结果前两个16进制位拿出来乘以0XC0,然后再加上后两个16进制位,最后转成13bit的编码。如下图示例:

Extended Channel Interpretation (ECI) mode 主要用于特殊的字符集。并不是所有的扫描器都支持这种编码。

Structured Append mode 用于混合编码,也就是说,这个二维码中包含了多种编码格式。

FNC1 mode 这种编码方式主要是给一些特殊的工业或行业用的。比如GS1条形码之类的。

简单起见,后面三种不会在本文 中讨论。

下面两张表中,

  • Table 2 是各个编码格式的“编号”,这个东西要写在Format Information中。注:中文是1101
  • Table 3 表示了,不同版本(尺寸)的二维码,对于,数字,字符,字节和Kanji模式下,对于单个编码的2进制的位数。(在二维码的规格说明书中,有各种各样的编码规范表,后面还会提到)

下面我们看几个示例,

示例一:数字编码

在Version 1的尺寸下,纠错级别为H的情况下,编码: 01234567

1. 把上述数字分成三组: 012 345 67

2. 把他们转成二进制:  012 转成 0000001100;  345 转成 0101011001;  67 转成 1000011。

3. 把这三个二进制串起来: 0000001100 0101011001 1000011

4. 把数字的个数转成二进制 (version 1-H是10 bits ): 8个数字的二进制是 0000001000

5. 把数字编码的标志0001和第4步的编码加到前面:  0001 0000001000 0000001100 0101011001 1000011

示例二:字符编码

在Version 1的尺寸下,纠错级别为H的情况下,编码: AC-42

1. 从字符索引表中找到 AC-42 这五个字条的索引 (10,12,41,4,2)

2. 两两分组: (10,12) (41,4) (2)

3.把每一组转成11bits的二进制:

(10,12) 10*45+12 等于 462 转成 00111001110
(41,4) 41*45+4 等于 1849 转成 11100111001
(2) 等于 2 转成 000010

4. 把这些二进制连接起来:00111001110 11100111001 000010

5. 把字符的个数转成二进制 (Version 1-H为9 bits ): 5个字符,5转成 000000101

6. 在头上加上编码标识 0010 和第5步的个数编码:  0010 000000101 00111001110 11100111001 000010

结束符和补齐符

假如我们有个HELLO WORLD的字符串要编码,根据上面的示例二,我们可以得到下面的编码,

编码 字符数 HELLO WORLD的编码
0010 000001011 01100001011 01111000110 10001011100 10110111000 10011010100 001101

我们还要加上结束符:

编码 字符数 HELLO WORLD的编码 结束
0010 000001011 01100001011 01111000110 10001011100 10110111000 10011010100 001101 0000
按8bits重排

如果所有的编码加起来不是8个倍数我们还要在后面加上足够的0,比如上面一共有78个bits,所以,我们还要加上2个0,然后按8个bits分好组:

00100000   01011011   00001011   01111000   11010001   01110010   11011100   01001101   01000011   01000000

补齐码(Padding Bytes)

最后,如果如果还没有达到我们最大的bits数的限制,我们还要加一些补齐码(Padding Bytes),Padding Bytes就是重复下面的两个bytes:11101100 00010001 (这两个二进制转成十进制是236和17,我也不知道为什么,只知道Spec上是这么写的)关于每一个Version的每一种纠错级别的最大Bits限制,可以参看QR Code Spec的第28页到32页的Table-7一表。

假设我们需要编码的是Version 1的Q纠错级,那么,其最大需要104个bits,而我们上面只有80个bits,所以,还需要补24个bits,也就是需要3个Padding Bytes,我们就添加三个,于是得到下面的编码:

00100000 01011011 00001011 01111000 11010001 01110010 11011100 01001101 01000011 01000000 11101100 00010001 11101100

上面的编码就是数据码了,叫Data Codewords,每一个8bits叫一个codeword,我们还要对这些数据码加上纠错信息。

纠错码

上面我们说到了一些纠错级别,Error Correction Code Level,二维码中有四种级别的纠错,这就是为什么二维码有残缺还能扫出来,也就是为什么有人在二维码的中心位置加入图标。

错误修正容量
L水平 7%的字码可被修正
M水平 15%的字码可被修正
Q水平 25%的字码可被修正
H水平 30%的字码可被修正

那么,QR是怎么对数据码加上纠错码的?首先,我们需要对数据码进行分组,也就是分成不同的Block,然后对各个Block进行纠错编码,对于如何分组,我们可以查看QR Code Spec的第33页到44页的Table-13到Table-22的定义表。注意最后两列:

  • Number of Error Code Correction Blocks :需要分多少个块。
  • Error Correction Code Per Blocks:每一个块中的code个数,所谓的code的个数,也就是有多少个8bits的字节。

举个例子:上述的Version 5 + Q纠错级:需要4个Blocks(2个Blocks为一组,共两组),头一组的两个Blocks中各15个bits数据 + 各 9个bits的纠错码(注:表中的codewords就是一个8bits的byte)(再注:最后一例中的(c, k, r )的公式为:c = k + 2 * r,因为后脚注解释了:纠错码的容量小于纠错码的一半)

下图给一个5-Q的示例(因为二进制写起来会让表格太大,所以,我都用了十进制,我们可以看到每一块的纠错码有18个codewords,也就是18个8bits的二进制数)

数据 对每个块的纠错码
1 1 67 85 70 134 87 38 85 194 119 50 6 18 6 103 38 213 199 11 45 115 247 241 223 229 248 154 117 154 111 86 161 111 39
2 246 246 66 7 118 134 242 7 38 86 22 198 199 146 6 87 204 96 60 202 182 124 157 200 134 27 129 209 17 163 163 120 133
2 1 182 230 247 119 50 7 118 134 87 38 82 6 134 151 50 7 148 116 177 212 76 133 75 242 238 76 195 230 189 10 108 240 192 141
2 70 247 118 86 194 6 151 50 16 236 17 236 17 236 17 236 235 159 5 173 24 147 59 33 106 40 255 172 82 2 131 32 178 236

注:二维码的纠错码主要是通过Reed-Solomon error correction(里德-所罗门纠错算法)来实现的。对于这个算法,对于我来说是相当的复杂,里面有很多的数学计算,比如:多项式除法,把1-255的数映射成2的n次方(0<=n<=255)的伽罗瓦域Galois Field之类的神一样的东西,以及基于这些基础的纠错数学公式,因为我的数据基础差,对于我来说太过复杂,所以我一时半会儿还有点没搞明白,还在学习中,所以,我在这里就不展开说这些东西了。还请大家见谅了。(当然,如果有朋友很明白,也繁请教教我)

最终编码

穿插放置

如果你以为我们可以开始画图,你就错了。二维码的混乱技术还没有玩完,它还要把数据码和纠错码的各个codewords交替放在一起。如何交替呢,规则如下:

对于数据码:把每个块的第一个codewords先拿出来按顺度排列好,然后再取第一块的第二个,如此类推。如:上述示例中的Data Codewords如下:

块 1 67 85 70 134 87 38 85 194 119 50 6 18 6 103 38
块 2 246 246 66 7 118 134 242 7 38 86 22 198 199 146 6
块 3 182 230 247 119 50 7 118 134 87 38 82 6 134 151 50 7
块 4 70 247 118 86 194 6 151 50 16 236 17 236 17 236 17 236

我们先取第一列的:67, 246, 182, 70

然后再取第二列的:67, 246, 182, 70, 85,246,230 ,247

如此类推:67, 246, 182, 70, 85,246,230 ,247 ………  ……… ,38,6,50,17,7,236

对于纠错码,也是一样:

块 1 213 199 11 45 115 247 241 223 229 248 154 117 154 111 86 161 111 39
块 2 87 204 96 60 202 182 124 157 200 134 27 129 209 17 163 163 120 133
块 3 148 116 177 212 76 133 75 242 238 76 195 230 189 10 108 240 192 141
块 4 235 159 5 173 24 147 59 33 106 40 255 172 82 2 131 32 178 236

和数据码取的一样,得到:213,87,148,235,199,204,116,159,…… …… 39,133,141,236

然后,再把这两组放在一起(纠错码放在数据码之后)得到:

67, 246, 182, 70, 85, 246, 230, 247, 70, 66, 247, 118, 134, 7, 119, 86, 87, 118, 50, 194, 38, 134, 7, 6, 85, 242, 118, 151, 194, 7, 134, 50, 119, 38, 87, 16, 50, 86, 38, 236, 6, 22, 82, 17, 18, 198, 6, 236, 6, 199, 134, 17, 103, 146, 151, 236, 38, 6, 50, 17, 7, 236, 213, 87, 148, 235, 199, 204, 116, 159, 11, 96, 177, 5, 45, 60, 212, 173, 115, 202, 76, 24, 247, 182, 133, 147, 241, 124, 75, 59, 223, 157, 242, 33, 229, 200, 238, 106, 248, 134, 76, 40, 154, 27, 195, 255, 117, 129, 230, 172, 154, 209, 189, 82, 111, 17, 10, 2, 86, 163, 108, 131, 161, 163, 240, 32, 111, 120, 192, 178, 39, 133, 141, 236

这就是我们的数据区。

Remainder Bits

最后再加上Reminder Bits,对于某些Version的QR,上面的还不够长度,还要加上Remainder Bits,比如:上述的5Q版的二维码,还要加上7个bits,Remainder Bits加零就好了。关于哪些Version需要多少个Remainder bit,可以参看QR Code Spec的第15页的Table-1的定义表。

画二维码图

Position Detection Pattern

首先,先把Position Detection图案画在三个角上。(无论Version如何,这个图案的尺寸就是这么大)

Alignment Pattern

然后,再把Alignment图案画上(无论Version如何,这个图案的尺寸就是这么大)

关于Alignment的位置,可以查看QR Code Spec的第81页的Table-E.1的定义表(下表是不完全表格)

下图是根据上述表格中的Version8的一个例子(6,24,42)

Timing Pattern

接下来是Timing Pattern的线(这个不用多说了)

Format Information

再接下来是Formation Information,下图中的蓝色部分。

Format Information是一个15个bits的信息,每一个bit的位置如下图所示:(注意图中的Dark Module,那是永远出现的)

这15个bits中包括:

  • 5个数据bits:其中,2个bits用于表示使用什么样的Error Correction Level, 3个bits表示使用什么样的Mask
  • 10个纠错bits。主要通过BCH Code来计算

然后15个bits还要与101010000010010做XOR操作。这样就保证不会因为我们选用了00的纠错级别和000的Mask,从而造成全部为白色,这会增加我们的扫描器的图像识别的困难。

下面是一个示例:

关于Error Correction Level如下表所示:

关于Mask图案如后面的Table 23所示。

Version Information

再接下来是Version Information(版本7以后需要这个编码),下图中的蓝色部分。

Version Information一共是18个bits,其中包括6个bits的版本号以及12个bits的纠错码,下面是一个示例:

而其填充位置如下:

数据和数据纠错码

然后是填接我们的最终编码,最终编码的填充方式如下:从左下角开始沿着红线填我们的各个bits,1是黑色,0是白色。如果遇到了上面的非数据区,则绕开或跳过。

掩码图案

这样下来,我们的图就填好了,但是,也许那些点并不均衡,如果出现大面积的空白或黑块,会告诉我们扫描识别的困难。所以,我们还要做Masking操作(靠,还嫌不复杂)QR的Spec中说了,QR有8个Mask你可以使用,如下所示:其中,各个mask的公式在各个图下面。所谓mask,说白了,就是和上面生成的图做XOR操作。Mask只会和数据区进行XOR,不会影响功能区。(注:选择一个合适的Mask也是有算法的

其Mask的标识码如下所示:(其中的i,j分别对应于上图的x,y)

下面是Mask后的一些样子,我们可以看到被某些Mask XOR了的数据变得比较零散了。

Mask过后的二维码就成最终的图了。

转载自酷壳网

阿尔比诺尼G小调柔板触发的思考

在刘雪枫老师的《雪枫音乐会》一期节目中听到这首阿尔比诺尼的G小调柔板。

古典音乐、建筑风格、历史、绘画、心理学、文学,这些人文科学之于人的作用恰如物理学、数学之于航天和计算机一样,它们的作用是基石性的。

一个老生常谈的话题,经典之所以得以流传百年,我认为一个重要原因就在于它们感情的真挚,在这些东西里,人有机会寻找到共鸣。我已经不再叛逆了,我已经开始学习理解一些东西,包括这些从前充耳不闻的“老生常谈”。

 

 

 

请让我慢下来

海桑

万物运动,让我慢下来

尘世烦忧,我愿意安静一些

你看秋风中的万紫千红

我是那不起眼的黑与白

谁看见了我,我就是他的了

请不要把我的日子填满吧

我不稀罕什么娱乐,甚至不要亲密

我想过似乎单调的生活

让心中总有个空荡荡的地方

什么也不种,就让它荒着

倘若它自己长出了什么,我就欢喜它

我想和你一起生活

在某个小镇

共享无尽的黄昏

和绵绵不绝的钟声。

在这个小镇的旅店里——

古老时钟敲出的

微弱响声

像时间轻轻滴落。

有时候,在黄昏,自顶楼某个房间传来

笛声,

吹笛者倚著窗牖,

而窗口大朵郁金香。

此刻你若不爱我,我也不会在意。

关于今晚的《心灵捕手》

   Experience without learning is better than learning without excperience.  ——Bertuand Russell
“你只是个孩子,你根本不晓得你在说什么。
所以当我问你艺术,你可能会提出艺术书籍中的粗浅论调,有关米开朗基罗,你知道很多,他的满腔政治热情,与教皇相交莫逆,耽于性爱,你对他很清楚吧?但你连西斯汀教堂的气味也不知道吧?你没试过站在那儿,昂首眺望天花板上的名画吧?肯定未见过吧?
如果我问关于女人的事,你大可以向我如数家珍,你可能上过几次床,但你没法说出在女人身旁醒来时,那份内心真正的喜悦。
你年轻彪悍,我如果和你谈论战争,你会向我大抛莎士比亚,朗诵“共赴战场,亲爱的朋友”,但你从未亲临战阵,未试过把挚友的头拥入怀里,看着他吸着最后一口气,凝望着你,向你求助。
我问你何为爱情,你可能只会吟风弄月,但你未试过全情投入真心倾倒,四目交投时彼此了解对方的心,好比上帝安排天使下凡只献给你,把你从地狱深渊拯救出来,对她百般关怀的感受你也从未试过,你从未试过对她的情深款款矢志厮守,明知她患了绝症也再所不惜,你从未尝试过痛失挚爱的感受…… ”

你有哪些终生难忘的扎心瞬间?(转自知乎)

初中班里成绩最好的男生如今因为赌博输光了家产。
高中出淤泥而不染的校花这几年在朋友圈做起了微商。
隔壁班天不怕地不怕的大混子现在开起了水果店,也学会鞠躬弯腰叫别人哥了。
脑子灵活的葫芦赚大钱买了跑车,当年我就觉得这小子一定行。
年幼时说要跟我结婚的同桌如今已经成了两个孩子的妈,我还没有女朋友。
听说小时候楼下老翻垃圾桶的疯乞丐最近饿死了。
城北街上百年老店的店主因为食材掺假被路人砸了门牌。
后座不爱学习的小李子果然继承了他家的零食店,只是店里不再卖小浣熊干脆面了。
唱歌五音不全的小旭现在当起了钢琴老师。
崇尚杀马特文化的大东留了好几年圆寸了。
梦想是成为科学家的老涂终于考上了中科大的物理博士。
爱看电视剧的邻家小妹在院线作品里出演了配角。
家属院门卫大叔捡来的狗狗已经生了九个孩子啦。
曾经特别不喜欢《童话世界》,现在居然想订十年期。
童年流行的奥特曼,现在还在更新,只是造型和剧情越来越奇怪了。
一直坚持单身主义的韩姐刚在空间晒了结婚照。
十年前拿扫把刷街的清洁工大妈,现在住上了小洋房。
电视里耍双截棍的周杰伦,已经结婚抱娃了。
金龟子的孩子都十几岁啦。
小时候传阅翻烂的火影漫画,几年前完结了。
模仿日剧在橡皮上写下自己的名字,到现在还没机会掰一半送给喜欢的女生。
夏天傍晚的滨河公园已经没人下棋了,老年合唱队也消失了。
重制版的《灌篮高手》还是那么好看,只是曾经一起看《灌篮高手》的伙伴们早已各奔东西。
小时候收集齐的水浒卡因为这几年搬家,找不到了。
老电视剧吓人的镜头,现在再看,只会不屑地说,嗨都是骗人的。
魂斗罗三条命通关其实没有小时候认为得那么难。
一直还把自己当孩子,其实已经上班一年了,很喜欢这份工作。
留给我无限回忆的绿皮火车,在不知不觉中销声匿迹。
曾和网友约定二十年后在《热血传奇》的世界里比武,现在只剩下满网的私服了。
开始操心早饭吃什么,午饭吃什么,晚饭吃什么,出去吃还是自己做。
开始习惯干什么都是一个人,告诉自己孤独和寂寞就是自由。
开始学会顺从与屈服,尽管几年前的我才刚刚叛逆过。
妈妈的头发花白了好多。
身体健康的帅气老爸患重病离开了,走之前他还笑着说一定要坚强。
老家院子里的梧桐树叶落了一地,一片两片三片四片五片六片。

从小就迷迷糊糊的我,现在还是那么喜欢发呆。
可明明只是眨了几下眼睛,原本熟悉的世界却变得不认识了。
这陌生又熟悉的世界啊,别来无恙。

和任何人都会分开的,不管是喜欢还是厌恶,一定要好好说再见。

回答下方的一条评论:

音乐识别案例分析——How does Shazam work?

Have you ever wondered how Shazam works? I asked myself this question a few years ago and I read a research article written by Avery Li-Chun Wang, the confounder of Shazam, to understand the magic behind Shazam. The quick answer is audio fingerprinting, which leads to another question: what is audio fingerprinting?

 

shazam logo

 

When I was student, I never took a course in signal processing. To really understand Shazam (and not just have a vague idea) I had to start with the basics. This article is a summary of the search I did to understand Shazam.

I’ll start with the basics of music theory, present some signal processing stuff and end with the mechanisms behind Shazam. You don’t need any knowledge to read this article but since it involves computer science and mathematics it’s better to have a good scientific background (especially for the last parts). If you already know what the words “octaves”, “frequencies”, “sampling” and “spectral leakage” mean you can skip the first parts.

Since it’s a long and technical article (11k words) feel free to read each part at different times.

Music and physics

A sound is a vibration that propagates through air (or water) and can be decrypted by ears. For example, when you listen to your mp3 player the earphones produce vibrations that propagate through air until they reach your ears. The light is also a vibration but you can’t hear it because your ears can’t decrypt it (but your eyes can).

A vibration can be modeled by sinusoidal waveforms. In this chapter, we’ll see how music can be physically/technically described.

 

Pure tones vs real sounds

A pure tone is a tone with a sinusoidal waveform. A sine wave is characterized by:

  • Its frequency: the number of cycles per second. Its unit is the Hertz (Hz), for example 100Hz = 100 cycles per second.
  • Its amplitude (related to loudness for sounds): the size of each cycle.

Those characteristics are decrypted by the human ear to form a sound. Human can hear pure tones from 20 Hz to 20 000 Hz (for the best ears) and this range decreases over age. By comparison, the light you see is composed of sinewaves from 4*10^14 Hz to 7.9*10^14 Hz.

You can check the range of your ears with youtube videos like this one that displays all the pure tones from 20 Hz to 20k Hz, in my case I can’t hear anything above 15 kHz.

 

The human perception of loudness depends on the frequency of the pure tone. For instance, a pure tone at amplitude 10 of frequency 30Hz will be quieter than a pure tone at amplitude 10 of frequency 1000Hz. Humans ears follow a psychoacoustic model, you can check this article onWikipedia for more information.

Note: This fun fact will have consequences at the end of the article.

 

pure sinewave at 20 Hz

pure sinewave at 20 Hz

In this figure, you can see the representation of a pure sine wave of frequency 20hz and amplitude 1.

Pure tones doesn’t naturally exist but every sound in the world is the sum a multiple pure tones at different amplitudes.

 

composition of sinewaves

composition of sinewaves

 

In this figure, you can see the representation of a more realistic sound which is the composition of multiple sinewaves:

  • a pure sinewave of frequency 20hz and amplitude 1
  • a pure sinewave of frequency 40hz and amplitude 2
  • a pure sinewave of frequency 80hz and amplitude 1.5
  • a pure sinewave of frequency 160hz and amplitude 1

A real sound can be composed of thousands of pure tones.

 

Musical Notes

simple_gifts_partition_min

A music partition is a set of notes executed at a certain moment. Those notes also have a duration and a loudness.

The notes are divided in octaves. In most occidental countries, an octave is a set of 8 notes (A, B, C, D, E, F,G in most English-speaking countries and Do, Re, Mi, Fa, Sol, La, Si in most Latin occidental countries) with the following property:

  • The frequency of a note in an octave doubles in the next octave. For example, the frequency of A4 (A in the 4th octave) at 440Hz equals 2 times the frequency of A3 (A in the 3rd octave) at 220Hz and 4 times the frequency of A2 (A in the 2nd octave) at 110Hz.

Many instruments provides more than 8 notes by octaves, those notes are called semitone or halfstep.

octave

For the 4th octave (or 3rd octave in Latin occidental countries), the notes have the following frequency:

  • C4 (or Do3) = 261.63Hz
  • D4 (or Re3) = 293.67Hz
  • E4 (or Mi3) = 329.63Hz
  • F4 (or Fa3) = 349.23Hz
  • G4 (or Sol3) = 392Hz
  • A4 (or La3) = 440Hz
  • B4 (or Si3) = 493.88Hz

Though it might be odd, the frequency sensitivity of ears is logarithmic. It means that:

  • between 32.70 Hz and 61.74Hz (the 1st octave)
  • or between 261.63Hz and 466.16Hz (4th octave)
  • or between 2 093 Hz and 3 951.07Hz (7th octave)

Human ears will be able to detect the same number of notes.

FYI, the A4/La3 at 440Hz is a standard reference for the calibration of acoustic equipment and musical instruments.

 

Timbre

The same note doesn’t sound exactly the same if it’s played by a guitar, a piano, a violin or a human singer. The reason is that each instrument has its own timbre for a given note.

For each instrument, the sound produced is a multitude of frequencies that sounds like a given note (the scientific term for a musical note is pitch). This sound has a fundamentalfrequency (the lowest frequency) and multiple overtones (any frequency higher than the fundamental).

Most instruments produce (close to) harmonic sounds. For those instruments, the overtones are multiples of the fundamental frequency called harmonics. For example the composition of pure tones A2 (fundamental), A4 and A6 is harmonic whereas the composition of pure tones A2, B3, F5 is inharmonic.

Many percussion instruments (like cymbals or drums) create inharmonic sounds.

Note: The pitch (the musical note perceived) might not be present in the sound played by an instrument. For example, if an instrument plays a sound with pure tones A4, A6 and A8, Human brain will interpret the resulting sound has an A2 note. This note/pitch will be an A2 whereas the lowest frequency in the sound is A4 (this fact is called the missing fundamental).

 

Spectrogram

A music song is played by multiple instruments and singers. All those instruments produce a combination of sinewaves at multiples frequencies and the overall is an even bigger combination of sinewaves.

It is possible to see music with a spectrogram. Most of the time, a spectrogram is a 3 dimensions graph where:

  • on the horizontal (X) axis, you have the time,
  • on the vertical (Y) axis you have the frequency of the pure tone
  • the third dimension is described by a color and it represents the amplitude of a frequency at a certain time.

For example, here is a sound of a piano playing of C4 note (whose fundamental frequency is 261.63Hz)

Audio Player

And here is the associated spectrogram:

spectrogram of a C4 piano note

The color represents the amplitude in dB (we’ll see in a next chapter what it means).

As I told you in the previous chapter, though the note played is a C4 there are other frequencies than 261Hz in this record: the overtones. What’s interesting is that the other frequencies are multiple of the first one: the piano is an example of a harmonic instrument.

Another interesting fact is that the intensity of the frequencies changes through time. It’s another particularity of an instrument that makes it unique. If you take the same artist but you replace the piano, the evolution of frequencies won’t behave the same and the resulting sound will be slightly different because each artist/instrument has its own style. Technically speaking, these evolutions of frequencies are modifying the envelope of the sound signal (which is a part of the timbre).

To give you a first idea of Shazam music fingerprinting algorithm, you can see in this spectrogram that some frequencies (the lowest ones) are more important than others. What if we kept just the strongest ones?

 

Digitalization

Unless you’re a vinyl disk lover, when you listen to music you’re using a digital file (mp3, apple lossless,  ogg, audio CD, whatever ). But when artists produce music, it is analogical (not represented by bits). The music is digitalized in order to be stored and played by electronics devices (like computers, phones, mp3 players, cd players …).  In this part we’ll see how to pass from an analog sound to a digital one. Knowing how a digital music is made will help us to analyse and manipulate this digital music in the next parts.

 

Sampling

Analog signals are continuous signals, which means if you take one second of an analog signal, you can divide this second into [put the greatest number you can think of and I hope it’s a big one !]  parts that last a fraction of second.  In the digital world, you can’t afford to store an infinite amount of information. You need to have a minimum unit, for example 1 millisecond. During this unit of time the sound cannot change so this unit needs to be short enough so that the digital song sounds like the analog one and big enough to limit the space needed for storing the music.

For example, think about your favorite music. Now think about it with the sound changing only every 2 seconds, it sounds like nothing. Technically speaking the sound is aliased. In order to be sure that your song sounds great you can choose a very small unit like a nano (10^-9) second. This time your music sounds great but you don’t have enough disk space to store it, too bad.

This problem is called sampling.

The standard unit of time in digital music is 44 100 units (or samplesper second. But where does this 44,1kHz come from? Well, some dude thought it would be good to put 44100 units per second and that all … I’m kidding of course.

In the first chapter I told you that humans can hear sounds from 20Hz to 20kHz. A theorem from Nyquist and Shannon states that if you want to digitalize a signal from 0Hz to 20kHz you need at least 40 000 samples per second. The main idea is that a sine wave signal at a frequency F needs at least 2 points per cycle to be identified. If the frequency of your sampling is at least twice than the frequency of your signal, you’ll end up with at least 2 points per cycle of the original signal.

Let’s try to understand with a picture, look at this example of a good sampling:

sampling of a signal

In this figure, a sound at 20Hz is digitalized using a 40Hz sampling rate:

  • the blue curve represents the sound at 20 Hz,
  • the red crosses represent the sampled sound, which means I marked the blue curve with a red cross every 1/40 second,
  • the green line an interpolation of the sampled sound.

Though it hasn’t the same shape nor the same amplitude, the frequency of the sampled signal remains the same.

And here is an example of bad sampling :

example of undersampling

In this figure, a sound at 20 Hz is digitalized with a 30Hz sampling rate. This time the frequency of the sampled signal is not the same as the original signal: it’s only 10Hz. If you look carefully, you can see that one cycle in the sampled signal represents two cycles in the original signal. This case is an under sampling.

This case also shows something else: if you want to digitalize a signal between 0Hz and 20 kHz, you need remove from the signal its frequencies over 20kHz before the sampling. Otherwise those frequencies will be transformed into frequencies between 0Hz and 20Khz and therefore add unwanted sounds (it’s called aliasing).

 

To sum up, if you want a good music conversion from analogic to digital you have to record the analog music at least 40000 times per second. HIFI corporations (like Sony) chose 44,1kHz during the 80s because it was above 40000 Hz and compatible with the video norms NTSC and PAL. Other standards exist for audio like 48 kHz (Blueray), 96 kHz or 192 kHz but if you’re neither a professional nor an audiophile you’re likely to listen to 44.1 kHz music.

Note1: The theorem of Nyquist-Shannon is broader than what I said, you can check on Wikipedia if you want to know more about it.

Note2: The frequency of the sampling rate needs to be strictly superior of 2 times the frequency of the signal to digitalize because in the worst case scenario, you could end up with a constant digitalized signal.

 

Quantization

We saw how to digitalize the frequencies of an analogic music but what about the loudness of music? The loudness is a relative measure: for the same loudness inside the signal, if you increase your speakers the sound will be higher. The loudness measures the variation between the lowest and the highest level of sound inside a song.

The same problem appears loudness: how to pass from a continuous world (with an infinite variation of volume) to a discrete one?

Imagine your favorite music with only 4 states of loudness: no sound, low sound, high sound and full power. Even the best song in the world becomes unbearable.  What you’ve just imagined was a 4-level quantization.

Here is an example of a low quantization of an audio signal:

8_level_quantization-min

This figure presents an 8 level quantization. As you can see, the resulting sound (in red) is very altered. The difference between the real sound and the quantized one is called quantization error or quantization noiseThis 8 level quantization is also called a 3 bits quantizationbecause you only need 3 bits to implement the 8 different levels (8 = 2^3).

Here is the same signal with a 64 levels quantization (or 6 bits quantization)

64_levels_quantization-min

Though the resulting sound is still altered, it looks (and sounds) more like the original sound.

 

Thankfully, humans don’t have extra sensitive ears. The standard quantization is coded on 16 bits, which means 65536 levels.  With a 16 bits quantization, the quantization noise is low enough for human ears.

Note: In studio, the quantization used by professionals is 24 bits, which means there are 2^24 (16 millions) possible variations of loudness between the lowest point of the track and the highest.

Note2: I made some approximations in my examples concerning the number of quantization levels.

 

Pulse Coded Modulation

PCM or Pulse Coded Modulation is a standard that represents digital signals. It is used by compact discs and most electronics devices. For example, when you listen to an mp3 file in your computer/phone/tablet, the mp3 is automatically transformed into a PCM signal and then send to your headphones.

A PCM stream is a stream of organized bits. It can be composed of multiple channels. For example, a stereo music has 2 channels.

In a stream, the amplitude of the signal is divided into samples.  The number of samples per second correspond to the sampling rate of the music. For instance a 44,1kHz sampled music will have 44100 samples per second. Each sample gives the (quantized) amplitude of the sound of the corresponding fraction of seconds.

There are multiple PCM formats but the most used one in audio is the (linear) PCM 44,1kHz, 16-bit depth stereo format. This format has 44 100 samples for each second of music. Each sample takes 4 bytes:

  • 2 bytes (16 bits) for the intensity (from -32,768 to 32,767) of the left speaker
  • 2 bytes (16 bits) for the intensity (from -32,768 to 32,767) of the right speaker

 

example of a pulse code modulation stereo sample

In a PCM 44,1kHz 16-bit depth stereo format, you have 44100 samples like this one for every second of music.

 

From digital sound to frequencies

You now know how to pass from an analog sound to a digital one. But how can you get the frequencies inside a digital signal? This part is very important since the Shazam fingerprinting algorithm works only with frequencies.

For analog (and therefore continuous) signals, there is a transformation called the Contiguous Fourier transform. This function transforms a function of time into a function of frequencies. In other words, if you apply the Fourier transform on a sound, it will give you the frequencies (and their intensities) inside this sound.

But there are 2 problems:

  • We are dealing with digital sounds and therefore finite (none continuous) sounds.
  • To have a better knowledge of the frequencies inside a music, we need to apply the Fourier Transform on small parts of the full length audio signal, like 0.1 second parts so that we know what are the frequencies for each 0.1 second parts of an audio track).

Thankfully, there is another mathematical function, the Discrete Fourier Transform (DFT), that works with some limitations.

Note: The Fourier Transform must be applied on only one channel, which means that if you have a stereo song you need to transform it into a mono song.

 

Discrete Fourier Transform

The DFT (Discrete Fourier Transform) applies to discrete signals and gives a discrete spectrum (the frequencies inside the signal).

 

Here is the magic formula to transform a digital signal into frequencies (don’t run away, I’ll explain it):

DFT-min

In this formula:

  • N is the size of the window: the number of samples that composed the signal (we’ll talk a lot about windows in the next part).
  • X(n) represents the nth bin of frequencies
  • x(k) is kth sample of the audio signal

 

For example, for an audio signal with a 4096-sample window this formula must be applied 4096 times:

  • 1 time for n = 0 to compute the 0th bin a frequencies
  • 1 time for n = 1 to compute the 1st bin a frequencies
  • 1 time for n = 2 to compute the 2nd bin a frequencies

 

As you might have noticed, I spoke about bin of frequencies and not frequency. The reason is that the DFT gives a discrete spectrum. A bin of frequencies is the smallest unit of frequency the DFT can compute. The size of the bin (called spectral/spectrum resolution or frequency resolution) equals the sampling rate of the signal divided by the size of the window (N). In our example, with a 4096-sample window and a standard audio sampling rate at 44.1kHz, the frequency resolution is 10.77 Hz (except the 0th bin that is special):

  • the 0th bin represents the frequencies between 0Hz to 5.38Hz
  • the 1st bin represents the frequencies between 5.38Hz to 16.15Hz
  • the 2nd bin represents the frequencies between 16.15Hz to 26.92Hz
  • the 3rd bin represents the frequencies between 26.92Hz to 37.68Hz

 

That means that the DFT can’t dissociate 2 frequencies that are closer than 10.77Hz. For example notes at 27Hz, 32Hz and 37Hz ends up in the same bin. If the note at 37Hz is very powerful you’ll just know that the 3rd bin is powerful. This is problematic for dissociating notes in the lowest octaves.  For example:

  • a A1 (or La -1) is at 55Hz whereas a B1 (or Si -1) is at 58.27Hz and a G1 (or Sol -1) is at 49 Hz.
  • the first note of a standard 88-key piano is a A0 at 27.5 Hz followed by a A#0 at 29.14Hz.

 

You can improve the frequency resolution by increasing the window size but that means losing fast frequency/note changes inside the music:

  • An audio signal has a sampling rate of 44,1 kHz
  • Increasing the window means taking more samples and therefore increasing the time taken by the window.
  • With 4096 samples, the window duration is 0.1 sec and the frequency resolution is 10.7 Hz: you can detect a change every 0.1 sec.
  • With 16384 samples, the window duration is 0.37 sec and the frequency resolution is 2.7 Hz: you can detect a change every 0.37 sec.

 

Another particularity for an audio signal is that we only need half the bins computed by the DFT. In the previous example, the bin definition is 10.7 Hz, which means that the 2047th bin represents the frequencies from 21902,9 Hz to 21913,6 Hz.  But:

  • The 2048th bin will give the same information as the 0th bin
  • The 2049th bin will give the same information as the 1th bin
  • The X+2048th bin will give the same information as the Xth bin
  • ..

If you want to know why the bin resolution equals” the sampling rate” divided by “the size of the window” or why this formula is so bizarre,  you can read a 5-part article on Fourier Transform on this very good website  (especially part 4 and part 5) which is the best article for beginners that I read (and I read a lot of articles on the matter).

 

Window functions

If you want to get the frequency of a one-second sound for each 0.1-second parts, you have to apply the Fourier Transform for the first 0.1-second part, apply it for the second 0.1-second part, apply it for the third 0.1-second part …

The problem

By doing so, you are implicitly applying a (rectangular) window function:

  • For the first 0.1 second you are applying the Fourier transform on the full one-second signal multiplied by a function that equals 1 between 0 and 0.1second, and 0 for the rest
  • For the second 0.1 second you are applying the Fourier transform on the full one-second signal multiplied by a function that equals 1 between 0.1 and 0.2 second, and 0 for the rest
  • For the third 0.1 second you are applying the Fourier transform on the full one-second signal multiplied by a function that equals 1 between 0.2 and 0.3 second, and 0 for the rest

Here is a visual example of the window function to apply to a digital (sampled) audio signal to get the first 0.01-second part:

rectangulare window with a signal

In this figure, to get the frequencies for the first 0.01-second part, you need to multiply the sampled audio signal (in blue) with the window function (in green).

rectangulare window with a signal

In this figure, to get the frequencies for the second 0.01-second part, you need to multiply the sampled audio signal (in blue) with the window function (in green).

By “windowing” the audio signal, you multiply your signal audio(t) by a window function window(t). This window function produces spectral leakage. Spectral leakage is the apparition of new frequencies that doesn’t exist inside the audio signal. The power of the real frequencies is leaked to others frequencies.

Here is a non-formal (and very light) mathematical explanation. Let’s assume you want a part of the full audio signal. You will multiply the audio signal with a window function that let pass the sound only for the part you want:

part_of_audio(t) = full_audio(t) . window (t)

When you try to get the frequencies of the part of audio, you apply the Fourier transform on the signal

Fourier(part_of_audio(t)) = Fourier(full_audio(t) . window (t))

According to the convolution theorem (* represents the convolution operator and . the multiplication operator)

Fourier(full_audio(t) . window (t)) = Fourier(full_audio(t))  *  Fourier(window (t))

—>Fourier(part_of_audio(t)) = Fourier(full_audio(t))  *  Fourier(window (t))

—>The frequencies of the part_of_audio(t) depend on the window() function used.

I won’t go deeper because it requires advanced mathematics. If you want to know more, look at this link on page 29, the chapter “the truncate effects” presents the mathematical effect of applying a rectangular window on a signal. What you need to keep in mind is that cutting an audio signal into small parts to analyze the frequencies of each part produces spectral leakage.

 

different types of windows

You can’t avoid spectral leakage but you can handle how the leakage will behave by choosing the right window function: instead of using a rectangular window function, you can choose a triangular widows, a Parzen window, a Blackman window, a Hamming window …

The rectangular window is the easiest window to use (because you just have to “cut” the audio signal into small parts) but for analyzing the most important frequencies in a signal, it might not be the best type of windows. Let’s have a look of 3 types of windows: rectangular, Hamming and Blackman. In order to analyse the effect of the 3 windows, we will use the following audio signal composed of:

  • A frequency 40 Hz with an amplitude of 2
  • A frequency 160 Hz with an amplitude of 0.5
  • A frequency 320 Hz with an amplitude of 8
  • A frequency 640Hz with an amplitude of 1
  • A frequency 1000 Hz with an amplitude of 1
  • A frequency 1225 Hz with an amplitude of0.25
  • A frequency 1400 Hz with an amplitude of 0.125
  • A frequency 2000 Hz with an amplitude of 0.125
  • A frequency 2500Hz with an amplitude of 1.5

In a perfect world, the Fourier transform of this signal should give us the following spectrum:

example of a spectrum of multiple sinewave

This figure shows a spectrum with only 9 vertical lines (at 40 Hz, 160 Hz, 320 Hz, 640 Hz, 1000 Hz, 1225 Hz, 1400 Hz, 2000 Hz and 2500 Hz. The y axis gives the amplitude in decibels (dB) which means the scale is logarithmic. With this scale a sound at 60 dB is 100 times more powerful than a sound at 40 dB and 10000 times more powerful than a sound at 20 dB.  To give you an idea, when you speak in a quiet room, the sound you produce is 20-30 dB higher (at 1 m of you) than the sound of the room.

In order to plot this “perfect” spectrum, I applied the Fourier Transform with a very long window: a 10-second window. Using a very long window reduces the spectrum leakage but 10 seconds is too long because in a real song the sound changes much faster.  To give you an idea of how fast the music changes:

  • here is a video with 1 change ( or beat) per second, it sounds slow but it’s a common rhythm for classical music.
  • here is a video with 2.7 changes per second, it sounds much faster but this rhythm is common for electro music
  • here is a video with 8.3 changes per second, it’s a very (very) fast rhythm but possible for small parts of songs.

In order to capture those fast changes, you need to “cut” the sound into very small parts using window functions. Imagine you want to analyze the frequencies of a sound every 1/3 second.

example of rectangular, blackman and hamming windows

In this figure, you can multiply the audio signal with one of the 3 window types to get the part of the signal between 0.333sec and 0.666 sec. As I said, using a rectangular window is like cutting the signal between 0.333sec and 0.666sec whereas with the Hamming or the Blackman windows you need to multiply the signal with the window signal.

Now, here is the spectrum of the previous audio signal with a 4096-sample window:

spectrogram of rectangular, blackman and hamming windows

The signal is sampled at 44100Hz so a 4096-sample window represents a 93-millisecond part (4096/44100) and a frequency resolution of 10.7 Hz.

This figure shows that all windows modify the real spectrum of the sound. We clearly see that a part of the power of the real frequencies is spread to their neighbours. The spectrum from the rectangular window is the worst since the spectrum leakage is much higher than the 2 others. It’s especially true between 40 and 160 Hz. The Blackman window gives the closest spectrum from the real spectrum.

Here is the same example with a Fourier Transform of a 1024 window:

spectrogram of rectangular, blackman and hamming windows

The signal is sampled at 44100Hz so a 1024-sample window represents a 23-millisecond part (1024/44100) and a frequency resolution of 43 Hz.

This time the rectangular window gives the best spectrum. With the 3 windows the 160 Hz frequency is hidden by the spectrum leakage produced by the 40 Hz and 320 Hz frequencies. The Blackman window gives the worst result with a 1225 Hz frequency close to invisible.

Comparing both figures shows that the spectrum leakage increases (for all the window function) as the frequency resolution increases. The fingerprint algorithm used by Shazam look for the loudest frequencies inside an audio track. Because of spectrum leakage, we can’t just take the X highest frequencies. In the last example, the 3 loudest frequencies are approximately 320 Hz, 277 Hz (320-43) and 363 Hz (320+43) whereas only the 320 Hz frequency exists.

 

Which window is the best?

There are no “best” or “worst” windows. Each window has its specificities and depending on the problem you might want to use a certain type.

A rectangular window has excellent resolution characteristics for sinusoids of comparable strength, but it is a poor choice for sinusoids of disparate amplitudes (which is the case inside a song because the musical notes don’t have the same loudness).

Windows like Blackman are better to prevent from the case where spectrum leakage of strong frequencies hides weak frequencies. But, these windows deal badly with noise since a noise will hide more frequencies than rectangular window. This is problematic for an algorithm like Shazam that needs to handle noise (for instance when you Shazam a music in a bar or outdoor there are a lot of noise).

A Hamming window is between these two extremes and is (in my opinion) a better choice for an algorithm like shazam.

 

Here are some useful links to go deeper on window functions and spectrum leakage:

http://en.wikipedia.org/wiki/Spectral_leakage

http://en.wikipedia.org/wiki/Window_function

http://web.mit.edu/xiphmont/Public/windows.pdf

 

Fast Fourier Transform and time complexity

 

the problem

Descrite Fourier Transform Formula

If you look again at the DFT formula (don’t worry, it’s the last time you see it), you can see that to compute one bin you need to do N additions and N multiplications (where N is the size of the window). Getting the N bins requires 2 *N^2 operations which is a lot.

For example, let’s assume you have a three-minute song at 44,1 kHz and you compute the spectrogram of the song with a 4096-sample window. You’ll have to compute 10.7 (44100/4096) DFT per second so 1938 DFTs for the full song. Each DFT needs 3.35*10^7 operations (2* 4096^2). To get the spectrogram of the song you need to do 6,5*10^10 operations.

Let’s assume you have a music collection of 1000 three-minutes-long songs, you’ll need 6,5*10^13 operations to get the spectrograms of your songs. Even with a good processor, it would take days/months to get the result.

Thankfully, there are faster implementations of the DFT called FFT (Fast Fourier Transforms). Some implementations require just 1.5*N * log(N) operations. For the same music collection, using the FFT instead of the DFT requires 340 times less additions (1.43*10^11) and it would take minutes/hours to get the result.

This example shows another tradeoff: though increasing the size of the window improves the frequency resolution, it also increases the computation time. For the same music collection, if you compute the spectrogram using a 512 sample window (frequency resolution of 86 Hz), you get the result with the FFT in 1.07*10^11 operations, approximately 1/4 time faster than with a 4096 sample window (frequency resolution of 10.77 Hz).

This time complexity is important since when you shazam a sound, your phone needs to compute the spectrogram of the recorded audio and a mobile processor is less powerful than a desktop processor.

 

downsampling

Thankfully, there is a trick to keep the frequency resolution and reduce the window size at the same time, it’s called downsampling. Let’s take a standard song at 44100 Hz, if you resample it at 11025 Hz (44100/4) you will get the same frequency resolution whether you do a FFT on the 44.1kHz song with a 4096 window or you do a FFT on the 11kHz resampled song with a 1024 window. The only difference is that the resampled song will only have frequencies from 0 to 5 kHz. But the most important part of a song is between 0 and 5kHz. In fact most of you won’t hear a big difference between a music at 11kHz and a music at 44.1kHz. So, the most important frequencies are still in the resampled song which is what matters for an algorithm like Shazam.

 

example of signal downsampling

Downsampling a 44.1 kHz song to a 11.025 kHz one is not very difficult: A simple way to do it is to take the samples by group of 4 and to transform this group into just one sample by taking the average of the 4 samples. The only tricky part is that before downsampling a signal, you need to filter the higher frequencies in the sound to avoid aliasing (remember the Nyquist-Shannon theorem). This can be done by using a digital low pass filter.

 

FFT

But let’s go back to the FFT. The simplest implementation of the FFT is the radix 2 Cooley–Tukey algorithm which is a divide a conquer algorithm. The idea is that instead of directly computing the Fourier Transform on the N-sample window, the algorithm:

  • divides the N-sample window into 2 N/2-sample windows
  • computes (recursively) the FFT for the 2 N/2-sample windows
  • computes efficiently the FFT for the N-sample windows from the 2 previous FFT

The last part only costs N operations using a mathematical trick on the roots of unity (the exponential terms).

Here is a readable version of the FFT (written in python) that I found on Wikipedia

 

from cmath import *
def fft(x):
        N=len(x)
        if N==1: return x
        even=fft([x[k] for k in range(0,N,2)])
        odd= fft([x[k] for k in range(1,N,2)])
        M=N/2
        l=[ even[k] + exp(-2j*pi*k/N)*odd[k] for k in range(M) ]
        r=[ even[k] - exp(-2j*pi*k/N)*odd[k] for k in range(M) ]
        return l+r

For more information on the FFT, you can check this article on Wikipedia.

 

Shazam

We’ve seen a lot of stuff during the previous parts. Now, we’ll put everything together to explain how Shazam quickly identifies songs (at last!).  I’ll first give you a global overview of Shazam, then I’ll focus on the generation of the fingerprints and I’ll finish with the efficient audio search mechanism.

Note: From now on, I assume that you read the parts on musical notes, FFT and window functions.  I’ll sometimes use the words “frequency”, “bin”, ”note”  or the full expression “bin of frequencies” but it’s the same concept since we’re dealing with digital audio signals.

 

Global overview

An audio fingerprint is a digital summary that can be used to identify an audio sample or quickly locate similar items in an audio database. For example, when you’re humming a song to someone, you’re creating a fingerprint because you’re extracting from the music what you think is essential (and if you’re a good singer, the person will recognize the song).

Before going deeper, here is a simplified architecture of what Shazam might be. I don’t work at Shazam so it’s only a guess (from the 2003 paper of the co-founder of Shazam):

shazam overview

On the server side:

  • Shazam precomputes fingerprints from a very big database of music tracks.
  • All those fingerprints are put in a fingerprint database which is updated whenever a new song is added in the song database

On the client side:

  • when a user uses the Shazam app, the app first records the current music with the phone microphone
  • the phone applies the same fingerprinting algorithm as Shazam on the record
  • the phone sends the fingerprint to Shazam
  • Shazam checks if this fingerprint matches with one of its fingerprints
    • If no it informs the user that the music can’t be found
    • If yes, it looks for the metadata associated with the fingerprints (name of the song, ITunes url, Amazon url …) and gives it back to the user.

The key points of Shazam are:

  • being Noise/Fault tolerant:
    • because the music recorded by a phone in a bar/outdoor has a bad quality,
    • because of the artifact due to window functions,
    • because of the cheap microphone inside a phone that produces noise/distortion
    • because of many physical stuff I’m not aware of
  • fingerprints needs to be time invariant: the fingerprint of a full song must be able to match with just a 10-second record of the song
  • fingerprint matching need to be fast: who wants to wait minutes/hours to get an answer from Shazam?
  • having few false positives: who wants to get an answer that doesn’t correspond to the right song?

 

Spectrogram filtering

Audio fingerprints differ from standard computer fingerprints like SSHA or MD5 because two different files (in terms of bits) that contain the same music must have the same audio fingerprint. For example a song in a 256kbit ACC format (ITunes) must give the same fingerprint as the same song in a 256kbit MP3 format (Amazon) or in a 128kbit WMA format (Microsoft). To solve this problem, audio fingerprinting algorithms uses the spectrogram of audio signals to extract fingerprints.

Getting our spectrogram

I told you before that to get the spectrogram of a digital sound you need to apply a FFT. For a fingerprinting algorithm we need a good frequency resolution (like 10.7Hz) to reduce spectrum leakage and have a good idea of the most important notes played inside the song. At the same time, we need to reduce the computation time as far as possible and therefore use the lowest possible window size. In the research paper from Shazam, they don’t explain how they get the spectrogram but here is a possible solution:

getting a spectrogram of a signal

On the server side (Shazam), the 44.1khz sampled sound (from CD, MP3 or whatever sound format) needs to pass from stereo to mono. We can do that by taking the average of the left speaker and the right one. Before downsampling, we need to filter the frequencies above 5kHz to avoid aliasing. Then, the sound can be downsampled at 11.025kHz.

On the client side (phone), the sampling rate of the microphone that records the sound needs to be at 11.025 kHz.

Then, in both cases we need to apply a window function to the signal (like a hamming 1024-sample window, read the chapter on window function to see why) and apply the FFT for every 1024 samples. By doing so, each FFT analyses 0.1 second of music. This gives us a spectrogram:

  • from 0 Hz to 5000Hz
  • with a bin size of 10.7Hz,
  • 512 possible frequencies
  • and a unit of time of 0.1 second.

 

Filtering

At this stage we have the spectrogram of the song. Since Shazam needs to be noise tolerant, only the loudest notes are kept. But you can’t just keep the X more powerful frequencies every 0.1 second. Here are some reasons:

  • In the beginning of the article I spoke about psychoacoustic models. Human ears have more difficulties to hear a low sound (<500Hz) than a mid-sound (500Hz-2000Hz) or a high sound (>2000Hz). As a result low sounds of many “raw” songs are artificially increased before being released. If you only take the most powerful frequencies you’ll end up with only the low ones and If 2 songs have the same drum partition, they might have a very close filtered spectrogram whereas there are flutes in the first song and guitars in the second.
  • We saw on the chapter on window functions that if you have a very powerful frequency other powerful frequencies close to this one will appeared on the spectrum whereas they doesn’t exist (because of spectrum leakage). You must be able to only take the real one.

 

Here is a simple way to keep only strong frequencies while reducing the previous problems:

step1 – For each FFT result, you put the 512 bins you inside 6 logarithmic bands:

  • the very low sound band (from bin 0 to 10)
  • the low sound band (from bin 10 to 20)
  • the low-mid sound band (from bin 20 to 40)
  • the mid sound band (from bin 40 to 80)
  • the mid-high sound band (from bin 80 to 160)
  • the high sound band (from bin 160 to 511)

step2 – For each band you keep the strongest bin of frequencies.

step3 – You then compute the average value of these 6 powerful bins.

step4 – You keep the bins (from the 6 ones) that are above this mean (multiplied by a coefficient).

The step4 is very important because you might have:

  • an a cappella music involving soprano singers with only mid or mid-high frequencies
  • a jazz/rap music with only low and low-mid frequencies
  • ..

And you don’t want to keep a weak frequency in a band just because this frequency is the strongest of its band.

But this algorithm has a limitation. In most songs some parts are very weak (like the beginning or the end of a song). If you analyze these parts you’ll end up with false strong frequencies because the mean value (computed at step 3) of these parts is very low. To avoid that, instead of taking the mean of the 6 powerful beans of the current FFT (that represents only 0.1sec of the song) you could take the mean of the most powerful bins of the full song.

To summarize, by applying this algorithm we’re filtering the spectrogram of the song to keep the peaks of energy in the spectrum that represent the loudest notes. To give you a visual idea of what this filtering is, here is a real spectrogram of a 14-second song.

shazam_full_spectrogram_min

This figure is from the Shazam research article. In this spectrogram, you can see that some frequencies are more powerful than others. If you apply the previous algorithm on the spectrogram here is what you’ll get:

shazam_filtered_spectrogram-min

This figure (still from the Shazam research article) is a filtered spectrogram. Only the strongest frequencies from the previous figure are kept.  Some parts of the song have no frequency (for example between 4 and 4.5 seconds).

The number of frequencies in the filtered spectrogram depends on the coefficient used with the mean during step4. It also depends on the number of bands you use (we used 6 bands but we could have used another number).

 

At this stage, the intensity of the frequencies is useless. Therefore, this spectrogram can modeled as a 2-column table where

  • the first column represents the frequency inside the spectrogram (the Y axis)
  • the  second column represents the time when the frequency occurred during the song (the X axis)

This filtered spectrogram is not the final fingerprint but it’s a huge part of it. Read the next chapter to know more.

 

Note: I gave you a simple algorithm to filter the spectrogram. A better approach could be to use a logarithmic sliding window and to keep only the most powerful frequencies above the mean + the standard deviation (multiplied by a coefficient) of a moving part of the song. I used this approach when I did my own Shazam prototype  but it’s more difficult to explain (and I’m not even sure that what I did was correct …).

 

Storing Fingerprints

We’ve just ended up with a filtered spectrogram of a song. How can we store and use it in an efficient way? This part is where the power of Shazam lies. To understand the problem, I’ll present a simple approach where I search for a song by using directly the filtered spectrograms.

 

Simple search approach

Pre-step: I precompute a database of filtered spectrograms for all the songs in my computer

Step 1: I record a 10-second part of a song from TV in my computer

Step 2: I compute the filtered spectrogram of this record

Step 3: I compare this “small” spectrogram with the “full” spectrogram of each songs. How can I compare a 10-second spectrogram with a spectrogram of a 180-second song? Instead of losing myself in a bad explanation, here is a visual explanation of what I need to do.

spectrogram-min

Visually speaking, I need to superpose the small spectrogram everywhere inside the spectrogram of the full song to check if the small spectrogram matches with a part of the full one.

spectrogram2-min

And I need to do this for each song until I find a perfect match.

spectrogram3-min

In this example, there is a perfect match between the record and the end of the song. If it’s not the case, I have to compare the record with another song and so on until I find a perfect match. If I don’t find a perfect match I can choose the closest match I found (in all the songs) if the matching rate is above a threshold. For instance, if the best match I found gives me a 90% similarity between the record and a part of a song, I can assume it’s the right song because the 10% of none similarity are certainly due to external noise.

 

Though it works well, this simple approach requires a lot of computation time. It needs to compute all the possibilities of matching between the 10-second record and each song in the collection. Let’s assume on average music contains 3 peak frequencies per 0.1 seconds. Therefore, the filtered spectrogram of the 10-second record has 300 time-frequency points. In the worst case scenario, you’ll need 300 * 300 * 30* S operations to find the right song where S is the number of second of music in your collection. If like me you have 30k songs (7 * 10^6 seconds of music) it might take a long time and it’s harder for Shazam with its 40 million songs collection (it’s a guess I couldn’t find the current size of Shazam).

So, how Shazam does it efficiently?

 

Target zones

Instead of comparing each point one by one, the idea is to look for multiple points at the same time. In the Shazam paper, this group of point is called a target zone. The paper from Shazam doesn’t explain how to generate these target zones but here is a possibility. For the sake of comprehension I’ll fix the size of the target zone at 5 frequency-time points.

In order to be sure that both the record and the full song will generate the same target zones, you need an order relation between the time-frequency points in a filtered spectrogram. Here is one:

  • If two time-frequency points have the same time, the time-frequency point with the lowest frequency is before the other one.
  • If a time time-frequency point has a lower time than another point one then it is before.

Here is what you get if you apply this order on the simplified spectrogram we saw before:

ordered_spectrogram1-min

In this figure I labeled all the time-frequency points using this order relation. For example:

  • The point 0 is before any other points in the spectrogram.
  • The point 2 is after point 0 and 1 but before all the others.

Now that the spectrograms can be inner-ordered, we can create the same target zones on different spectrogram with the following rule: “To generate target zones in a spectrogram, you need for each time-frequency point to create a group composed of this point and the 4 points after it”. We’ll end up with approximately the same amount of target zones as the number of points. This generation is the same for the songs or the record

ordered_spectrogram2-min

In this simplified spectrogram, you can see the different target zones generated by the previous algorithm. Since the target size is 5, most of the points belong to 5 target zones (except the points at the beginning and the end of the spectrogram).

Note: I didn’t understand at first why for the record we needed to compute that much target zones. We could generate target zones with a rule like “for each point whose label is a multiple of 5 you need to create a group composed of this frequency and the 4 frequencies after it”. With this rule, the number of target zones would be reduced by 5 and so the search time (explained in the next part). The only reason I found is that computing all the possible zones on both the record and the song increases a lot the noise robustness.

 

Address generation

We now have multiple target zones, what do we do next? We create for each point an addressbased on those target zones. In order to create those addresses, we also need an anchor point per target zone. Again, the paper doesn’t explain how to do it. I propose this anchor point to be the 3rdpoint before the target zone. The anchor can be anywhere as long as the way it is generated is reproducible (which it is thanks to our order relation).

ordered_spectrogram4-min

 

In this picture I plotted 2 target zones with their anchor points. Let’s focus on the purple target zone. The address formula proposed by shazam is following one:

[“frequency of the  anchor”;” frequency of the  point”;”delta time between the anchor and the point”].

For the purple target zone:

  • the address of point 6 is [“frequency of  3”;” frequency of  point 6”;”delta_time between point 3 & point 6”] so concretely [10;30;1],
  • the address of point 7 is [10;20;2].

Both points appeared also in the brown target zone, their addresses with this target zone are [10;30;2] for point 6 and [10;20;3] for point 7.

I spoke about addresses, right? That means that those addresses are linked to something. In the case of the full songs (so only in the server side), those addresses are linked to the following couple [“absolute time of the anchor in the song”;”Id of the song”]. In our simple example with the 2 previous points we have the following result:

[10;30;1] –>[2;1]

[10;30;2]–>[2;1]

[10;30;2] –>[1;1]

[10;30;3] –>[1;1]

If you apply the same logic for all the points of all the target zones of all the song spectrograms, you’ll end up with a very big table with 2 columns:

  • the addresses
  • the couples (“time of anchor” ; “song Id”).

This table is the fingerprint database of Shazam. If on average a song contains 30 peak frequencies per second and the size of the target zone is 5,  the size of this table is 5 * 30 *S where S is the number of seconds of the music collection.

If you remember, we used an FFT with 1024 samples which means that there are only 512 possible frequency values. Those frequencies can be coded in 9 bits (2^9 = 512). Assuming that the delta time is in milliseconds,  it will never be over 16 seconds because it would imply a song with a 16-second part without music (or very low sound).  So, the delta time can be coded in 14 bits (2^14 = 16384). The address can be coded in a 32-bit integer:

  • 9 bits for the “frequency of the  anchor”
  • 9 bits for the ” frequency of the  point”
  • 14 bits for the ”delta time between the anchor and the point”

Using the same logic, the couple (“time of anchor” ; “song Id”) can be coded in a 64-bit integer (32 bit for each part).

The fingerprint table can be implemented as a simple array of list of 64-bit integers where:

  • the index of the array is the 32-bit integer address
  • the list of 64-bits integers is all the couples for this address .

In other words, we transformed the fingerprint table into an inverted look-up that allows search operation in O(1) (ie. very effective search time).

Note: You may have noticed that I didn’t choose the anchor point inside the target zone (I could have chosen the first point of the target Zone for example). If I did it would have generated a lot of addresses like [frequency anchor;frequency anchor;0] and therefore too many couples(“time of anchor” ; “song Id”) would have an address like [Y,Y,0] where Y is the frequency (between 0 and 511). In other words, the look-up would have been skewed.

 

Searching And Scoring the fingerprints

We now have a great data structure on the server side, how can we use it? It’s my last question, I promise!

 

Search

To perform a search, the fingerprinting step is performed on the recorded sound file to generate an address/value structure slightly different on the value side:

[“frequency of the  anchor”;” frequency of the  point”;”delta time between the anchor and the point”] -> [“absolute time of the anchor in the record”].

 

This data is then sent to the server side (Shazam). Let’s take the same assumption than before (300 time-frequency points in the filtered spectrogram of the 10-second record and the size of the target zone of 5 points), it means there are approximately 1500 data sent to Shazam.

Each address from the record is used to search in the fingerprint database for the associated couples [“absolute time of the anchor in the song”;”Id of the song”]. In terms of time complexity, assuming that the fingerprint database is in-memory, the cost is the search is proportional to the number of address sent to Shazam (1500 in our case). This search returns a big amount of couples, let’s say for the rest of the article it returns M couples.

 

Though M is huge, it’s way lower than the number of notes (time-frequency points) of all the songs. The real power of this search is that instead of looking if a one note exists in a song, we’re looking if 2 notes separated from delta_time seconds exist in the song. At the end of this part we’ll talk more about time complexity.

 

Result filtering

Though it is not mentioned in the Shazam paper, I think the next thing to do is to filter the M results of the search by keeping only the couples of the songs that have a minimum number of target zones in common with the record.

 

For example, let’s suppose our search has returned:

  • 100 couples from song 1 which has 0 target zone in common with the record
  • 10 couples from song 2 which has 0 target zone in common with the record
  • 50 couples from song 5 which has 0 target zone in common with the record
  • 70 couples from song 8 which has 0 target zone in common with the record
  • 83 couples from song 10 which has 30 target zones in common with the record
  • 210 couples from song 17 which has 100 target zones in common with the record
  • 4400 couples from song 13 which has 280 target zones in common with the record
  • 3500 couples from song 25 which has 400 target zones in common with the record

Our 10-second record has (approximately) 300 targets zone. In the best case scenario:

  • song 1 and the record will have a 0% matching ratio
  • song 2 and the record will have a 0% matching ratio
  • song 5 and the record will have a 0% matching ratio
  • song 8 and the record will have a 0% matching ratio
  • song 10 and the record will have a 10% matching ratio
  • song 17 and the record will have a 33% matching ratio
  • song 13 and the record will have a 91.7% matching ratio
  • song 25 and the record will have a 100% matching ratio

We’ll only keep the couples of song 13 and 25 from the result. Although songs 1 2,5 and 8 have multiples couples in common with the record, none of them form at least a target zone (of 5 points) in common with the record. This step can filter a lot of false results because the fingerprint database of Shazam has a lot of couples for the same address and you can easily end up with couples at the same address that don’t belong to the same target zone. If you don’t understand why, look the last picture of the previous part: the [10;30;2] address is used by 2 time-frequency points that doesn’t belong to the same target zone. If the record also have an [10;30;2], (at least) one of the 2 couples in the result will be filtered in this step.

 

This step can be done in O(M) with the help of a hash table whose key is the couple (songID;absolute time of the anchor in the song) and value the number of time it appears in the result:

  • We iterate through the M results and count (in the hash table ) the number of time a couple is present
  • We remove all the couples (i.e. the key of the hash table) that appear less than 4 times (in other words we remove all the points that doesn’t form a target zone)*
  • We count the number X of times the song ID is part of a key in the hash table (i.e we count the number of complete target zones in the song. Since the couple come from the search, those target zones are also in the record)
  • We only keep the result whose song number is above 300*coeff (300 is the number of target zone of the record and we reduce this number with a coeff because of the noise).
  • We put the remaining results in a new hash table whose index is the songId (this hashmap will be useful for the next step)

 

* The idea is to look for the target zone created by an anchor point in a song. This anchor point can be defined by the id of the song it belongs and the absolute time it occurs. I made an approximation because in a song, you can have multiple anchor points at the same time. Since we’re dealing with a filtered spectrogram, you won’t have a lot of anchor points at the same time. But the key [songID;absolute time of the anchor in the song] will gather all the target zones created by these target points.

Note: I used 2 hash tables in this algorithm. If you don’t know how it works, just see it as a very efficient way to store and get data. If you wan’t to know more, you can read my article on the HashMap in Java which is just an efficient hash table.

 

Time coherency

At this stage we only have songs that are really close to the record. But we still need to verify the time coherency between the notes of the record and these songs. Let’s see why:

search-min

 

In this figure, we have 2 target zones that belong to 2 different songs. If we didn’t look for time coherency, those target zones would increase the matching score between the 2 songs whereas they don’t sound alike since the notes in those target zones are not played in the same order.

This last step  is about time ordering. The idea is:

  • to compute for each remaining song the notes and their absolute time position in the song.
  • to do the same for the record, which gives us the notes and their absolute time position in the record.
  • if the notes in the song and those in the record are time coherent, we should find a relation like this one: “absolute time of the note in the song = absolute time of the note in record + delta”, were delta is the starting time of the part of the song that matches with the record.
  • for each song, we need to find the delta  that maximizes the number of notes that respect this time relation
  • Then we choose the song that has the maximum number of time coherent notes with the record

 

Now that you get the idea let’s see how to do it technically. At this stage we have for the record a list of address/value:

[“frequency of the  anchor”;” frequency of the  point”;”delta time between the anchor and the point”] -> [“absolute time of the anchor in the record”].

 

And we have for each song a list address/value (stored in the hash table of the previous step):

[“frequency of the  anchor”;” frequency of the  point”;”delta time between the anchor and the point”] -> [“absolute time of the anchor in the song”;”Id of the song”].

 

The following process needs to be done for all the remaining songs:

  • For each address in the record, we get the associated value of the song and we compute delta = “absolute time of the anchor in the record” – “absolute time of the anchor in the song” and put the delta in a “list of delta”.
  • It is possible that the address in the record is associated with multiples values in the song (i.e. multiple points in different target zones of the song), in this case we compute the delta for each associated values and we put the deltas in the “list of delta”
  • For each different value of delta in the “list of delta” we count its number of occurrence (in other words, we count for each delta the number of notes that respect the rule “absolute time of the note in the song = absolute time of the note in record + delta”)
  • We keep the greatest value (which gives us the maximum number of notes that are time coherent between the record and the song)

From all the songs, we keep the song with the maximum time coherent notes.  If this coherency is above “the number of note in the record” * “a coefficient” then this song is the right one.

We just have to look for the metadata of the song (“artist name”,”song name, “Itunes URL”,”Amazon URL”, …)  with the Song ID and gives the result back to the user.

 

Let’s talk about complexity!

This search is really more complicated that the simple one we first saw, let’s see if this is worth it. The enhanced search is a step by step approach that reduces the complexity at each step.

For the sake of comprehension, I’ll recall all the assumptions (or choices) I made and make new ones to simplify the problem:

  • We have 512 possible frequencies
  • on average a song contains 30 peak frequencies per second
  • Therefore the 10-sec record contains 300 time-frequency points
  • S is the number of seconds of music off all the songs
  • The size of the target zone is 5 notes
  • (new) I assume that the delta time between a point and its anchor is ether 0 or 10 msec
  • (new) I assume the generation of addresses is uniformly distributed, which means there is the same amount of couple for any address [X,Y,T] where X and Y are one of the 512 frequencies and T is either 0 or 10 msec

The first step, the search, only requires 5 * 300 unitary searches.

The size of the result M is the sum of the result of the 5 * 300 unitary searches

M =(5 * 300) *(S *30* 5 * 300) / (512 *512 * 2)

The second step, the result filtering can be done in M operations. At the end of this step there are N notes distributed in Z songs. Without a statistical analysis of the music collection, it’s impossible to get the value of N and Z. I feel N is really lower than M and Z represent only a few songs, even for a 40-million song database like Shazam.

The last step is the analysis of the time coherency of the Z songs. We’ll assume that each song as approximately the same amount of notes: N/Z. In the worst case scenario (a record that comes from a song that contains only one note played continuously), the complexity of one analysis is (5*300) * (N/Z).

The cost of the Z songs is 5 * 300 * N.

Since N<<M, the real cost of this search is M = (300 * 300 * 30* S) * (5 * 5)/ (512 *512 * 2)

If you remember, the cost of the simple search was: 300 * 300 * 30* S.

This new search is 20 000 times faster

Note: The real complexity depends on distribution of frequencies inside the songs of the collection but this simple calculus gives us a good idea of the real one.

 

Improvements

The Shazam paper is from 2003 which means the associated research is even older. In 2003, 64-bits processors were released to the mainstream market. Instead of using one anchor point per target zone like the paper proposes (because of the limited size of a 32-bit integer), you could use 3 anchor points (for example the 3 points just before the target zone) and store the address of a point in the target zone in a 64-bit integerThis would dramatically improve the search time. Indeed,  the search would be to find 4 notes in a song separated from detla_time1, detla_time2 and detla_time3 seconds which means the number of results M would be very (very) lower than the one we just computed.

A great advantage of this fingerprint search is its high scalability:

  • Instead of having 1 fingerprint database you can have D databases, each of them containing 1/D of the full song collection
  • You can search at the same time for the closest song of the record in the D databases
  • Then you choose the closest song from the D songs
  • The whole process is D times faster.

 

Tradeoffs

Another good discussion is the noise robustness of this algorithm. I could easily add 2k words just for this subject but after 11k words I think it’s better not to speak about it … or  just a few words.

If you read carefully, you noticed that I used a lot of thresholds, coefficients and fixed values (like the sampling rate,  the duration of a record, …). I also chose/made many algorithms (to filter a spectrogram, to generate a spectrogram,…). They all have an impact on the noise resistance and the time complexity. The real challenge is to find the right values and algorithms that maximize:

  • The noises resistance
  • The time complexity
  • The precision (reducing the number of false positive results)

 

Conclusion

I hope you now understand how Shazam works. It took me a lot of time to understand the different subjects of this article and I still don’t master them. This article won’t make you an expert but I hope you have a very good picture of the processes behind Shazam. Keep in mind that Shazam is just one possible audio fingerprinting implementation.

 

You should be able to code your own Shazam. You can look at this very good article that focuses more on how to code a simplified Shazam in Java than the concepts behind it. The same author made a presentation at a Java conference and the slides are available here. You can also check this link for a MatLab/Octave implementation of Shazam.  And of course, you can read by yourself the paper from Shazam co-founder Avery Li-Chun Wang by clicking right here.

 

The world of music computing is a very interesting field with touchy algorithms that you use every day without knowing it. Though Shazam is not easy to understand it’s easier than:

  • query by humming: for example SoundHound ,a concurrent of Shazam, allows you to hum/sing the song you’re looking for
  • speech recognition and speech synthesis: implemented by Skype, Apple “Siri” and Android “Ok Google”
  • music similarity: which is the ability to find that 2 song are similar. It’s used by Echonest a start-up recently acquired by Spotify

If you’re interested, there is an annual contest between researchers on those topics and the algorithms of each participant are available. Here is the link to the MIREX contest.