Tidzam is an ambient sound analysis system for outdoor environment. It is a component of the Tidmarsh project which monitors the environmental evolution of an industrial cramberry farm during its ecological restoration of wetland. Tidzam analyses the audio streams generated by the deployed microphones into the wild in order to detect the sonic events happening on the site, such as bird calls, insects, frogs, rain, storms, car noise, human voices and others. This system is used to cross-validate other sensors for weather monitoring, to identify, geolocalize and track present wildlife and bird specimens over time. It also controls the audio mixers in order to mute or to change the gain on noisy microphones.
This system uses deeplearning technology in order to learn its classification tasks. A Human Computer Interface API provides tools to build a training database from the targeted sonic environment. A new task of classification could be boostraped by external audio recordings in order to create poor classifiers which will be refined by the addition of audio samples that the system automatically extracts from the environment. Therefore the system improves its accurancy after several generations of its iterative learning process.
Tidzam is composed of several independent processes which can require important ressources in terms of CPU, GPU and memory according to the classification tasks complexity and the number of processed audio streams. Its different processes are multi-threaded and can be deployed on a cluster-based architectures.
Several external components are required by Tidzam and must be installed. Currently, the system has been tested on clusted based Ubuntu 16.04 with Titan X GPUs.
tidzam install
Tidzam has been implemented on Python 3.x and uses the following external components:
- JACK server is a low latency audio mixer server which routes the audio streams between the different components of the system. It is managed by the TidzamStreamManager which loads the incoming audio sources, configure them and monitor the JACK server. Manual client configurations can be operated without conflict with the TidzamStreamManager.
- Icecast server pushs the audio streams processed by the system to a Web interface in order to allow the clients to listen all the streams independently. This functionnality is required when the incoming audio sources contains a lot of channels (like in OPUS encoding) which cannot be played with a classical audio player. Each channel of the audio input sources is splited into an independant mono channel.
- MPV and FFMPEG are used to load the incoming audio sources in JACK server and to push them to the Icecast server.
- Tensorflow is the deeplearning framework on which is implemented Tidzam. It is recommended to install its GPU-enabled version for real-time processing and for CPU load / memory saving.
- Python Package Dependencies:
python3
python-scipy
python-matplotlib
python-socketio
python-engineio
socketIO_client
sounddevice
json
jack-mixer
The tidzam script can be used to start, stop and restart all the processes on a single server architecture with one system command. The check option verifies that the system is running properly and restart it if not.
tidzam [start | stop | restart | check]
TidzamStreamManager feeds the JACK server with audio streams, denoted sources, which are routed to the Icecast Server and the TidzamAnalyzer. Currently all audio formats supported by MPV can be used on Tidzam such as MPV is responsible to load the sources on the JACK Server. A source can be a local file on the server, a Web URL or a LiveStream. A LiveStream is a PCM16bits stream pushed through socket.io to the TidzamStreamManager. It can be for example a recording from the microphone of a mobile device.
Usage: TidzamStreamManager.py [options]
Options:
-h, --help show this help message and exit
--buffer-size=BUFFER_SIZE
Set the Jack ring buffer size in seconds (default: 100
seconds).
--samplerate=SAMPLERATE
Set the sample rate (default: 44100).
--port-available=LIVE_PORT
Number of available ports for live streams.
(default: 10).
--port=PORT Socket.IO Web port (default: 8080).
--tidzam-socketio=TIDZAM_ADDRESS
Socket.IO address of the tidzam server (default:
localhost:8001).
--sources=SOURCES JSON file containing the list of the initial audio
source streams (default: None).
--debug=DEBUG Set debug level (Default: 0).
An initial JSON config file can be provided to the manager (--sources options) in order to automatically load some sources at the starting. In such case, they are automatically defined such as permanent sources which means that if the streaming is disconnected, the system will automatically try to reload them periodically.
{
"sources":[
{
"name":"impoundment",
"url":"http://doppler.media.mit.edu:8000/impoundment.opus",
"path_database":"/mnt/tidmarsh-audio/impoundment-mc",
"nb_channels":30
},
{
"name":"herring",
"url":"http://doppler.media.mit.edu:8000/herring.ogg",
"path_database":"/mnt/tidmarsh-audio/herring",
"nb_channels":1
}
]
}
The JACK Server can handle a lot of different clients which can increase drastically the CPU load til saturating the system ressources. Several parameters should be adapted according to the number of streams that Tidzam is supposed to handle according to the available ressources of the server.
-
--port-max defines the maximum number of clients that the JACK servers will authorize. It is the safety lock in Tidzam to protect the system load. This parameter bounds Tidzam such as if the maximum number of clients is reached, the TidzamStreamManager will not be able to load new sources. An audio source in Tidzam consums in the worth case and by source: (3 x #NumberOfSources x #LoadedChannels). The
-
-r desactivates the real-time mode (if the OS doesn't support it) or if the system load cannot handle it.
-
-t defines the timeout to kick out a jack client. When a client is disconnected, the TidzamStreamManager will automatically try to reload it. If the client is disconnected because it is too slow to response to JACK server due to the system load, then its restarting will produce the disconnection of another one and them trigger a cascade of failures. In order to avoid such situation, the timeout should be enough large to avoid disconnection due to system load or the number of client MUST be reduced with the --port-max option.
-
-ddumpy defines a dumpy audio driver for JACK server in order to not lock the hardware sound device.
-
-r defines the sample rate of JACK server (must be the same that one of Tidzam).
-
** -pXXXX ** defines the size XXXX of JACK buffer used by the client and so their latency. An higher value increases the system latency but decrease the CPU load. It must be a power of two value.
jackd --port-max 2048 -v -r -t50000 -ddummy -r44100 -p8192
The Icecast Server receives the audio streams from different FFMPEG clients which are started and managed by the TidzamStreamManager. Any format supported by FFMPEG can be used as output stream on the Icecast server. The special copy mode of FFMPEG can be also used in order to copy directly the input source to Icecast output without encoding-decoding process (see Source Management - Source Loading). This feature is used when the input source contains more channels (like in OPUS stream) than FFMPEG can decode.
TidzamStreamManager has a socket.io interface in order to manage the sources during its runtime.
A source can be loaded from a remote URL, from a local file or from a LiveStream. A permanent source can also have a local database composed of its previous recordings with their filename formatted by database_name-YYYY-MM-DD-HH-MM-SS.{ogg | opus | wav} (see JSON config file). Therefore the request of source loading can select the proper file to load through the field date. If the date is in the future, the URL field will be loaded as the online audio stream. By default all channels are loaded but a list of channels can be also provided through the field channels. If the field is_permanent is turn on True (default is False), this source would be considered as permanent and will be restarted in case of termination. If the field format (default is ogg) is turn on copy, the TidzamStreamManager will use the FFMEPG copy option to push the stream in Icecast without any encoding-decoding processing.
Request on "sys" event:
{
'sys':{
'loadsource':{
'name':'mynewsource',
'url':'https://',
'database':'database_name',
'date':'YYYY-MM-DD-HH-MM-SS'
'channels':"channel4,test_chan5,...",
'is_permanent':True | False,
'format':'ogg | copy'
}
}
}
Request on "sys" event:
{
'sys':{
'unloadsource':{
'name':'mysourcename'
}
}
}
Request on "sys" event:
{'sys':{'database':''}}
Response on "sys" event:
{
'sys':{
'database':{
'database_name':{
'nb_channels':int,
'database':[
[start_time, end_time],
[...]
]
},
'database_name2':{
'nb_channels':int,
'database':[
[start_time, end_time],
[...]
]
}, ...
}
}
}
A LiveStream is automatically created and connected to the Icecast server and TidzamAnalyzer when its data is received on socket.io event "audio" from a client. The audio stream MUST be in PCM16bits format. The system will generate a unique portname identifier based on the socket.io SID of the client which can be request by: Getting the created portname:
Request on event 'sys'
{
'sys':{
'add_livestream':''
}
}
Response on event 'sys'
{
'sys':{
'portname':'name'
}
}
Request on event 'sys'
{
'sys':{
'del_livestream':''
}
}
TidzamAnalyzer plays the different loaded classifiers on its input streams which can be a regular audio file (--stream argument) or source channels from the Jack server (--jack argument). TidzamStreamManager does not connect the sources to the TidzamAnalyzer, it must be indicated in --jack argument as a list of portname pattern machings (for example impoudment- will connect all portname starting by this prefix).
The classifiers, that must be loaded, are specified by the --nn argument. They can be cascaded if there is a pattern matching between their classe name. The primary classifier must be named by selector. If a classifier contains a classe that matchs the name of another classifier, then the output classe of the first classifier weights all classes of the second one. (For example, the classe birds of the selector classifier weights the classe of the bird specimen classifier named birds).
TidzamAnalyzer has an optional module for automatic sample extraction which would be stored in the folder specified by --out argument. The sample extraction can specific on one port and depend of a pattern matching in the classe names (for example birds- will extract all detected sample with this prefix like birds-american_crow or birds-canada_goose). If the argument --extract-dd is specified, the decision of sample extraction is depend of a probability distribution representing the number of sample for each classe stored in --out. These rules can be defined for each channel during the runtime trough socket.io interface (see Extraction rules).
Usage: TidzamAnalyzer.py --nn=build/test [--stream=stream.wav | --jack=jack-output] [OPTIONS]
Options:
-h, --help show this help message and exit
-s STREAM, --stream=STREAM
Input audio stream to analyze.
-c CHANNEL, --channel=CHANNEL
Select a particular channel (only with stream option).
-j JACK, --jack=JACK List of Jack audio mixer ports to process.
-n NN, --nn=NN Folder containing the ckassifier to load.
-o OUT, --out=OUT Output folder for audio sound extraction.
--extract=EXTRACT List of classes to extract (--extract=unknown,birds).
--extract-dd Activate the extraction according to a Dynamic
Distribution of extracted sample (Default: True).
--extract-channels=EXTRACT_CHANNELS
Specify an id list of particular channels for the
sample extraction (Default: ).
--show Play the audio samples and show their spectrogram.
--overlap=OVERLAP Overlap value (default:0).
--chainAPI=CHAINAPI Provide URL for chainAPI username:password@url
(default: None).
--port=PORT Socket.IO Web port (default: 8080).
--debug=DEBUG Set debug level (Default: 0).
Subscription on event 'sys'
[
{
'chan':channel_id,
'analysis':{
'time':"YYYY-MM-DD-HH-MM-SS.MS",
'result':[ classe2 ],
'predicitions':{
'classe1': 0.001,
'classe2': 0.91
}
}
}
]
Request on event 'sys'
{
'sys':{
'classifier':{
'list':''
}
}
}
Request on event 'SampleExtractionRules'
{'get':'rules'}
Extraction rules define when a sample must be extracted. Its extraction is determined according to the parameter rate which defines its extraction probability when an element of classes is detected. If rate is set to auto, its extraction probability depends of the sample distribution in the database. The length parameter defines the audio file length in seconds (default 0.5 second), the detected sample will be localized in the middle of the audio file. object_filter parameter applies a filter which doesn't extract samples in which the spectrogram energy is located on the sample border. It tries to extract centered sound object. Request on event 'SampleExtractionRules'
{
'set':'rules',
'rules:'{
"source1":{
"classes":"classe1,classe2"
"rate":"auto | float"
"length":10,
"object_filter":[True | False]
},
"source2":{
"classes":"birds"
"rate":"auto | float"
}, [...]
}
}
Request on event 'SampleExtractionRules'
{'get':'extracted_count'}
Response on event 'SampleExtractionRules'
{
'extracted_count':{
"source1":#nb_samples,
"source2":18, [...]
}
}
Request on event 'SampleExtractionRules'
{'get':'database_info'}
The TidzamTrain process has a cluster-based implementation of Asynchronous Between-graph Replication which allows the training to be executed in parallel on several GPUs distributed on several machines. A Parameter Server (ps) is responsible to aggregate and share the weights between the different distributed workers. If TidzamTrain is executed without explicit cluster configuration (see --workers, --ps, --task-index and --job), only local GPUs will be used. The training and testing datasets can be provided by two approaches:
-
On the fly A set of independent workers generate on live the batchs directly from the audio file folders specified in --dataset-train. Based on an online file indexing, some files are used for the training and some others for the validation (hardcoded rate of 80%). A master process is managing the sample batch queue in order to deliver them to the different workers of the training process.
-
Compiled Dataset approach uses datasets which has been processed offline with the Database Editor or Database Tool. A dataset is composed of several archives containing the audio FFT samples with their labels. The dataset MUST be manually randomize and splitted into the two training and a validation datasets which must be provided to the trainer with the arguments --dataset-train and --dataset-test.
Usage: TidzamTrain.py --dataset-train=mydataset --dnn=models/model.py --out=save/ [OPTIONS]
Options:
-h, --help show this help message and exit
-d DATASET_TRAIN, --dataset-train=DATASET_TRAIN
Define the dataset to train.
-t DATASET_TEST, --dataset-test=DATASET_TEST
Define the dataset for evaluation.
-o OUT, --out=OUT Define output folder to store the neural network and
trains.
--dnn=DNN DNN model to train (Default: ).
--training-iterations=TRAINING_ITERS
Number of training iterations (Default: 20000
iterations).
--testing-step=TESTING_ITERATIONS
Number of training iterations between each testing
step (Default: 10).
--batchsize=BATCH_SIZE
Size of the training batch (Default:64).
--learning-rate=LEARNING_RATE
Learning rate (default: 0.001).
--stats-step=STATS_STEP
Step period to compute statistics, embeddings and
feature maps (Default: 10).
--nb-embeddings=NB_EMBEDDINGS
Number of embeddings to compute (default: 50)..
--job-type=JOB_TYPE Selector the process job: ps or worker
(default:worker).
--task-index=TASK_INDEX
Provide the task index to execute (default:0).
--workers=WORKERS List of workers
(worker1.mynet:2222,worker2.mynet:2222, etc).
--ps=PS List of parameter servers
(ps1.mynet:2222,ps2.mynet:2222, etc).
export CUDA_VISIBLE_DEVICES=''
Database Tool is a inline command interface in order to build compiled databases.
Usage: python src/TidzamDatabase.py
Options:
-h, --help show this help message and exit
--dataset=DATASET Open an exisiting dataset.
--rename=RENAME Rename the dataset.
--classe=CLASSE Create an empty classe.
--audio-folder=AUDIO_FOLDER
Load the audio file folder in the dataset (as a single
classe if --classe is specified).
--merge=MERGE Merge the dataset with another one (as a single classe
if --classe is specified).
--split=SPLIT Extraction proportion of a sub dataset for testing
--split in [0...1]
--split-name=SPLIT_NAME
Name for the generated dataset.
--balance Automatic balance the classe in the dataset (by
duplicating samples in small classes).
--randomize Randomize the dataset.
--file-count Return the number of files which compose the dataset.
--metadata Generate metadata information and store them on file
0.
--info Return some dataset information.
Database Editor is Text-Based User Interface with a menu to create and manipulate compiled databases.
Usage: python src/TidzamDatabaseEditor.py
Options:
-h, --help show this help message and exit
--dataset=OPEN Open an exisiting dataset
--stream=STREAM Sample extraction from an audio stream [WAV/OGG/MP3].
--play Play the dataset content.
--play-id=PLAYID Play the dataset content of a particular classe.
-s, --show Select a specific classe ID for --play option.
TidzamTrainer generates periodically (see --stats-step) some summaries for tensorboard :
- Accurancy, costs, recall, precision and confusion matrix
- GraphDef with memory usage and computation distribution over the devices
- Weight histograms, distributions and feature maps
- Embeddings for 3D visualization of output classes distance.
tensorboard --logdir=checkpoints