“The most exciting phrase to hear in science,
the one that heralds the most discoveries,
is not 'Eureka!', but 'That's funny…”
--Isaac Asimov
“To Make Machines Perceive Signals as Humans Do”
Signal quality evaluation plays a central role in shaping almost all signal processing algorithms and systems, as well as their implementation, optimization and testing. Since the human (with visual, hearing, touching, smelling and tasting senses) is the ultimate receiver of the majority of signals so far--being natural captured or computer generated--after acquisition, processing and transmission, incorporating proper human perception characteristics not only makes the built systems user-oriented but also enables resource savings (i.e., turning imperfectness of the human perception into advantages in design).
The resultant metrics are to replace the existing, mathematical measure (e.g., MSE, SNR, PSNR, or one of their relatives) to define and gauge the distortion of processed signals, since MSE, SNR or PSNR does not reflect the human perception well. Perceptual metrics are expected to fill a gap in most existing signal processing related products and services; namely, a non-perception-based criterion used in engineering design versus devices/services for the human to consume.
This is an exciting, inter-disciplinary research area since it enables user-oriented designs and further system performance improvement. We need to incorporate the latest relevant findings in neuroscience, brain theory, psychophysics, aesthetics, statistics, and
user and cultural studies into computational models, and to verify such models with subjective viewing/hearing/sensing results; i.e., to make perception science truly quantitative.
Some examples of research work under this theme:
As can be seen in the Publications part, the comprehensive theoretical formulation of JND (Just-Noticeable Distortion) has been introduced in spatial [2003], transform [2005], spatiotemporal [2006], boundary-texture separation [2010], and pattern masking (with brain theory) [2013] aspects, with extension to screen contents [2016] and top-down mechanisms [2022]; we have researched for JND-based protection of privacy/copyright, fighting adversary/deepfake [2021-2023], and exploration beyond the visual into audio, haptics, olfaction and taste [2022]. For visual attention modeling, first quantitative solutions have been proposed for modulatory effect [2005], compressed bitstream directly [2011] (a leap since all visual signals are compressed), spatiotemporal uncertainty [2013], stereoscopic views [2014] and use of deep learning [2019,2020,2022,2023]. Besides several technological breakthroughs in signal-driven IQA (Image Quality Assessment) for natural and partially-artificial images/videos [2003-2022], our effort has led to emergence of machine learning as a new IQA category [2009-2013], by big foundation models [2022,2023] and for generative AI [2023], the first attempt for fine-grained IQA [2019,2022,2023] (an overlooked but important field), and demonstration that foundation models possess IOA capability [2024]. For perceptual video coding, the built perceptual models enable effective resource-allocation and signal optimization/reconstruction [2003,2005,2013-2017], and exploration to 3D visual signals [2022,2023].
“To Manage Megabits in Visual Signals”
Great achievement has been made in the area of video coding during the past 3+ decades with the joint effort of academia and industries (resulting in the current ITU/MPEG/JPEG standards and all related products and services launched). Due to the huge data volume in visual signal, images and video still consume the dominant portion of storage/bandwidth for a typical multimedia system today, even after being compressed with state-of-the-art technology, in comparison with other media types (speech, audio, text and others). Therefore, there is a need of further, substantial enhancement of video coding technology, especially with the coming of the big data era. However, after so many years of intensive effort in the field, the room for further improvement in coding efficiency is diminishing in the current technical framework, and it is unlikely to arrive at new coding schemes always just by other rounds of reassembly and optimization of the existing techniques (like in the cases of H.264/HEVC/...). A more revolutionary approach is badly in need.
In our past research, we went back to the basics (e.g., how to define a data frame for coding, how find or even make a reference frame, and which transform should be used and in which situation), to perform a systematic investigation on video coders, and to devise new video coding methods to address the major difficulties in further improving the compression ratio, and to meet the requirements for ubiquity and affordability for multimedia signal representation and communication. Pre-processing, post-processing, adaptation and reconstruction are also important to enhance the received coding quality. It is also noted that videos increasingly contain animations and scenes generated by computers have different characteristics against naturally-captured ones.
Another interesting dimension of work is to build a perceptual model (in conjunction with Theme 1) that accounts for major relevant psychophysical findings, since the human is still the ultimate appreciator and consumer of coded signals in many situations. The phenomena to be modeled may include spatiotemporal contrast sensitivity function, luminance adaptation, visual attention, object motion, eye movement, color/contrast/activity masking, emotional/aesthetic features, task related clues, interaction with other media (like audio, etc), and friendliness with a coder. The resultant model is to be used as the integrated control for the new video coding scheme. Compression of 3D point-clouds is an emerging area of interest and importance.
We have also been exploring the paradigm shift from whole signal coding to feature coding, to enable collaborative intelligence in the AI era. The proposed new paradigm is designed to distribute computing load among edge, cloud and clients with
flexibility, transmit data for various analysis tasks, extract more accurate features, and preserve privacy. Research will be extended into true multimedia signals as well, i.e., to include hearing, touching, smelling and tasting ones.
As machines increasingly become the ultimate users of visual signals, research effort has been also extended to machine-task oriented visual signal processing and evaluation.
Some work that we are proud of under this theme:
Please refer to the Publications part for more and updated development.