Ethan Sheng Lyu

Under Review

Loom: Leveraging In-situ Smart Speakers for Scalable Neural Floorplan Inference

Sheng Lyu, Yuemin Yu, Chenshu Wu

Under Review

Abstract

Accurate indoor floorplans are foundational for emerging smart home applications. Yet, acquiring this geometry typically relies on intrusive dedicated hardware or active crowdsourced mobile scanning, rendering widespread adoption impractical. In this paper, we present Loom, the first neural floorplan inference system that recovers room geometry using in-situ, commodity smart speakers without any active user intervention. However, translating sparse, stationary acoustic signals into geometric boundaries is a highly ambiguous, ill-posed inverse problem. Loom breaks this physical barrier through three core innovations. First, we formulate the layout reconstruction as a physics-guided conditional generation task. At its core, we design a proxy network to model acoustic propagation and constrain the structural search space. Second, we opportunistically reuse ambient echoes from daily user-device interactions as dynamic sound sources, unlocking multi-view spatial parallax without extra burden. Third, we employ a self-evolving mechanism to seamlessly adapt to unlabeled, heterogeneous room semantics out-of-the-box. Extensive evaluations show that Loom achieves an SSIM of 0.83 in furnished rooms. We believe Loom will pave the way for the ubiquitous spatial intelligence.

UbiComp / IMWUT'25

ASE: Practical Acoustic Speed Estimation via Sound Diffusion Field

Sheng Lyu, Chenshu Wu

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

PDF arXiv Code DOI BibTeX

Abstract

Passive human speed estimation plays a critical role in acoustic sensing. Despite extensive study, existing systems, however, suffer from various limitations: First, the channel measurement rate proves inadequate to estimate high moving speeds. Second, previous acoustic speed estimation exploits Doppler Frequency Shifts (DFS) created by moving targets and relies on microphone arrays, making them only capable of sensing the radial speed within a constrained distance. To overcome these issues, we present ASE, an accurate and robust Acoustic Speed Estimation system on a single commodity microphone. We propose a novel Orthogonal Time-Delayed Multiplexing (OTDM) scheme for acoustic channel estimation at a high rate that was previously infeasible, making it possible to estimate high speeds. We then model the sound propagation from a unique perspective of the acoustic diffusion field, and infer the speed from the acoustic spatial distribution, a completely different way of thinking about speed estimation beyond prior DFS-based approaches. We further develop novel techniques for motion detection and signal enhancement to deliver a robust and practical system. We implement and evaluate ASE through extensive real-world experiments. Our results show that ASE reliably tracks walking speed, independently of target location and direction, with a mean error of 0.13 m/s, a reduction of 2.5x from DFS, and a detection rate of 97.4% for large coverage, e.g., free walking in a 4m × 4m room. We believe ASE pushes acoustic speed estimation beyond the conventional DFS-based paradigm and inspires exciting research in acoustic sensing. Code is available at https://github.com/aiot-lab/ASE.

ICASSP 2026 (Oral)

Fall Detection with Sound Diffusion Field: Integrating Audible Sound Event and Acoustic Speed Estimation

Sheng Lyu, Chenshu Wu

2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (Oral)

PDF

Abstract

Accurate indoor floorplans are foundational for emerging smart home applications. Yet, acquiring this geometry typically relies on intrusive dedicated hardware or active crowdsourced mobile scanning, rendering widespread adoption impractical. In this paper, we present Loom, the first neural floorplan inference system that recovers room geometry using in-situ, commodity smart speakers without any active user intervention. However, translating sparse, stationary acoustic signals into geometric boundaries is a highly ambiguous, ill-posed inverse problem. Loom breaks this physical barrier through three core innovations. First, we formulate the layout reconstruction as a physics-guided conditional generation task. At its core, we design a proxy network to model acoustic propagation and constrain the structural search space. Second, we opportunistically reuse ambient echoes from daily user-device interactions as dynamic sound sources, unlocking multi-view spatial parallax without extra burden. Third, we employ a self-evolving mechanism to seamlessly adapt to unlabeled, heterogeneous room semantics out-of-the-box. Extensive evaluations show that Loom achieves an SSIM of 0.83 in furnished rooms. We believe Loom will pave the way for the ubiquitous spatial intelligence.

MM'25

CardioLive: Empowering Video Streaming with Online Cardiac Monitoring via Audio-Visual Learning

Sheng Lyu, Ruiming Huang, Sijie Ji, Yasar Abbas Ur Rehman, Lan Ma, Chenshu Wu

33rd ACM International Conference on Multimedia

PDF Code DOI BibTeX Journal Version

Abstract

Online Cardiac Monitoring (OCM) emerges as a compelling enhancement for the next-generation video streaming platforms. It enables various applications including remote health, affective computing, and deepfake detection. Yet the physiological information encapsulated in the video streams has been long neglected. In this paper, we present the design and implementation of CardioLive, the first online cardiac monitoring system in video streaming platforms. We leverage the naturally co-existed video and audio streams and devise CardioNet, the first audio-visual network to learn the cardiac series. It incorporates multiple unique designs to extract temporal and spectral features, ensuring robust performance under realistic streaming conditions. To enable the Service-On-Demand OCM, we implement CardioLive as a plug-and-play middleware service and develop systematic solutions to practical issues including changing FPS and unsynchronized streams. Extensive evaluations demonstrate the effectiveness of our system. We achieve a Mean Squared Error of 1.79 BPM error, outperforming the video-only and audio-only solutions by 69.2% and 81.2%, respectively. CardioLive achieves average throughput of 115.97 and 98.16 FPS in Zoom and YouTube. We believe our work opens up new applications for video stream systems. Code is available at https://github.com/aiot-lab/CardioLive.

Interspeech'25 (Oral)

Temporal Modeling of Room Impulse Response Generation via Multi-Scale Autoregressive Learning

Sheng Lyu, Yuemin Yu, Chenshu Wu

26th edition of the Interspeech Conference (Oral)

PDF BibTeX

Abstract

The rise of AIGC has revolutionized multimedia processing, including audio applications. Room Impulse Response (RIR), which models sound propagation in acoustic environments, plays a critical role in various downstream tasks such as speech synthesis. Existing RIR generation methods, whether based on ray tracing or neural representations, fail to fully exploit the temporal dynamics inherent in RIR. In this work, we propose a novel method for temporal modeling of RIR through autoregressive learning. Our approach captures the sequential evolution of sound propagation by introducing a multi-scale generation mechanism that adaptively scales across varying temporal resolutions. Extensive evaluations demonstrate that our approach achieves respective T60 error rates of 4.1% and 5.3% on two real-world datasets, outperforming existing RIR generation methods. We believe our work opens up new directions for future research.

MobiSys'24

Statistical Acoustic Sensing For Real-Time Respiration Monitoring and Presence Detection

Sheng Lyu, Ruiming Huang, Yuemin Yu, Chenshu Wu

The 22nd ACM International Conference on Mobile Systems, Applications, and Services

PDF Video BibTeX

Abstract

In this demo, we present an all-in-one real-time system for breathing monitoring and presence detection using statistical acoustic sensing. By applying Auto-Correlation Function (ACF) to the Channel Frequency Response (CFR), our system captures both motion statistics and breathing rates. We devise novel weight combining schemes to enhance the SNR of the weak sensing signals. We then enable human presence detection by integrating both motion statistics and breathing rate as vital indicators. Our system operates using a single microphone without relying on a bulky microphone array. Our demo functions in real-time and supports any device that is equipped with a commodity microphone and speaker. Our demo can be accessed through https://youtu.be/1bxpXNwHGv0.

Ethan Sheng Lyu

Publications

Artifact Evaluation Committee

Invited Reviewer

Shadow Program Committee

Teaching Assistant