Peng Li's website

I am currently a Software Design Engineer at Microsoft's Windows Core OS Division, Redmond, WA. Our team contributes architectural solutions and features to improve reliability of Windows operating systems. November 8, 2006 marked another important milestone in the company's long history of delivering commercial operating systems to hundreds of millions of customers worldwide. We shipped! The new Windows Vista operating system is here.

Operating system reliability or software reliability in general, seems to be a well studied area in computer science. There are hundreds of papers on this or related areas, such as fault tolerance, software reliability modeling and measurement, formal verification, etc. It is, in my opinion, a misconception. There are several unique challenges nowadays: (1) modern software is extremely complex (i.e., thousands of moving parts need to work together), (2) massive customer base, e.g., hundreds of millions PCs running Microsoft Windows around the world; and (3) huge variety of applications (either in user mode or kernel mode) for Windows platforms.

Before joining Microsoft, I completed my Ph.D. in Computer Engineering at Department of Electrical and Computer Engineering, Virginia Tech, Blacksburg, VA. Prof. Binoy Ravindran and Dr. E. Douglas Jensen (The MITRE Corporation) were my advisors. They have led me through my short research career in real-time computing. Some of my projects and papers are available at Prof. Ravindran's Real-Time Systems Lab website. My Ph.D. thesis is here.

The specific topic on which I did my Ph.D. is real-time scheduling and resource management using Dr. Jensen's Time/Utility Functions (TUFs). The concept of TUF is very intuitive: utility of a real-time task depends on when it completes. A TUF has time as an independent variable and utility as other dependent variable. Regardless of this conceptual simplicity, maximizing total utility accrued by tasks in a system is hard. Dr. Jensen's website has by far the best collection of literatures on TUFs.

One thing I want to make abundantly clear is that real-time systems are not restricted to some specialized systems such as weapon control, process control, or embedded communication systems. A system is a real-time system because it has some timing constraints. An air bag needs to be deployed within seconds of collision; otherwise people die. A UI needs to respond to mouse click within seconds; otherwise the application is perceived as in a hung state. A search engine needs to return search results within milliseconds; otherwise, nobody will use it. A Bussiness Intelligence (BI) system needs to analyze customer data within some time constraint; otherwise, its conclusion will become irrelevant. A simulation program possibly runs on a virtual timeline. However, it still needs to complete the simulation and to get results by certain time constraints. Even the Sun will burn out within a finite time. So anything that completes after the Sun dies has no meaning to us. Conversely, it is not always the faster the better though it may sound counter intuitive. Take a simple media player as an example. Playing movies in a super fast speed won't do any good; nobody can really enjoy the movie. As I show in the above examples, real-time is not real fast or real slow. It's just a set of timing constraints which could vary from nano seconds to centuries. Anything useful needs to interact with the physical world one way or the other. We, users of the systems, are probably the most important part of the physical world so to speak. And in the physical world, time never stops or goes back. Therefore, every system is a real-time system.

My research has produced scheduling and resource management algorithms for dynamic real-time systems. The heart of these algorithms is quite straightforward: since the system is highly dynamic and we don’t know or can’t predict the future, the algorithms try to accrue the most utility with the least resource consumption at each scheduling point. Similar ideas have been around for a while. My main contribution is to develop algorithms for generalized models such as applications with shared resources and almost arbitrarily shaped time/utility functions. The dilemma with dynamic real-time systems is that they are not very predictable yet require some sort of guarantee or assurance. So my other contribution is to develop resource access/synchronization protocols that provide statistical timing assurance for applications with some known patterns.

Since we were in the business of doing system research, I also implemented these algorithms in several prototype real-time distributed systems. One particularly interesting result is a generic, user-mode scheduling framework for POSIX pthreads. I call it “meta scheduler”, meaning it is a scheduler on top of priority schedulers. The universal priority scheduler is more of a dispatcher than a scheduler. Scheduling policies, e.g., which thread gets what priority, are outside the priority scheduler. Many entities in the system can and will impact scheduling. Take Windows as an example. Threads priorities are constantly boosted or dropped to prevent starvation and to improve responsiveness. Unfortunately, the world of scheduling is far larger than priority scheduling. Many interesting scheduling algorithms have no notion of priority. For example, the well-known Earliest Deadline First (EDF) scheduler always schedules the earliest deadline thread and has nothing to do with priorities. My meta scheduler provides a framework to translate decisions of external scheduling algorithms to thread priority changes. Therefore it bridges the gap between almost arbitrary scheduling algorithms and the underlying priority scheduler.

On the distributed resource management side, I mainly considered application replication as a way of improving timeliness and fault tolerance. The assumption is that these applications are mostly stateless, process continuous streams of data, and occasional data loss is acceptable. For example, a distributed real-time monitoring application reads video data from cameras/sensors at some fixed rate (e.g., 15 FPS), and detects abnormal objects in the pictures. An object will likely appear in multiple, continuous frames so that occasional loss of frames is acceptable. The goal is to complete the tasks by their timing constraints. An intuitive technique is to replicate applications on multiple nodes and distribute workload among the nodes. However, the questions are (1) how many replicas are enough, (2) what types of replicas are they, e.g., replicas to process the same data for fault-tolerance or replicas to process a different subset of data for real-time, and (3) where to put the replicas. Too many replicas can be counter-productive because of the overhead involved. In addition, poor selection of replication nodes can cause heavy resource contentions on the same node and therefore counter-productive too. Fault-tolerance is another constraint if some applications have certain availability requirements. My approach is primarily response time analysis/prediction by constructing possible schedules across multiple nodes. It is a proactive resource allocation by nature. Scheduling analysis on distributed systems can be expensive. Even if response times on all schedules can be easily analyzed, finding the optimal or sub-optimal one is hard. Therefore, my algorithms are heuristics in the sense that they only consider replication schemes that most likely yield good results.

Speaking of research, I'm interested in several areas:

Operating systems in general
Software/OS reliability
Software engineering
Embedded systems
Middleware and open system architectures
Real-time systems
Distributed systems
Application of formal methods in safety/mission critical systems
Quality of service
Network protocols

Before I came to US for my Ph.D., I completed B.S. (Automatic Control) and M.E. (Communication and Information Systems) in China. Both degrees were obtained from University of Electronic Science and Technology of China (UESTC). Staring my senior year design project, I have been involved in network operating systems and protocols. Routing protocols were my major focus back then.

Refereed Journal Publications

1. P. Li, H. Wu, B. Ravindran, and E. D. Jensen, "A Utility Accrual Scheduling Algorithm for Real-Time Activities with Mutual Exclusion Resource Constraints," IEEE Transactions on Computers, Vol. 55, Issue 4, April 2006, pp. 454 - 469.

2. P. Li and B. Ravindran, “Fast Real-Time Scheduling Algorithms,” IEEE Transactions on Computers, Vol. 53, No. 8, September 2004, pp. 1159-1175.

3. P. Li, B. Ravindran, S. Suhaib, and S. Feizabadi, “A Formally Verified Application-Level Framework for Real-Time Scheduling on POSIX Real-Time Operating Systems,” IEEE Transactions on Software Engineering, Vol. 30, Issue 9, September 2004, pp. 613-629.

4. B. Ravindran and P. Li, “DPR, LPR: Proactive Resource Allocation Algorithms for Asynchronous Real-Time Distributed Systems,” IEEE Transactions on Computers, Vol. 53, No. 2, February 2004, pp. 201-216.

5. H. Wu, B. Ravindran, E. D. Jensen, and P. Li “Time/Utility Function Decomposition Techniques for Utility Accrual Scheduling Algorithms in Real-Time Distributed Systems,” IEEE Transactions on Computers, Vol. 54, No. 9, September 2005, pp. 1138-1153.

6. H. Wu, B. Ravindran, E. D. Jensen, and P. Li, "Energy-Efficient, Utility Accrual Scheduling under Resource Constraints for Mobile Embedded Systems," ACM Transactions on Embedded Computing Systems, Vol. 5, No. 3, August 2006, pp. 513-542.

7. B. Ravindran, P. Li, and T. Hegazy, “Proactive Resource Allocation for Asynchronous Real-Time Distributed Systems in the Presence of Failures,” Journal of Parallel and Distributed Computing (JPDC), Vol. 63, Issue 12, December 2003, pp. 1219-1242.

8. P. Li and B. Ravindran, "Proactive QoS Negotiation in Asynchronous Real-Time Distributed Systems," Journal of Systems and Software, Vol. 73, Issue 1, September 2004, pp. 75-88.

9. P. Li and B. Ravindran, “Efficiently Tolerating Failures in Asynchronous Real-Time Distributed Systems,” Journal of Systems Architecture, Vol. 50, Issue 10, October 2004, pp. 607-621.

Referred Conference Publications

1. P. Li, B. Ravindran, and E. D. Jensen, “Utility Accrual Real-Time Resource Access Protocols with Assured Individual Activity Timeliness Behavior”, Proc. 14th International Conference on Real-Time and Network Systems (RTNS06) , May 2006.

2. P. Li, H. Cho, B. Ravindran, and E. D. Jensen, “Stochastic, Utility Accrual Real-Time Scheduling with Task-Level and System-Level Timeliness Assurance”, Proc. IEEE Intl. Symposium on Object-Oriented Real-Time Distributed Computing (ISORC), May 2005.

3. B. Ravindran, E. D. Jensen, and P. Li, “On Recent Advances in Time/Utility Function Scheduling and Resource Management”, Proc. IEEE Intl. Symposium on Object-Oriented Real-Time Distributed Computing (ISORC), May 2005.

4. P. Li, B. Ravindran, and E. D. Jensen, “Utility Accrual Real-Time Scheduling with Probabilistically-Assured Timeliness Performance”, Proc. International Workshop on Probabilistic Analysis Techniques for Real-Time and Embedded Systems, ACM International Conference on Embedded Software, September 2004.

5. H. Wu, B. Ravindran, E. D. Jensen, and P. Li, “Energy-Efficient, Utility Accrual Scheduling under Resource Constraints for Mobile Embedded Systems,” Proc. Fourth ACM International Conference on Embedded Software (EMSOFT2004), September 2004.

6. H. Wu, B. Ravindran, E. D. Jensen, and P. Li, “CPU scheduling for statistically-assured real-time performance and improved energy efficiency,” Proc. IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS 2004), September 2004.

7. P. Li, B. Ravindran, H. Cho, and E. D. Jensen, "Scheduling Distributable Real-Time Threads in Tempus Middleware," Proc. IEEE Intl. Conference on Parallel and Distributed Systems (ICPADS), July 2004.

8. P. Li, B. Ravindran, and E. D. Jensen, "Utility Accrual Scheduling of Distributable Threads---The Tempus Approach," Proc. Workshop on High Performance Embedded Computing (HPEC'04), MIT Lincoln Laboratory, June 2004.

9. P. Li, B. Ravindran, and E. D. Jensen, "Adaptive Time-Critical Resource Management Using Time/Utility Functions: Past, Present and Future," Proc. IEEE International Computer Software and Applications Conference (COMSAC'04), June 2004 (fast abstract).

10. B. Ravindran, E. D. Jensen, P. Li, H. Wu, and S. Feizabadi, "The Case for TUFs and UA Scheduling in RT UML Profile: A Real-Time Scheduling/Operating System Perspective," Workshop on the Usage of the UML profile for Scheduling, Performance and Time (SIVOES-SPT 2004), April 2004.

11. S. Feizabadi, W. Beebee, Jr., B. Ravindran, P. Li, and M. Rinard, "Utility Accrual Scheduling With Real-Time Java," Proc. Workshop on Java Technologies for Real-Time and Embedded Systems, Lecture Notes in Computer Science, Springer-Verlag Heidelberg, Volume 2889/2003.

12. P. Li, B. Ravindran, J. Wang, and G. Konowicz, "Choir: A Middleware Architecture Supporting Benefit-Based Proactive Resource Allocation," Proc. IEEE Intl. Symposium on Object-Oriented Real-Time Distributed Computing (ISORC), March 2003, pp. 292-299.

13. B. Ravindran, G. Le Lann, J. Wang, and P. Li, "A Systems Engineering Approach for Constructing Certifiable Real-Time Distributed Systems," Proc. IEEE Intl. Symposium on Object-Oriented Real-Time Distributed Computing (ISORC), March 2003, pp. 105-112.

14. B. Ravindran, G. Le Lann, and P. Li, "Constructing High Assurance Asynchronous Real-Time Distributed Systems: A Proof-Based System Engineering Approach," Proc. IEEE/IEICE Intl. Symposium on High Assurance Systems Engineering (HASE), October 2002, pp. 89-90.

15. P. Li and B. Ravindran, "Efficiently Tolerating Failures in Asynchronous Real-Time Distributed Systems," Proc. IEEE/IEICE International Symposium on High Assurance Systems Engineering (HASE), October 2002, pp. 19-26.

16. P. Li and B. Ravindran, "Proactive QoS Negotiation in Asynchronous Real-Time Distributed Systems," Proc. ISCA Intl. Conference on Parallel and Distributed Computing Systems (PDCS), September 2002, pp.237-242.

17. P. Li, B. Ravindran, and T. Hegazy, "Implementation and Evaluation of A Best-Effort Scheduling Algorithm in an Embedded Real-Time System", Proc. IEEE Intl. Symposium on Performance Analysis of Systems and Software, November 2001, pp. 21-29.

Patents and Inventions

M. R. Garzia, M. Ni, T. Jivakov, and P. Li, “Stability Index User Interface”, Microsoft Corporation, pending US patent, 2006.
A. Dayen, H. A. Crittenden, M. R. Garzia, P. Li, and M. Khambatti, “A Method and Apparatus of Analyzing Computer System Interruptions”, Microsoft Corporation, pending US patent, 2005.
P. Li, B. Ravindran, and H. Wu, “Generic Benefit Scheduling Algorithm (GBS),” VTIP Intellectual Property Disclosure No. 03-095, Filed July 2003.
P. Li and B. Ravindran, “A Formally Verified Application-Level Framework for Utility Accrual Real-Time Scheduling On POSIX Real-Time Operating Systems,” VTIP Intellectual Property Disclosure No. 03-128, Filed October 2003.
H. Wu, B. Ravindran, and P. Li, “Resource-Constrained Energy-Efficient Utility Accrual Algorithm (ReUA),” VTIP Intellectual Property Disclosure No. 04-074, Filed August 2004.

Professional Services

Reviewer for IEEE Transactions on Computers, IEEE Transactions on Parallel and Distributed Systems (TPDS), ACM Transactions in Embedded Computing Systems (TECS), IEEE International Conference on Computer Communications and Networks (ICCCN’04), The Journal of Systems and Software (JSS), Elsevier Science Inc., and Parallel and Distributed Computing Practices (PDCP), NOVA Science Publishers

Last Updated: December 2006