|
I am currently a
Software Design Engineer at Microsoft's Windows Core OS Division, Redmond,
WA. Our team contributes architectural solutions and features to improve reliability
of Windows operating systems. November
8, 2006 marked another important milestone in the company's long history
of delivering commercial operating systems to hundreds of millions of
customers worldwide. We shipped! The new Windows Vista
operating system is here. Operating system
reliability or software reliability in general, seems to be a well studied area
in computer science. There are hundreds of papers on this or related areas,
such as fault tolerance, software reliability modeling and measurement,
formal verification, etc. It is, in my opinion, a misconception. There
are several unique challenges nowadays: (1) modern software is extremely
complex (i.e., thousands of moving parts need to work together), (2) massive
customer base, e.g., hundreds of millions PCs running Microsoft Windows
around the world; and (3) huge variety of applications (either in user mode
or kernel mode) for Windows platforms. Before joining
Microsoft, I completed my Ph.D. in Computer Engineering at Department of Electrical and
Computer Engineering, Virginia
Tech, Blacksburg, VA. Prof. Binoy Ravindran
and Dr. E. Douglas Jensen (The MITRE Corporation) were my advisors. They
have led me through my short research career in real-time computing. Some of
my projects and papers are available at Prof. Ravindran's
Real-Time Systems Lab website.
My Ph.D. thesis is here. The specific topic on
which I did my Ph.D. is real-time scheduling and resource management using
Dr. Jensen's Time/Utility Functions (TUFs). The concept of TUF is very
intuitive: utility of a real-time task depends on when it completes. A
TUF has time as an independent variable and utility as other dependent
variable. Regardless of this conceptual simplicity, maximizing total utility
accrued by tasks in a system is hard. Dr. Jensen's website has by far the best collection of
literatures on TUFs. One thing I want to make abundantly clear is that real-time systems are not restricted to some specialized systems such as weapon control, process control, or embedded communication systems. A system is a real-time system because it has some timing constraints. An air bag needs to be deployed within seconds of collision; otherwise people die. A UI needs to respond to mouse click within seconds; otherwise the application is perceived as in a hung state. A search engine needs to return search results within milliseconds; otherwise, nobody will use it. A Bussiness Intelligence (BI) system needs to analyze customer data within some time constraint; otherwise, its conclusion will become irrelevant. A simulation program possibly runs on a virtual timeline. However, it still needs to complete the simulation and to get results by certain time constraints. Even the Sun will burn out within a finite time. So anything that completes after the Sun dies has no meaning to us. Conversely, it is not always the faster the better though it may sound counter intuitive. Take a simple media player as an example. Playing movies in a super fast speed won't do any good; nobody can really enjoy the movie. As I show in the above examples, real-time is not real fast or real slow. It's just a set of timing constraints which could vary from nano seconds to centuries. Anything useful needs to interact with the physical world one way or the other. We, users of the systems, are probably the most important part of the physical world so to speak. And in the physical world, time never stops or goes back. Therefore, every system is a real-time system. My research has
produced scheduling and resource management algorithms for dynamic real-time
systems. The heart of these algorithms is quite straightforward: since the
system is highly dynamic and we don’t know or can’t predict the future, the
algorithms try to accrue the most utility with the least resource consumption
at each scheduling point. Similar ideas have been around for a while. My main
contribution is to develop algorithms for generalized models such as
applications with shared resources and almost arbitrarily shaped time/utility
functions. The dilemma with dynamic real-time systems is that they are not
very predictable yet require some sort of guarantee or assurance. So my other
contribution is to develop resource access/synchronization protocols that
provide statistical timing assurance for applications with some known
patterns. Since we were in the
business of doing system research, I also implemented these algorithms in
several prototype real-time distributed systems. One particularly interesting
result is a generic, user-mode scheduling framework for POSIX pthreads. I call it “meta scheduler”, meaning it is a
scheduler on top of priority schedulers. The universal priority scheduler is more
of a dispatcher than a scheduler. Scheduling policies, e.g., which thread
gets what priority, are outside the priority scheduler. Many entities in the
system can and will impact scheduling. Take Windows as an example. Threads
priorities are constantly boosted or dropped to prevent starvation and to
improve responsiveness. Unfortunately, the world of scheduling is far larger
than priority scheduling. Many interesting scheduling algorithms have no
notion of priority. For example, the well-known Earliest Deadline First (EDF)
scheduler always schedules the earliest deadline thread and has nothing to do
with priorities. My meta scheduler provides a framework to translate
decisions of external scheduling algorithms to thread priority changes.
Therefore it bridges the gap between almost arbitrary scheduling algorithms
and the underlying priority scheduler. On
the distributed resource management side, I mainly considered application replication
as a way of improving timeliness and fault tolerance. The assumption is that these
applications are mostly stateless, process continuous streams of data, and occasional
data loss is acceptable. For example, a distributed real-time monitoring application
reads video data from cameras/sensors at some fixed rate (e.g., 15 FPS), and detects
abnormal objects in the pictures. An
object will likely appear in multiple, continuous frames so that occasional
loss of frames is acceptable. The goal is to complete the tasks by their
timing constraints. An intuitive technique is to replicate applications on multiple
nodes and distribute workload among the nodes. However, the questions are (1)
how many replicas are enough, (2) what types of replicas are they, e.g., replicas to process the same data for fault-tolerance or replicas to process a different subset of data for real-time, and (3) where to put the replicas. Too many
replicas can be counter-productive because of the overhead involved. In
addition, poor selection of replication nodes can cause heavy resource
contentions on the same node and therefore counter-productive too.
Fault-tolerance is another constraint if some applications have certain
availability requirements. My approach is primarily response time
analysis/prediction by constructing possible schedules across multiple nodes.
It is a proactive resource allocation by nature. Scheduling analysis on
distributed systems can be expensive. Even if response times on all schedules
can be easily analyzed, finding the optimal or sub-optimal one is hard. Therefore,
my algorithms are heuristics in the sense that they only consider replication schemes that most likely yield good results. Speaking of research,
I'm interested in several areas:
Before I came to US
for my Ph.D., I completed B.S. (Automatic Control) and M.E. (Communication
and Information Systems) in China. Both degrees were obtained from University
of Electronic Science and Technology of China (UESTC). Staring my senior year design
project, I have been involved in network operating systems and
protocols. Routing protocols were my major focus back then. Refereed Journal Publications1.
P.
Li, H. Wu, B. Ravindran, and E. D. Jensen, "A Utility Accrual Scheduling
Algorithm for Real-Time Activities with Mutual Exclusion Resource
Constraints," IEEE Transactions on Computers, Vol. 55, Issue 4,
April 2006, pp. 454 - 469. 2.
P.
Li and B. Ravindran, “Fast Real-Time Scheduling Algorithms,” IEEE
Transactions on Computers, Vol. 53, No. 8, September 2004,
pp. 1159-1175. 3.
P.
Li, B. Ravindran, S. Suhaib, and S. Feizabadi, “A Formally Verified Application-Level
Framework for Real-Time Scheduling on POSIX Real-Time Operating Systems,” IEEE
Transactions on Software Engineering, Vol. 30, Issue 9, September
2004, pp. 613-629. 4.
B.
Ravindran and P. Li, “DPR, LPR: Proactive Resource Allocation Algorithms for
Asynchronous Real-Time Distributed Systems,” IEEE Transactions on
Computers, Vol. 53, No. 2, February 2004, pp. 201-216. 5.
H.
Wu, B. Ravindran, E. D. Jensen, and P. Li “Time/Utility Function Decomposition
Techniques for Utility Accrual Scheduling Algorithms in Real-Time Distributed
Systems,” IEEE Transactions on Computers, Vol. 54, No. 9, September
2005, pp. 1138-1153. 6.
H.
Wu, B. Ravindran, E. D. Jensen, and P. Li, "Energy-Efficient, Utility
Accrual Scheduling under Resource Constraints for Mobile Embedded
Systems," ACM Transactions on Embedded Computing Systems, Vol. 5, No. 3, August 2006, pp. 513-542. 7.
B.
Ravindran, P. Li, and T. Hegazy, “Proactive
Resource Allocation for Asynchronous Real-Time Distributed Systems in the
Presence of Failures,” Journal of Parallel and Distributed Computing
(JPDC), Vol. 63, Issue 12, December 2003, pp. 1219-1242. 8.
P.
Li and B. Ravindran, "Proactive QoS
Negotiation in Asynchronous Real-Time Distributed Systems," Journal
of Systems and Software, Vol. 73, Issue 1, September 2004, pp. 75-88. 9.
P.
Li and B. Ravindran, “Efficiently Tolerating Failures in Asynchronous
Real-Time Distributed Systems,” Journal of Systems Architecture, Vol.
50, Issue 10, October 2004, pp. 607-621. Referred Conference Publications1.
P.
Li, B. Ravindran, and E. D. Jensen, “Utility Accrual Real-Time Resource
Access Protocols with Assured Individual Activity Timeliness Behavior”,
Proc. 14th International Conference on Real-Time and Network Systems
(RTNS06) , May 2006. 2.
P.
Li, H. Cho, B. Ravindran, and E. D. Jensen, “Stochastic, Utility Accrual
Real-Time Scheduling with Task-Level and System-Level Timeliness Assurance”,
Proc. IEEE Intl. Symposium on Object-Oriented Real-Time Distributed
Computing (ISORC), May 2005. 3.
B.
Ravindran, E. D. Jensen, and P. Li, “On Recent Advances in Time/Utility
Function Scheduling and Resource Management”, Proc. IEEE Intl. Symposium
on Object-Oriented Real-Time Distributed Computing (ISORC), May 2005. 4.
P.
Li, B. Ravindran, and E. D. Jensen, “Utility Accrual Real-Time Scheduling
with Probabilistically-Assured Timeliness Performance”, Proc.
International Workshop on Probabilistic Analysis Techniques for Real-Time and
Embedded Systems, ACM International
Conference on Embedded Software, September
2004. 5.
H.
Wu, B. Ravindran, E. D. Jensen, and P. Li, “Energy-Efficient, Utility Accrual
Scheduling under Resource Constraints for Mobile Embedded Systems,” Proc. Fourth ACM International Conference
on Embedded Software (EMSOFT2004), September 2004. 6.
H.
Wu, B. Ravindran, E. D. Jensen, and P. Li, “CPU scheduling for
statistically-assured real-time performance and improved energy efficiency,” Proc.
IEEE/ACM/IFIP
International Conference on Hardware/Software Codesign
and System Synthesis (CODES/ISSS 2004), September 2004. 7.
P.
Li, B. Ravindran, H. Cho, and E. D. Jensen, "Scheduling Distributable
Real-Time Threads in Tempus Middleware," Proc. IEEE Intl. Conference
on Parallel and Distributed Systems (ICPADS), July 2004. 8.
P.
Li, B. Ravindran, and E. D. Jensen, "Utility Accrual Scheduling of
Distributable Threads---The Tempus Approach," Proc. Workshop
on High Performance Embedded Computing (HPEC'04), MIT Lincoln
Laboratory, June 2004. 9.
P.
Li, B. Ravindran, and E. D. Jensen, "Adaptive Time-Critical Resource
Management Using Time/Utility Functions: Past, Present and Future," Proc.
IEEE International Computer Software and Applications Conference (COMSAC'04),
June 2004 (fast abstract). 10.
B.
Ravindran, E. D. Jensen, P. Li, H. Wu, and S. Feizabadi,
"The Case for TUFs and UA Scheduling in RT UML Profile: A Real-Time
Scheduling/Operating System Perspective," Workshop on the Usage of the UML profile for
Scheduling, Performance and Time (SIVOES-SPT 2004), April 2004. 11.
S.
Feizabadi, W. Beebee,
Jr., B. Ravindran, P. Li, and M. Rinard,
"Utility Accrual Scheduling With Real-Time Java," Proc. Workshop
on Java Technologies for Real-Time and Embedded Systems, Lecture Notes in
Computer Science, Springer-Verlag Heidelberg,
Volume 2889/2003. 12.
P.
Li, B. Ravindran, J. Wang, and G. Konowicz, "Choir:
A Middleware Architecture Supporting Benefit-Based Proactive Resource
Allocation," Proc. IEEE Intl. Symposium on Object-Oriented Real-Time
Distributed Computing (ISORC), March 2003, pp. 292-299. 13.
B.
Ravindran, G. Le Lann, J. Wang, and P. Li, "A
Systems Engineering Approach for Constructing Certifiable Real-Time
Distributed Systems," Proc. IEEE Intl. Symposium on Object-Oriented
Real-Time Distributed Computing (ISORC), March 2003, pp. 105-112. 14.
B.
Ravindran, G. Le Lann, and P. Li,
"Constructing High Assurance Asynchronous Real-Time Distributed Systems:
A Proof-Based System Engineering Approach," Proc. IEEE/IEICE Intl.
Symposium on High Assurance Systems Engineering (HASE), October 2002, pp.
89-90. 15.
P.
Li and B. Ravindran, "Efficiently Tolerating Failures in Asynchronous
Real-Time Distributed Systems," Proc. IEEE/IEICE International
Symposium on High Assurance Systems Engineering (HASE), October 2002, pp.
19-26. 16.
P.
Li and B. Ravindran, "Proactive QoS
Negotiation in Asynchronous Real-Time Distributed Systems," Proc.
ISCA Intl. Conference on Parallel and Distributed Computing Systems (PDCS),
September 2002, pp.237-242. 17.
P.
Li, B. Ravindran, and T. Hegazy,
"Implementation and Evaluation of A Best-Effort Scheduling Algorithm in
an Embedded Real-Time System", Proc. IEEE Intl. Symposium on
Performance Analysis of Systems and Software, November 2001, pp. 21-29. Patents and Inventions
Professional ServicesReviewer for IEEE
Transactions on Computers, IEEE Transactions on Parallel and
Distributed Systems (TPDS), ACM Transactions in Embedded Computing
Systems (TECS), IEEE International Conference on Computer
Communications and Networks (ICCCN’04), The Journal of Systems and
Software (JSS), Elsevier Science Inc., and Parallel and Distributed
Computing Practices (PDCP), NOVA Science Publishers |
|
Last Updated: December 2006 Copyright (c) Peng Li 2006. All rights
reserved. pli_cn AT yahoo com |