The Java virtual machine defines eight primitive data types: five integer types, two floating-point types, and one boolean type. The types are byte, short, int, long, float, double, char, and boolean. This chapter explores how these different primitive types are stored in memory and used in calculations. You’ll learn how one can be converted to another and what can go wrong in this conversion. You’ll also learn how to use the bit-level operators to reach down to the lowest level of the virtual machine and to change what you find there.
All data in Java (or any digital computer) must be represented as a particular sequence of bits in the computer’s memory. A bit is an abstract quantity that can have exactly two values. These two values are commonly called 0 and 1. However, as you’ll see shortly, these are not the same as the numbers zero and one.
At the very low level of electronic circuits, a transistor that is charged to a particular value — generally 5.0 or 3.3 volts relative to ground — is said to be on and to have the value “one.” A transistor that is uncharged — at the value of 0.0 volts relative to ground — is said to be off and have the value “zero.” However, when you consider matters at this low a level, the real world is analog, not digital. It is possible for transistors to have voltages of 2.5 volts, 1.2 volts, -3.4 volts, or just about any other value you can imagine. Most digital electronic circuits have some tolerance so that a transistor that’s on at 3.3 volts will still be on at 3.2 volts. Past that tolerance, however, the transistor is said to be three-stating. This is a problem for the electrical engineers that design integrated circuits, but it shouldn’t be a problem for a software engineer. If your computer starts three-stating when it isn’t supposed to, send it back to the shop to be replaced.
Modern computers, including the Java virtual machine, organize bits into groups of eight called bytes. A group of eight bits is also sometimes referred to as an octet. The single byte is normally the lowest level at which you can interact with a computer’s memory. You always work with at least eight bits at a time. Bits are like hot dog buns. You can’t go to a grocery store and buy one hot dog bun or 13 hot dog buns. Because hot dog buns come in packs of 8, you can get 8, 16, 24, or any other multiple of 8, but not any number of buns that isn’t a multiple of 8. There is no keyword or operator in Java that enables you to read from or write to one bit of memory at a time. You have to work with at least seven more bits adjacent to the bit you’re interested in at the same time, even if you aren’t doing anything to those bits.
Note: This wasn’t always the case. Some early computers used 12-bit words. However, these computers have long since become extinct.
Although you can buy as few as eight hot dog buns at a time, it’s sometimes cheaper to buy them by the case. The case size often depends on where you buy them. At the corner convenience mart, 32 hot dog buns probably cost you four times as much as eight hot dog buns. However, at Benny’s Super Discount Warehouse Store, buns may be cheaper by the gross. Similarly, different computers pack different numbers of bytes into a word. Computers based on the Intel 8088 chip use 8-bit, 1-byte words. Computers based on the 286 architecture, however, use 16-bit words and can therefore move data around at (very roughly) twice the speed of an 8088 computer at the same clock rate. Most modern CPUs use 32-bit words. The 32-bit processors include the 80386, 80486, Pentium, Pentium Pro, Sparc, PowerPC 601, PowerPC 603, and PowerPC 604 CPUs. Some 64-bit processors are just starting to appear, including Digital’s Alpha line, Sun’s UltraSparc chip, and the forthcoming HP/Intel Merced. All of these chips can still run old 8-bit or 16-bit software, but they run faster and more efficiently with software that moves data around in words that match the native size of the processor.
So which is Java? 8-bit? 16-bit? 32-bit? In fact, it’s really none of the above. Because Java uses only a virtual machine, it needs to be able to run on any and all of the mentioned architectures without being tied to a particular word size. In one sense, you can argue that the Java virtual machine is an 8-bit machine because each instruction is exactly one byte long. However, the native integer data type for Java is 32-bit, so in that respect, Java is a 32-bit computer. The interpreter or JIT will likely convert the Java instructions and data into whichever format is appropriate for the machine on which it’s running.
Variables, values, and identifiers are closely related to each other. In common use, the three words are used interchangeably. However, each word does have a slightly different meaning, and when you discuss computers at the CPU or virtual machine level, these differences become important.
Consider this Java statement:
int j = 2;
The letter “j” is an identifier. It identifies a variable in Java source code. The identifier, however, does not appear in the compiled byte code. It is a mnemonic device to make programmers’ lives easier. The number 2 is the value of the variable. To be more precise, the bit pattern 00000000000-000000000000000000010 is the value of the variable. The four bytes of memory where this pattern is stored are the variable.
A variable is a particular group of bytes in the computer’s memory. The value of a variable is the bit pattern stored in those bytes that make up the variable. How the bit pattern is interpreted depends on the type of the variable. The rest of this chapter discusses the interpretation of the bit patterns that make up different primitive data types.
You can change the value of a variable by adjusting the bits that live in those bytes. This does not make it a new variable. Conversely, two different variables can have the same value.
An identifier is a name for a particular group of bytes in memory. Some programming languages allow a single variable to have more than one name. However, Java does not. In a Java program, an identifier always points to a particular area of memory. Once an identifier has been created, there is no way to change where it points.
Note: This may sound a little strange to experienced Java programmers. In particular, you may think that this is true for primitive data types like int but not for object types like String. In fact, this is true for all Java data types. You’ll have to wait till the next chapter to see why.
Place-Value Number Systems
The bits in memory aren’t just random voltages. They have meanings, and the meanings depend on the context. In one context, the bit sequence 0000000000100001 means the letter “A.” In another context, it means the number 65. Let’s explore how you get the number 65 out of the bits 0000000000100001.
When you write a number like 1406 in decimal notation, what you really mean is one thousand, four hundreds, no tens, and six ones. This may seem trivially obvious to you. After all, you’ve had this system drilled into you since early childhood. However, the place-value number system in which there are exactly ten digits and numbers larger than nine are represented by moving the digits further to the left is far from obvious. It took humanity most of its existence on this planet to develop this form of counting, and it didn’t become widespread, even in Eurasia, until well into the second millennium. It’s even less obvious that the digits on the left represent bigger numbers than the digits on the right. You could just as easily write the number as 6041 with the understanding that the first place is the ones place, the second place the tens place, the third place the hundreds, and so on.
Note: Classical Hebrew writes numbers from right to left. However, it doesn’t use a place-value system.
Binary notation
The number 0000000000100001 that you saw in the preceding section is written in a place-value system based on powers of two called binary notation. Each place is a power of two, not of ten, and there are only two digits — 0 and 1. Moving from right to left, therefore, we have one one, zero twos, zero fours, zero eights, zero sixteens, zero thirty-twos, and one sixty-four. Therefore, 0000000000100001 is equal to 64 + 1, or 65, in decimal notation.
There are extra zeroes on the left side because Java uses bits only in groups of eight at a time, although the individual bits do have meaning. Furthermore, as you’ll see below, characters like A are always 16 bits wide. You could use 0100001 to represent the value 65, but unlike 0000000010-0001, it would not also mean the letter A.
Java has several methods to convert between binary and decimal notation. The Integer and Long classes each have a static toBinaryString() method which converts ints and longs respectively to binary strings of ones and zeroes. For example, to print the int value 65 as a binary string, you could write
System.out.println(Integer.toBinaryString(65));Longs are converted similarly:
System.out.println(Long.toBinaryString(5000000000L));
Secret: The Byte and Short classes do not have toBinaryString() methods, but bytes and shorts can be converted using the Integer.toBinaryString() method.
Given a binary string of ones and zeroes, the Byte, Short, Integer, and Long classes each have static valueOf() and parse methods that convert binary strings into integers of the specified width.
The Byte.parseByte(String s), Short.parseShort(String s), Integer. parseInt(String s), and Long.parseLong(String s) methods convert a string like “28” into a byte, short, int, or long value respectively. These methods presume that the string is written in base 10. However, you can change the base that’s used to make the conversion by passing an additional int containing the base to the method, like this:
int m = Integer.parseInt("100001", 2);To convert the binary string 00000000100001 into byte, short, int, and long values of 65, you would write
byte b = Byte.parseByte("0100001", 2); short s = Short.parseShort("00000000100001", 2); int i = Integer.parseInt("00000000100001", 2); long l = Long.parseLong("00000000100001", 2);If the string does not have the form appropriate for the base you specify in the second argument (for example, if you try to convert the string “97” in base 2), then a NumberFormatException will be thrown.
The static valueOf() methods in the Byte, Short, Integer, and Long classes are very similar except that they return objects of the type-wrapper classes rather than primitive data types. For example:
Byte B = Byte.valueOf("0100001", 2); Short S = Short.valueOf("00000000100001", 2); Integer I = Integer.valueOf("00000000100001", 2); Long L = Long.valueOf("00000000100001", 2);
Java also allows the use of a base-eight notation with eight digits called octal notation. An octal digit can be represented in three bits. For example, 011 is octal 3. Table 2-2 lists all the octal digits and their equivalent binary patterns. Notice that this is the same as the first eight rows of Table 2-1 with the initial zero removed from each bit pattern.
|
3-bit binary pattern |
Octal digit |
|---|---|
|
000 |
0 |
|
001 |
1 |
|
010 |
2 |
|
011 |
3 |
|
100 |
4 |
|
101 |
5 |
|
110 |
6 |
|
111 |
7 |
Although the words octal and base eight sound like they should be closely related to the eight bits in a byte, in reality they’re not. You cannot write a byte value as a certain number of octal digits because the three bits in an octal digit do not evenly divide the eight bits in a byte. Therefore, octal numbers aren’t nearly as useful in practice as hexadecimal numbers. Their presence in Java is a holdover from their presence in C. Octal numbers were included in C because they are quite useful on machines with 12-bit words. The three bits in an octal number divide evenly into 12 bits, and computers with 12-bit words were still being used when C was created.
To use an octal literal in Java code, just prefix it with a leading 0. For example, to set n to decimal 227, you could write
int n = 0343;
Note: I can think of no reason why you might want to do this. If you do this, please write and tell me why.
Java has several methods to convert between decimal and octal notation. The Integer and Long classes each have a static toOctalString() method which converts ints and longs respectively to octal strings. For example, to print the int value 1024 as an octal string, you could write
System.out.println(Integer.toOctalString(1024));
Longs are converted similarly:
System.out.println(Long.toOctalString(5000000000L));
Note: The Byte and Short classes do not have toOctalString() methods, but bytes and shorts can be converted using the Integer.toOctalString() method.
You can convert an octal string to a numeric value using the parse and valueOf() methods described in the last section. Just pass 8 as the base argument instead. For example:
byte b = Byte.parseByte("30", 8);
short s = Short.parseShort("7002", 8);
int i = Integer.parseInt("30047132", 8);
long l = Long.parseLong("0108755260027112", 8);
Byte B = Byte.valueOf("30", 8);
Short S = Short.valueOf("7002", 8);
Integer I = Integer.valueOf("30047132", 8);
Long L = Long.valueOf("0108755260027112", 8);
An integer is a mathematical concept that describes a whole number. One, two, zero, 72, -1,324, and 768,542,188,963,243,888 are all examples of integers. There’s no limit to the size of an integer. An integer can be as large or as small as it needs to be, although it must always be a whole number like seven and never a fraction like seven and a half.
Java’s integer data types map pretty closely to the mathematical ideal, with the single exception that they’re all of finite magnitude. The four integer types — byte, short, int, and long — differ in the size of the numbers they can hold, but they all hold only a finite number of different integers. Most of the time this is enough.
In Java, an int is composed of four bytes of memory — that is, 32 bits. Written in binary notation, an integer looks like
01001101000000011100101010001101
In hexadecimal notation, this same number is
8D01BA8D
Each of the rightmost 31 places is a place value. The rightmost place is the one’s place, the second from the right is the two’s place, the third from the right is the four’s, the fourth from the right is the eight’s, and so on, up to the 31st place from the left, which is the 1,073,741,824’s place.
The largest possible int in Java has all bits set to one except the leftmost bit. In other words, it is 01111111111111111111111111111111, or, in decimal, 2,147,483,647.
You’re probably thinking that we could set the leftmost bit to one, and then have 11111111111111111111111111111111 as the largest number, but the leftmost bit in an int isn’t used for place value. It’s used to indicate the sign of the number and is called the sign bit. If the leftmost bit is one, then the int is a negative number. Therefore, 11111111111111111111111111111111 is not 4,294,967,295 but rather -1.
Java, like most modern computers, uses two’s complement binary numbers. In a two’s complement scheme, to reverse the sign of a number, you first take its complement — that is, convert all the ones to zeroes and all the zeroes to ones — and then add one. For example, to convert the byte value 0100001 (decimal 65) to -65, you would follow these steps:
65: 0100001 65 complement: 1011110 Add 1: +0000001 -65: 1011111
Here I’ve worked with 8-bit numbers instead of the full 32-bit ints used by Java. The principle is the same regardless of the number of bits in the number.
To change a negative number into a positive number, do exactly the same thing. For example:
-65: 1011111 -65 complement: 0100000 Add 1: +0000001 65: 0100001
One of the advantages of two’s complement numbers is that the procedure reverses itself. You don’t need separate circuits to convert a negative number to a positive one.
Computer integers differ from the mathematical ideal in that they have maximum and minimum sizes. The largest positive integer has a zero bit on the left side and all remaining bits set to one — that is, 0111111111111111-1111111111111111, or 2,147,483,647 in decimal. If you try to add one to this number as shown here, the one carries all the way over into the leftmost digit. In other words, you get 10000000000000000000000000000000, which is the smallest negative int in Java, decimal -2,147,483,648.
01111111111111111111111111111111
+ 00000000000000000000000000000001
10000000000000000000000000000000
Further addition will make the negative number count back up to zero and then into the positive numbers. In other words, if you count high enough, eventually you wrap around to very small numbers. The next int after 2,147,483,647 isn’t 2,147,483,648. It’s -2,147,483,648. If you need to count higher than 2,147,483,647 or lower than -2,147,483,648, then you need to use a long or a floating-point number, as I discuss in the next sections. These numbers have maximums and minimums of their own; they’re just larger ones.
So far we’ve worked with 32-bit ints. Java provides three other integer data types: byte, short, and long. These have different bit-widths, and they’re not as easy to use as literals in Java source code, but their analysis is exactly the same as that of ints.
|
One’s Complement Some early computers used one’s complement arithmetic instead. In one’s complement, you invert all the bits to change the sign of a number, as you do in two’s complement, but you don’t add 1. Thus, since 65 is 0100001 and -65 is 1011110. This seems simpler. However, you encounter a problem with zero. Zero itself is 00000000. Negative zero is 11111111. But negative zero is supposed to be the same as positive zero. Adding one to 11111111, as you do in two’s complement, flips all the bits back to 0 as the one carries across to the left and disappears. In two’s complement notation, therefore, 0 and -0 have the same bit pattern. This advantage has led to the triumph of two’s complement computers in the marketplace. One’s complement computers died off even before 12-bit word machines did. |
A byte is eight bits wide. The largest byte is 01111111, or 127 in decimal. The smallest byte is 10000000, or -128 in decimal. Bytes are the lowest common denominator for data interchange between different computers, and Java uses them extensively in input and output. However, it does not use byte values in arithmetic calculations or as literals. The Java compiler won’t even let you write code like the following:
byte b3 = b1 + b2;
If you try this, where b1 and b2 are byte variables, you’ll get an error message that says Error: Incompatible type for =. Explicit cast needed to convert int to byte. This is because the Java compiler converts bytes to ints before doing the calculation. It does not add b1 and b2 as bytes, but rather as ints. The result it produces and tries to assign to b3 is also an int.
Shorts are 16 bits wide. The largest short is 0111111111111111, or 32,767 in decimal. The smallest short is 1000000000000000, or -32,768 in decimal. There is no way to use a short as a literal or in arithmetic. As with bytes, if you write code like
short s3 = 454 + -732;
you’ll get an error message that says: Error: Incompatible type for =. Explicit cast needed to convert int to short. The Java compiler converts all shorts to ints before doing the calculation. The only time shorts are actually used in Java is when you’re reading or writing data that is interchanged with programs written in other languages on platforms that use 16-bit integers. For example, some old 680X0 Macintosh C compilers use 16-bit integers as the native int format. Shorts are also used when very many of them need to be stored and space is at a premium (either in memory or on disk).
The final Java integer data type is the long. A long is 64 bits wide and can represent integers between -9,223,372,036,854,775,808 and 9,223, 372,036,854,775,807. Unlike shorts and bytes, longs are directly used in Java literals and arithmetic. To indicate that a number is a long, just suffix it with the letter L — for example, 2147483856L or -76L. Like other integers, longs can be written as hexadecimal and octal literals — for example, 0xCAFEBABEL or 0714L.
Note: You can use either a small l or a capital L to indicate a long literal. However, a capital L is strongly preferred because the lowercase l is easily confused with the numeral 1 in most typefaces.
Integers aren’t the only kind of number you need. Java also provides support for rational numbers — numbers with a decimal point like 106.53 or -78.0987. For reasons you’ll learn shortly, these are called floating-point numbers, and Java has two primitive data types for them: the float and the double.
Floating-point literals can be made quite large or quite small by writing them in exponential notation — for example, 1.0E89 or -0.7E-32. The first is 1.0 × 1089, in other words 1 followed by 89 zeroes. The second is -0.7 × 10-32 or -0.00000000000000000000000000000007.
A floating-point number can be split into three parts: the sign, the mantissa, and the exponent. The sign tells you whether the number is positive or negative. The mantissa tells you how precise the number is. Generally, the more digits a number has, the more precise it is. Finally the exponent tells you how large or small the number is. In the number 0.7E-32, the sign is -, the mantissa is 7, and the exponent is -32. In 1.0E89, the sign is +, the mantissa is 1, and the exponent is 89.
Although Java does not put any particular limits on the number of digits a float or double literal can have before the decimal point, it is customary to place exactly one non-zero digit before the decimal point and all the rest after it and adjust the exponent to compensate. Thus, instead of writing 15.4 × 1089, you would write 1.54 × 1090. This is called scientific notation. An alternative custom called exponential notation places the first non-zero digit immediately following the decimal point. In exponential notation, 15.4 × 1089 becomes 0.154 × 1091.
The advantage to such a custom is that you no longer actually have to write the decimal point. If you know that the decimal point is always going to be immediately after the first non-zero digit, as it is in scientific notation, then why bother writing it down? Of course, not writing it makes it harder for human beings to read and understand the number, so the decimal point is required in Java source code. Computers can do quite well without an explicit decimal point as long as the byte code sticks to a form of scientific notation.
Once we’ve agreed that floating-point numbers will always be written in scientific notation, the mantissa, exponent, and sign of a floating-point number can all be written as integers. Just like the sign bit in integer data types, 1 represents a positive number and 0 represents a negative number. For example, 15.4 has sign 1, mantissa 154, and exponent 1. The number -0.7 × 10-32 has sign 0, manntissa 7, and exponent -32.
To represent a floating-point number in a computer, you must convert each of these values into bits and binary notation. Converting a number with a decimal point into binary notation is only slightly harder than converting a number without a decimal point. When you write the number 10.5, you mean one ten, no ones, and five tenths. In binary notation you use a binary point rather than a decimal point (though they look exactly the same on the printed page.) Thus, a real number in binary notation looks like 1010.1. This means a number with one eight, no fours, one two, no ones, and one half. In other words, this is 8 + 2 + 0.5 = 10.5 in decimal notation.
Binary floating-point numbers in Java are written in normalized form. This means that the leftmost one is shifted to the immediate right of the binary point. An exponent is then added as a power of two. Thus 1010.1 becomes 0.10101 × 10100 (where 10100 is 24 in decimal). The sign is 1, the mantissa is 10101, and the exponent is 100.
But wait! It gets better. When you’re using binary notation, the only non-zero digit is 1. The first non-zero digit after the binary point must be 1 because it can’t be anything else. Therefore, you don’t need to write it down either. You get an extra bit of precision, essentially for free. To store the mantissa 10101, you only need to write the bits 0101.
The next step is to determine how these numbers will be stuffed into bytes. Java allots four bytes for each float and eight bytes for each double. The first bit of each float is used for the sign bit. A 1 bit is negative and a 0 bit is positive, exactly as with integers.
The next eight bits are used for the exponent. These eight bits are treated as an unsigned integer between 0 and 255. The numbers 0 and 255 have special meanings that I discuss shortly. Otherwise, the exponent is biased by subtracting 127 from it. Therefore, float exponents have values between -126 (1 - 127) and +127 (254 - 127). Here’s what this arrangement looks like:
01111111111111111111111111111111
00000000000000000000000000000001
1000000000000000000000000000000
The final 23 bits are used for the mantissa. The mantissa is given as a fractional number between 1 and 2. As discussed earlier in this chapter, the first bit is assumed to be one, so the mantissa effectively has 24 bits of precision. Extra zeroes are appended if necessary. This doesn’t change the number, though, because 1.0101000000000000000000 is exactly the same as 1.0101. In other words, you can always add extra zeroes at the end of the mantissa to fill space. Figure 2-1 shows the bits in a float.
Figure 2-1 The bits in a float.
Note: The description that I’ve adopted here is the one used by the IEEE 754 specification. In this description, the mantissa is a normalized, binary, rational number — that is, its value is a fraction between 1 and 2. The Java Language Specification uses an alternate but equivalent description in which the mantissa is interpreted as an integer between 223 and 224-1. In this description, the bias used on the exponent is 150 — that is 127 + 23. A little thought should convince you that these descriptions are equivalent.
Finite precision
It’s important to understand that not all floating-point numbers can be exactly represented in a finite number of bits. For example, whereas one half is exactly 0.1 (binary) or 0.5 (decimal), one third in binary is 0.0101010101 . . . where the pattern repeats indefinitely. One third also repeats in decimal notation where it’s 0.33333333 . . . . Whether or not a number repeats or terminates depends on the base of the number system. One fifth is exactly 0.2 in decimal, but is 0.0011001100110011 . . . in binary. Some numbers, most famously [Pi], neither terminate nor repeat. Because computer arithmetic must truncate these infinite mantissas to just 24 bits, computer arithmetic on floats is often imprecise. The best Java can do with a number like [Pi] is approximate with an accuracy of 24 bits.
Doubles
If a float is not precise enough or large enough, you can use a double instead. A double has eight bytes, of which 1 bit is used for the sign, 11 bits for the exponent, and 53 bits for the mantissa. If you’re sharp, you’ll notice that this adds up to 65 bits. Don’t forget that the first bit of the mantissa is always 1, so you don’t need to store that bit. The exponent is biased by subtracting 1023.
Java’s floating-point numbers aren’t limited to the rational numbers you learned in high school. There are several special numbers that, while not true numbers in the traditional sense of the word, are produced by some calculations. If the non-biased exponent is 255, then the number takes on one of several special meanings.
Inf
Java has two special floating-point values to represent positive and negative infinity. There’s no literal for these infinities, but the public final static float values java.lang.Float.POSITIVE_INFINITY and java.lang.Float. NEGATIVE_INFINITY allow you to use them in source code.
More commonly you’ll bump across these values unexpectedly when a calculation goes in a direction that you didn’t anticipate. Positive infinity is produced when a positive float or a double is divided by zero. Dividing a negative float or double by zero gives negative infinity. For example:
double x = 1.0/0.0;
There’s little reason to deliberately create a float or double that’s infinite. However, it is a rather common thing to create one accidentally in more complicated programs where all possible divisors aren’t determined until runtime. The Inf value lets your programs continue without crashing or throwing an exception.
You can get the value Inf only in a floating-point calculation. If you try to divide an integer by integer zero, an ArithmeticException is thrown instead. For example:
int i = 1/0;
In a comparison test with <, <=, >, or >=, -Inf is smaller than any other number and Inf is larger than any other number. Each is equal only to itself.
The bit patterns for positive infinity and negative infinity are formed by the appropriate sign bit (1 for negative, 0 for positive), an unbiased exponent of 255 (11111111), and a mantissa of zero. Thus, positive infinity is 01111111100000000000000000000000, or in hexadecimal, 7F800000. Negative infinity is 11111111100000000000000000000000, or in hexadecimal, FF800000.
Double positive and negative infinity are formed in the same way. Choose the appropriate sign, fill the exponent with one bits, and set the mantissa to zero. Thus, positive double infinity is 7FF0000000000000 and negative double infinity is FFF0000000000000.
NaN
NaN is an acronym for “Not a Number.” A floating-point calculation returns NaN if it divides zero by zero. For example:
double z = 0.0/0.0;
You can also get NaN values in certain other undefined arithmetic operations, such as taking the square root of a negative number or raising zero to the zeroth power.
There is no literal that lets you type NaN into Java source code, but you can get the same effect with the public, final, static float constant java.lang.Float.NaN.
More commonly, NaN will pop up unexpectedly. For example, the following code fragment divides 0.0 by 0.0 when x is equal to 5.0:
double y = 10.0;
for (double x = 0.0; x <= y; x+=1.0, y -= 1.0) {
double z = x - 5.0;
double result = (x - y)/z;
System.out.println(x + " " + y + " " + z + " " + result);
}
NaN is unordered, so the result will always be false if you compare it to other numbers with <, <=, >, >=, or ==. The only comparison that can return true is !=, which always returns true if one or both of the operands is NaN. In other words, NaN is never equal to any number (including itself), never greater than any number, and never less than any number.
Although division by zero does not crash your program like it does in some programming languages, the unexpected appearance of NaNs or Infs in program output generally indicates a bug that needs to be stomped. Real world quantities shouldn’t be infinite or “Not a Number.” If you see NaNs or Infs, it may be an indication that a small factor you left out of your analysis, friction for example, is becoming important in a special case because everything else is canceling out.
NaN is represented by any float or double bit pattern in which the exponent is all ones and the mantissa is non-zero. (If the mantissa is zero, then the number is either positive or negative infinity.) The sign bit is ignored because NaN is not signed. Thus, all floats from 7F800001 to 7FFFFFFF and from FF800001 to FFFFFFFF correspond to NaN. All doubles from 7FF0000000000001 to 7FFFFFFFFFFFFFFF and from FFF0000000000001 to FFFFFFFFFFFFFFFF also correspond to NaN.
Finite precision
It’s important to understand that not all floating-point numbers can be exactly represented in a finite number of bits. For example, whereas one half is exactly 0.1 (binary) or 0.5 (decimal), one third in binary is 0.0101010101 . . . where the pattern repeats indefinitely. One third also repeats in decimal notation where it’s 0.33333333 . . . . Whether or not a number repeats or terminates depends on the base of the number system. One fifth is exactly 0.2 in decimal, but is 0.0011001100110011 . . . in binary. Some numbers, most famously [Pi], neither terminate nor repeat. Because computer arithmetic must truncate these infinite mantissas to just 24 bits, computer arithmetic on floats is often imprecise. The best Java can do with a number like [Pi] is approximate with an accuracy of 24 bits.
Doubles
If a float is not precise enough or large enough, you can use a double instead. A double has eight bytes, of which 1 bit is used for the sign, 11 bits for the exponent, and 53 bits for the mantissa. If you’re sharp, you’ll notice that this adds up to 65 bits. Don’t forget that the first bit of the mantissa is always 1, so you don’t need to store that bit. The exponent is biased by subtracting 1023.
Java’s floating-point numbers aren’t limited to the rational numbers you learned in high school. There are several special numbers that, while not true numbers in the traditional sense of the word, are produced by some calculations. If the non-biased exponent is 255, then the number takes on one of several special meanings.
Inf
Java has two special floating-point values to represent positive and negative infinity. There’s no literal for these infinities, but the public final static float values java.lang.Float.POSITIVE_INFINITY and java.lang.Float. NEGATIVE_INFINITY allow you to use them in source code.
More commonly you’ll bump across these values unexpectedly when a calculation goes in a direction that you didn’t anticipate. Positive infinity is produced when a positive float or a double is divided by zero. Dividing a negative float or double by zero gives negative infinity. For example:
double x = 1.0/0.0;
There’s little reason to deliberately create a float or double that’s infinite. However, it is a rather common thing to create one accidentally in more complicated programs where all possible divisors aren’t determined until runtime. The Inf value lets your programs continue without crashing or throwing an exception.
You can get the value Inf only in a floating-point calculation. If you try to divide an integer by integer zero, an ArithmeticException is thrown instead. For example:
int i = 1/0;
In a comparison test with <, <=, >, or >=, -Inf is smaller than any other number and Inf is larger than any other number. Each is equal only to itself.
The bit patterns for positive infinity and negative infinity are formed by the appropriate sign bit (1 for negative, 0 for positive), an unbiased exponent of 255 (11111111), and a mantissa of zero. Thus, positive infinity is 01111111100000000000000000000000, or in hexadecimal, 7F800000. Negative infinity is 11111111100000000000000000000000, or in hexadecimal, FF800000.
Double positive and negative infinity are formed in the same way. Choose the appropriate sign, fill the exponent with one bits, and set the mantissa to zero. Thus, positive double infinity is 7FF0000000000000 and negative double infinity is FFF0000000000000.
NaN
NaN is an acronym for “Not a Number.” A floating-point calculation returns NaN if it divides zero by zero. For example:
double z = 0.0/0.0;
You can also get NaN values in certain other undefined arithmetic operations, such as taking the square root of a negative number or raising zero to the zeroth power.
There is no literal that lets you type NaN into Java source code, but you can get the same effect with the public, final, static float constant java.lang.Float.NaN.
More commonly, NaN will pop up unexpectedly. For example, the following code fragment divides 0.0 by 0.0 when x is equal to 5.0:
double y = 10.0;
for (double x = 0.0; x <= y; x+=1.0, y -= 1.0) {
double z = x - 5.0;
double result = (x - y)/z;
System.out.println(x + " " + y + " " + z + " " + result);
}
NaN is unordered, so the result will always be false if you compare it to other numbers with <, <=, >, >=, or ==. The only comparison that can return true is !=, which always returns true if one or both of the operands is NaN. In other words, NaN is never equal to any number (including itself), never greater than any number, and never less than any number.
Although division by zero does not crash your program like it does in some programming languages, the unexpected appearance of NaNs or Infs in program output generally indicates a bug that needs to be stomped. Real world quantities shouldn’t be infinite or “Not a Number.” If you see NaNs or Infs, it may be an indication that a small factor you left out of your analysis, friction for example, is becoming important in a special case because everything else is canceling out.
NaN is represented by any float or double bit pattern in which the exponent is all ones and the mantissa is non-zero. (If the mantissa is zero, then the number is either positive or negative infinity.) The sign bit is ignored because NaN is not signed. Thus, all floats from 7F800001 to 7FFFFFFF and from FF800001 to FFFFFFFF correspond to NaN. All doubles from 7FF0000000000001 to 7FFFFFFFFFFFFFFF and from FFF0000000000001 to FFFFFFFFFFFFFFFF also correspond to NaN.
Positive and negative zero
The smallest value that you can represent in Java is java.lang.Double. MIN_VALUE, 4.94065645841246544e-324. Numbers with absolute values smaller than this are set to zero. However, the sign of the number can be retained if the number is in fact non-zero. The normal 0.0 you type in source code is positive zero. You get negative zero when you multiply a negative number by zero. For example:
double x = -1.0 * 0.0;
In direct comparisons, negative zero and positive zero appear to be equal. However, some other operations will produce different results depending on whether positive zero or negative zero is used. For example, 1.0 divided by positive zero is positive infinity, but 1.0 divided by negative zero is negative infinity.
The zero literal you type into source code with 0.0 or 0.0F is always positive zero. You can get negative zero only if it shows up in a calculation.
Positive zero is, as you would expect, the float or double value whose bits are all zero. In other words, float positive zero is 0000000000000000-0000000000000000, or 00000000 in hexadecimal. Negative zero is the same, except that the sign bit is one. Thus, float negative zero is 10000000-0000000000000000000000000, or 80000000 in hexadecimal. Double positive zero is 0000000000000000 in hexadecimal, and double negative zero is 8000000000000000.
Numbers whose unbiased exponent is zero but whose mantissa is not zero are denormalized. Denormalized numbers do not have an implied first bit with value one. All of the bits that a denormalized number has are present in the mantissa. The mantissa is presumed to be multiplied by 2-127 In other words, it acts like it has a biased exponent of -127, or an unbiased exponent of zero. In fact, this is exactly what it does have, so the only real difference between normalized and denormalized floating point numbers is the implied first bit.
Unlike Inf, NaN, and positive and negative zero, all of which can appear in one form or another in Java source code or output, denormalized numbers don’t look any different from regular floating point numbers. However, being able to recognize and decode them will become important when you learn how to disassemble Java byte code in Chapters 4 and 5.
The char data type in Java is considered to be a number, but it’s a funny one. Most obviously, when you try to print a char, you don’t get a number. Rather you get a character like “a” or “#”. Secondly, char literals don’t look like numbers in source code. You normally enter a char like this:
char c = `r';
You can, however, use integer literals to assign values to char variables. The following statement does exactly the same thing as the previous one:
char c = 114;
You don’t often see Java source code that initializes chars with integer literals, because most programmers don’t walk around with the entire ASCII chart in their head. The meaning of the first statement is much more obvious than the meaning of the second, but they produce identical byte code.
Chars are two bytes wide-they take up the same space as a short. However, chars are not shorts. Shorts are signed and chars are unsigned. The first bit in a char is the 32,768 place, not a sign bit. Thus, while 1000000000000001 interpreted as a short is -32,768, 1000000000000001 interpreted as a char is 32,769. Chars range from 0 to 65,535.
The Java compiler has to work a little magic to handle this. The line
char c = 114;
compiles without problem. So does the line
char d = 45000;
Both 114 and 45000 are within the range of a char. However, the following two lines produce compile-time error messages, telling you an explicit cast is needed to convert an int to a char:
char e = -123;
char f = 65536;
Java characters are understood to be part of the Unicode character set. The Unicode character set has, at the time of this writing, 38,885 characters, each two bytes wide. Unicode scripts include alphabets used in Europe, Africa, the Middle East, India, and many other parts of Asia, as well as the unified Han set of East Asian ideographs and the complete ideographs for Korean Hangul. Some scripts are not yet supported or are only partially supported, primarily because these scripts are not yet well understood.
Unsupported scripts include Braille, Cherokee, Cree, Ethiopic, Khmer (a.k.a. Cambodian), Maldivian (a.k.a. Dihevi), Mongolian, Moso (a.k.a. Naxi), Pahawh Hmong, Rong (a.k.a. Lepcha), Sinhalese, Tagalog, Tai Lu, Tai Mau, Tifinagh, Yi (a.k.a. Lolo), and Yoruba. Cherokee, Ethiopic, Braille, and possibly Khmer are likely to be added in the near future. Some of these languages can be written with other scripts that Unicode does support. For example, Mongolian is commonly written using the Cyrillic alphabet, and Hmong can be written in ASCII.
Furthermore, Unicode does not support many archaic alphabets, including Ahom, Akkadian Cuneiform, Aramaic, Babylonian Cuneiform, Balinese, Balti, Batak, Brahmi, Buginese, Chola, Cypro-Minoan, Egyptian hieroglyphics, Etruscan, Glagolitic, Hittite, Javanese (a particularly galling omission), Kaithi, Kawi, Khamti, Kharoshthi, Kirat (Limbu), Lahnda, Linear B, Mandaic, Mangyan, Manipuri (Meithei), Meroitic (Kush), Modi, Numidian, Ogham, Pahlavi (Avestan), Phags-pa, Pyu, Old Persian Cuneiform, Phoenician, Northern Runic, Satavahana, Siddham, South Arabian, Sumerian Cuneiform, Syriac, Tagbanuwa, Tircul, and Ugaritic Cuneiform. Runic and Ogham are likely to be added in the near future. Some of the rest of these languages, such as Linear B, are still areas of active research among linguists. Of the remainder, few (if any) are likely to be added to Unicode in the foreseeable future, even those that are fairly well understood.
Theoretically, Unicode can be expanded to cover up to 65,536 different characters. This is not quite enough to handle every character from all the world’s alphabets, primarily because of the large number of characters in the pictographic alphabets used for Chinese, Japanese, and historical Vietnamese. The Chinese alphabet alone has more than 80,000 different characters. However, by combining similar characters in these four alphabets so that some chars represent different words in different languages, all of the alphabets and the most commonly used pictographs can be squeezed into two bytes.
Unicode is based on two character sets that predate it: ASCII and ISO Latin-1. ASCII is a 7-bit character set with 128 different characters. ASCII was designed for communication in United States English. It therefore contains the lowercase letters a-z, the capital letters A-Z, the digits 0-9, various punctuation marks, and a number of non-printing control characters, many of which are closely related to the types of terminals and printers that were in use when ASCII was invented. The characters in ASCII are numbered from 0 to 127. Character 0 is the non-printing null character. Character 127 is the delete character. Characters 48 through 57 are the digits 0 through 9. Characters 65 through 90 are the capital letters A through Z. Characters 97 through 122 are the lowercase letters a through z. The remaining ASCII characters are various punctuation marks and non-printing characters. Table 2-3 is a complete list.
| Code | Character | Code | Character | Code | Character | Code |
Character |
|---|---|---|---|---|---|---|---|
|
0 |
null |
32 |
space |
64 |
@ |
96 |
` |
|
1 |
soh |
33 |
! |
65 |
A |
97 |
a |
|
2 |
stx |
34 |
" |
66 |
B |
98 |
b |
|
3 |
etx |
35 |
# |
67 |
C |
99 |
c |
|
4 |
eot |
36 |
$ |
68 |
D |
100 |
d |
|
5 |
enq |
37 |
% |
69 |
E |
101 |
e |
|
6 |
ack |
38 |
& |
70 |
F |
102 |
f |
|
7 |
bell |
39 |
' |
71 |
G |
103 |
g |
|
8 |
backspace |
40 |
( |
72 |
H |
104 |
h |
|
9 |
tab (\t) |
41 |
) |
73 |
I |
105 |
i |
|
10 |
linefeed (\n) |
42 |
* |
74 |
J |
106 |
j |
|
11 |
vertical tab |
43 |
+ |
75 |
K |
107 |
k |
|
12 |
formfeed (\f) |
44 |
, |
76 |
L |
108 |
l |
|
13 |
carriage return, (\r) |
45 |
- |
77 |
M |
109 |
m |
|
14 |
so |
46 |
. |
78 |
N |
110 |
n |
|
15 |
si |
47 |
/ |
79 |
O |
111 |
o |
|
16 |
dle |
48 |
0 |
80 |
P |
112 |
p |
|
17 |
dc1 |
49 |
1 |
81 |
Q |
113 |
q |
|
18 |
dc2 |
50 |
2 |
82 |
R |
114 |
r |
|
19 |
dc3 |
51 |
3 |
83 |
S |
115 |
s |
|
20 |
dc4 |
52 |
4 |
84 |
T |
116 |
t |
|
21 |
nak |
53 |
5 |
85 |
U |
117 |
u |
|
22 |
syn |
54 |
6 |
86 |
V |
118 |
v |
|
23 |
etb |
55 |
7 |
87 |
W |
119 |
w |
|
24 |
can |
56 |
8 |
88 |
X |
120 |
x |
|
25 |
em |
57 |
9 |
89 |
Y |
121 |
y |
|
26 |
sub |
58 |
: |
90 |
Z |
122 |
z |
|
27 |
escape |
59 |
; |
91 |
[ |
123 |
{ |
|
28 |
is4 |
60 |
< |
92 |
\ |
124 |
| |
|
29 |
is3 |
61 |
= |
93 |
] |
125 |
} |
|
30 |
is2 |
62 |
> |
94 |
^ |
126 |
~ |
|
31 |
is1 |
63 |
? |
95 |
_ |
127 |
delete |
As I said, ASCII is designed to handle U.S. English. It can do a reasonable approximation of other dialects of English, but it begins to have problems with many other European languages, like French and German. There are no cedillas, umlauts, or any of the other characters not used in English, but present in these languages.
The first bit of each ASCII character is 0. You can define another 128 characters by using the bytes whose first bit is one. Indeed, this is the scheme used in most modern computers. The characters with numeric values between 128 and 255 are used to encode the additional characters needed by most languages that are written in some approximation of the Latin alphabet. There are at least two common ways ASCII is extended into the upper 128 characters. The one around which Unicode and Java are built is the ISO 8859-1 Latin-1 character set, often just referred to as ISO Latin-1. Table 2-4 lists the upper 128 characters of the ISO Latin-1 character set. The lower 128 characters are exactly the same as they are for ASCII.
| Code | Character | Code | Character | Code | Character | Code | Character |
|---|---|---|---|---|---|---|---|
|
128 |
|
160 |
non-breaking space |
192 |
¿ |
224 |
‡ |
|
129 |
|
161 |
¡ |
193 |
¡ |
225 |
· |
|
130 |
bph |
162 |
¢ |
194 |
¬ |
226 |
´ |
|
131 |
nbh |
163 |
£ |
195 |
v |
227 |
" |
|
132 |
|
164 |
¤ |
196 |
[fnof] |
228 |
[permil] |
|
133 |
nel |
165 |
¥ |
197 |
~ |
229 |
 |
|
134 |
ssa |
166 |
| |
198 |
? |
230 |
Ê |
|
135 |
esa |
167 |
§ |
199 |
« |
231 |
Á |
|
136 |
hts |
168 |
| |
200 |
» |
232 |
Ë |
|
137 |
htj |
169 |
© |
201 |
… |
233 |
È |
|
138 |
vts |
170 |
ª |
202 |
|
234 |
Í |
|
139 |
pld |
171 |
« |
203 |
À |
235 |
Î |
|
140 |
plu |
172 |
¬ |
204 |
à |
236 |
Ï |
|
141 |
ri |
173 |
shy |
205 |
Õ |
237 |
Ì |
|
142 |
ss2 |
174 |
Æ |
206 |
[OElig] |
238 |
Ó |
|
143 |
ss3 |
175 |
Ø |
207 |
[oelig] |
239 |
Ô |
|
144 |
dcs |
176 |
8 |
208 |
-D |
240 |
? |
|
145 |
pu1 |
177 |
± |
209 |
— |
241 |
Ò |
|
146 |
pu2 |
178 |
2 |
210 |
” |
242 |
Ú |
|
147 |
sts |
179 |
3 |
211 |
“ |
243 |
Û |
|
148 |
cch |
180 |
¥ |
212 |
‘ |
244 |
Ù |
|
149 |
mw |
181 |
µ |
213 |
õ |
245 |
|
|
150 |
spa |
182 |
¶ |
214 |
|
246 |
[circ] |
|
151 |
epa |
183 |
. |
215 |
x |
247 |
~ |
|
152 |
sos |
184 |
, |
216 |
ÿ |
248 |
- |
|
153 |
|
185 |
1 |
217 |
[Yuml] |
249 |
? |
|
154 |
sci |
186 |
· |
218 |
/ |
250 |
? |
|
155 |
csi |
187 |
» |
219 |
¤ |
251 |
º |
|
156 |
st |
188 |
1/4 |
220 |
< |
252 |
, |
|
157 |
osc |
189 |
1/2 |
221 |
Ý |
253 |
ý |
|
158 |
pm |
190 |
3/4 |
222 |
capital thorn |
254 |
little thorn |
|
159 |
apc |
191 |
¿ |
223 |
? |
255 |
? |
Programs that don’t support ISO Latin-1 characters often operate by ignoring the most significant bit of each character; that is, they presume that each byte begins with a zero bit. For example, the umlaut (ü), ISO Latin-1 character 252, would be reduced to ASCII character 252-128, which is character 124, the vertical bar, |. This can be a reasonable approximation if most of the text is ASCII.
Just as ISO Latin-1 extends ASCII by adding an extra high-order bit, so too does Unicode extend ISO Latin-1 by adding an extra high-order byte. If the high-order byte is zero (00000000), then the Unicode character is identical to the ISO Latin-1 character in the low-order byte. You can do an approximate conversion from Unicode to ISO Latin-1 by chopping off all the high-order bytes. This works as long as all the text is composed only of ISO Latin-1 characters. Most of the time, especially when you’re working in English, this is a reasonable assumption. Many of Java’s classes that output text make this assumption, most notably PrintStream, which includes System.out.
Note: I’d love to show you a table of all the extra characters in Unicode, but it would be so lengthy that this book would be mostly that table and not much else. If you need to know more about the specific encodings of the different characters in Unicode, you should check out The Unicode Standard, Second edition, ISBN 0-201-48345-9, from Addison-Wesley. This 950-page book includes the complete Unicode 2.0 specification. Errata for this volume are on the Web at http://www.unicode.org/.
|
Mac Roman Remember that I said there were two ways to encode these extra characters in the upper 128 bytes? The Macintosh uses a completely different character-encoding scheme called Mac Roman. It has most of the same glyphs as the ISO Latin-1 character set, but different glyphs are mapped to different numbers. If Java programs try to print the upper 128 characters on a Macintosh, they come out in the Mac Roman character set, not the ISO Latin-1 character set like they are supposed to. This is a royal pain for more than just Java programs because it makes file translation between platforms excessively difficult. In fact, Java 1.1 provides one of the few class libraries that can translate between the Mac Roman and ISO Latin-1 character sets. This is especially painful to authors trying to write about ISO Latin-1 on a Macintosh. When the Macintosh was created in the early 1980s, it was one of the very few computers that could handle non-ASCII text. ISO Latin-1 was not yet established. Therefore, Apple had to invent their own scheme for encoding the extra characters. Regrettably, backward-compatibility means that Macs will never get in sync with the rest of the world. That’s one of the disadvantages of pioneering new technology. To make matters worse, it’s happening again. Apple developed their 2-byte WorldScript technology before Unicode was ready. Everyone who came after Apple standardized on Unicode. This means that we’re probably stuck with ASCII as the lowest common denominator for text data for the foreseeable future. |
| Code | Character | Code | Character | Code | Character | Code | Character |
|---|---|---|---|---|---|---|---|
|
128 |
|
160 |
non-breaking space |
192 |
¿ |
224 |
‡ |
|
129 |
|
161 |
¡ |
193 |
¡ |
225 |
· |
|
130 |
bph |
162 |
¢ |
194 |
¬ |
226 |
´ |
|
131 |
nbh |
163 |
£ |
195 |
v |
227 |
" |
|
132 |
|
164 |
¤ |
196 |
[fnof] |
228 |
[permil] |
|
133 |
nel |
165 |
¥ |
197 |
~ |
229 |
 |
|
134 |
ssa |
166 |
| |
198 |
? |
230 |
Ê |
|
135 |
esa |
167 |
§ |
199 |
« |
231 |
Á |
|
136 |
hts |
168 |
| |
200 |
» |
232 |
Ë |
|
137 |
htj |
169 |
© |
201 |
… |
233 |
È |
|
138 |
vts |
170 |
ª |
202 |
|
234 |
Í |
|
139 |
pld |
171 |
« |
203 |
À |
235 |
Î |
|
140 |
plu |
172 |
¬ |
204 |
à |
236 |
Ï |
|
141 |
ri |
173 |
shy |
205 |
Õ |
237 |
Ì |
|
142 |
ss2 |
174 |
Æ |
206 |
[OElig] |
238 |
Ó |
|
143 |
ss3 |
175 |
Ø |
207 |
[oelig] |
239 |
Ô |
|
144 |
dcs |
176 |
8 |
208 |
-D |
240 |
? |
|
145 |
pu1 |
177 |
± |
209 |
— |
241 |
Ò |
|
146 |
pu2 |
178 |
2 |
210 |
” |
242 |
Ú |
|
147 |
sts |
179 |
3 |
211 |
“ |
243 |
Û |
|
148 |
cch |
180 |
¥ |
212 |
‘ |
244 |
Ù |
|
149 |
mw |
181 |
µ |
213 |
õ |
245 |
|
|
150 |
spa |
182 |
¶ |
214 |
|
246 |
[circ] |
|
151 |
epa |
183 |
. |
215 |
x |
247 |
~ |
|
152 |
sos |
184 |
, |
216 |
ÿ |
248 |
- |
|
153 |
|
185 |
1 |
217 |
[Yuml] |
249 |
? |
|
154 |
sci |
186 |
· |
218 |
/ |
250 |
? |
|
155 |
csi |
187 |
» |
219 |
¤ |
251 |
º |
|
156 |
st |
188 |
1/4 |
220 |
< |
252 |
, |
|
157 |
osc |
189 |
1/2 |
221 |
Ý |
253 |
ý |
|
158 |
pm |
190 |
3/4 |
222 |
capital thorn |
254 |
little thorn |
|
159 |
apc |
191 |
¿ |
223 |
? |
255 |
? |
Programs that don’t support ISO Latin-1 characters often operate by ignoring the most significant bit of each character; that is, they presume that each byte begins with a zero bit. For example, the umlaut (ü), ISO Latin-1 character 252, would be reduced to ASCII character 252-128, which is character 124, the vertical bar, |. This can be a reasonable approximation if most of the text is ASCII.
Just as ISO Latin-1 extends ASCII by adding an extra high-order bit, so too does Unicode extend ISO Latin-1 by adding an extra high-order byte. If the high-order byte is zero (00000000), then the Unicode character is identical to the ISO Latin-1 character in the low-order byte. You can do an approximate conversion from Unicode to ISO Latin-1 by chopping off all the high-order bytes. This works as long as all the text is composed only of ISO Latin-1 characters. Most of the time, especially when you’re working in English, this is a reasonable assumption. Many of Java’s classes that output text make this assumption, most notably PrintStream, which includes System.out.
Note: I’d love to show you a table of all the extra characters in Unicode, but it would be so lengthy that this book would be mostly that table and not much else. If you need to know more about the specific encodings of the different characters in Unicode, you should check out The Unicode Standard, Second edition, ISBN 0-201-48345-9, from Addison-Wesley. This 950-page book includes the complete Unicode 2.0 specification. Errata for this volume are on the Web at http://www.unicode.org/.
|
Mac Roman Remember that I said there were two ways to encode these extra characters in the upper 128 bytes? The Macintosh uses a completely different character-encoding scheme called Mac Roman. It has most of the same glyphs as the ISO Latin-1 character set, but different glyphs are mapped to different numbers. If Java programs try to print the upper 128 characters on a Macintosh, they come out in the Mac Roman character set, not the ISO Latin-1 character set like they are supposed to. This is a royal pain for more than just Java programs because it makes file translation between platforms excessively difficult. In fact, Java 1.1 provides one of the few class libraries that can translate between the Mac Roman and ISO Latin-1 character sets. This is especially painful to authors trying to write about ISO Latin-1 on a Macintosh. When the Macintosh was created in the early 1980s, it was one of the very few computers that could handle non-ASCII text. ISO Latin-1 was not yet established. Therefore, Apple had to invent their own scheme for encoding the extra characters. Regrettably, backward-compatibility means that Macs will never get in sync with the rest of the world. That’s one of the disadvantages of pioneering new technology. To make matters worse, it’s happening again. Apple developed their 2-byte WorldScript technology before Unicode was ready. Everyone who came after Apple standardized on Unicode. This means that we’re probably stuck with ASCII as the lowest common denominator for text data for the foreseeable future. |
Because very few text editors are available that allow you to write in Unicode, Java source code files are written in ISO Latin-1. Furthermore, the Java compiler expects to see source code written in ISO Latin-1. If you actually have a text editor that works in Unicode and try to write Java files with it, the compiler will get hopelessly confused when it tries to compile your files.
In fact, Java can be written perfectly well with only ASCII. All Java keywords, operators, and literals, as well as all method, class, and field names in the java packages, can be written in pure ASCII. Because ISO Latin-1 makes your source code difficult to move between Macs and other platforms, you should probably restrict yourself to ASCII in your programs.
You can use Unicode characters in Java string and char literals as well as in identifiers. To embed a non-ASCII character in a string, prefix the hexadecimal number for the character with \u. For example, the division sign is Unicode character 247. Therefore, you can make it part of the string by writing \u00F7. The Greek letter [pi] is Unicode character 12,480 or hexadecimal \u03C0. Thus,
double \u03C0 = 3.141592;
All Unicode characters can be encoded in this fashion, even those you could type literally. For example, the small letter t can also be written as \u0074. The backslash itself can be written as \u005C. Writing code this way is a very bad idea unless you’re deliberately trying to make it obscure.
When a Java compiler reads Java source code, it first converts all such \u escapes to the actual characters, taking into account double backslash escapes as well. This pre-processing happens before anything else. For example, consider this statement:
System.out.println("This is not a \\u0074");
The double backslash is interpreted as a literal backslash, not as the start of an escape sequence. Thus you get “This is not a \u0074” instead of “This is not a \t.” To get the second effect, you would have to write
System.out.println("This is not a \\\u0074");
or better yet, just
System.out.println("This is not a \\t");
Unicode escape translation is not cumulative. “\u005Cu0074” is translated to the six characters “\u0074” rather than the single character “t.”
As if Unicode input to Java weren’t complex enough, Unicode output is equally troublesome. You already know that PrintStreams like System.out just chop off the high byte of a Unicode character. Although it varies from platform to platform, different output classes in the java package either chop off the high byte like PrintStream or output \u escapes.
To summarize what you have learned so far, characters in Java source code are 8-bit ISO Latin-1 characters. Internally, Java translates these characters and any embedded \u escapes into 16-bit Unicode characters.
Using 16-bit characters is relatively inefficient, however, when almost all the text you’re working with is likely to be regular 7-bit ASCII. Therefore, Java byte code embeds string literals in an intermediate format called “Universal Character Set Transformation Format 8-bit form.” Since that’s way more than a mouthful, this is almost always written as the acronym UTF8.
UTF8 encodes the most common characters (the ASCII character set) in a single byte for each character. However, less-common characters use two bytes, including the upper 128 ISO Latin-1 characters (which normally only take one byte apiece). The least common characters of all — the upper 32,768 Unicode characters — are encoded in three bytes.
The details are as follows. Characters between 1 and 127 (\u0001 and \u007F) — that is, ASCII characters except null — are encoded as their low-order byte. The high byte (which is just zeroes anyway) is discarded. If the Unicode character is between 128 and 28,927 (\u0080 to \u07FF) — that is, if its top five bits are zero — then it has 11 bits of data. These 11 bits are encoded as a pair of bytes like this
1 1 0 x x x x x 1 0 x x x x x x
bits 6-10 bits 0-5
The null character is also encoded in two bytes as 1100000010000000.
Characters in the range \u0800 to \uFFFF have a full 16 bits of data. These are encoded in three bytes, like this:
1 1 1 0 x x x x x 1 0 x x x x x 1 0 x x x x x x
bits 12-15 bits 6-11 bits 0-5
Note: This is not exactly the official UTF8 encoding. Java differs from the formal standard in that it uses two bytes to encode the null character (\u0000) rather than one. Furthermore, the real UTF8 standard has several more formats to handle four byte characters as well. By using a 4-byte character set, it’s no longer necessary to unify the Chinese, Japanese, and Vietnamese scripts.
This encoding scheme is designed to be easy and quick to parse. Any byte that begins with a 0 bit is a 1-byte ASCII character. Any byte that begins with 110 starts a 2-byte character. Any byte that starts with 1110 is a 3-byte character. Finally, any byte that starts with 10 is the second or third byte of a multi-byte character.
The more ASCII characters in a text string, the more space that can be saved by UTF8. Pure ASCII text is only half as large in UTF8 as it is in true Unicode. In the worst case, where all characters occupy three bytes, a UTF8 string is only 50 percent larger than the equivalent Unicode string. However, the worst case is rarely seen in practice.
The DataInputStream and DataOutputStream classes have writeUTF() and readUTF() methods to handle UTF8 data. readUTF() first reads two bytes from the underlying stream. These are interpreted as an unsigned short specifying the number of bytes to read from the stream (not the number of characters to read from the stream). These bytes are then read and translated from UTF8 into Unicode, and a String containing the translated data is returned. We use this method in Chapter 4 to read the UTF8 strings stored in the constant pool of a byte code file.
The DataOutputStream writeUTF(String s) method writes a Unicode string onto the underlying output stream after translating the string to UTF8 format. The string is preceded by an unsigned short that gives the number of bytes that will be written.
Because very few text editors are available that allow you to write in Unicode, Java source code files are written in ISO Latin-1. Furthermore, the Java compiler expects to see source code written in ISO Latin-1. If you actually have a text editor that works in Unicode and try to write Java files with it, the compiler will get hopelessly confused when it tries to compile your files.
In fact, Java can be written perfectly well with only ASCII. All Java keywords, operators, and literals, as well as all method, class, and field names in the java packages, can be written in pure ASCII. Because ISO Latin-1 makes your source code difficult to move between Macs and other platforms, you should probably restrict yourself to ASCII in your programs.
You can use Unicode characters in Java string and char literals as well as in identifiers. To embed a non-ASCII character in a string, prefix the hexadecimal number for the character with \u. For example, the division sign is Unicode character 247. Therefore, you can make it part of the string by writing \u00F7. The Greek letter [pi] is Unicode character 12,480 or hexadecimal \u03C0. Thus,
double \u03C0 = 3.141592;
All Unicode characters can be encoded in this fashion, even those you could type literally. For example, the small letter t can also be written as \u0074. The backslash itself can be written as \u005C. Writing code this way is a very bad idea unless you’re deliberately trying to make it obscure.
When a Java compiler reads Java source code, it first converts all such \u escapes to the actual characters, taking into account double backslash escapes as well. This pre-processing happens before anything else. For example, consider this statement:
System.out.println("This is not a \\u0074");
The double backslash is interpreted as a literal backslash, not as the start of an escape sequence. Thus you get “This is not a \u0074” instead of “This is not a \t.” To get the second effect, you would have to write
System.out.println("This is not a \\\u0074");
or better yet, just
System.out.println("This is not a \\t");
Unicode escape translation is not cumulative. “\u005Cu0074” is translated to the six characters “\u0074” rather than the single character “t.”
As if Unicode input to Java weren’t complex enough, Unicode output is equally troublesome. You already know that PrintStreams like System.out just chop off the high byte of a Unicode character. Although it varies from platform to platform, different output classes in the java package either chop off the high byte like PrintStream or output \u escapes.
To summarize what you have learned so far, characters in Java source code are 8-bit ISO Latin-1 characters. Internally, Java translates these characters and any embedded \u escapes into 16-bit Unicode characters.
Using 16-bit characters is relatively inefficient, however, when almost all the text you’re working with is likely to be regular 7-bit ASCII. Therefore, Java byte code embeds string literals in an intermediate format called “Universal Character Set Transformation Format 8-bit form.” Since that’s way more than a mouthful, this is almost always written as the acronym UTF8.
UTF8 encodes the most common characters (the ASCII character set) in a single byte for each character. However, less-common characters use two bytes, including the upper 128 ISO Latin-1 characters (which normally only take one byte apiece). The least common characters of all — the upper 32,768 Unicode characters — are encoded in three bytes.
The details are as follows. Characters between 1 and 127 (\u0001 and \u007F) — that is, ASCII characters except null — are encoded as their low-order byte. The high byte (which is just zeroes anyway) is discarded. If the Unicode character is between 128 and 28,927 (\u0080 to \u07FF) — that is, if its top five bits are zero — then it has 11 bits of data. These 11 bits are encoded as a pair of bytes like this
1 1 0 x x x x x 1 0 x x x x x x
bits 6-10 bits 0-5
The null character is also encoded in two bytes as 1100000010000000.
Characters in the range \u0800 to \uFFFF have a full 16 bits of data. These are encoded in three bytes, like this:
1 1 1 0 x x x x x 1 0 x x x x x 1 0 x x x x x x
bits 12-15 bits 6-11 bits 0-5
Note: This is not exactly the official UTF8 encoding. Java differs from the formal standard in that it uses two bytes to encode the null character (\u0000) rather than one. Furthermore, the real UTF8 standard has several more formats to handle four byte characters as well. By using a 4-byte character set, it’s no longer necessary to unify the Chinese, Japanese, and Vietnamese scripts.
This encoding scheme is designed to be easy and quick to parse. Any byte that begins with a 0 bit is a 1-byte ASCII character. Any byte that begins with 110 starts a 2-byte character. Any byte that starts with 1110 is a 3-byte character. Finally, any byte that starts with 10 is the second or third byte of a multi-byte character.
The more ASCII characters in a text string, the more space that can be saved by UTF8. Pure ASCII text is only half as large in UTF8 as it is in true Unicode. In the worst case, where all characters occupy three bytes, a UTF8 string is only 50 percent larger than the equivalent Unicode string. However, the worst case is rarely seen in practice.
The DataInputStream and DataOutputStream classes have writeUTF() and readUTF() methods to handle UTF8 data. readUTF() first reads two bytes from the underlying stream. These are interpreted as an unsigned short specifying the number of bytes to read from the stream (not the number of characters to read from the stream). These bytes are then read and translated from UTF8 into Unicode, and a String containing the translated data is returned. We use this method in Chapter 4 to read the UTF8 strings stored in the constant pool of a byte code file.
The DataOutputStream writeUTF(String s) method writes a Unicode string onto the underlying output stream after translating the string to UTF8 format. The string is preceded by an unsigned short that gives the number of bytes that will be written.
The final primitive data type is the only one that cannot be interpreted as a number. This is the boolean. A boolean has two possible values: true and false. In Java source code, these are boolean literals. They are not the same as 1 and 0. They are not the same as the strings "true" and "false." They are simply true and false. That’s all.
At the level of the virtual machine, things are a little different. The virtual machine does not have instructions that operate on boolean data. Instead, expressions that involve booleans are compiled using integer instructions. The integer constant 1 is used to represent true, and the integer constant 0 is used to represent false. Don’t try to take advantage of this when writing Java source code, though. It won’t work.
However, for the purposes of efficiency, Java does allow arrays of booleans to be stored more compactly than arrays of ints. Sun’s virtual machines make arrays of booleans out of arrays of bytes. In these arrays, true is 01 and false is 00. Other implementations are free to use even more compact representations for boolean arrays, perhaps as little as one bit per value.
The preceding section described how primitive data types are represented in Java. This matches fairly closely how numbers are represented on Sparc-Solaris systems. This shouldn’t be surprising, given that Java was created by Sun Microsystems programmers who were accustomed to Sparc-Solaris systems.
However, not all systems represent data in the same way. Most annoyingly, roughly half of computer architectures are Little-Endian rather than Big-Endian. (Little-Endian and Big-Endian architectures are discussed shortly). Furthermore, some programming languages allow the use of unsigned numeric quantities. And although Java’s native integer format is 32 bits, many other systems prefer 16-bit or 64-bit ints. Although Java is supposed to be above such concerns, when you have to deal with legacy data from programs written in other languages, you need to be aware of these differences.
Which two mighty powers have, as I was going to tell you, been engaged in a most obstinate war for six and thirty moons past. It began upon the following occasion. It is allowed on all hands, that the primitive way of breaking eggs, before we eat them, was upon the larger end: but his present Majesty’s grandfather, while he was a boy, going to eat an egg, and breaking it according to the ancient practice, happened to cut one of his fingers. Whereupon the Emperor his father published an edict, commanding all his subjects, upon great penalties, to break the smaller end of their eggs. The people so highly resented this law, that our histories tell us there have been six rebellions raised on that account; wherein one Emperor lost his life, and another his crown. These civil commotions were constantly fomented by the monarchs of Blefuscu; and when they were quelled, the exiles always fled for refuge to that empire. It is computed that eleven thousand persons have, at several times, suffered death, rather than submit to break their eggs at the smaller end. Many hundred large volumes have been published upon this controversy: but the books of the Big-Endians have been long forbidden, and the whole party rendered incapable by law of holding employment. During the course of these troubles, the Emperors of Blefuscu did frequently expostulate by their ambassadors, accusing us of making a schism in religion, by offending against a fundamental doctrine of our great prophet Lustrog, in the fifty-fourth chapter of the Blundecral (which is their Alcoran). This, however, is thought to be a mere strain upon the text: for the words are these: That all true believers shall break their eggs at the convenient end: and which is the convenient end, seems, in my humble opinion, to be left to every man’s conscience, or at least in the power of the chief magistrate to determine. Now the Big-Endian exiles have found so much credit in the Emperor of Blefuscu’s court, and so much private assistance and encouragement from their party here at home, that a bloody war has been carried on between the two empires for six and thirty moons with various success; during which time we have lost forty capital ships, and a much greater number of smaller vessels, together with thirty thousand of our best seamen and soldiers; and the damage received by the enemy is reckoned to be somewhat greater than ours.
Jonathan Swift, Gulliver’s Travels, Chapter IV
I made an implicit assumption in the preceding section: that the leftmost byte of a multi-byte number is the most significant one. Of course, spatial concepts like left and right really don’t apply to computer memories. In this context, left means lower in memory, and right means higher. Of course, lower and higher are also spatial terms. By lower, I mean “has a smaller address,” and by higher, I mean “has a bigger address.” Thus, if the bytes in a computer memory with n bytes of memory are organized from byte 0 to byte n-1, then byte 0 is the lowest, or leftmost, byte and byte n-1 is the highest, or rightmost, byte.
We associate left with lower addresses in memory because computer programs start executing the instruction at a lower address and then proceed through the instructions to a higher address. In other words, first the instruction in byte 0 is executed, and then the instruction at byte 1, and then the instruction at byte 2, and so on.
Note: This is a little over-simplified. Not all bytes contain instructions; not all instructions are one byte long (though they are in Java); and some instructions jump backward or forward in memory. However, none of this changes the point I’m making here about associating lower addresses in memory with left and higher addresses with right.
When people who speak English write sequences of numbers they automatically put 0 on the left as shown here:
0 1 2 3 4 5 6 7 8 9 10 11
Because English is a left-to-right language and most of the people who developed the first computers spoke English, the spatial concept of left came to be implicitly associated with lower addresses in memory. If the first digital computers had been invented in Arabic- or Hebrew-speaking cultures, which use right-to-left scripts, we’d probably speak of byte 0 as the rightmost byte.
Consider the number 6401. This is shorthand for six thousands, four hundreds, zero tens, and one one. The leftmost digit, 6, is the most important. It tells you to within a thousand how big the number is. Subsequent digits improve on the precision, but don’t change the big picture. In jargon, it’s said that 6 is the most significant digit. Similarly, the rightmost digit, 1, is the least significant digit.
The most significant digits are read first. Therefore, this is a Big-Endian number system. The big end of the number (the thousands) comes before the little end (the ones) of the number. This assumption seems to be perfectly reasonable unless and until you encounter a script in which numbers are stored differently.
A number system in which 6401 means 6 ones, 4 tens, 0 hundreds, and 1 thousand is called Little-Endian because the least significant digits come first. There’s no reason why 6401 couldn’t mean 6 ones, 4 tens, 0 hundreds, and 1 thousand. That’s just not the way European scripts count. There’s no mathematical reason for Big-Endian numbers. It’s purely a convention enforced by centuries of common practice. It’s no more right or wrong than the grammatical convention that adjectives tend to come before the nouns they modify. In English and many other languages, adjectives come first. In Latin and many other languages, the nouns come first. Neither is right or wrong. They’re just different.
Bringing this discussion back to the level of computers, recall that a Java int can be thought of as made out of four hexadecimal digits. For example, decimal 6401 is 0x1901. Java follows a Big-Endian scheme. The most significant digit comes first, followed by the second most significant digit, followed by the third most significant digit, followed by the least significant digit.
Macs and most UNIX machines, including Sun’s, also support a Big-Endian architecture, where the digit with the highest place value in a number is in the leftmost (lowest addressed) byte in the number. However, computer architectures based around the Intel X86 and VAX architectures do things exactly the opposite way. Those machines are Little-Endian; the least significant byte in a number comes first. On an X86 system, the decimal number gets laid out in memory as 1091.
Now let’s suppose we have to store the 4-byte integer 1,870,475,384 in this memory. All computer architectures would use four contiguous bytes. First, the integer is converted into its hexadecimal form, 6F7D3078; each 2-digit pair is exactly one byte. Working from the bottom up, as is customary in a stack, the first byte can go to address A, the second byte to address A+1, the third to A+2, and the fourth to A+3. Figure 2-2 shows this arrangement.
Figure 2-2 The number 0x6F7D3078 stored
at address A in memory in Big-Endian order.
This is a classic Big-Endian ordering of bytes. However, not all architectures do it like this. In particular, X86 and VAX architectures use a Little-Endian ordering. They put the most significant byte at address A+3, the second most significant byte at address A+2, the third most significant byte at address A+1, and the least significant byte at address A. Figure 2-3 shows this arrangement.
Figure 2-3 The number 0x6F7D3078 stored
at address A in Little-Endian order.
As long as you’re on only one computer system, you don’t need to worry about this. All the routines are designed to work with the native data format. However, as soon as you start trying to transfer data between systems, you need to worry about converting between byte orders. Otherwise, the integer you write to a file in Big-Endian format on your Sun as 6F7D3078 (1,870,475,384) will be read in Little-Endian format as 78307D6F (2,016,443,759) on your PC — not the same thing at all.
Note: Some older computer systems used neither Big-Endian nor Little-Endian byte orders. DEC’s PDP-11 wrote 4-byte integers in this order: second-least-significant byte, least-significant byte, most-significant byte, and second-most-significant byte. Other computers did even stranger things. Fortunately, these architectures have all died out, and we’re now left to deal with only the confusion between Little-Endian and Big-Endian.
Java was first designed by Big-Endian engineers at Sun Microsystems. It was also designed for the Internet, where almost all protocols specify Big-Endian byte orders. Therefore, it should come as no surprise that Java’s virtual machine uses Big-Endian format for all data types. Little-Endian systems, like the X86, have to translate the Big-Endian data in Java byte code into their native Little-Endian format before executing it.
Secret: You need to worry about byte order only when you’re reading data that comes from a Little-Endian source. The readByte(), readShort(), readInt(), readLong(), readFloat(), and readDouble() methods of java.io. DataInputStream all assume the data is Big-Endian. Similarly the writeByte(), writeShort(), writeInt(), writeLong(), writeFloat(), and writeDouble() methods of java.io.DataOutputStream write Big-Endian data. To read Little-Endian data in Java, you have to read each byte separately and then reconstruct the int or long from the bytes that make it up. To write Big-Endian data, you have to break the ints or longs apart into bytes and then write the bytes separately. There are several ways to accomplish this, but the most efficient use the bit-level operators discussed later in this chapter. I revisit this topic there.
Unsigned integers
Many traditional programming languages, notably C, allow the use of unsigned quantities. An unsigned number uses its high-order bit for data so it can count twice as high as a number that has to reserve one bit for the sign. However, it can only count positive numbers, not negative numbers. Recall that the largest signed byte is 01111111, which is 127 in decimal. 11111111 is not 255 but rather -128. However, by reading 11111111 as an unsigned quantity, the first 1 bit is interpreted as 128, not the - sign. Thus, as unsigned quantity, 11111111 is indeed 255. On the other hand, there’s no way to express negative numbers as unsigned numbers.
All Java numeric data types except char use signed integers exclusively. However it’s not unlikely that you’ll run across data from programs written in other languages that do have unsigned integers. java.io.DataInputStream has two methods that read unsigned quantities. readUnsignedByte() reads a single byte off the stream and returns an int between 0 and 255. An int is returned instead of a byte or a short because a byte can go only as high as 127, whereas an unsigned byte can go as high as 255. Similarly readUnsignedShort() reads two bytes from the input stream and returns an int between 0 and 65,535.
There is no similar readUnsignedInt() method. If you want to, it’s easy enough to write one yourself. You’ll need to read four bytes and return a long between 0 and 4,294,967,295. Again, the most efficient way to do this uses bit-level operators, so we’ll defer the details until the end of this chapter.
An unsigned long — that is, an 8-byte unsigned integer — is relatively uncommon in practice. No primitive Java data type is large enough to handle unsigned longs. You can, however, use the java.math.BigInteger class instead.
Integer widths
You’ve probably heard a lot of hype about 32-bit computing and 32-bit clean code. You’ll be hearing more about 64-bit platforms in the near future, if you haven’t already. What’s being referred to is, very roughly, the preferred size of an integer on a given computer architecture and the number of bits that can be transferred from main memory to the CPU in one clock cycle. Generally, the higher the number of bits, the faster the computer will run. However, you need to rewrite (or at least recompile) the software to accommodate the proper bit width before you can see the performance gain.
Much legacy code is written in languages like C that do not guarantee the width of an integer. The same C program may use 32-bit ints on a Sparc, 16-bit ints on a Mac, and 64-bit ints on a DEC Alpha. Although these all have Java equivalents, you have to know which one you’re dealing with before you write the code to handle it! Trying to read 16-bit ints with Java’s readInt() method is a sure path to failure.
There’s no guaranteed way to look at a file in the absence of outside information and tell solely from the contents of the file whether it was written using 16-bit integers or 32-bit integers. Similarly, you can’t tell whether or not it uses Big-Endian or Little-Endian data. In an ideal world, you’d have access to a specification that describes the data format used. If you don’t, perhaps you have access to the source code that was used to write the file. If not, you’ll have to do some testing. Try to read the file as 16-bit ints. Do the results make sense? What if you read it as 32-bit ints? Do those results make sense? If you seem to have an excessive number of zeroes appearing in your data, especially if they tend to alternate with non-zero values, that may indicate that you are reading the data using too short an integer. For example, if the data file is full of numbers mostly between 10 and 1000, then if it’s written with 32-bit ints, the high two bytes of each int will be zero.
Conversions and Casting
With seven different numeric types that may be freely intermixed in expressions, it’s important to understand the rules by which this intermixing takes place. Java converts between primitive data types in expressions, in assignment statements, as a result of explicit casts, and during method invocations. You need to understand when conversion can occur and what happens when it does.
Using a cast
Java enables you to explicitly change the type of a value using a cast. A cast is just the name of the type to which you wish to change the value, enclosed in parentheses. For example, suppose you’ve read a byte into the byte variable b, perhaps using DataInputStream’s readByte() method. Then you can cast that variable to the int type like this:
int n = (int) b;This doesn’t permanently change the type of b. It just makes a temporary copy of the value of b and puts it in an int. This int is then assigned to the int variable n.
The second place in which conversion of primitive types takes place is in arithmetic expressions. Expressions range from simple ones, like a + b, to considerably more complex ones such as 1.65 * (32 / -9.8 - c++)/0.65. The expression is evaluated using the widest type present in the expression, where doubles are wider than floats, which are wider than longs, which are wider than ints. Thus, if any of the operands are doubles, all operands are promoted to doubles. If no operands are doubles but some are floats, then all operands are promoted to floats. If no operands are floats or doubles but some are longs, then all operands are promoted to longs. Finally, if an expression contains no floats, doubles, or longs, then all operands are promoted to ints. All arithmetic in Java uses at least ints. Shorts, bytes, and chars are never used directly in arithmetic expressions.
The third place in which conversions take place is in assignment statements; that is statements like
long a = 3 + 4;In this example 3 is an int, 4 is an int, and the result of their addition is the int value 7. This must be promoted to a long before being assigned to a. Conversions in the other direction may lose information. Not all longs have equivalent int values. For example, 5294967295L is a valid long, but it’s more than two times larger than the largest int:
int n = 5294967295L;If you try to assign 5294967295L to an int variable, you get the compile-time error Error: Incompatible type for declaration. Explicit cast needed to convert long to int. The compiler sees that you may lose information and warns you about it. However, the compiler isn’t that smart. The following assignment, which does not lose information, also causes a compiler error:
int m = 3L;In both of these cases, you can tell the compiler that you’re aware of the problem, that you accept that your assignment may lose information, and that you want it to go ahead anyway. You do this with an explicit cast to the type on the left side. For example:
int m = (int) 3L; int n = (int) 5294967295L;This tells the compiler that you know what you’re doing, that you’ve given thought to whether this cast will lose data. Java tries to prevent you from performing operations that may lose data, but it does allow you to do so if you use a cast to tell it that you know what you’re doing.
The final place where conversions take place is in method calls. Suppose you try to call MethodA(24). The compiler first tries to find a perfect match, a version of MethodA that takes as an argument a single int. However, if it fails in this effort, it will next look for a MethodA that takes a long as an argument. If it finds one, it promotes 24 to 24L and calls MethodA(long). Failing to find a MethodA that takes a long, Java next looks for one that takes a float. Failing to find that, it looks for one that takes a double. Only if it can’t find any of these will Java produce a compile-time error.
Conversions and Casting
With seven different numeric types that may be freely intermixed in expressions, it’s important to understand the rules by which this intermixing takes place. Java converts between primitive data types in expressions, in assignment statements, as a result of explicit casts, and during method invocations. You need to understand when conversion can occur and what happens when it does.
Using a cast
Java enables you to explicitly change the type of a value using a cast. A cast is just the name of the type to which you wish to change the value, enclosed in parentheses. For example, suppose you’ve read a byte into the byte variable b, perhaps using DataInputStream’s readByte() method. Then you can cast that variable to the int type like this:
int n = (int) b;This doesn’t permanently change the type of b. It just makes a temporary copy of the value of b and puts it in an int. This int is then assigned to the int variable n.
The second place in which conversion of primitive types takes place is in arithmetic expressions. Expressions range from simple ones, like a + b, to considerably more complex ones such as 1.65 * (32 / -9.8 - c++)/0.65. The expression is evaluated using the widest type present in the expression, where doubles are wider than floats, which are wider than longs, which are wider than ints. Thus, if any of the operands are doubles, all operands are promoted to doubles. If no operands are doubles but some are floats, then all operands are promoted to floats. If no operands are floats or doubles but some are longs, then all operands are promoted to longs. Finally, if an expression contains no floats, doubles, or longs, then all operands are promoted to ints. All arithmetic in Java uses at least ints. Shorts, bytes, and chars are never used directly in arithmetic expressions.
The third place in which conversions take place is in assignment statements; that is statements like
long a = 3 + 4;In this example 3 is an int, 4 is an int, and the result of their addition is the int value 7. This must be promoted to a long before being assigned to a. Conversions in the other direction may lose information. Not all longs have equivalent int values. For example, 5294967295L is a valid long, but it’s more than two times larger than the largest int:
int n = 5294967295L;If you try to assign 5294967295L to an int variable, you get the compile-time error Error: Incompatible type for declaration. Explicit cast needed to convert long to int. The compiler sees that you may lose information and warns you about it. However, the compiler isn’t that smart. The following assignment, which does not lose information, also causes a compiler error:
int m = 3L;In both of these cases, you can tell the compiler that you’re aware of the problem, that you accept that your assignment may lose information, and that you want it to go ahead anyway. You do this with an explicit cast to the type on the left side. For example:
int m = (int) 3L; int n = (int) 5294967295L;This tells the compiler that you know what you’re doing, that you’ve given thought to whether this cast will lose data. Java tries to prevent you from performing operations that may lose data, but it does allow you to do so if you use a cast to tell it that you know what you’re doing.
The final place where conversions take place is in method calls. Suppose you try to call MethodA(24). The compiler first tries to find a perfect match, a version of MethodA that takes as an argument a single int. However, if it fails in this effort, it will next look for a MethodA that takes a long as an argument. If it finds one, it promotes 24 to 24L and calls MethodA(long). Failing to find a MethodA that takes a long, Java next looks for one that takes a float. Failing to find that, it looks for one that takes a double. Only if it can’t find any of these will Java produce a compile-time error.
The mechanics of conversion
Now that we’ve seen when conversions may take place, let’s investigate how. Some conversions, such as an int to a long, are easy and never lose information. Others, such as a long to an int, are trickier because not all longs have int equivalents. For example, suppose that a byte variable b holds the value 92. In binary notation, this is 01011100. Because an int needs 32 bits, three extra zero bytes are added to the front of b, making it 000000000000-00000000000001011100.
Now suppose instead that the value of b is -92. Using two’s complement arithmetic, we see that the binary expansion of -92 is 10100011 + 00000001 = 10100100. Now if you just attach three bytes of zeroes on the left side of this number, you get 00000000000000000000000010100100, which is not -92 (since the sign bit is zero, the number must be positive) but rather 164, not the same thing at all. In fact, it’s not even off by a sign. If that were the problem, it would be simple enough to change the leftmost bit to 1. However, here that gives you -164, which isn’t -92 any more than 164 is.
On the other hand, look what happens if you extend -92 with three bytes full of ones. You get 11111111111111111111111110100100. This is obviously a negative number since the leftmost bit is one. Using two’s complement arithmetic to find out which number it is, you invert the number and add one:
00000000000000000000000001011011 +00000000000000000000000000000001 00000000000000000000000001011100which, lo and behold, is 92! Thus, the proper way to convert an integer type to a wider format is sign extension. That is, take whatever bit is in the sign bit and add as many extra bytes as you need filled with that bit. This works for other widening casts between integer types as well. For example, to change a positive int to the equivalent long, just add four bytes of zeroes to the front. To change a negative int to the equivalent long just attach four more bytes of ones to the front of the int. Performed in this fashion, widening integer casts — that is, casts that go from a smaller type to a larger type — never lose information.
The same cannot be said for narrowing casts. A narrowing cast moves from a wider type, like int, to a narrower type, like byte. To do this, the extra bytes are just cut off the front of the wider type. Thus, to move from the int 92 to the byte 92, remove the first three bytes from 000000000000-00000000000001011100, leaving 01011100. This cast doesn’t lose information, but other casts can. For example, the int 192 is 00000000000-000000000000011000000. If you cast this to a byte by removing the first three bytes, you get 11000000. Notice the sign bit. This is a negative number, specifically -64. There is no easy way around this problem. The numbers you get in a narrowing cast are not guaranteed to make sense. The simple fact is that you cannot fit 192 into a signed byte.
The two basic rules for conversion between integer data types are as follows:
1. If the type to be converted to is wider than the type you’re converting from, sign extend the narrower type.
2. If the type to be converted to is narrower than the type you’re converting from, truncate the most significant bytes of the integer you’re converting.
Conversions to and from the char type behave similarly, once you take account of the fact that char is unsigned. To convert a char to a byte, the high-order byte is truncated. To convert a char to a short, the char is left as is, but is now interpreted as a signed 2-byte integer. To convert to an int or a long, the char is sign extended by two or six bytes respectively. This may produce a negative number where there wasn’t one before if the char value is greater than 32,767 — that is, if its high-order bit is one.
To convert a byte to a char, the byte is sign extended one byte. To convert a short to a char, the short is merely reinterpreted as a signed, 2-byte integer. Finally, to convert an int or long to a char, all but the least-significant 16 bits are truncated. Although converting a char to a short, int, or long may play funny games with the sign, converting it back will return the original char.
The rules for conversions to and from floating-point numbers are more complex. A float can be cast to a double with no loss of precision whatsoever. Double to float conversion presents some problems, though. Some doubles can be exactly represented as floats, but some are too large, some are too small, and some have more precision (that is, a longer mantissa) than a float allows. If the absolute value of the double is larger than can fit in a float, the float becomes infinity — positive or negative depending on the sign of the double. If the absolute value of the double is smaller than can fit in a float (that is, closer to zero), the float becomes zero — positive or negative depending on the sign of the double.
Floats and doubles that are small enough to be represented as ints must fall between two ints; that is, there is an int value larger than the float and an int value smaller than the float. The float is rounded to the int in the pair between which it falls that is closest to zero. Thus, 7.5 is rounded to 7; 7.6 is also rounded to 7, but -7.5 is rounded to -7, not to -8. If the float or double is too large to be represented as an int, for example 6.73E14, then it is rounded to the largest possible int, 2,147,483,647. Similarly, if the float is too small and negative, for example -6.73E14, then it is rounded to the smallest possible int, -2,147,483,648. NaN is rounded to zero. Rounds to longs behave similarly except that the largest and smallest values are quite a bit larger.
Conversions of floats and doubles to shorts and bytes involve a two-step procedure. First the float or double is converted to a double, as described earlier in this chapter. Then the int is converted to a byte or short in the normal way, by truncating the excess bytes in the int. Thus, casting the float 7.5 to a byte results in the value 7. However, casting 175.5 to a byte results in the value -47. This occurs by first rounding 175.5 to 175, 0x000000AF, and then by truncating this to AF, 10101111 in binary. Of course, a byte is signed, so this is equal to -47.
I can think of little reason to want to convert a float or a double to a char, but you can if you need to. The conversion takes place much as with conversions to shorts: the float or double is first converted to an int, which is then converted to a char.
The mechanics of conversion
Now that we’ve seen when conversions may take place, let’s investigate how. Some conversions, such as an int to a long, are easy and never lose information. Others, such as a long to an int, are trickier because not all longs have int equivalents. For example, suppose that a byte variable b holds the value 92. In binary notation, this is 01011100. Because an int needs 32 bits, three extra zero bytes are added to the front of b, making it 000000000000-00000000000001011100.
Now suppose instead that the value of b is -92. Using two’s complement arithmetic, we see that the binary expansion of -92 is 10100011 + 00000001 = 10100100. Now if you just attach three bytes of zeroes on the left side of this number, you get 00000000000000000000000010100100, which is not -92 (since the sign bit is zero, the number must be positive) but rather 164, not the same thing at all. In fact, it’s not even off by a sign. If that were the problem, it would be simple enough to change the leftmost bit to 1. However, here that gives you -164, which isn’t -92 any more than 164 is.
On the other hand, look what happens if you extend -92 with three bytes full of ones. You get 11111111111111111111111110100100. This is obviously a negative number since the leftmost bit is one. Using two’s complement arithmetic to find out which number it is, you invert the number and add one:
00000000000000000000000001011011 +00000000000000000000000000000001 00000000000000000000000001011100which, lo and behold, is 92! Thus, the proper way to convert an integer type to a wider format is sign extension. That is, take whatever bit is in the sign bit and add as many extra bytes as you need filled with that bit. This works for other widening casts between integer types as well. For example, to change a positive int to the equivalent long, just add four bytes of zeroes to the front. To change a negative int to the equivalent long just attach four more bytes of ones to the front of the int. Performed in this fashion, widening integer casts — that is, casts that go from a smaller type to a larger type — never lose information.
The same cannot be said for narrowing casts. A narrowing cast moves from a wider type, like int, to a narrower type, like byte. To do this, the extra bytes are just cut off the front of the wider type. Thus, to move from the int 92 to the byte 92, remove the first three bytes from 000000000000-00000000000001011100, leaving 01011100. This cast doesn’t lose information, but other casts can. For example, the int 192 is 00000000000-000000000000011000000. If you cast this to a byte by removing the first three bytes, you get 11000000. Notice the sign bit. This is a negative number, specifically -64. There is no easy way around this problem. The numbers you get in a narrowing cast are not guaranteed to make sense. The simple fact is that you cannot fit 192 into a signed byte.
The two basic rules for conversion between integer data types are as follows:
1. If the type to be converted to is wider than the type you’re converting from, sign extend the narrower type.
2. If the type to be converted to is narrower than the type you’re converting from, truncate the most significant bytes of the integer you’re converting.
Conversions to and from the char type behave similarly, once you take account of the fact that char is unsigned. To convert a char to a byte, the high-order byte is truncated. To convert a char to a short, the char is left as is, but is now interpreted as a signed 2-byte integer. To convert to an int or a long, the char is sign extended by two or six bytes respectively. This may produce a negative number where there wasn’t one before if the char value is greater than 32,767 — that is, if its high-order bit is one.
To convert a byte to a char, the byte is sign extended one byte. To convert a short to a char, the short is merely reinterpreted as a signed, 2-byte integer. Finally, to convert an int or long to a char, all but the least-significant 16 bits are truncated. Although converting a char to a short, int, or long may play funny games with the sign, converting it back will return the original char.
The rules for conversions to and from floating-point numbers are more complex. A float can be cast to a double with no loss of precision whatsoever. Double to float conversion presents some problems, though. Some doubles can be exactly represented as floats, but some are too large, some are too small, and some have more precision (that is, a longer mantissa) than a float allows. If the absolute value of the double is larger than can fit in a float, the float becomes infinity — positive or negative depending on the sign of the double. If the absolute value of the double is smaller than can fit in a float (that is, closer to zero), the float becomes zero — positive or negative depending on the sign of the double.
Floats and doubles that are small enough to be represented as ints must fall between two ints; that is, there is an int value larger than the float and an int value smaller than the float. The float is rounded to the int in the pair between which it falls that is closest to zero. Thus, 7.5 is rounded to 7; 7.6 is also rounded to 7, but -7.5 is rounded to -7, not to -8. If the float or double is too large to be represented as an int, for example 6.73E14, then it is rounded to the largest possible int, 2,147,483,647. Similarly, if the float is too small and negative, for example -6.73E14, then it is rounded to the smallest possible int, -2,147,483,648. NaN is rounded to zero. Rounds to longs behave similarly except that the largest and smallest values are quite a bit larger.
Conversions of floats and doubles to shorts and bytes involve a two-step procedure. First the float or double is converted to a double, as described earlier in this chapter. Then the int is converted to a byte or short in the normal way, by truncating the excess bytes in the int. Thus, casting the float 7.5 to a byte results in the value 7. However, casting 175.5 to a byte results in the value -47. This occurs by first rounding 175.5 to 175, 0x000000AF, and then by truncating this to AF, 10101111 in binary. Of course, a byte is signed, so this is equal to -47.
I can think of little reason to want to convert a float or a double to a char, but you can if you need to. The conversion takes place much as with conversions to shorts: the float or double is first converted to an int, which is then converted to a char.
Bit-Level Operators
The 13 bit-level operators are among the more obscure in Java. They nonetheless have their uses. The bitwise operators operate on a number or boolean at the bit level, generally by comparing the bits in two quantities and returning a result that depends on the bits in each. The single exception is ~, the NOT, or complement, operator. It takes a single argument and inverts all its bits. The bitshift operators take two operands: the number to be shifted and the number of places to shift it. Except for ~, these operators have “operate and assign” equivalents as well. Table 2-5 lists all the bit-level operators in Java.
Table 2-5 The bitwise operators Operator Meaning &
AND
|
OR
^
Exclusive OR
~
NOT (complement)
<<
Shift bits left
>>
Shift bits right
>>>
Shift bits right without sign extension
&=
AND and assign
|=
OR and assign
^=
Exclusive OR and assign
<<=
Shift bits left and assign
>>=
Shift bits right and assign
>>>=
Shift bits right without sign extension and assign
Some terminology
We’ll need some shorthand to discuss these operators. First, given a value with n bits, the rightmost, least-significant bit is bit 0. The second-rightmost bit is bit one, and so on, up to the leftmost and most significant bit, which is bit n-1. For example, the byte value 37, 00010101 in binary, would have bits shown in Figure 2-4.
![]()
Figure 2-4 Bit positions in a byte.Next, when I write that a bit is “set,” or “on,” that means the bit is 1. When I write that a bit is “not set,” “unset,” or “off,” that means the bit is 0. You’ll also hear these states referred to as “true” and “false” in other books, but I avoid that terminology here to avoid confusion with the boolean literals.
Finally, note that a lot of the examples in this book will be with bytes, simply because it’s easier to follow what’s going on when you only have to keep track of eight bits. However, just as Java performs arithmetic only on int and larger data types, and promotes the operands as necessary, so too will it promote the operands of a bitwise operator and return an int or larger result. For example, even if b1 and b2 are bytes, b1 & b2 is an int; both b1 and b2 are promoted to ints before the bitwise and is performed.
Bitwise operators
The bitwise operators — &, |, and ^ — combine two numbers according to their bit patterns. The bitwise not operator ~ inverts a single number’s bit pattern.
The & operator
The & operator is the bitwise AND operator. It takes two numeric arguments, compares their bits, and sets the bits in the result that are set in both of the arguments. For example, let b1 be a byte with value 78 and b2 be a byte with the value -23. In binary, 78 is 01001110 and -23 is 11101001. Lay these values on top of each other as shown in Figure 2-5. The result, shown in the bottom row, is 01001100, that is, 76.
![]()
Figure 2-5 78 & -23.The bits that are equal to one in both 78 and -23 are equal to one in the result. All other bits are zero.
As mentioned earlier, Java actually performs this calculation using 32-bit ints. Because the high-order three bytes of a positive int are just full of zeroes, the real result of 78 & -23 must be 0000000000000000000000000-1001000. If either argument of & has a zero bit in a particular position, that bit must be 0 in the result, regardless of the value of the bit in the second argument. Therefore, 0 & anything is always 0.
The & operator can also be used with two booleans: true & true is true, true & false is false, and false & false is false. At the level of the virtual machine, the boolean value true is the int 00000001 and false is the int 00000000. Thus, true & true is the same as 00000001 & 00000001 equals 00000001 or true. Conversely, false & false is 00000000 & 00000000 equals 00000000 or false. And finally, true & false is 00000001 & 00000000 equals 00000000 or false.
This is often used to avoid short-circuiting expression evaluation. Suppose isConditionOne() and isConditionTwo() are methods that return booleans and have some side effect such as printing output on System.out. Now suppose you write this statement:
if ( isConditionOne() && isConditionTwo() ) doSomething();If isConditionOne() returns false, then isConditionTwo() is never called. Because isConditionOne() is known to be false, Java knows the result will be false, regardless of the value of isConditionTwo(). This can be a problem when isConditionTwo() has side effects, and you need it to be called regardless of condition one. To force isConditionTwo() to be called, use the bitwise & instead. That is
if ( isConditionOne() & isConditionTwo() ) doSomething();The truth value of ( isConditionOne() & isConditionTwo() ) is the same as the truth value of ( isConditionOne() && isConditionTwo() ), but now both methods will be called.
The | operator
The | operator is the bitwise OR operator. It takes two numeric arguments, compares their bits, and sets the bits in the result that are set in either or both of the arguments. For example, let b1 be a byte with value 78 and b2 be a byte with the value -23. In binary, 78 is 01001110 and -23 is 11101001. Lay these values on top of each other as shown in Figure 2-6. The result, shown in the bottom row, is 11101111, that is -17.
![]()
Figure 2-6 78 | -23.The bits that are equal to one in either 78 or -23 or both are equal to one in the result. All other bits are zero.
Of course, Java actually performs this calculation using 32-bit ints. Because the high-order three bytes of a positive int are just full of zeroes, the real result of 78 & -23 is 11111111111111111111111111101111. If either argument of | has a one bit in a particular position, that bit must be 1 in the result, regardless of the value of the bit in the second argument.
The | operator can also be used with two booleans: true | true is true, true | false is true, and false | false is false.
The AWT sometimes uses this to set a series of flags. If you have an item that has up to 32 boolean characteristics, then you can stuff all the values of those characteristics into an int.
For example, consider the java.awt.Font class. To create a new font, you use this constructor:
public Font(String name, int style, int size)The name is the name of the typeface, like Times or Arial. The size is the size of the font in points, such as 12 or 24. The style, however, is one of a special set of mnemonic constants. These constants are
Font.BOLD = 1 Font.PLAIN = 0 Font.ITALIC = 2You can pass one of these constants in the style argument of the Font constructor to get that style. However, what if you want a Font that is both bold and italic? Then, you pass Font.BOLD | Font.ITALIC. This means that the bold bit and the italic bit are both set in the style argument. Notice that Font.BOLD is 00000001 whereas Font.ITALIC is 00000010. Each bit in the number is a binary flag indicating the value of the binary characteristic; for example, is this or is this not bold? Other classes that use this scheme can have many more such constants, all of which are powers of two: 4, 8, 16, 32, 64, and so on. Each power of two is a 32-bit int with exactly one bit set and the rest unset.
As with &, | can also prevent the short-circuiting of expression evaluation. Consider the statement
if ( isConditionOne() || isConditionTwo() ) doSomething();If isConditionOne() returns true, then isConditionTwo() will not be called because Java knows the result will be true, regardless of the value of isConditionTwo(). To force isConditionTwo() to be called, use the bitwise | instead. That is
if ( isConditionOne() | isConditionTwo() ) doSomething();The truth value of ( isConditionOne() | isConditionTwo() ) is the same as the truth value of ( isConditionOne() || isConditionTwo() ), but now both methods are called.
The ^ operator
The ^ is the bitwise EXCLUSIVE-OR operator. The operator | does not behave like many people expect, based on its English meaning. Many people think the “A or B” is true if A is true and B is not true, or vice versa, but that “A or B” is not true if both A and B are true. ^ is the bitwise equivalent of this idea. The ^ operator takes two numeric arguments, compares their bits, and sets the bits in the result that are set in exactly one of the arguments.
Returning to the example where b1 is a byte with value 78 and b2 is a byte with the value -23, lay these values on top of each other as shown in Figure 2-7. The result, shown in the bottom row, is 10100111-89.
![]()
Figure 2-7 78 ^ -23.The ^ operator can also be used with two booleans: true ^ true is false, true ^ false is true, and false ^ false is false.
The ~ operator
The ~ is the bitwise NOT or complement operator. It is unary; that is, it acts on a single number or boolean, and it flips all the bits in that value. As a result, all ones turn to zeroes and zeroes turn to ones. Figure 2-8 shows 78 and ~78.
![]()
Figure 2-8 78 ~ -23.By the nature of two’s complement arithmetic, if b is an int or a long, then ~b equals -b - 1.
Assignment operators
The &=, |=, and ^= operators behave like their arithmetic cousins, *=, +=, -=, %= and /=. In other words, they combine the value on the left side of the operator with the value on the right side, and then assign it to the left side. For example:
int a = 78; a &= -23;This makes a equal to 76. |= and ^= behave similarly except they use bitwise OR and bitwise XOR respectively.
The | operator
The | operator is the bitwise OR operator. It takes two numeric arguments, compares their bits, and sets the bits in the result that are set in either or both of the arguments. For example, let b1 be a byte with value 78 and b2 be a byte with the value -23. In binary, 78 is 01001110 and -23 is 11101001. Lay these values on top of each other as shown in Figure 2-6. The result, shown in the bottom row, is 11101111, that is -17.
![]()
Figure 2-6 78 | -23.The bits that are equal to one in either 78 or -23 or both are equal to one in the result. All other bits are zero.
Of course, Java actually performs this calculation using 32-bit ints. Because the high-order three bytes of a positive int are just full of zeroes, the real result of 78 & -23 is 11111111111111111111111111101111. If either argument of | has a one bit in a particular position, that bit must be 1 in the result, regardless of the value of the bit in the second argument.
The | operator can also be used with two booleans: true | true is true, true | false is true, and false | false is false.
The AWT sometimes uses this to set a series of flags. If you have an item that has up to 32 boolean characteristics, then you can stuff all the values of those characteristics into an int.
For example, consider the java.awt.Font class. To create a new font, you use this constructor:
public Font(String name, int style, int size)The name is the name of the typeface, like Times or Arial. The size is the size of the font in points, such as 12 or 24. The style, however, is one of a special set of mnemonic constants. These constants are
Font.BOLD = 1 Font.PLAIN = 0 Font.ITALIC = 2You can pass one of these constants in the style argument of the Font constructor to get that style. However, what if you want a Font that is both bold and italic? Then, you pass Font.BOLD | Font.ITALIC. This means that the bold bit and the italic bit are both set in the style argument. Notice that Font.BOLD is 00000001 whereas Font.ITALIC is 00000010. Each bit in the number is a binary flag indicating the value of the binary characteristic; for example, is this or is this not bold? Other classes that use this scheme can have many more such constants, all of which are powers of two: 4, 8, 16, 32, 64, and so on. Each power of two is a 32-bit int with exactly one bit set and the rest unset.
As with &, | can also prevent the short-circuiting of expression evaluation. Consider the statement
if ( isConditionOne() || isConditionTwo() ) doSomething();If isConditionOne() returns true, then isConditionTwo() will not be called because Java knows the result will be true, regardless of the value of isConditionTwo(). To force isConditionTwo() to be called, use the bitwise | instead. That is
if ( isConditionOne() | isConditionTwo() ) doSomething();The truth value of ( isConditionOne() | isConditionTwo() ) is the same as the truth value of ( isConditionOne() || isConditionTwo() ), but now both methods are called.
The ^ operator
The ^ is the bitwise EXCLUSIVE-OR operator. The operator | does not behave like many people expect, based on its English meaning. Many people think the “A or B” is true if A is true and B is not true, or vice versa, but that “A or B” is not true if both A and B are true. ^ is the bitwise equivalent of this idea. The ^ operator takes two numeric arguments, compares their bits, and sets the bits in the result that are set in exactly one of the arguments.
Returning to the example where b1 is a byte with value 78 and b2 is a byte with the value -23, lay these values on top of each other as shown in Figure 2-7. The result, shown in the bottom row, is 10100111-89.
![]()
Figure 2-7 78 ^ -23.The ^ operator can also be used with two booleans: true ^ true is false, true ^ false is true, and false ^ false is false.
The ~ operator
The ~ is the bitwise NOT or complement operator. It is unary; that is, it acts on a single number or boolean, and it flips all the bits in that value. As a result, all ones turn to zeroes and zeroes turn to ones. Figure 2-8 shows 78 and ~78.
![]()
Figure 2-8 78 ~ -23.By the nature of two’s complement arithmetic, if b is an int or a long, then ~b equals -b - 1.
Assignment operators
The &=, |=, and ^= operators behave like their arithmetic cousins, *=, +=, -=, %= and /=. In other words, they combine the value on the left side of the operator with the value on the right side, and then assign it to the left side. For example:
int a = 78; a &= -23;This makes a equal to 76. |= and ^= behave similarly except they use bitwise OR and bitwise XOR respectively.
Bit shift operators
The bit shift operators shift the bits in an integer type by a specified number of places to the right or left. Bit shift operators cannot be used on floats, doubles, or booleans. For example, << is the left shift operator. The integer 78 is 00000000000000000000000001001110 in binary. Table 2-6 shows the result of shifting it progressively leftward. Notice that at each step the pattern of ones and zeroes appears to move one bit further left.
Table 2-6 Left-shifting 78 Value Bit Pattern 78
00000000000000000000000001001110
78 << 1 = 156
00000000000000000000000010011100
78 << 2 = 312
00000000000000000000000100111000
78 << 3 = 624
00000000000000000000001001110000
78 << 4 = 1248
00000000000000000000010011100000
78 << 5 = 2496
00000000000000000000100111000000
78 << 6 = 4992
00000000000000000001001110000000
78 << 7 = 9984
00000000000000000010011100000000
78 << 8 = 19,968
00000000000000000100111000000000
78 << 9 = 39,936
00000000000000001001110000000000
78 << 10 = 79,872
00000000000000010011100000000000
78 << 11 = 159,744
00000000000000100111000000000000
78 << 12 = 319,488
00000000000001001110000000000000
78 << 13 = 638,976
00000000000010011100000000000000
Also notice that at each step, the value of the number is doubled. A 1-bit shift left is exactly equivalent to multiplication by two. Depending on the compiler, the virtual machine, and the CPU, it may be mildly quicker to shift an int to the left by the appropriate number of bits rather than to multiply by two. Similarly, shifting an int to the right can replace dividing by two or a power of two. However, this optimization may well not be worth the decrease in the legibility of your code, even on platforms where it makes a difference in performance.
What happens when the pattern of ones reaches the left side? Does it wrap around? No. The ones just march off to the left as the right side fills with zeroes. Note that once you hit 25 left shifts, you lose the multiplication by two property and drop over into negative numbers. If you had started with a larger number, this might have happened sooner. From that point on, the results bear little numerical relation to the original 78. Table 2-7 demonstrates.
Table 2-7 Left-shifting 78 by 22 to 31 places Value Bit Pattern 78 << 22 = 327,155,712
00010011100000000000000000000000
78 << 23 = 654,311,424
00100111000000000000000000000000
78 << 24 = 1,308,622,848
01001110000000000000000000000000
78 << 25 = 1,677,721,600
10011100000000000000000000000000
78 << 26 = 939,524,096
00111000000000000000000000000000
78 << 27 = 1,879,048,192
01110000000000000000000000000000
78 << 28 = -536,870,912
11100000000000000000000000000000
78 << 29 = -1,073,741,824
11000000000000000000000000000000
78 << 30 = -2,147,483,648
10000000000000000000000000000000
78 << 31 = 0
00000000000000000000000000000000
However, if you keep going, something interesting happens. The next shift, by 32, appears to bring the number back, as Table 2-8 demonstrates.
Table 2-8 should look familiar. Except for the number of bits by which 78 is shifted, it’s an exact copy of Table 2-6. Did it just take a little extra time to wrap around? Not exactly. Java limits the right side of the shift operator to five bits (six bits if the left side is a long). Extra bits are truncated. This means that you can only really shift an int (or a byte, or a short) between 0 and 31 bits. Longs can be shifted between 0 and 63 bits. If you try to shift by more than that, Java throws away the higher-order bits. Thus, in the last line of Table 2-8, 78 is really being shifted by 45 - 32 = 13 bits, not by 45 bits.
Table 2-8 Left-shifting 78 by values greater than 31 Value Bit pattern 78 << 32 = 78
00000000000000000000000001001110
78 << 33 = 156
00000000000000000000000010011100
78 << 34 = 312
00000000000000000000000100111000
78 << 35 = 624
00000000000000000000001001110000
78 << 36 = 1248
00000000000000000000010011100000
78 << 37 = 2496
00000000000000000000100111000000
78 << 38 = 4992
00000000000000000001001110000000
78 << 39 = 9984
00000000000000000010011100000000
78 << 40 = 19,968
00000000000000000100111000000000
78 << 41 = 39,936
00000000000000001001110000000000
78 << 42 = 79,872
00000000000000010011100000000000
78 << 43 = 159,744
00000000000000100111000000000000
78 << 44 = 319,488
00000000000001001110000000000000
78 << 45 = 638,976
00000000000010011100000000000000
In this example, we used an int. You can also shift bytes, shorts, chars, and longs. Bytes, chars, and shorts are promoted to ints before being shifted. Floats, doubles, and booleans cannot be shifted.
Making Floats from Bits
If you really need to create a float from a series of bits, there are a couple of workarounds. You can shift the bits around in an int and use the static java.lang.Float.intBitsToFloat() method to convert the int into a float. For example, suppose data is a byte array with four components that correspond to the four bytes in a float. (You’ll see exactly this in Chapter 4.) You can read the float out of the byte array by first shifting the bytes into an int called bits and then calling the java.lang.Float. intBitsToFloat(int bits) method like this:
int bits = data[0] << 24 | data[1] << 16 | data[2] << 8 | data[3]; float f = Float.intBitsToFloat(bits);You can make doubles from longs in a similar fashion with the java.lang.Double.longBitsToDouble(long bits) method.
long bits = (long) data[0] << 56 | (long) data[1] << 48 | (long) data[2] << 40 | (long) data[3] << 32 | (long) data[4] << 24 | (long) data[5] << 16 | (long) data[6] << 8 | (long) data[7]; double d = Double.longBitsToDouble(bits);Alternately, you can construct a ByteArrayInputStream from the byte array, chain the ByteArrayInputStream to a DataInputStream, and then call the readFloat() or readDouble() method of the DataInputStream. For example,
ByteArrayInputStream bis = new ByteArrayInputStream(data); DataInputStream dis = new DataInputStream(bis); float f = dis.readFloat();In most Java implementations, this is less efficient than the first alternative. However, it produces slightly more intelligible code.
The >> operator shifts numbers to the right with sign extension. This means that vacated bits on the left are filled with the sign bit: 0 for a positive number or 1 for a negative number. Otherwise, right shifts carry the same caveats as left shifts: The left side must be an integral type and will be promoted to an int if necessary before shifting. The left side must be between 0 and 31 (0 to 63 if the left hand side’s a long) and will be truncated to that value if necessary. For example, the int -23 is, in binary notation, 11111111111111111111111111101001. Table 2-9 shows what you get when this is right shifted by various numbers of bits. Note that the vacated spots are filled with sign bits and that right shifting is equivalent to division by two.
Table 2-9 Right-shifting -23 Value Bit Pattern -23
11111111111111111111111111101001
-23 > 1 = -12
11111111111111111111111111110100
-23 > 2 = -6
11111111111111111111111111111010
-23 > 3 = -3
11111111111111111111111111111101
-23 > 4 = -2
11111111111111111111111111111110
-23 > 5 = -1
11111111111111111111111111111111
-23 > 6 = -1
11111111111111111111111111111111
Sometimes you don’t want to fill with the sign bits, but rather with 0. The >>> operator does an unsigned shift right. In other words, it fills the vacated spaces with zeroes regardless of the sign bit. Table 2-10 demonstrates this.
Table 2-10 Unsigned right-shifting of -23 Value Bit Pattern -23
11111111111111111111111111101001
-23 >> 1 = 2,147,483,636
01111111111111111111111111110100
-23 >> 2 = 1,073,741,818
00111111111111111111111111111010
-23 >> 3 = 536,870,909
00011111111111111111111111111101
-23 >> 4 = 268,435,454
00001111111111111111111111111110
-23 >> 5 = 134,217,727
00000111111111111111111111111111
-23 >> 6 = 67,108,863
00000011111111111111111111111111
The >>=, <<=, and >>>= behave as you might expect, shifting the left argument by the number of bits specified in the right argument and in the direction specified by the operator, and then assigning the result to the left side
Little-Endian data
To read Little-Endian data, you first read the necessary number of bytes into an array. Then you use the << bit shift operator and the | operator to put the parts of the Little-Endian number back together in the right order.
public static int readLittleEndianInt(InputStream is) throws IOException { int result; byte[] buffer = new byte[4]; int check = is.read(buffer); if (check != 4) throw new IOException("Unexpected End of Stream"); result = (buffer[3] << 24) | (buffer[2] << 16) | (buffer[1] << 8) | buffer[0]; return result; }Longs are just the same except you have to use an 8-byte buffer and put eight pieces back together.
To write Little-Endian data, you create a buffer for the bytes in an int. The bytes are extracted from the int by a simple cast. Recall that casting an int to a byte truncates the int to its least significant byte. Before the cast is done, the right shift operator >> moves the needed byte into position in the least significant byte.
public static void writeLittleEndianInt(int i, OutputStream os) throws IOException { int result; byte[] buffer = new byte[4]; buffer[0] = (byte) i; buffer[1] = (byte) (i >> 8); buffer[2] = (byte) (i >> 16); buffer[3] = (byte) (i >> 24); os.write(buffer); }Unsigned integers
java.io.DataInputStream has methods to read unsigned bytes and unsigned shorts, but nothing to read an unsigned int. To do that, you must read four bytes and use them to construct the lower four bytes of a long. The upper four bytes of the long will be zero. For example:
public static long readUnsignedInt(InputStream is) throws IOException { byte[] buffer = new byte[4]; int check = is.read(buffer); if (check != 4) throw new IOException("Unexpected End of Stream"); long result = 0L; // move the bytes into position result = (buffer[0] << 24) | (buffer[1] << 16) | (buffer[2] << 8) | buffer[3]; // zero out the upper four bytes result &= 0xFFFFFFFF; return result; }It’s necessary to combine the result with 0xFFFFFFFF using a bitwise and to make sure that none of the bytes were sign extended into negative numbers when left-shifted.
Image manipulation with bit shift operators
Bit shift operators are fairly obscure. One of the few areas of Java where they’re useful is working with images and image filters.
Java images are built with a 32-bit color model. Each color has four channels: alpha, red, green, and blue. The alpha channel represents transparency. The other three channels are the primary colors for an additive color system. Each of the four channels has a value from 0 to 255 (in other words, one unsigned byte). For the color channels, the higher the value, the brighter the color. For the alpha channel, the higher the value, the more opaque the image is.
Note: Java 1.0’s support for transparency is mainly theoretical. A value of 255 is fully opaque. Anything less is 100 percent transparent (invisible).
Figure 2-9 shows a color that is 50 percent gray. The alpha channel is 255 (11111111), which is fully opaque, while each of the red, green, and blue channels is set to 127. This means the color is equal to the integer 111111011111110111111101111111. The integer value has little meaning here, though; it’s the individual bytes that matter.
![]()
Figure 2-9 The layout of a 32-bit colorWhen the red channel, green channel, and blue channel have the same value, the resulting image varies from black (all three 00000000) to white (all three 11111111). It passes through various shades of gray in between. By varying the colors disproportionately, you can produce the different colors of the visible spectrum. For example, pure blue is 11111111000000-001111111100000000.
So how do you create these colors? It’s simple, really. Just initialize ints to the values you want for each of the four channels, shift them into place, and combine them with the bitwise OR operator, |. For example, to create a pure blue, do the following:
int alpha = 255 << 24; int red = 0 << 16; int blue = 255 << 8; int green = 0; int pureblue = alpha | red | green | blue;If you prefer, you can combine these on one line. For example, to create the 50 percent gray of Figure 2-9, use this command:
int halfgray = (255 << 24) | (127 << 16) | (127 << 8) | 127;Summary
In this chapter you learn how a computer stores numbers. You learn what a place-value number system is, and about the binary and hexadecimal place-value number systems computers use. You learn how the primitive Java data types like int and float are laid out in memory and how this affects operations with those types.
You also learn how Java stores characters and the different character sets used for this purpose, particularly ASCII, ISO Latin-1, Unicode, and UTF8. You learn when and for what purposes these different but related character sets are used and how to convert from one to another.
Finally, you learn how to use the bit-level operators to operate on numbers at a very low level. The bitwise operators combine values in memory, while the bitshift operators move the bits in data back and forth.
Little-Endian data
To read Little-Endian data, you first read the necessary number of bytes into an array. Then you use the << bit shift operator and the | operator to put the parts of the Little-Endian number back together in the right order.
public static int readLittleEndianInt(InputStream is) throws IOException { int result; byte[] buffer = new byte[4]; int check = is.read(buffer); if (check != 4) throw new IOException("Unexpected End of Stream"); result = (buffer[3] << 24) | (buffer[2] << 16) | (buffer[1] << 8) | buffer[0]; return result; }Longs are just the same except you have to use an 8-byte buffer and put eight pieces back together.
To write Little-Endian data, you create a buffer for the bytes in an int. The bytes are extracted from the int by a simple cast. Recall that casting an int to a byte truncates the int to its least significant byte. Before the cast is done, the right shift operator >> moves the needed byte into position in the least significant byte.
public static void writeLittleEndianInt(int i, OutputStream os) throws IOException { int result; byte[] buffer = new byte[4]; buffer[0] = (byte) i; buffer[1] = (byte) (i >> 8); buffer[2] = (byte) (i >> 16); buffer[3] = (byte) (i >> 24); os.write(buffer); }Unsigned integers
java.io.DataInputStream has methods to read unsigned bytes and unsigned shorts, but nothing to read an unsigned int. To do that, you must read four bytes and use them to construct the lower four bytes of a long. The upper four bytes of the long will be zero. For example:
public static long readUnsignedInt(InputStream is) throws IOException { byte[] buffer = new byte[4]; int check = is.read(buffer); if (check != 4) throw new IOException("Unexpected End of Stream"); long result = 0L; // move the bytes into position result = (buffer[0] << 24) | (buffer[1] << 16) | (buffer[2] << 8) | buffer[3]; // zero out the upper four bytes result &= 0xFFFFFFFF; return result; }It’s necessary to combine the result with 0xFFFFFFFF using a bitwise and to make sure that none of the bytes were sign extended into negative numbers when left-shifted.
Image manipulation with bit shift operators
Bit shift operators are fairly obscure. One of the few areas of Java where they’re useful is working with images and image filters.
Java images are built with a 32-bit color model. Each color has four channels: alpha, red, green, and blue. The alpha channel represents transparency. The other three channels are the primary colors for an additive color system. Each of the four channels has a value from 0 to 255 (in other words, one unsigned byte). For the color channels, the higher the value, the brighter the color. For the alpha channel, the higher the value, the more opaque the image is.
Note: Java 1.0’s support for transparency is mainly theoretical. A value of 255 is fully opaque. Anything less is 100 percent transparent (invisible).
Figure 2-9 shows a color that is 50 percent gray. The alpha channel is 255 (11111111), which is fully opaque, while each of the red, green, and blue channels is set to 127. This means the color is equal to the integer 111111011111110111111101111111. The integer value has little meaning here, though; it’s the individual bytes that matter.
![]()
Figure 2-9 The layout of a 32-bit colorWhen the red channel, green channel, and blue channel have the same value, the resulting image varies from black (all three 00000000) to white (all three 11111111). It passes through various shades of gray in between. By varying the colors disproportionately, you can produce the different colors of the visible spectrum. For example, pure blue is 11111111000000-001111111100000000.
So how do you create these colors? It’s simple, really. Just initialize ints to the values you want for each of the four channels, shift them into place, and combine them with the bitwise OR operator, |. For example, to create a pure blue, do the following:
int alpha = 255 << 24; int red = 0 << 16; int blue = 255 << 8; int green = 0; int pureblue = alpha | red | green | blue;If you prefer, you can combine these on one line. For example, to create the 50 percent gray of Figure 2-9, use this command:
int halfgray = (255 << 24) | (127 << 16) | (127 << 8) | 127;Summary
In this chapter you learn how a computer stores numbers. You learn what a place-value number system is, and about the binary and hexadecimal place-value number systems computers use. You learn how the primitive Java data types like int and float are laid out in memory and how this affects operations with those types.
You also learn how Java stores characters and the different character sets used for this purpose, particularly ASCII, ISO Latin-1, Unicode, and UTF8. You learn when and for what purposes these different but related character sets are used and how to convert from one to another.
Finally, you learn how to use the bit-level operators to operate on numbers at a very low level. The bitwise operators combine values in memory, while the bitshift operators move the bits in data back and forth.
![]()
Not Found
outdated, inaccurate, or the server has been instructed not to let you have itThe requested object does not exist on this server. The link you followed is either .