Diagnosing "Broken Pipe" errors. =============================== Every so often these days (circa win98, win2k), we see error messages like the following showing up in the logs: [2000/09/08 18:06:24, 0] /usr/ports/net/samba/work/samba-2.0.6/source/lib/util_s ock.c:write_socket_data(537) write_socket_data: write failure. Error = Broken pipe or [2000/10/05 11:08:16, 0] lib/util_sock.c:read_socket_data(477) read_socket_data: recv failure for 1252. Error = Connection reset by peer These are - unexpected client disconnections, as seen by Samba. - dns errors: we'll discuss them too). Client Disconenctions: If you happen to be using security = server, the Samba messages are an artifact of the implementation, and are harmless. These can be eliminated by setting keepalive = 0. If you are using any other security setting, use keepalive = 30 to tell Samba to clean up more often (this won't reduce the messages, though!) They usually indicates an error by the client, which caused it to 1) blue screen 2) reboot 3) silently disconnect and reconnect. The latter is the annoying one... The usual cause is a networking problem. What the team's seen a lot lately are bad drivers for ethernet cards (especially on Windows ME) or mismatches between ethernet cards and hubs. Both ends of each connection between a hub and a machine must be running at the same speed, either 10 or 100 Mbit/S and at the same duplex setting (half- or full-duplex, sometimes called simplex and duplex). A mismatch is usually detectable using ftp: if copying in one direction is an order of magnitude fast than the other, you have a problem. Note that being a little slower one way than the other is normal: my machine at work looks like this: ftp> get gnuplot-3.7-sol8-sparc-local foo 200 PORT command successful. 150 ASCII data connection for gnuplot-3.7-sol8-sparc-local (129.155.8.39,39390) (1159384 bytes). 226 ASCII Transfer complete. local: foo remote: gnuplot-3.7-sol8-sparc-local 1163024 bytes received in 1.2 seconds (955.88 Kbytes/s) ftp> put foo foo 200 PORT command successful. 150 ASCII data connection for foo (129.155.8.39,39394). 226 Transfer complete. local: foo remote: foo 1163024 bytes sent in 0.45 seconds (2520.48 Kbytes/s) So the "get" is 37% of the "put" speed (the put is writing to a slower disk than the get was). Looking at ethernet stats can help, too: netstat -i output should look something like: Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue lo0 8232 loopback localhost 1904006 0 1904006 0 0 0 hme0 1500 elsbeth elsbeth 8278338 1 2280982 0 579602 0 The number of errors should be **very** low: I have about 1 in 12,100,000. The number of collisions should be below 3% UNLESS you have a cut-through (first generation) ethernet hub. I have 25.4% because I have a cut-through hub on this machine. The failing client can be found by using smbstatus repeatedly. Look at the "pid" column: Samba version 2.0.7 Service uid gid pid machine ---------------------------------------------- temp davecb staff 18310 elsbeth (129.155.8.39) Tue Sep 12 07:49:40 2000 If elsbeth had been failing and reconnecting, a previous smbstatus would have looked the same, but the pid would be different, as Samba starts a new process on each (re-)connection. Interestingly, client disconnects can also cause the client to report "The network is busy" while trying to (re-)map the file. DNS failures: One other cause (somewhat like the "security=server" case) was detected by Giovanni Biscuolo , who found that every time he got a disconnect, he couldn't ping the client from the server. So he used tcpdump to watch the packets being sent by the server, and found his problem was a failing DNS reverse lookup. Several other readers found they had the same problem. --dave