×
Hi,

I’ve been using the this very simple script for a while to do test
builds of the kernel :

#!/bin/bash

for i in $(seq 1 100); do
nice make distclean
while true; do
nice make randconfig
grep -q “CONFIG_EXPERIMENTAL=y” .config
if [ $? -eq 1 ]; then
break
fi
done
cp .config config.${i}
nice make -j3 > build.log.${i} 2>&1
done

Which has worked great in the past, but with recent kernels it has
been a sure way to cause a complete lockup within 1 hour πŸ™

The last kernel where I know for sure that it ran without problems is
2.6.17.13 .
The first kernel where I know for sure it caused lockups is
2.6.18-git15 . I’ve also tested 2.6.18-git16, 2.6.18-git21 and
2.6.19-rc1-git2 and those 3 also lock up solid.

The lockup usually happens within 30 minutes, but sometimes the box
survives longer, but I’ve not seen it survive for more than 60 minutes
at most.
It doesn’t seem to matter if I leave it alone just building kernels or
if I use it for other purposes while building in the background – if
anything, it seems to survive longer when I do other work while it
builds.

When the lockup happens the box just freezes and doesn’t respond to
anything at all. Sometimes I can reboot with alt+sysrq+b but sometimes
not even that works.

Here’s exactely what I do, so you can try to reproduce :

1) boot my distro (Slackware 11.0) into runlevel 4 (multi-user with
X), using kernel 2.6.19-rc1-git2 (or one of the other “known-bad”
kernels).

2) Log in via kdm, and once I’m at my KDE desktop I start ‘konsole’.

3) cd into a dir holding a fresh copy of the 2.6.19-rc1-git2 source
and run the above script from a file named build-random.sh that I have
placed in the root of the source dir and made executable.

4) wait for 0-60 minutes.

After a reboot I find nothing in the logs, so I can’t give you many
hints on what goes wrong, unfortunately.

Attached you can find the config I’m using for my current
2.6.19-rc1-git2 kernel that very consistently exhibits the problem,
and below are some details about my hardware and software environment.

I’ve run memtest86+ for ~12hrs without problems, just to rule out bad
RAM, and I’ve seen nothing at all in my logs to indicate that this
should be a hardware problem. Also, the fact that if I boot into
2.6.17.13 I can run the above script for hours and hours without
problems indiates to me that this is not a hardware issue.

# uname -a
Linux dragon 2.6.19-rc1-git2 #1 SMP PREEMPT Sat Oct 7 00:30:45 CEST
2006 i686 athlon-4 i386 GNU/Linux

# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 35
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
stepping : 2
cpu MHz : 2200.149
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
lm 3dnowext 3dnow pni lahf_lm cmp_legacy ts fid vid ttp
bogomips : 4402.75

processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 35
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
stepping : 2
cpu MHz : 2200.149
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
lm 3dnowext 3dnow pni lahf_lm cmp_legacy ts fid vid ttp
bogomips : 4399.53

# cat /proc/meminfo
MemTotal: 2071360 kB
MemFree: 1683228 kB
Buffers: 29092 kB
Cached: 193184 kB
SwapCached: 0 kB
Active: 165528 kB
Inactive: 141904 kB
HighTotal: 1179328 kB
HighFree: 895532 kB
LowTotal: 892032 kB
LowFree: 787696 kB
SwapTotal: 763076 kB
SwapFree: 763076 kB
Dirty: 184 kB
Writeback: 0 kB
AnonPages: 85096 kB
Mapped: 48360 kB
Slab: 66968 kB
SReclaimable: 33216 kB
SUnreclaim: 33752 kB
PageTables: 1256 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 1798756 kB
Committed_AS: 285864 kB
VmallocTotal: 114680 kB
VmallocUsed: 6344 kB
VmallocChunk: 107532 kB

# lspci -vvx
00:00.0 Host bridge: ALi Corporation M1695 K8 Northbridge [PCI Express
and HyperTransport]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Capabilities: [40] HyperTransport: Slave or Primary Interface
Command: BaseUnitID=0 UnitCnt=3 MastHost- DefDir- DUL-
Link Control 0: CFlE- CST- CFE- <LkFail- Init+ EOC-
TXO- <CRCErr=0 IsocEn- LSEn- ExtCTL- 64b-
Link Config 0: MLWI=16bit DwFcIn- MLWO=16bit DwFcOut-
LWI=16bit DwFcInEn- LWO=16bit DwFcOutEn-
Link Control 1: CFlE- CST- CFE- <LkFail- Init+ EOC-
TXO- <CRCErr=0 IsocEn- LSEn- ExtCTL- 64b-
Link Config 1: MLWI=16bit DwFcIn- MLWO=16bit DwFcOut-
LWI=8bit DwFcInEn- LWO=16bit DwFcOutEn-
Revision ID: 1.05
Link Frequency 0: 800MHz
Link Error 0: <Prot- <Ovfl- <EOC- CTLTm-
Link Frequency Capability 0: 200MHz+ 300MHz- 400MHz+
500MHz- 600MHz+ 800MHz+ 1.0GHz+ 1.2GHz+ 1.4GHz- 1.6GHz- Vend-
Feature Capability: IsocFC- LDTSTOP+ CRCTM- ECTLT- 64bA- UIDRD-
Link Frequency 1: 800MHz
Link Error 1: <Prot- <Ovfl- <EOC- CTLTm-
Link Frequency Capability 1: 200MHz+ 300MHz- 400MHz+
500MHz- 600MHz+ 800MHz+ 1.0GHz+ 1.2GHz+ 1.4GHz- 1.6GHz- Vend-
Error Handling: PFlE- OFlE- PFE- OFE- EOCFE- RFE-
CRCFE- SERRFE- CF- RE- PNFE- ONFE- EOCNFE- RNFE- CRCNFE- SERRNFE-
Prefetchable memory behind bridge Upper: 00-00
Bus Number: 00
Capabilities: [5c] HyperTransport: MSI Mapping
Capabilities: [68] HyperTransport: UnitID Clumping
Capabilities: [74] HyperTransport: Interrupt Discovery and Configuration
Capabilities: [7c] Message Signalled Interrupts: 64bit+
Queue=0/1 Enable-
Address: 00000000fee00000 Data: 0000
00: b9 10 95 16 07 00 10 00 00 00 00 06 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00

00:01.0 PCI bridge: ALi Corporation PCI Express Root Port (prog-if 00
[Normal decode])
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 64 bytes
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
Memory behind bridge: ff200000-ff2fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
Capabilities: [40] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Message Signalled Interrupts: 64bit+
Queue=0/1 Enable-
Address: 00000000fee00000 Data: 0000
Capabilities: [58] Express Root Port (Slot+) IRQ 0
Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag+
Device: Latency L0s <64ns, L1 <1us
Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
Link: Supported Speed 2.5Gb/s, Width x16, ASPM L0s L1, Port 0
Link: Latency L0s <2us, L1 <32us
Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
Link: Speed unknown, Width x1
Slot: AtnBtn- PwrCtrl- MRL- AtnInd- PwrInd- HotPlug- Surpise-
Slot: Number 0, PowerLimit 0.000000
Slot: Enabled AtnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq-
Slot: AttnInd Off, PwrInd Off, Power-
Root: Correctable- Non-Fatal- Fatal- PME-
Capabilities: [7c] HyperTransport: MSI Mapping
Capabilities: [88] HyperTransport: Revision ID: 1.05
00: b9 10 4b 52 06 01 10 00 00 00 04 06 10 00 01 00
10: 00 00 00 00 00 00 00 00 00 01 01 00 f0 00 00 00
20: 20 ff 20 ff f0 ff 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 0a 01 03 00

00:02.0 PCI bridge: ALi Corporation PCI Express Root Port (prog-if 00
[Normal decode])
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 64 bytes
Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
Memory behind bridge: ff300000-ff3fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
Capabilities: [40] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Message Signalled Interrupts: 64bit+
Queue=0/1 Enable-
Address: 00000000fee00000 Data: 0000
Capabilities: [58] Express Root Port (Slot+) IRQ 0
Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag+
Device: Latency L0s <64ns, L1 <1us
Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
Link: Supported Speed 2.5Gb/s, Width x2, ASPM L0s L1, Port 0
Link: Latency L0s <2us, L1 <32us
Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
Link: Speed unknown, Width x1
Slot: AtnBtn- PwrCtrl- MRL- AtnInd- PwrInd- HotPlug- Surpise-
Slot: Number 0, PowerLimit 0.000000
Slot: Enabled AtnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq-
Slot: AttnInd Off, PwrInd Off, Power-
Root: Correctable- Non-Fatal- Fatal- PME-
Capabilities: [7c] HyperTransport: MSI Mapping
Capabilities: [88] HyperTransport: Revision ID: 1.05
00: b9 10 4c 52 06 01 10 00 00 00 04 06 10 00 01 00
10: 00 00 00 00 00 00 00 00 00 02 02 00 f0 00 00 00
20: 30 ff 30 ff f0 ff 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 0b 01 03 00

00:04.0 Host bridge: ALi Corporation M1689 K8 Northbridge [Super K8 Single Chip]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Region 0: Memory at dc000000 (32-bit, prefetchable) [size=64M]
Capabilities: [40] HyperTransport: Slave or Primary Interface
Command: BaseUnitID=4 UnitCnt=1 MastHost- DefDir- DUL-
Link Control 0: CFlE- CST- CFE- <LkFail- Init+ EOC-
TXO- <CRCErr=0 IsocEn- LSEn- ExtCTL- 64b-
Link Config 0: MLWI=16bit DwFcIn- MLWO=8bit DwFcOut-
LWI=16bit DwFcInEn- LWO=8bit DwFcOutEn-
Link Control 1: CFlE- CST- CFE- <LkFail+ Init- EOC+
TXO+ <CRCErr=0 IsocEn- LSEn- ExtCTL- 64b-
Link Config 1: MLWI=8bit DwFcIn- MLWO=8bit DwFcOut-
LWI=8bit DwFcInEn- LWO=8bit DwFcOutEn-
Revision ID: 1.04
Link Frequency 0: 800MHz
Link Error 0: <Prot- <Ovfl- <EOC- CTLTm-
Link Frequency Capability 0: 200MHz+ 300MHz- 400MHz+
500MHz- 600MHz+ 800MHz+ 1.0GHz- 1.2GHz- 1.4GHz- 1.6GHz- Vend-
Feature Capability: IsocFC- LDTSTOP+ CRCTM- ECTLT- 64bA- UIDRD-
Link Frequency 1: 200MHz
Link Error 1: <Prot- <Ovfl- <EOC- CTLTm-
Link Frequency Capability 1: 200MHz- 300MHz- 400MHz-
500MHz- 600MHz- 800MHz- 1.0GHz- 1.2GHz- 1.4GHz- 1.6GHz- Vend-
Error Handling: PFlE- OFlE- PFE- OFE- EOCFE- RFE-
CRCFE- SERRFE- CF- RE- PNFE- ONFE- EOCNFE- RNFE- CRCNFE- SERRNFE-
Prefetchable memory behind bridge Upper: 00-00
Bus Number: 00
Capabilities: [60] HyperTransport: Interrupt Discovery and Configuration
Capabilities: [80] AGP version 3.0
Status: RQ=28 Iso- ArqSz=0 Cal=0 SBA+ ITACoh- GART64-
HTrans- 64bit- FW- AGP3- Rate=x1,x2,x4
Command: RQ=1 ArqSz=0 Cal=0 SBA- AGP- GART64- 64bit-
FW- Rate=<none>
00: b9 10 89 16 06 01 10 00 00 00 00 06 00 00 00 00
10: 08 00 00 dc 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00

00:05.0 PCI bridge: ALi Corporation AGP8X Controller (prog-if 00
[Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap- 66MHz+ UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Bus: primary=00, secondary=03, subordinate=03, sec-latency=64
Memory behind bridge: ff400000-ff4fffff
Prefetchable memory behind bridge: c7f00000-d7efffff
Secondary status: 66MHz+ FastB2B- ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort+ <SERR- <PERR-

BridgeCtl: Parity+ SERR+ NoISA- VGA+ MAbort- >Reset- FastB2B-
00: b9 10 46 52 07 01 20 00 00 00 04 06 00 00 01 00
10: 00 00 00 00 00 00 00 00 00 03 03 40 f0 00 20 22
20: 40 ff 40 ff f0 c7 e0 d7 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0b 00

00:06.0 PCI bridge: ALi Corporation M5249 HTT to PCI Bridge (prog-if
01 [Subtractive decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Bus: primary=00, secondary=04, subordinate=04, sec-latency=32
I/O behind bridge: 0000d000-0000dfff
Memory behind bridge: ff500000-ff5fffff
Prefetchable memory behind bridge: 88000000-880fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort+ <SERR- <PERR-

BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
00: b9 10 49 52 07 01 00 00 00 01 04 06 00 00 01 00
10: 00 00 00 00 00 00 00 00 00 04 04 20 d0 d0 00 22
20: 50 ff 50 ff 00 88 00 88 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 03 00

00:07.0 ISA bridge: ALi Corporation M1563 HyperTransport South Bridge (rev 70)
Subsystem: ASRock Incorporation Unknown device 1563
Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-

Latency: 0 (250ns min, 6000ns max)
00: b9 10 63 15 0f 00 00 02 70 00 01 06 00 00 80 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 49 18 63 15
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 18

00:07.1 Bridge: ALi Corporation M7101 Power Management Controller [PMU]
Subsystem: ASRock Incorporation Unknown device 7101
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-

00: b9 10 01 71 00 00 00 02 00 00 80 06 00 00 80 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 49 18 01 71
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00:11.0 Ethernet controller: ALi Corporation ULi 1689,1573 integrated
ethernet. (rev 40)
Subsystem: ASRock Incorporation Unknown device 5263
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-

Latency: 32 (5000ns min, 10000ns max), Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 10
Region 0: I/O ports at e800 [size=256]
Region 1: Memory at ff6ffc00 (32-bit, non-prefetchable) [size=256]
Capabilities: [50] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
00: b9 10 63 52 07 01 10 02 40 00 00 02 08 20 00 00
10: 01 e8 00 00 00 fc 6f ff 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 49 18 63 52
30: 00 00 00 00 50 00 00 00 00 00 00 00 0a 01 14 28

00:12.0 IDE interface: ALi Corporation M5229 IDE (rev c7) (prog-if 8a
[Master SecP PriP])
Subsystem: ASRock Incorporation Unknown device 5229
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-

Latency: 32
Interrupt: pin A routed to IRQ 0
Region 0: I/O ports at <ignored>
Region 1: I/O ports at <ignored>
Region 2: I/O ports at <ignored>
Region 3: I/O ports at <ignored>
Region 4: I/O ports at ff00 [size=16]
00: b9 10 29 52 05 00 a0 02 c7 8a 01 01 00 20 00 00
10: f1 01 00 00 f5 03 00 00 71 01 00 00 75 03 00 00
20: 01 ff 00 00 00 00 00 00 00 00 00 00 49 18 29 52
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00

00:13.0 USB Controller: ALi Corporation USB 1.1 Controller (rev 03)
(prog-if 10 [OHCI])
Subsystem: ASRock Incorporation Unknown device 5237
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap- 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-

Latency: 32 (20000ns max), Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 11
Region 0: Memory at ff6fe000 (32-bit, non-prefetchable) [size=4K]
00: b9 10 37 52 17 01 a8 02 03 10 03 0c 10 20 80 00
10: 00 e0 6f ff 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 49 18 37 52
30: 00 00 00 00 00 00 00 00 00 00 00 00 0b 01 00 50

00:13.1 USB Controller: ALi Corporation USB 1.1 Controller (rev 03)
(prog-if 10 [OHCI])
Subsystem: ASRock Incorporation Unknown device 5237
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap- 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-

Latency: 32 (20000ns max), Cache Line Size: 64 bytes
Interrupt: pin B routed to IRQ 3
Region 0: Memory at ff6fd000 (32-bit, non-prefetchable) [size=4K]
00: b9 10 37 52 17 01 a8 02 03 10 03 0c 10 20 80 00
10: 00 d0 6f ff 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 49 18 37 52
30: 00 00 00 00 00 00 00 00 00 00 00 00 03 02 00 50

00:13.2 USB Controller: ALi Corporation USB 1.1 Controller (rev 03)
(prog-if 10 [OHCI])
Subsystem: ASRock Incorporation Unknown device 5237
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap- 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-

Latency: 32 (20000ns max), Cache Line Size: 64 bytes
Interrupt: pin C routed to IRQ 11
Region 0: Memory at ff6fc000 (32-bit, non-prefetchable) [size=4K]
00: b9 10 37 52 17 01 a8 02 03 10 03 0c 10 20 80 00
10: 00 c0 6f ff 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 49 18 37 52
30: 00 00 00 00 00 00 00 00 00 00 00 00 0b 03 00 50

00:13.3 USB Controller: ALi Corporation USB 2.0 Controller (rev 01)
(prog-if 20 [EHCI])
Subsystem: ASRock Incorporation Unknown device 5239
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-

Latency: 32 (4000ns min, 8000ns max), Cache Line Size: 64 bytes
Interrupt: pin D routed to IRQ 5
Region 0: Memory at ff6ff800 (32-bit, non-prefetchable) [size=256]
Capabilities: [50] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Debug port
00: b9 10 39 52 16 01 b0 02 01 20 03 0c 10 20 80 00
10: 00 f8 6f ff 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 49 18 39 52
30: 00 00 00 00 50 00 00 00 00 00 00 00 05 04 10 20

00:18.0 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] HyperTransport Technology Configuration
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Capabilities: [80] HyperTransport: Host or Secondary Interface
!!! Possibly incomplete decoding
Command: WarmRst+ DblEnd-
Link Control: CFlE- CST- CFE- <LkFail- Init+ EOC- TXO- <CRCErr=0
Link Config: MLWI=16bit MLWO=16bit LWI=16bit LWO=16bit
Revision ID: 1.02
00: 22 10 00 11 00 00 10 00 00 00 00 06 00 00 80 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 80 00 00 00 00 00 00 00 00 00 00 00

00:18.1 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] Address Map
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
00: 22 10 01 11 00 00 00 00 00 00 00 06 00 00 80 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00:18.2 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] DRAM Controller
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
00: 22 10 02 11 00 00 00 00 00 00 00 06 00 00 80 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00:18.3 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] Miscellaneous Control
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
00: 22 10 03 11 00 00 00 00 00 00 00 06 00 00 80 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

03:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA Parhelia
AGP (rev 03) (prog-if 00 [VGA])
Subsystem: Matrox Graphics, Inc. Parhelia 128Mb
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-

Latency: 32 (4000ns min, 8000ns max), Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 5
Region 0: Memory at c8000000 (32-bit, prefetchable) [size=128M]
Region 1: Memory at ff4fe000 (32-bit, non-prefetchable) [size=8K]
Expansion ROM at ff4c0000 [disabled] [size=128K]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [f0] AGP version 2.0
Status: RQ=32 Iso- ArqSz=0 Cal=0 SBA+ ITACoh- GART64-
HTrans- 64bit- FW+ AGP3- Rate=x1,x2,x4
Command: RQ=1 ArqSz=0 Cal=0 SBA- AGP- GART64- 64bit-
FW- Rate=<none>
00: 2b 10 27 05 07 00 b0 02 03 00 00 03 10 20 00 00
10: 08 00 00 c8 00 e0 4f ff 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 2b 10 40 08
30: 00 00 4c ff dc 00 00 00 00 00 00 00 05 01 10 20

04:05.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev 0a)
Subsystem: Creative Labs SBLive! 5.1 eMicro 28028
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-

Latency: 32 (500ns min, 5000ns max)
Interrupt: pin A routed to IRQ 20
Region 0: I/O ports at d880 [size=32]
Capabilities: [dc] Power Management version 1
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
00: 02 11 02 00 05 01 90 02 0a 00 01 04 00 20 80 00
10: 81 d8 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 02 11 67 80
30: 00 00 00 00 dc 00 00 00 00 00 00 00 0b 01 02 14

04:05.1 Input device controller: Creative Labs SB Live! Game Port (rev 0a)
Subsystem: Creative Labs Gameport Joystick
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-

Latency: 32
Region 0: I/O ports at dc00 [size=8]
Capabilities: [dc] Power Management version 1
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
00: 02 11 02 70 05 01 90 02 0a 00 80 09 00 20 80 00
10: 01 dc 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 02 11 20 00
30: 00 00 00 00 dc 00 00 00 00 00 00 00 00 00 00 00

04:06.0 SCSI storage controller: Adaptec AIC-7892A U160/m (rev 02)
Subsystem: Adaptec 29160N Ultra160 SCSI Controller
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-

Latency: 32 (10000ns min, 6250ns max), Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 19
BIST result: 00
Region 0: I/O ports at d400 [disabled] [size=256]
Region 1: Memory at ff5ff000 (64-bit, non-prefetchable) [size=4K]
Expansion ROM at 88000000 [disabled] [size=128K]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
00: 05 90 80 00 16 01 b0 02 02 00 00 01 10 20 00 80
10: 01 d4 00 00 04 f0 5f ff 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 05 90 a0 62
30: 00 00 5c ff dc 00 00 00 00 00 00 00 03 01 28 19

04:07.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 42)
Subsystem: D-Link System Inc DFE-530TX rev B
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-

Latency: 32 (750ns min, 2000ns max), Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 18
Region 0: I/O ports at d000 [size=256]
Region 1: Memory at ff5fec00 (32-bit, non-prefetchable) [size=256]
Expansion ROM at 88020000 [disabled] [size=64K]
Capabilities: [40] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
00: 06 11 65 30 17 01 10 02 42 00 00 02 10 20 00 00
10: 01 d0 00 00 00 ec 5f ff 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 86 11 01 14
30: 00 00 ff ff 40 00 00 00 00 00 00 00 0b 01 03 08

root@dragon:/home/juhl/download/kernel/linux-2.6.19-rc1-git2# scripts/ver_linux
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.

Linux dragon 2.6.19-rc1-git2 #1 SMP PREEMPT Sat Oct 7 00:30:45 CEST
2006 i686 athlon-4 i386 GNU/Linux

Gnu C 3.4.6
Gnu make 3.81
binutils 2.15.92.0.2
util-linux 2.12r
mount 2.12r
module-init-tools 3.2.2
e2fsprogs 1.39
reiserfsprogs 3.6.19
quota-tools 3.13.
PPP 2.4.4b1
Linux C Library 2.3.6
Dynamic linker (ldd) 2.3.6
Linux C++ Library 6.0.3
Procps 3.2.7
Net-tools 1.60
Kbd 1.12
Sh-utils 5.97
udev 097
Modules Loaded snd_seq_oss snd_seq_midi_event snd_seq
snd_pcm_oss snd_mixer_oss agpgart snd_emu10k1 snd_rawmidi
snd_ac97_codec snd_ac97_bus snd_pcm snd_seq_device snd_timer
snd_page_alloc snd_util_mem snd_hwdep evdev snd

On Sat, 7 Oct 2006 01:36:24 +0200
“Jesper Juhl” <jesper.juhl@gmail.com> wrote:

> Hi,
>
> I’ve been using the this very simple script for a while to do test
> builds of the kernel :
>
>
> #!/bin/bash
>
> for i in $(seq 1 100); do
> nice make distclean
> while true; do
> nice make randconfig
> grep -q “CONFIG_EXPERIMENTAL=y” .config
> if [ $? -eq 1 ]; then
> break
> fi
> done
> cp .config config.${i}
> nice make -j3 > build.log.${i} 2>&1
> done
>
>
> Which has worked great in the past, but with recent kernels it has
> been a sure way to cause a complete lockup within 1 hour πŸ™
>
This is probably one of those nobody-but-you-can-reproduce-it things.

>
> The last kernel where I know for sure that it ran without problems is
> 2.6.17.13 .
> The first kernel where I know for sure it caused lockups is
> 2.6.18-git15 . I’ve also tested 2.6.18-git16, 2.6.18-git21 and
> 2.6.19-rc1-git2 and those 3 also lock up solid.
>
> The lockup usually happens within 30 minutes, but sometimes the box
> survives longer, but I’ve not seen it survive for more than 60 minutes
> at most.
> It doesn’t seem to matter if I leave it alone just building kernels or
> if I use it for other purposes while building in the background – if
> anything, it seems to survive longer when I do other work while it
> builds.
>
> When the lockup happens the box just freezes and doesn’t respond to
> anything at all. Sometimes I can reboot with alt+sysrq+b but sometimes
> not even that works.
If you can do sysrq-b then you can do sysrq-t, too?

Please ensure that you have all the CONFIG_DEBUG_* things set, apart from
PAGEALLOC.

> Here’s exactely what I do, so you can try to reproduce :
>
> 1) boot my distro (Slackware 11.0) into runlevel 4 (multi-user with
> X), using kernel 2.6.19-rc1-git2 (or one of the other “known-bad”
> kernels).
>
> 2) Log in via kdm, and once I’m at my KDE desktop I start ‘konsole’.
>
> 3) cd into a dir holding a fresh copy of the 2.6.19-rc1-git2 source
> and run the above script from a file named build-random.sh that I have
> placed in the root of the source dir and made executable.
>
> 4) wait for 0-60 minutes.
>
>
> After a reboot I find nothing in the logs, so I can’t give you many
> hints on what goes wrong, unfortunately.
>

Once you’ve got the test set up and running, you can do the alt-ctl-F1
thing to take you out of X and into the vga console. I suggest you leave
it running that way, see if anything pops up when it hangs.

On 07/10/06, Andrew Morton <akpm@osdl.org> wrote:
> On Sat, 7 Oct 2006 01:36:24 +0200
> “Jesper Juhl” <jesper.juhl@gmail.com> wrote:
>
>
> This is probably one of those nobody-but-you-can-reproduce-it things.
>

I hope not. But that actually why I post the script, to try an get
more people to reproduce…
>
> If you can do sysrq-b then you can do sysrq-t, too?
>

I don’t know, haven’t tried – but I’ll try the next few times it locks up.
> Please ensure that you have all the CONFIG_DEBUG_* things set, apart from
> PAGEALLOC.
>

$ zgrep CONFIG_DEBUG_ /proc/config.gz
# CONFIG_DEBUG_DRIVER is not set
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_SLAB=y
CONFIG_DEBUG_SLAB_LEAK=y
CONFIG_DEBUG_PREEMPT=y
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_PI_LIST=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_RWSEMS=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
CONFIG_DEBUG_LOCKING_API_SELFTESTS=y
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_HIGHMEM=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_VM=y
CONFIG_DEBUG_LIST=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_STACK_USAGE=y
CONFIG_DEBUG_PAGEALLOC=y
CONFIG_DEBUG_RODATA=y

That good enough?


>
> Once you’ve got the test set up and running, you can do the alt-ctl-F1
> thing to take you out of X and into the vga console. I suggest you leave
> it running that way, see if anything pops up when it hangs.
>

I’ve done that on a few occasions already without seeing anything, but
I’ll try a few more times.

On 07/10/06, wrote:
> On 07/10/06, Andrew Morton <akpm@osdl.org> wrote:
> I’ve done that on a few occasions already without seeing anything, but
> I’ll try a few more times.
>
Hmm, trying to do this (with 2.6.19-rc1-git2) seems to have revealed
yet another problem.
If I try to switch to tty1 just after boot, everything is fine. It’s
still fine after using the box for a few minutes doing random stuf
like reading email, surfing the web etc, but once my build script has
been running for a few minutes (tested 2 times after ~5min. runs) I
just get a completely white screen when switching to tty1, and when
switching back to X I also just get a white screen πŸ™
Something is definately broken here….
On Sat, 7 Oct 2006, Jesper Juhl wrote:
>
> Which has worked great in the past, but with recent kernels it has
> been a sure way to cause a complete lockup within 1 hour πŸ™
Reliable lock-ups (and “within 1 hour” is quite quick too) are actually
great.

> 2.6.17.13 .
> The first kernel where I know for sure it caused lockups is
> 2.6.18-git15 . I’ve also tested 2.6.18-git16, 2.6.18-git21 and
> 2.6.19-rc1-git2 and those 3 also lock up solid.

Can I bother you to just bisect it?

Even if you decide that it’s too painful to bisect to the very end, “git
bisect” will give great results after just as few reboots as four or five,
and hopefully narrow down the thing a _lot_.

So, for example, while my git tree doesn’t contain the stable release
numbers, you can trivially just get my tree, and then point “git fetch” at
the stable git tree and get v2.6.17.13 that way.

Then you can do just

git bisect start
git bisect good v2.6.17.13
git bisect bad $(cat patch-2.6.18-git15.id)

and off you go – it will pick a half-way point for you to test, and then
if that one was good, you just say “git bisect good”, and it will pick the
next one..

(that “patch-2.6.18-git15.id” thing is from kernel.org – it’s how you can
get the exact git state of any particular snapshot, even if it’s not
tagged in any real tree – that particular one seems to have SHA1 ID
1bdfd554be94def718323659173517c5d4a69d25
..)

“git bisect” really does kick XXX. Don’t worry if it says “10374 commits
to test after this” – because it does a binary search, it basically
cuts the commits to test in half each time, and so if you do just five
bisections, you’ll have cut down the 10,000 commits to just a few hundred.
At that point, maybe we even have a clue, or we might ask you to test a
few more times to narrow things down even more.

Linus

On Sat, 7 Oct 2006 01:36:24 +0200, “Jesper Juhl” <jesper.juhl@gmail.com> wrote:

>Hi,
>
>I’ve been using the this very simple script for a while to do test
>builds of the kernel :
>
>
>#!/bin/bash
>
>for i in $(seq 1 100); do
> nice make distclean
> while true; do
> nice make randconfig
> grep -q “CONFIG_EXPERIMENTAL=y” .config
> if [ $? -eq 1 ]; then
> break
> fi
> done
> cp .config config.${i}
> nice make -j3 > build.log.${i} 2>&1
>done
>
>
>Which has worked great in the past, but with recent kernels it has
>been a sure way to cause a complete lockup within 1 hour πŸ™
There’s some no-nos Adrian Bunk pointed out back when I was doing this,
here’s what I used last year — it recently ran a hundred compiles but
I forgot or lost the script that interpreted results grant@sempro:~$ cat /usr/local/bin/zrandom-build
#!/bin/bash
#
# 2.6 kernel random .config compiler driver
#
# Copyright (C) 2005 Grant Coady gcoady.lk@gmail.com
#
# GPL v2 per linux/COPYING by reference
#
# Thanks to:
# comp.unix.shell people:
# Chris F.A. Johnson <http://cfaj.freeshell.org> for CLI number test
# Ed Morton <morton@lsupcaemnt.com> for ‘awk’ solution in resuming
# for answers to query 2005-07-27 for improvements to this script.
#
# linux-kernel people:
# Adrian Bunk Don’t bother with useless CONFIG_BROKEN= .config
# CONFIG_STANDALONE=
# Jesper Juhl Feedback
#
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
# What?
# “““
# A script to build random kernel .configs to discover kbuild errors.
# The .config, compiler output and time are recorded into the destination
# directory. Run several in parallel with outputs to different directories.
#
# The .config and compiler result are linked by a three digit number at
# start of filename.
#
# Files
# “““
# 000-about record settings for a particular run
# ???-config the .config
# ???-result build (compiler) output
# ???-time time to build in seconds and mm:ss (curiosity)
#
# Post processing of results lists each error (or warning) and the first
# .config file triggering the error/warning. Another script.
#
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
# globals
clean=”” # -c set “Y” to do ‘make clean’ prior to compiles
store=”../” # -d destination directory
jobnr=”” # -jn make job control
limit=100 # -n number of .config builds to make
build=”Y” # -t clear to not build .config for testing
patch=”” # set “Y” to skip retry CONFIG_BROKEN=y .configs
count=0 # build counter
retry=0 # retry counter for useless .config filter
#
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
# Setup the trial series, command line interface

function show_usage
{
echo “random-build
random .config compiler driver for 2.6 series kernel
usage: random-build [-d destination_directory] [-n nnn] [-t] [-u]

-c do ‘make clean’ prior to each compile, default off
-d dir destination for results, default ../
-jn make job control, n = 0..9
-n nnn number of compile runs, default 100
-t testing config driver, no build .config

cd into linux top-level directory, specify an output directory
outside the kernel directory, example command:

random-build -c -n 333 -d ../trial-2.6.13-rc3-mm2-1

would make clean prior to each build and place the results of
333 random .config to directory ../trial-2.6.13-rc3-mm2-1

Useless .config generated are skipped, read script source to see
current setting; CONFIG_BROKEN=y is definitely useless

exit 1
}

function check_config_limit # limit
{
case $1 in
*[!0-9]*) limit=0;;
* ) limit=$1;;
esac
if [ $limit -lt 1 -o $limit -gt 999 ]; then
limit=100
fi
}

function check_create_dest # destination
{
local crap=”n”
if [ ! -d “$1” ]; then
echo -e \
“Non-existent destination $1 specified, create it? (y/N) \c”
read crap
echo
if [ “$crap” == “y” -o “$crap” == “Y” ]; then
mkdir “$1”
else
echo “bad dest”; show_usage
fi
fi
store=$1
}

# parse command line
while [ $1 ]; do
case $1 in
-c ) clean=”Y”;; # do ‘make clean’
-d ) check_create_dest $2; shift;;
-j[0-9]) jobnr=$1;;
-n ) check_config_limit $2; shift;;
-t ) build=””;; # disable build
* ) echo “bad CLI”; show_usage;;
esac
shift
done
echo ”
#==>>
#==>> Grant’s random kernel configs $(date)
#==>> $0 from $PWD
#==>> host: linux-$(uname -r) on $HOSTNAME
#==>> store=$store
#==>> limit=$limit
#==>> clean=$clean
#==>> build=$build
#==>> job control=$jobnr
#==>>
” 2>&1 | tee “$store/000-about”
#
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
# Run the trial series
#
# check if destination contains results, if so assume restart test run and
# thus overwrite the last partial result, remove leading zeroes so number
# is seen as decimal, not octal! Queried comp.unix.shell – 2005-07-27…

function perhaps_resume_trial
{
count=$(ls $store/*-config 2>/dev/null \
| awk -F/ ‘{f=$NF}END{print f+0}’)

if [ $count -gt 0 -a $count -lt 1000 ]; then
if [ $count -gt 1 ]; then
echo -e “\n#==>> Resuming: $count\n”
fi
((count–))
else
count=0
fi
}

function check_config
{
local x=$(egrep \
‘CONFIG_BROKEN= | CONFIG_STANDALONE= | CONFIG_DEBUG_INFO=’ \
.config > /dev/null)
return $x
}

function create_random_config
{
if [ -n “$patch” ]; then
make randconfig > /dev/null
else
while true; do
make randconfig > /dev/null
check_config && break
echo -e “\tRetry ($((++retry))): skipped useless .config”
done
fi
cp .config “$store/$trial-config”
}

function build_random_config
{
if [ -n “$build” ]; then
[ -n “$clean” ] && make clean
make $jobnr 2> “$store/$trial-result”
fi
}

stamp=$SECONDS
function write_timestamp_file
{
local t=0 m=0 s=0
t=$((SECONDS – stamp))
m=$(printf “%2d” $((t / 60)))
s=$(printf “%02d” $((t % 60)))
echo -e “$t\t$m:$s” > “$store/$trial-time”
stamp=$SECONDS
}

perhaps_resume_trial
while [ $((++count)) -le $limit ]; do

trial=$(printf %003d $count)
echo “#==>> $0, run $count: make randconfig”
create_random_config
build_random_config
write_timestamp_file
done

echo “skipped $retry useless .config :o)”

# end

On 07/10/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Sat, 7 Oct 2006, Jesper Juhl wrote:
>
> Reliable lock-ups (and “within 1 hour” is quite quick too) are actually
> great.
>
>
> Can I bother you to just bisect it?
>

Sure, but it will take a little while since building + booting +
starting the test + waiting for the lockup takes a fair bit of time
for each kernel and also due to the fact that my git skills are pretty
limited, but I’ll figure it out (need to improve those git skills
anyway) :-)I’ll be back with more info.
On Sat, 7 Oct 2006, Jesper Juhl wrote:
>
>
> Sure, but it will take a little while since building + booting +
> starting the test + waiting for the lockup takes a fair bit of time
> for each kernel
Sure. That said, we’ve tried to narrow down things that took hours or days
(under real loads, not some nice test-script) to reproduce, and while it
doesn’t always work, the real problem tends to be if the problem case
isn’t really reproducible. It sounds like yours is pretty clear-cut, and
that will make things much easier.

> and also due to the fact that my git skills are pretty
> limited, but I’ll figure it out (need to improve those git skills
> anyway) πŸ™‚

“git bisect” in particular isn’t that hard to use, and it will really do
a lot of heavy lifting for you.

Although since it will just select a random commit (well, it’s not
“random”: it’s strictly as half-way as it can possibly be, but it’s
automated without any regard for anything else), you can sometimes hit a
situation where git will ask you to test a kernel that simply doesn’t work
at all, and you can’t even test whether it reproduces your particular bug
or not.

For example, “git bisect” might pick a kernel that just doesn’t compile,
because of some stupid bug that was fixed almost immediately afterwards.
In those cases, the total automation of “git bisect” ends up being
something that has to be helped along by hand, and then it definitely
helps to know more about how git works.

Anyway, the quick tutorial about “git bisect” is that once you’ve given it
the required first “good” and “bad” points, it will create a new branch in
the repository (called “bisect”, in case you care), and after that point
it will do a search in the commit DAG (aka “history tree” – it’s not a
tree, it’s a DAG, since merges will join branches together) for the next
commit that will neatly “split” the DAG into two equal pieces. It will
keep splitting the commit history until you get fed up, or until it has
pinpointed the single commit that caused the problem.

The nicest tool to use during bisection is to just do a

git bisect visualize

that simply starts up “gitk” (the default git history visualizer) to show
what the current state of bisection is. Now, if there are thousands and
thousands of commits, you’ll have a really hard time getting a visual clue
about what is going on, but especially once you get to a smaller set of
commits, it’s very useful indeed.

And it’s _especially_ useful if you hit one of the problem spots where you
can’t test the resulting tree for some unrelated reason. When that
happens, you should _not_ mark the problematic commit as being “bad”,
because you really don’t know – the “badness” of that commit is probably
not related to the “badness” that you’re actually searching for.

Instead, you should say “ok, I refuse to test this commit at all, because
it’s got other problems, and I will select another commit instead”. The
bisection algorithm doesn’t care which commit you pick, as long as it’s
within the set of “unknown” commits that you’ll see with the visualization
tool.

Of course, for efficiency reasons, the _closer_ you get to the half-way
mark, the better. So it’s useful to try to pick a commit that is close to
the one that “git bisect” originally chose for you, but that’s not a
correctness issue, that’s just an issue of “if we have a thousand
potential commits, we’re better off bisecting it 400/600 rather than
1/999, even if the exact half-way point isn’t testable”.

So if you need to decide to pick another point than the one “git bisect”
chose for you automatically, just select that commit in the visualizer
(which will cut the SHA1 name of it), and then do

git reset –hard <paste-sha1-here”

to reset the “bisect” branch to that point instead. And then compile and
test that kernel instead (and then if that’s good or bad, you can do the
“git bisect good” or “git bisect bad” thing to mark it so, and git will
continue to bisect the set of commits).

It can be a bit boring, but damn, it’s effective. I’ve used “git bisect”
several times when I’ve been too lazy to try to really think about what is
going on – I’ll happily brute-force bug-finding even if it might take a
little longer, if it’s guaranteed to find it (and if the bug is
reproducible, git bisect definitely guarantees to find what made it
appear, even if that may not necessarily be the deeper _cause_ of the bug)

Linus

Ok, some preliminary results on this before I go get some sleep + a
working day tomorrow…On 07/10/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Sat, 7 Oct 2006, Jesper Juhl wrote:
>
> Sure. That said, we’ve tried to narrow down things that took hours or days
> (under real loads, not some nice test-script) to reproduce, and while it
> doesn’t always work, the real problem tends to be if the problem case
> isn’t really reproducible. It sounds like yours is pretty clear-cut, and
> that will make things much easier.
>
Yeah, it seems pretty clear-cut, but I’m a bit nervous that it may
sometimes take longer than my observed 60min to reproduce, rendering
my git-bisection less than perfect (more on that below).


>
> “git bisect” in particular isn’t that hard to use, and it will really do
> a lot of heavy lifting for you.
>

(…)
Thanks a lot for the tutorial, that really helped.

For some reason I couldn’t get git to accept 2.6.17.13 as a “good”
starting point, so I used 2.6.17 instead, and the sha1 you gave me for
2.6.18-git15 as the “bad” starting point.

Here’s where I am right now (a log of what I’ve done) :

[bisection start]

Bisecting: 5188 revisions left to test after this
& #91;92164c5dd1ade33f4e90b72e407910de6694
de49] USB: OHCI hub code unaligned access

[git bisect good]

Bisecting: 2567 revisions left to test after this
& #91;e41542f5167d6b506607f8dd111fa0a3e468
ccb8] [DCCP]: Introduce dccp_probe

[git bisect good]

Bisecting: 1351 revisions left to test after this
& #91;b98adfccdf5f8dd34ae56a2d5adbe2c030bd
4674] Merge
master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6

[git bisect good]

Bisecting: 635 revisions left to test after this
& #91;538d9d532b0e0320c9dd326a560b5a72d73f
910d] irq: remove a extra line

[git bisect good]

Bisecting: 292 revisions left to test after this
& #91;db1a19b38f3a85f475b4ad716c71be133d8c
a48e] Merge branch
‘intelfb-patches’ of
master.kernel.org:/pub/scm/linux/kernel/git/airlied/intelfb-2.6

[git bisect bad]

Bisecting: 146 revisions left to test after this
& #91;1db27c11e9a0c6d659040ac0b7c64a339e24
8fa1] istallion: Remove private
baud rate decoding, which is also broken in this case on some
platforms

[git bisect bad]

Bisecting: 73 revisions left to test after this
& #91;3171a0305d62e6627a24bff35af4f997e498
8a80] simplify update_times
(avoid jiffies/jiffies_64 aliasing problem)

[git bisect good]

Bisecting: 37 revisions left to test after this
& #91;29b884921634e1e01cbd276e1c9b8fc07a7e
4a90] set EXIT_DEAD state in
do_exit(), not in schedule()

[currently testing this kernel]

Looking at “git bisect visualize” the current status is this :

bisect/good: 3171a0305d62e6627a24bff35af4f997e4988a80

bisect/bad: 1db27c11e9a0c6d659040ac0b7c64a339e248fa1

Current bisect marker at: 29b884921634e1e01cbd276e1c9b8fc07a7e4a90

I’m a little worried though that my results may not be completely reliable.

There’s no doubt that you can trust the kernels that I told git were
“bad” since those resultet in a hang and there’s just no getting
around that. So we know for a fact that the bad commit is somewhere
between my last found bad kernel and 2.6.17, what we don’t know with
the same amount of certainty is if the bad commit is between my last
found good kernel and the last found bad one.

What I’m worried about is the kernels I’ve marked as “good”. Before
starting this run I had never experienced a hang if the kernel
survived past the one hour mark, so I concluded that testing each
kernel for 80min would be enough to prove it good or bad. This now
seems to be not completely reliable since my second bad kernel
happened to hang after ~2hrs. This happened since I forgot to check my
computer after 80min and only came back to it some 3hrs later (I know
the time it hung since I had a xterm doing while true;do sleep
10;uptime;done running, so I could check.

This all means that my testing and concluding kernels were “good”
after 80min of test runtime may not be 100% reliable.

Is it useful for me to continue bisecting from the point I’m at, or
should I reset from good==2.6.17 and bad==the_last_bad_commit_I_found
? Or do you have a likely culprit I should try revoking?

Whatever your answer it’ll have to wait until tomorrow evening since
I’m going to go get some sleep now, but please let me know what you’d
like me to do …

Ok, finally got to the end of the bisection (see below; quoting all of
my previous email since my concerns from that one are still valid).On 09/10/06, wrote:
> Ok, some preliminary results on this before I go get some sleep + a
> working day tomorrow…
>
>
> On 07/10/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
> Yeah, it seems pretty clear-cut, but I’m a bit nervous that it may
> sometimes take longer than my observed 60min to reproduce, rendering
> my git-bisection less than perfect (more on that below).
>
>
> (…)
> Thanks a lot for the tutorial, that really helped.
>
> For some reason I couldn’t get git to accept 2.6.17.13 as a “good”
> starting point, so I used 2.6.17 instead, and the sha1 you gave me for
> 2.6.18-git15 as the “bad” starting point.
>
> Here’s where I am right now (a log of what I’ve done) :
>
> [bisection start]
>
> Bisecting: 5188 revisions left to test after this
> & #91;92164c5dd1ade33f4e90b72e407910de6694
de49] USB: OHCI hub code unaligned access
>
> [git bisect good]
>
> Bisecting: 2567 revisions left to test after this
> & #91;e41542f5167d6b506607f8dd111fa0a3e468
ccb8] [DCCP]: Introduce dccp_probe
>
> [git bisect good]
>
> Bisecting: 1351 revisions left to test after this
> & #91;b98adfccdf5f8dd34ae56a2d5adbe2c030bd
4674] Merge
> master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6
>
> [git bisect good]
>
> Bisecting: 635 revisions left to test after this
> & #91;538d9d532b0e0320c9dd326a560b5a72d73f
910d] irq: remove a extra line
>
> [git bisect good]
>
> Bisecting: 292 revisions left to test after this
> & #91;db1a19b38f3a85f475b4ad716c71be133d8c
a48e] Merge branch
> ‘intelfb-patches’ of
> master.kernel.org:/pub/scm/linux/kernel/git/airlied/intelfb-2.6
>
> [git bisect bad]
>
> Bisecting: 146 revisions left to test after this
> & #91;1db27c11e9a0c6d659040ac0b7c64a339e24
8fa1] istallion: Remove private
> baud rate decoding, which is also broken in this case on some
> platforms
>
> [git bisect bad]
>
> Bisecting: 73 revisions left to test after this
> & #91;3171a0305d62e6627a24bff35af4f997e498
8a80] simplify update_times
> (avoid jiffies/jiffies_64 aliasing problem)
>
> [git bisect good]
>
> Bisecting: 37 revisions left to test after this
> & #91;29b884921634e1e01cbd276e1c9b8fc07a7e
4a90] set EXIT_DEAD state in
> do_exit(), not in schedule()
>
> [currently testing this kernel]
>
>
> Looking at “git bisect visualize” the current status is this :
>
> bisect/good: 3171a0305d62e6627a24bff35af4f997e4988a80
> bisect/bad: 1db27c11e9a0c6d659040ac0b7c64a339e248fa1

> Current bisect marker at: 29b884921634e1e01cbd276e1c9b8fc07a7e4a90

>
>
> I’m a little worried though that my results may not be completely reliable.
>
> There’s no doubt that you can trust the kernels that I told git were
> “bad” since those resultet in a hang and there’s just no getting
> around that. So we know for a fact that the bad commit is somewhere
> between my last found bad kernel and 2.6.17, what we don’t know with
> the same amount of certainty is if the bad commit is between my last
> found good kernel and the last found bad one.
>
> What I’m worried about is the kernels I’ve marked as “good”. Before
> starting this run I had never experienced a hang if the kernel
> survived past the one hour mark, so I concluded that testing each
> kernel for 80min would be enough to prove it good or bad. This now
> seems to be not completely reliable since my second bad kernel
> happened to hang after ~2hrs. This happened since I forgot to check my
> computer after 80min and only came back to it some 3hrs later (I know
> the time it hung since I had a xterm doing while true;do sleep
> 10;uptime;done running, so I could check.
>
> This all means that my testing and concluding kernels were “good”
> after 80min of test runtime may not be 100% reliable.
>
> Is it useful for me to continue bisecting from the point I’m at, or
> should I reset from good==2.6.17 and bad==the_last_bad_commit_I_found
> ? Or do you have a likely culprit I should try revoking?
>
> Whatever your answer it’ll have to wait until tomorrow evening since
> I’m going to go get some sleep now, but please let me know what you’d
> like me to do …
>

In the end, this is what git told me :

1db27c11e9a0c6d659040ac0b7c64a339e248fa1
is first bad commit
commit 1db27c11e9a0c6d659040ac0b7c64a339e248fa1

Author: Alan Cox <alan@lxorguk.ukuu.org.uk>
Date: Fri Sep 29 02:01:38 2006 -0700

[PATCH] istallion: Remove private baud rate decoding, which is
also broken in this case on some platforms

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

:040000 040000 0fc700de5e78b39acc130d529cf59437e9242b68

884b27574b6c38a5fa952d09ca945b167e36db84
M drivers

But, that doesn’t make much sense, so I very strongly suspect that my
test case was not as reliable as I thought.
We can trust the commits I marked as ‘bad’ though since there’s no
getting around a complete lockup of the box. So we know for sure now
that things broke between 2.6.17 and the commit above. But since that
commit makes no sense as the cause of the breakage it must be a case
of me having marked a kernel as ‘good’ that would eventually have
turned out bad if I’d run it longer πŸ™

Where do I go from here? The problem is still there… I’ll test
2.6.19-rc2 tomorrow, but apart from that I don’t know how to proceed
apart from trying to capture a sysrq+t dump when the box locks up…
any ideas?

On Tue, 17 Oct 2006, Jesper Juhl wrote:
>
> Ok, finally got to the end of the bisection (see below; quoting all of
> my previous email since my concerns from that one are still valid).
Ok. It does smell like you marked somethign good that wasn’t. That commit
1db27c11 was the last one you claimed was bad, of course, so it’s the one
git will claim caused it, when you’ve marked its parent good.

> Where do I go from here? The problem is still there… I’ll test
> 2.6.19-rc2 tomorrow, but apart from that I don’t know how to proceed
> apart from trying to capture a sysrq+t dump when the box locks up…
> any ideas?

Yeah, trying to do sysrq when it locks is probably worth it. As is
enabling debugging things (netconsole, page-alloc, slab alloc, lockdep
etc).

But if nothing seems to really give any clues, you might just try
to restart bisection with

git bisect reset
git bisect start
git bisect good v2.6.17
git bisect bad 1db27c11

and just run the resulting kernel version for a day or two. If an hour
wasn’t really good enough, it’s not as repeatable as we’d have wished, but
even if it takes a few days to narrow it down by just two bisections or
so, it will cut things down from ten thousand commits to “just” 2500..

Linus

On 17/10/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Tue, 17 Oct 2006, Jesper Juhl wrote:
>
> Ok. It does smell like you marked somethign good that wasn’t. That commit
> 1db27c11 was the last one you claimed was bad, of course, so it’s the one
> git will claim caused it, when you’ve marked its parent good.
>
>
> Yeah, trying to do sysrq when it locks is probably worth it. As is
> enabling debugging things (netconsole, page-alloc, slab alloc, lockdep
> etc).
>

I’ve got all those debug options (and more) enabled already in all the
bisection builds. I run with those options enabled most of the time
and I didn’t change my config for any of the kernels I tested (except
for running ‘make oldconfig’).
netconsole is not much use to me as I don’t have a second box at the
moment to capture output on πŸ™ So the best I can do there is to let
the box run in a plain console with the test script and then press
sysrq+t when it locks and take a photo of the output (or whats left of
it on the screen) if any.
> But if nothing seems to really give any clues, you might just try
> to restart bisection with
>
> git bisect reset
> git bisect start
> git bisect good v2.6.17
> git bisect bad 1db27c11
>
> and just run the resulting kernel version for a day or two. If an hour
> wasn’t really good enough, it’s not as repeatable as we’d have wished, but
> even if it takes a few days to narrow it down by just two bisections or
> so, it will cut things down from ten thousand commits to “just” 2500..
>

Ok, sure. I’ll do a days run of 2.6.19-rc2 first, just to see if it’s
been fixed in the mean time. If it’s still there I’ll try to get a
sysrq+t and post that, then I’ll restart bisection and give each
kernel a full 24hrs of testing before concluding it is good.I’ll report back as soon as I have some results.
On 17/10/06, wrote:
> On 17/10/06, Linus Torvalds <torvalds@osdl.org> wrote:

[…]
> Ok, sure. I’ll do a days run of 2.6.19-rc2 first, just to see if it’s
> been fixed in the mean time. If it’s still there I’ll try to get a
> sysrq+t and post that, then I’ll restart bisection and give each
> kernel a full 24hrs of testing before concluding it is good.
>
> I’ll report back as soon as I have some results.
>

Ok, I’ve been unable to do any testing for a few days, but today I had
some spare time and set my box to run my test script while doing some
other work. It was running latest git at the time of 2.6.19-rc2 + a
day or two and it locked up after ~20min.
So we are not so lucky that the problem has been fixed by some of the
patches that have gone in recently :-(Since there was nothing in the system logs and the box was completely
frozen (not even sysrq worked) I goess I’ll have to try and restart
the bisection.Just wanted to report the little data I had. I’ll be back with more
(hopefully soon).
On 23/10/06, wrote:
> On 17/10/06, wrote:
> […]
> Ok, I’ve been unable to do any testing for a few days, but today I had
> some spare time and set my box to run my test script while doing some
> other work. It was running latest git at the time of 2.6.19-rc2 + a
> day or two and it locked up after ~20min.
> So we are not so lucky that the problem has been fixed by some of the
> patches that have gone in recently πŸ™
>
> Since there was nothing in the system logs and the box was completely
> frozen (not even sysrq worked) I goess I’ll have to try and restart
> the bisection.
>
> Just wanted to report the little data I had. I’ll be back with more
> (hopefully soon).
>
A little more data :I’m still able to reproduce the lockups with 2.6.19-rc6 and 2.6.19 git
HEAD as of yesterday.

I’ve still not been able to get a sysrq-t dump or anything in my logs yet

One thing I have found though is that I don’t have to use my test
script to reproduce. Usually building an allyesconfig kernel (or two)
is enough.
The lockups seem to happen when my box runs low on memory. What
happens is that I can see all my memory being used up and the kernel
starts dipping into swap. Interactive behaviour in X then gets
significantly worse – changing between windows starts lagging and
eventually even moving the mouse gets jerky, it makes large jumps with
several seconds delay – that’s a sure sign a lockup is comming very
soon.
The box has 2GB of RAM and 768MB swap. When it starts getting
unresponsive before a hang there’s usually plenty of swap (a few
hundred MB) left and also a bit of RAM free.

So it *seems* to be somehow related to running low on RAM and swap
starting to be used.

One other thing that I’ve noticed, that may or may not be related, is
that when I shutdown my machine after a session where a significant
amount of RAM has been in use at some point (especially bad if some
swap has also been in use), then unmounting my filesystems takes ages.
Normally it just takes a few seconds to unmount the filesystems upon a
shutdown, or at most 10 seconds, but if I’m at the point where the
machine has dipped into swap (or has been very close to), then
unmounting the filesystems often takes 10-15 *minutes* or more
(sometimes I just give up and power off the box after 30min or
thereabouts).

Hope that helps in some way… I still want to redo/complete a new
bisection, but havent found the time yet.
More details when I have some.

On Wed, 22 Nov 2006, Jesper Juhl wrote:
>
> So it *seems* to be somehow related to running low on RAM and swap
> starting to be used.
Does it happen if you just do some simple “use all memory” script, eg run
a few copies of

#define SIZE (100<<20)

char *buf = malloc(SIZE);
memset(buf, SIZE, 0);
sleep(100);

on your box?

> The box has 2GB of RAM and 768MB swap.

I wonder.. It _used_ to be true that we were pretty good at making swap be
“extra” memory. But maybe we’ve lost some of that, and we have trouble
with having more physical memory. We could end up in a situation where we
allocate it all very quickly (because we don’t actually page it out, we
just allocate backing store for the pages), and we screw something up.

But stupid bugs there should still leave us trivially able to do the SysRQ
things, so..

Is it highmem-related? Some bounce-buffering problem while having to swap?
What block device driver do you use for the swap device?

I don’t think we use any irq-disable locking in the VM itself, but I could
imagine some nasty situation with the block device layer getting into a
deadlock with interrupts disabled when it runs out of queue entries and
cannot allocate more memory..

Linus

On Tue, Nov 21, 2006 at 06:36:39PM -0800, Linus Torvalds wrote:
>
>
> On Wed, 22 Nov 2006, Jesper Juhl wrote:
>
> Does it happen if you just do some simple “use all memory” script, eg run
> a few copies of
>
> #define SIZE (100<<20)
>
> char *buf = malloc(SIZE);
> memset(buf, SIZE, 0);
> sleep(100);
>
> on your box?
ITYM…memset(buf, 0, SIZE);

Dave


http://www.codemonkey.org.uk

On Tue, 21 Nov 2006, Dave Jones wrote:
>
> ITYM…
>
> memset(buf, 0, SIZE);
I’m just checking that you’re paying attention.

There’s a reason sparse warns about the third parameter of a memset()
being zero πŸ˜‰

Linus

On Tue, Nov 21, 2006 at 07:44:45PM -0800, Linus Torvalds wrote:

>
> I’m just checking that you’re paying attention.
>
> There’s a reason sparse warns about the third parameter of a memset()
> being zero πŸ˜‰
Heh, it’s amazing how commonplace that mistake is.
Come back bzero, all is forgiven..Dave


http://www.codemonkey.org.uk

On Tue, Nov 21 2006, Linus Torvalds wrote:
> I don’t think we use any irq-disable locking in the VM itself, but I could
> imagine some nasty situation with the block device layer getting into a
> deadlock with interrupts disabled when it runs out of queue entries and
> cannot allocate more memory..
Not likely. Request allocation is done with GFP_NOIO and backed by a
memory pool, so as long the vm doesn’t go totally nuts because
__GFP_WAIT is set, we should be safe there. If it did go crazy, I
suspect a sysrq-t would still work.If bouncing is involved for swap, we do have a potential deadlock issue
that isn’t fixed yet. I just whipped up this completely untested patch,
it should shed some light on that issue.

diff –git a/mm/bounce.c b/mm/bounce.c
index e4b62d2..f75eb37 100644
— a/mm/bounce.c
+++ b/mm/bounce.c
@@ -20,6 +20,7 @@ #define POOL_SIZE 64
#define ISA_POOL_SIZE 16

static mempool_t *page_pool, *isa_page_pool;
+static struct bio_set *bounce_bio_set;

#ifdef CONFIG_HIGHMEM
static __init int init_emergency_pool(void)
@@ -31,6 +32,9 @@ static __init int init_emergency_pool(vo
if (!i.totalhigh)
return 0;

+ bounce_bio_set = bioset_create(1, 1, 0);
+ BUG_ON(!bounce_bio_set);
+
page_pool = mempool_create_page_pool(POOL_SIZE, 0);
BUG_ON(!page_pool);
printk(“highmem bounce pool size: %d pages\n”, POOL_SIZE);
@@ -190,6 +194,11 @@ static int bounce_end_io_read_isa(struct
return 0;
}

+static void bounce_bio_destructor(struct bio *bio)
+{
+ bio_free(bio, bounce_bio_set);
+}
+
static void __blk_queue_bounce(request_queue_t *q, struct bio **bio_orig,
mempool_t *pool)
{
@@ -210,8 +219,10 @@ static void __blk_queue_bounce(request_q
/*
* irk, bounce it
*/
– if (!bio)
– bio = bio_alloc(GFP_NOIO, (*bio_orig)->bi_vcnt);
+ if (!bio) {
+ bio = bio_alloc_bioset(GFP_NOIO, (*bio_orig)->bi_vcnt, bounce_bio_set);
+ bio->bi_destructor = bounce_bio_destructor;
+ }

to = bio->bi_io_vec + i;

Dave Jones wrote:
> On Tue, Nov 21, 2006 at 07:44:45PM -0800, Linus Torvalds wrote:
>
>
> Heh, it’s amazing how commonplace that mistake is.
> Come back bzero, all is forgiven..
It’s interesting to do the following on google codesearchlang:^(c|c\+\+)$ memset\ *\(.*,\ *0\ *\);
http://tinyurl.com/y47qu4

lang:^(c|c\+\+)$ \sif\([^)]*\);
http://tinyurl.com/y4mdbl

It would be interesting to build
up a suite of these regular expressions.

PΓ‘draig.

On 22/11/06, <> wrote:
> On Tue, Nov 21 2006, Linus Torvalds wrote:
>
> Not likely. Request allocation is done with GFP_NOIO and backed by a
> memory pool, so as long the vm doesn’t go totally nuts because
> __GFP_WAIT is set, we should be safe there. If it did go crazy, I
> suspect a sysrq-t would still work.
>
> If bouncing is involved for swap, we do have a potential deadlock issue
> that isn’t fixed yet. I just whipped up this completely untested patch,
> it should shed some light on that issue.
>

Thanks Jens, I’ll apply that later tonight and force a few lockups and
see if I get any extra details with that patch.
On Wed, Nov 22 2006, Jesper Juhl wrote:
> On 22/11/06, <> wrote:
> Thanks Jens, I’ll apply that later tonight and force a few lockups and
> see if I get any extra details with that patch.
Can you post a full dmesg too, as well as clarify which device holds the
swap space?–
On 22/11/06, <> wrote:
> On Wed, Nov 22 2006, Jesper Juhl wrote:
>
> Can you post a full dmesg too, as well as clarify which device holds the
> swap space?
>

Sure. I’ll post a full dmesg as soon as I get home.The swap partition is on a IBM Ultrastar U160 10K RPM SCSI disk,
hooked up to an Adaptec 29160N controller, using the aic7xxx driver.
That disk holds all my filesystems as well and the controller also has
a SCSI DVD drive and a SCSI CD writer attached to it. No SATA/PATA
devices in the box, in case that matters.
On Wed, Nov 22 2006, Jesper Juhl wrote:
> On 22/11/06, <> wrote:
> Sure. I’ll post a full dmesg as soon as I get home.
>
> The swap partition is on a IBM Ultrastar U160 10K RPM SCSI disk,
> hooked up to an Adaptec 29160N controller, using the aic7xxx driver.
> That disk holds all my filesystems as well and the controller also has
> a SCSI DVD drive and a SCSI CD writer attached to it. No SATA/PATA
> devices in the box, in case that matters.
Does the box survive io intensive workloads? Have you tried using net or
serial console to see if it spits out any info before it crashes? I
would not be too surprised if it’s the aic7xxx driver taking a dive, I’d
be a lot more surprised if it’s actually the bouncing (I don’t think you
do any, can you post cat /proc/meminfo | grep -i bounce on that box?) or
a generic vm/block bug causing you problems.–
On 22/11/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Wed, 22 Nov 2006, Jesper Juhl wrote:
>
> Does it happen if you just do some simple “use all memory” script, eg run
> a few copies of
>
> #define SIZE (100<<20)
>
> char *buf = malloc(SIZE);
> memset(buf, SIZE, 0);
> sleep(100);
>
> on your box?
>
I’ll try, when I get home from work. I’ll let you know later.
>
> I wonder.. It _used_ to be true that we were pretty good at making swap be
> “extra” memory. But maybe we’ve lost some of that, and we have trouble
> with having more physical memory. We could end up in a situation where we
> allocate it all very quickly (because we don’t actually page it out, we
> just allocate backing store for the pages), and we screw something up.
>
> But stupid bugs there should still leave us trivially able to do the SysRQ
> things, so..
>

Well, it’s a fact that sysrq works just fine before the lockup but
does not work at all after a lockup, so…


> Is it highmem-related? Some bounce-buffering problem while having to swap?

I can try building a kernel without highmem support and see if I can
still cause it to lockup. Would be an interresting datapoint.

I’ll also try reproducing the lockup without any swap active to see if
that makes a difference.


> What block device driver do you use for the swap device?
>

It’s a swap partition on a IBM Ultra160 10K RPM SCSI disk. The
controller is an Adaptec 29160N. Using the SCSI_AIC7XXX driver.


> I don’t think we use any irq-disable locking in the VM itself, but I could
> imagine some nasty situation with the block device layer getting into a
> deadlock with interrupts disabled when it runs out of queue entries and
> cannot allocate more memory..
>

Just let me know what you would like me to try/test to prove/disprove that.

On Wed, Nov 22, 2006 at 10:32:36AM +0000, PΓ‘draig Brady wrote:

>
> It’s interesting to do the following on google codesearch
>
> lang:^(c|c\+\+)$ memset\ *\(.*,\ *0\ *\);
> http://tinyurl.com/y47qu4
>
> lang:^(c|c\+\+)$ \sif\([^)]*\);
> http://tinyurl.com/y4mdbl
>
> It would be interesting to build
> up a suite of these regular expressions.
A bunch of people already started gathering these a day
or so after codesearch launched..http://asert.arbornetworks.com/2006…le-code-search/
is a good start.
http://www.cipher.org.uk/index.php?…s/bugle.project
is also somewhat interesting (but from a security bug standpoint only)

I’ve got some crufty shell scripts that I grew that I use
from time to time that just grep a bunch of patterns, I’ve had
“put them all together and make one decent one” on my todo
for a while. I’ll see if I can get to it this week.
I’ve used these occasionally not just to find bugs in the kernel
but across a completely unpacked distro source tree.
Amazing what turns up sometimes.

Dave


http://www.codemonkey.org.uk

On 22/11/06, wrote:
> On 22/11/06, <> wrote:

….
> Sure. I’ll post a full dmesg as soon as I get home.
>
I didn’t have time to look at this last night, so I have to keep you
guys waiting for a little while longer.
On Wednesday 22 November 2006 11:57, wrote:
> On Wed, Nov 22 2006, Jesper Juhl wrote:
>
> Can you post a full dmesg too, as well as clarify which device holds the
> swap space?
>
Here’s a complete dmesg from a fresh boot :Linux version 2.6.19-rc6-g66c669ba (juhl@dragon) (gcc version 3.4.6) #2 SMP PREEMPT Fri Nov 24 00:37:24 CET 2006
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 – 000000000009f800 (usable)
BIOS-e820: 000000000009f800 – 00000000000a0000 (reserved)
BIOS-e820: 00000000000e8000 – 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 – 000000007ffb0000 (usable)
BIOS-e820: 000000007ffb0000 – 000000007ffc0000 (ACPI data)
BIOS-e820: 000000007ffc0000 – 000000007fff0000 (ACPI NVS)
BIOS-e820: 000000007fff0000 – 0000000080000000 (reserved)
BIOS-e820: 00000000ff7c0000 – 0000000100000000 (reserved)
1151MB HIGHMEM available.
896MB LOWMEM available.
found SMP MP-table at 000ff780
Entering add_active_range(0, 0, 524208) 0 entries of 256 used
Zone PFN ranges:
DMA 0 -> 4096
Normal 4096 -> 229376
HighMem 229376 -> 524208
early_node_map[1] active PFN ranges
0: 0 -> 524208
On node 0 totalpages: 524208
DMA zone: 32 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 4064 pages, LIFO batch:0
Normal zone: 1760 pages used for memmap
Normal zone: 223520 pages, LIFO batch:31
HighMem zone: 2303 pages used for memmap
HighMem zone: 292529 pages, LIFO batch:31
DMI 2.3 present.
ACPI: RSDP (v000 ACPIAM ) @ 0x000f9bb0
ACPI: RSDT (v001 A M I OEMRSDT 0x12000506 MSFT 0x00000097) @ 0x7ffb0000
ACPI: FADT (v002 A M I OEMFACP 0x12000506 MSFT 0x00000097) @ 0x7ffb0200
ACPI: MADT (v001 A M I OEMAPIC 0x12000506 MSFT 0x00000097) @ 0x7ffb0390
ACPI: OEMB (v001 A M I AMI_OEM 0x12000506 MSFT 0x00000097) @ 0x7ffc0040
ACPI: DSDT (v001 939M2 939M2150 0x00000150 INTL 0x02002026) @ 0x00000000
ACPI: PM-Timer IO Port: 0x808
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 15:3 APIC version 16
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
Processor #1 15:3 APIC version 16
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
ACPI: IOAPIC (id[0x03] address[0xfec10000] gsi_base[24])
IOAPIC[1]: apic_id 3, version 17, address 0xfec10000, GSI 24-39
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Enabling APIC mode: Flat. Using 2 I/O APICs
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 88000000 (gap: 80000000:7f7c0000)
Detected 2200.199 MHz processor.
Built 1 zonelists. Total pages: 520113
Kernel command line: BOOT_IMAGE=g66c669ba ro root=801
mapped APIC to ffffd000 (fee00000)
mapped IOAPIC to ffffc000 (fec00000)
mapped IOAPIC to ffffb000 (fec10000)
Enabling fast FPU save and restore… done.
Enabling unmasked SIMD FPU exception support… done.
Initializing CPU#0
CPU 0 irqstacks, hard=c04d5000 soft=c04d3000
PID hash table entries: 4096 (order: 12, 16384 bytes)
Console: colour dummy device 80×25
Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
…. MAX_LOCKDEP_SUBCLASSES: 8
…. MAX_LOCK_DEPTH: 30
…. MAX_LOCKDEP_KEYS: 2048
…. CLASSHASH_SIZE: 1024
…. MAX_LOCKDEP_ENTRIES: 8192
…. MAX_LOCKDEP_CHAINS: 8192
…. CHAINHASH_SIZE: 4096
memory used by lock dependency info: 904 kB
per task-struct memory footprint: 1200 bytes
————————
| Locking API testsuite:
—————————————————————————-
| spin |wlock |rlock |mutex | wsem | rsem |
————————————————————————–
A-A deadlock: ok | ok | ok | ok | ok | ok |
A-B-B-A deadlock: ok | ok | ok | ok | ok | ok |
A-B-B-C-C-A deadlock: ok | ok | ok | ok | ok | ok |
A-B-C-A-B-C deadlock: ok | ok | ok | ok | ok | ok |
A-B-B-C-C-D-D-A deadlock: ok | ok | ok | ok | ok | ok |
A-B-C-D-B-D-D-A deadlock: ok | ok | ok | ok | ok | ok |
A-B-C-D-B-C-D-A deadlock: ok | ok | ok | ok | ok | ok |
double unlock: ok | ok | ok | ok | ok | ok |
initialize held: ok | ok | ok | ok | ok | ok |
bad unlock order: ok | ok | ok | ok | ok | ok |
————————————————————————–
recursive read-lock: | ok | | ok |
recursive read-lock #2: | ok | | ok |
mixed read-write-lock: | ok | | ok |
mixed write-read-lock: | ok | | ok |
————————————————————————–
hard-irqs-on + irq-safe-A/12: ok | ok | ok |
soft-irqs-on + irq-safe-A/12: ok | ok | ok |
hard-irqs-on + irq-safe-A/21: ok | ok | ok |
soft-irqs-on + irq-safe-A/21: ok | ok | ok |
sirq-safe-A => hirqs-on/12: ok | ok | ok |
sirq-safe-A => hirqs-on/21: ok | ok | ok |
hard-safe-A + irqs-on/12: ok | ok | ok |
soft-safe-A + irqs-on/12: ok | ok | ok |
hard-safe-A + irqs-on/21: ok | ok | ok |
soft-safe-A + irqs-on/21: ok | ok | ok |
hard-safe-A + unsafe-B #1/123: ok | ok | ok |
soft-safe-A + unsafe-B #1/123: ok | ok | ok |
hard-safe-A + unsafe-B #1/132: ok | ok | ok |
soft-safe-A + unsafe-B #1/132: ok | ok | ok |
hard-safe-A + unsafe-B #1/213: ok | ok | ok |
soft-safe-A + unsafe-B #1/213: ok | ok | ok |
hard-safe-A + unsafe-B #1/231: ok | ok | ok |
soft-safe-A + unsafe-B #1/231: ok | ok | ok |
hard-safe-A + unsafe-B #1/312: ok | ok | ok |
soft-safe-A + unsafe-B #1/312: ok | ok | ok |
hard-safe-A + unsafe-B #1/321: ok | ok | ok |
soft-safe-A + unsafe-B #1/321: ok | ok | ok |
hard-safe-A + unsafe-B #2/123: ok | ok | ok |
soft-safe-A + unsafe-B #2/123: ok | ok | ok |
hard-safe-A + unsafe-B #2/132: ok | ok | ok |
soft-safe-A + unsafe-B #2/132: ok | ok | ok |
hard-safe-A + unsafe-B #2/213: ok | ok | ok |
soft-safe-A + unsafe-B #2/213: ok | ok | ok |
hard-safe-A + unsafe-B #2/231: ok | ok | ok |
soft-safe-A + unsafe-B #2/231: ok | ok | ok |
hard-safe-A + unsafe-B #2/312: ok | ok | ok |
soft-safe-A + unsafe-B #2/312: ok | ok | ok |
hard-safe-A + unsafe-B #2/321: ok | ok | ok |
soft-safe-A + unsafe-B #2/321: ok | ok | ok |
hard-irq lock-inversion/123: ok | ok | ok |
soft-irq lock-inversion/123: ok | ok | ok |
hard-irq lock-inversion/132: ok | ok | ok |
soft-irq lock-inversion/132: ok | ok | ok |
hard-irq lock-inversion/213: ok | ok | ok |
soft-irq lock-inversion/213: ok | ok | ok |
hard-irq lock-inversion/231: ok | ok | ok |
soft-irq lock-inversion/231: ok | ok | ok |
hard-irq lock-inversion/312: ok | ok | ok |
soft-irq lock-inversion/312: ok | ok | ok |
hard-irq lock-inversion/321: ok | ok | ok |
soft-irq lock-inversion/321: ok | ok | ok |
hard-irq read-recursion/123: ok |
soft-irq read-recursion/123: ok |
hard-irq read-recursion/132: ok |
soft-irq read-recursion/132: ok |
hard-irq read-recursion/213: ok |
soft-irq read-recursion/213: ok |
hard-irq read-recursion/231: ok |
soft-irq read-recursion/231: ok |
hard-irq read-recursion/312: ok |
soft-irq read-recursion/312: ok |
hard-irq read-recursion/321: ok |
soft-irq read-recursion/321: ok |
——————————————————-
Good, all 218 testcases passed! |
———————————
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
Memory: 2070236k/2096832k available (2333k kernel code, 25428k reserved, 957k data, 224k init, 1179328k highmem)
virtual kernel memory layout:
fixmap : 0xfff81000 – 0xfffff000 ( 504 kB)
pkmap : 0xff800000 – 0xffc00000 (4096 kB)
vmalloc : 0xf8800000 – 0xff7fe000 ( 111 MB)
lowmem : 0xc0000000 – 0xf8000000 ( 896 MB)
.init : 0xc0496000 – 0xc04ce000 ( 224 kB)
.data : 0xc03477c1 – 0xc0436f54 ( 957 kB)
.text : 0xc0100000 – 0xc03477c1 (2333 kB)
Checking if this processor honours the WP bit even in supervisor mode… Ok.
Calibrating delay using timer specific routine.. 4402.72 BogoMIPS (lpj=2201360)
Mount-cache hash table entries: 512
CPU: After generic identify, caps: 178bfbff e3d3fbff 00000000 00000000 00000001 00000000 00000003
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
CPU 0(2) -> Core 0
CPU: After all inits, caps: 178bfbf7 e3d3fbff 00000000 00000410 00000001 00000000 00000003
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
Checking ‘hlt’ instruction… OK.
Freeing SMP alternatives: 12k freed
ACPI: Core revision 20060707
CPU0: AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ stepping 02
lockdep: not fixing up alternatives.
Booting processor 1/1 eip 2000
CPU 1 irqstacks, hard=c04d6000 soft=c04d4000
Initializing CPU#1
Calibrating delay using timer specific routine.. 4399.52 BogoMIPS (lpj=2199764)
CPU: After generic identify, caps: 178bfbff e3d3fbff 00000000 00000000 00000001 00000000 00000003
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
CPU 1(2) -> Core 1
CPU: After all inits, caps: 178bfbf7 e3d3fbff 00000000 00000410 00000001 00000000 00000003
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#1.
CPU1: AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ stepping 02
Total of 2 processors activated (8802.24 BogoMIPS).
ENABLING IO-APIC IRQs
…TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
checking TSC synchronization across 2 CPUs:
CPU#0 had -31 usecs TSC skew, fixed it up.
CPU#1 had 31 usecs TSC skew, fixed it up.
Brought up 2 CPUs
migration_cost=387
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: PCI BIOS revision 3.00 entry at 0xf0031, last bus=4
PCI: Using configuration type 1
Setting up standard PCI resources
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI: Probing PCI hardware (bus 00)
PCI quirk: region 0800-083f claimed by ali7101 ACPI
Boot video device is 0000:03:00.0
PCI: Transparent bridge – 0000:00:06.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P4._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.HTT_._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PEB1._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PEB2._PRT]
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 *5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 *10 11 12 14 15), disabled.
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 *10 11 12 14 15), disabled.
ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNKF] (IRQs *3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 6 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 7 10 11 12 14 15) *9
ACPI: PCI Interrupt Link [LNKP] (IRQs 3 4 *5 6 7 10 11 12 14 15)
SCSI subsystem initialized
PCI: Using ACPI for IRQ routing
PCI: If a device doesn’t work, try “pci=routeirq”. If it helps, post a report
PCI: Bridge: 0000:00:01.0
IO window: disabled.
MEM window: ff200000-ff2fffff
PREFETCH window: disabled.
PCI: Bridge: 0000:00:02.0
IO window: disabled.
MEM window: ff300000-ff3fffff
PREFETCH window: disabled.
PCI: Bridge: 0000:00:05.0
IO window: disabled.
MEM window: ff400000-ff4fffff
PREFETCH window: c7f00000-d7efffff
PCI: Bridge: 0000:00:06.0
IO window: d000-dfff
MEM window: ff500000-ff5fffff
PREFETCH window: 88000000-880fffff
ACPI: PCI Interrupt 0000:00:01.0[A] -> GSI 29 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:00:01.0 to 64
ACPI: PCI Interrupt 0000:00:02.0[A] -> GSI 34 (level, low) -> IRQ 17
PCI: Setting latency timer of device 0000:00:02.0 to 64
PCI: Setting latency timer of device 0000:00:05.0 to 64
PCI: Setting latency timer of device 0000:00:06.0 to 64
NET: Registered protocol family 2
IP route cache hash table entries: 32768 (order: 5, 131072 bytes)
TCP established hash table entries: 65536 (order: 9, 2359296 bytes)
TCP bind hash table entries: 32768 (order: 8, 1179648 bytes)
TCP: Hash tables configured (established 65536 bind 32768)
TCP reno registered
Machine check exception polling timer started.
Initializing RT-Tester: OK
audit: initializing netlink socket (disabled)
audit(1164325505.879:1): initialized
highmem bounce pool size: 64 pages
io scheduler noop registered
io scheduler cfq registered (default)
vesafb: framebuffer at 0xc8000000, mapped to 0xf8880000, using 3072k, total 16384k
vesafb: mode is 1024x768x16, linelength=2048, pages=9
vesafb: protected mode interface info at c000:7880
vesafb: pmi: set display start = c00c79d3, set palette = c00c7ab3
vesafb: pmi: ports =
vesafb: scrolling: redraw
vesafb: Truecolor: size=0:5:6:5, shift=0:11:5:0
Console: switching to colour frame buffer device 128×48
fb0: VESA VGA frame buffer device
Real Time Clock Driver v1.12ac
Serial: 8250/16550 driver $Revision: 1.90 $ 2 ports, IRQ sharing disabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
Floppy drive(s): fd0 is 1.44M
FDC 0 is a post-1991 82077
via-rhine.c:v1.10-LK1.4.2 Sept-11-2006 Written by Donald Becker
ACPI: PCI Interrupt 0000:04:07.0[A] -> GSI 22 (level, low) -> IRQ 18
eth0: VIA Rhine II at 0xff5fec00, 00:50:ba:f2:a3:1d, IRQ 18.
eth0: MII PHY found at address 8, status 0x7829 advertising 01e1 Link 45e1.
ACPI: PCI Interrupt 0000:04:06.0[A] -> GSI 21 (level, low) -> IRQ 19
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 7.0
<Adaptec 29160N Ultra160 SCSI adapter>
aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs

scsi 0:0:4:0: CD-ROM PIONEER DVD-ROM DVD-305 1.03 PQ: 0 ANSI: 2
target0:0:4: Beginning Domain Validation
target0:0:4: FAST-20 SCSI 20.0 MB/s ST (50 ns, offset 16)
target0:0:4: Domain Validation skipping write tests
target0:0:4: Ending Domain Validation
scsi 0:0:5:0: CD-ROM PLEXTOR CD-R PX-W1210S 1.01 PQ: 0 ANSI: 2
target0:0:5: Beginning Domain Validation
target0:0:5: FAST-20 SCSI 20.0 MB/s ST (50 ns, offset 16)
target0:0:5: Domain Validation skipping write tests
target0:0:5: Ending Domain Validation
scsi 0:0:6:0: Direct-Access IBM DDYS-T36950N S96H PQ: 0 ANSI: 3
scsi0:A:6:0: Tagged Queuing enabled. Depth 200
target0:0:6: Beginning Domain Validation
target0:0:6: wide asynchronous
target0:0:6: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 63)
target0:0:6: Ending Domain Validation
SCSI device sda: 71687340 512-byte hdwr sectors (36704 MB)
sda: Write Protect is off
sda: Mode Sense: cb 00 00 08
SCSI device sda: drive cache: write back
SCSI device sda: 71687340 512-byte hdwr sectors (36704 MB)
sda: Write Protect is off
sda: Mode Sense: cb 00 00 08
SCSI device sda: drive cache: write back
sda: sda1 sda2 sda3 sda4
sd 0:0:6:0: Attached scsi disk sda
sr0: scsi3-mmc drive: 16x/40x cd/rw xa/form2 cdda tray
Uniform CD-ROM driver Revision: 3.20
sr 0:0:4:0: Attached scsi CD-ROM sr0
sr1: scsi3-mmc drive: 32x/32x writer cd/rw xa/form2 cdda tray
sr 0:0:5:0: Attached scsi CD-ROM sr1
sr 0:0:4:0: Attached scsi generic sg0 type 5
sr 0:0:5:0: Attached scsi generic sg1 type 5
sd 0:0:6:0: Attached scsi generic sg2 type 0
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
mice: PS/2 mouse device common for all mice
EDAC MC: Ver: 2.0.1 Nov 24 2006
TCP cubic registered
input: AT Translated Set 2 keyboard as /class/input/input0
Initializing XFRM netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
Starting balanced_irq
Using IPI Shortcut mode
Time: acpi_pm clocksource has been installed.
input: ImExPS/2 Generic Explorer Mouse as /class/input/input1
kjournald starting. Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) readonly.
Freeing unused kernel memory: 224k freed
Write protecting the kernel read-only data: 384k
Adding 763076k swap on /dev/sda3. Priority:-1 extents:1 across:763076k
EXT3 FS on sda1, internal journal
ACPI: PCI Interrupt 0000:04:05.0[A] -> GSI 20 (level, low) -> IRQ 20
Linux agpgart interface v0.101 (c) Dave Jones
ReiserFS: sda2: found reiserfs format “3.6” with standard journal
ReiserFS: sda2: using ordered data mode
ReiserFS: sda2: journal params: device sda2, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
ReiserFS: sda2: checking transaction log (sda2)
ReiserFS: sda2: Using r5 hash to sort names
ReiserFS: sda4: found reiserfs format “3.6” with standard journal
ReiserFS: sda4: using ordered data mode
ReiserFS: sda4: journal params: device sda4, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
ReiserFS: sda4: checking transaction log (sda4)
ReiserFS: sda4: Using r5 hash to sort names
eth0: link up, 100Mbps, full-duplex, lpa 0x45E1
(scsi0:A:4:0): No or incomplete CDB sent to device.
scsi0: Issued Channel A Bus Reset. 7 SCBs aborted
target0:0:6: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 63)
target0:0:4: FAST-20 SCSI 20.0 MB/s ST (50 ns, offset 16)

On Wednesday 22 November 2006 12:07, wrote:
> On Wed, Nov 22 2006, Jesper Juhl wrote:
>
> Does the box survive io intensive workloads?
It seems to. It does get sluggish as hell when there is lots of disk I/O but
it seems to be able to survive.
I’ll try some more, with some IO benchmarks + various other stuff to see
if I can get it to die that way.

> Have you tried using net or
> serial console to see if it spits out any info before it crashes?
Lacking a second box at the moment, so that’s not an option currently


> I
> would not be too surprised if it’s the aic7xxx driver taking a dive, I’d
> be a lot more surprised if it’s actually the bouncing (I don’t think you
> do any, can you post cat /proc/meminfo | grep -i bounce on that box?) or
> a generic vm/block bug causing you problems.
>

$ cat /proc/meminfo | grep -i bounce
Bounce: 0 kB

On 22/11/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Wed, 22 Nov 2006, Jesper Juhl wrote:
>
> Does it happen if you just do some simple “use all memory” script, eg run
> a few copies of
>
> #define SIZE (100<<20)
>
> char *buf = malloc(SIZE);
> memset(buf, SIZE, 0);
> sleep(100);
>
> on your box?
>
No. That doesn’t kill the box. It very effectively turns it into a
slug (bigtime) but it doesn’t kill it.Running just a single copy is no problem. Neither is running 4 or 5 in
parallel.
Doing
for i in $(seq 1 30); do ./a.out & done
turns the box into a slug for 5 minutes or so, but then when all the
processes have terminated and another few minutes have passed it is
back to normal.
Running
for i in $(seq 1 100); do ./a.out & done
is a different story though. Starting the first ~40 processes happens
relatively fast, then starting the next 10-20 or so happens very
slowly (5-10 sec intervals between each one), then it starts taking
something like 20-30 seconds for each new process to start and when we
get somewhere around 75-85 processes started the box appears to be
hung, except that sysrq still works and I can still switch tty’s with
ctrl+alt+F?. After a few minutes in this almost-hung state the Oom
killer kicks in and kills a few of the processes and after some
additional minutes all 100 processes eventually get started and
sometimes a few have even started to die off as well. Once all 100
processes have been started it takes somewhere around 5-10 minutes for
them all to terminate (most terminate normally, some die with
“segmentation fault” and they die off roughly in the order they got
started). The biggest problem after all processes have terminated is
then that the box remains a slug. I left it alone for ~10 minutes at
this point and when I came back it was still not back to normal (and
trying to do a normal reboot took so long that I eventually lost my
patience and used sysrq+b to boot it).
On Fri, Nov 24 2006, Jesper Juhl wrote:
>
> It seems to. It does get sluggish as hell when there is lots of disk I/O but
> it seems to be able to survive.
> I’ll try some more, with some IO benchmarks + various other stuff to see
> if I can get it to die that way.
Just wondering if you have a marginal powersupply, perhaps.

>
> Lacking a second box at the moment, so that’s not an option currently
It’s likely a requirement to get any further with this issue, I’m
afraid. Nobody can debug this thing blind folded.

On 24/11/06, <> wrote:
> On Fri, Nov 24 2006, Jesper Juhl wrote:
>
> Just wondering if you have a marginal powersupply, perhaps.
>

It is a possibility, but I doubt it, since if I use a 2.6.17.x kernel
then things are rock solid and I can’t cause a lockup even if I leave
my box building kernels in the background for days.

>
> It’s likely a requirement to get any further with this issue, I’m
> afraid. Nobody can debug this thing blind folded.
>

I’ll see if I can get my hands on a second box.
On Fri, Nov 24 2006, Jesper Juhl wrote:
> On 24/11/06, <> wrote:
> It is a possibility, but I doubt it, since if I use a 2.6.17.x kernel
> then things are rock solid and I can’t cause a lockup even if I leave
> my box building kernels in the background for days.
Since it triggers fairly quickly, any chance that you could try and
narrow it down to a specific version that breaks?–
On 24/11/06, <> wrote:
> On Fri, Nov 24 2006, Jesper Juhl wrote:
>
> Since it triggers fairly quickly, any chance that you could try and
> narrow it down to a specific version that breaks?
>

I already tried doing a git bisect, but I somehow messed it up
(probably by concluding that a bad kernel was good).
The problem is that *usually* triggers fairly quickly (within 1hr),
but sometimes it takes much longer to trigger, so it’s hard to be 100%
sure that a kernel is actually good – except if I leave it running for
something like 24hrs for each step in the bisect. That is actually
something I plan to do, but finding the time for that is not easy.