Web Server forum
Back To The Forum Home!Search!Private Messaging System

This is Interesting: Free IT Magazines Now Free shipping to   
Web Server Talk Web Server Talk > Unix and Linux reviews > Linux support forum > Linux Kernel > [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage




Pages (4): [1] 2 3 4 »   Last Thread   Next Thread Next
  Show Printable Version Email this Page Subscribe to this Thread      Post New Thread    Post A Reply      

    [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage  
Marcelo Tosatti


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-05-04 10:46 PM

Hi,

As you know the OOM is very problematic in 2.6 right now - so I went
to investigate it.

Currently the oom killer is invoked from the task reclaim
code (try_to_free_pages), which IMO is fundamentally broken,
because its non deterministic - the chance the OOM killer
will be triggered increases as the number of tasks inside
reclaiming increases. And kswapd is freeing pages in parallel,
which is completly ignored by this approach.

In my opinion the correct approach is to trigger the OOM killer
when kswapd is unable to free pages. Once that is done, the number
of tasks inside page reclaim is irrelevant.

So the following patch moves the out_of_memory() call to
balance_pgdat(), and makes it dependant on success reclaiming pages
when work has actually been done, and on priority reaching zero.

It also removes the "If it's been a long time since last failure dont
OOM kill" logic which for me just tries to paper over a bigger issue.

Relying on this information (kswapd failure after DEF_PRIORITY passes)
to trigger the OOM killer seems to be very reliable - it needs some
more testing though.

With this in place I can't see spurious OOM kills - just need to guarantee
that it reliably OOM kills when we are really out of memory.

While doing this, I noticed that kswapd will happily go to sleep
if all zones have all_unreclaimable set. I bet this is the reason
for the page allocation failures we are seeing. So the patch
also makes balance_pgdat() NOT return and go to "loop_again"
instead in case of page shortage - even if all_unreclaimable is set.

Basically the "loop_again" logic IS NOT WORKING!

Comments?

My wife is almost killing me, its Friday night and I've been telling her
"just another minute" for hours. Have to run.

diff -Nur --show-c-function --exclude='*.orig' linux-2.6.10-rc1-mm2.orig/mm/
oom_kill.c linux-2.6.10-rc1-mm2/mm/oom_kill.c
--- linux-2.6.10-rc1-mm2.orig/mm/oom_kill.c	2004-11-04 22:50:50.000000000 -0
200
+++ linux-2.6.10-rc1-mm2/mm/oom_kill.c	2004-11-05 18:33:29.918459072 -0200
@@ -240,23 +240,23 @@ void out_of_memory(int gfp_mask)
* If it's been a long time since last failure,
* we're not oom.
*/
-	if (since > 5*HZ)
-		goto reset;
+	//if (since > 5*HZ)
+	//	goto reset;

/*
* If we haven't tried for at least one second,
* we're not really oom.
*/
-	since = now - first;
-	if (since < HZ)
-		goto out_unlock;
+	//since = now - first;
+	//if (since < HZ)
+	//	goto out_unlock;

/*
* If we have gotten only a few failures,
* we're not really oom.
*/
-	if (++count < 10)
-		goto out_unlock;
+//	if (++count < 10)
+//		goto out_unlock;

/*
* If we just killed a process, wait a while
diff -Nur --show-c-function --exclude='*.orig' linux-2.6.10-rc1-mm2.orig/mm/
vmscan.c linux-2.6.10-rc1-mm2/mm/vmscan.c
--- linux-2.6.10-rc1-mm2.orig/mm/vmscan.c	2004-11-04 22:50:50.000000000 -020
0
+++ linux-2.6.10-rc1-mm2/mm/vmscan.c	2004-11-05 19:53:35.915836416 -0200
@@ -952,8 +952,6 @@ int try_to_free_pages(struct zone **zone
if (sc.nr_scanned && priority < DEF_PRIORITY - 2)
blk_congestion_wait(WRITE, HZ/10);
}
-	if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY))
-		out_of_memory(gfp_mask);
out:
for (i = 0; zones[i] != 0; i++) {
struct zone *zone = zones[i];
@@ -997,13 +995,15 @@ static int balance_pgdat(pg_data_t *pgda
int all_zones_ok;
int priority;
int i;
-	int total_scanned, total_reclaimed;
+	int total_scanned, total_reclaimed, worked;
struct reclaim_state *reclaim_state = current->reclaim_state;
struct scan_control sc;

+
loop_again:
total_scanned = 0;
total_reclaimed = 0;
+	worked = 0;
sc.gfp_mask = GFP_KERNEL;
sc.may_writepage = 0;
sc.nr_mapped = read_page_state(nr_mapped);
@@ -1033,6 +1033,10 @@ loop_again:
if (zone->present_pages == 0)
continue;

+				if (!zone_watermark_ok(zone, order,
+						zone->pages_high, 0, 0, 0))
+					all_zones_ok = 0;
+
if (zone->all_unreclaimable &&
priority != DEF_PRIORITY)
continue;
@@ -1072,6 +1076,9 @@ scan:
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue;

+			if (priority == 0)
+				worked = 1;
+
if (nr_pages == 0) {	/* Not software suspend */
if (!zone_watermark_ok(zone, order,
zone->pages_high, end_zone, 0, 0))
@@ -1129,6 +1136,9 @@ out:
zone->prev_priority = zone->temp_priority;
}
if (!all_zones_ok) {
+		if (priority == 0 && !total_reclaimed && worked)
+			out_of_memory(GFP_KERNEL);
+
cond_resched();
goto loop_again;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/





[ Post a follow-up to this message ]



    Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage  
Jesse Barnes


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-05-04 10:46 PM

On Friday, November 05, 2004 12:01 pm, Marcelo Tosatti wrote:
> Hi,
>
> As you know the OOM is very problematic in 2.6 right now - so I went
> to investigate it.
>
> Currently the oom killer is invoked from the task reclaim
> code (try_to_free_pages), which IMO is fundamentally broken,
> because its non deterministic - the chance the OOM killer
> will be triggered increases as the number of tasks inside
> reclaiming increases. And kswapd is freeing pages in parallel,
> which is completly ignored by this approach.
>
> In my opinion the correct approach is to trigger the OOM killer
> when kswapd is unable to free pages. Once that is done, the number
> of tasks inside page reclaim is irrelevant.

That makes sense.

> With this in place I can't see spurious OOM kills - just need to guarantee
> that it reliably OOM kills when we are really out of memory.

That's good.  I can test it on a large machine (hopefully next week).

> Comments?

Sounds good, though we may want to do a couple of more things, we shouldn't
kill root tasks quite as easily and we should avoid zombies since they may b
e
large apps in the process of exiting, and killing them would be bad (iirc
it'll cause a panic).

Thanks,
Jesse






[ Post a follow-up to this message ]



    Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage  
Andrea Arcangeli


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-06-04 01:45 AM

On Fri, Nov 05, 2004 at 03:32:50PM -0800, Jesse Barnes wrote:
> On Friday, November 05, 2004 12:01 pm, Marcelo Tosatti wrote: 
>
> That makes sense.

I don't like it, kswapd may fail balancing because there's a GFP_DMA
allocation that eat the last dma page, but we should not kill tasks if
we fail to balance in kswapd, we should kill tasks only when no fail
path exists (i.e. only during page faults, everything else in the kernel
has a fail path and it should never trigger oom).

If you move it in kswapd there's no way to prevent oom-killing from a
syscall allocation (I guess even right now it would go wrong in this
sense, but at least right now it's more fixable). I want to move the oom
kill outside the alloc_page paths. The oom killing is all about the page
faults not having a fail path, and in turn the oom killing should be
moved in the page fault code, not in the allocator. Everything else
should keep returning -ENOMEM to the caller.

So to me moving the oom killer into kswapd looks a regression.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/





[ Post a follow-up to this message ]



    Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage  
Jesse Barnes


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-06-04 01:45 AM

On Friday, November 05, 2004 5:26 pm, Nick Piggin wrote: 
>
> Probably a good idea. OTOH, some kernel allocations might really
> need to be performed and have no failure path. For example __GFP_REPEAT.

Ah, I see what you're saying, yes, that makes even more sense 

> I think maybe __GFP_REPEAT allocations at least should be able to
> cause an OOM. Not sure though.
> 
>
> Also, I think it would do the wrong thing on NUMA machines because
> that has a per-node kswapd.

Yep, Andrea's explaination is clear, I just had to read it a few times.
Anyway, the fixes I posted are still necessary I think.

Thanks,
Jesse
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/





[ Post a follow-up to this message ]



    Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage  
Andrea Arcangeli


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-06-04 01:45 AM

On Sat, Nov 06, 2004 at 12:26:57PM +1100, Nick Piggin wrote:
> need to be performed and have no failure path. For example __GFP_REPEAT.

all allocations should have a failure path to avoid deadlocks. But in
the meantime __GFP_REPEAT is at least localizing the problematic places ;)

> I think maybe __GFP_REPEAT allocations at least should be able to
> cause an OOM. Not sure though.

probably it should because this is also a case where no fail path exists.

My point was only that when a fail path exists, it's more reliable not
to invoke the oom killer and let userspace handle the failure.

> Also, I think it would do the wrong thing on NUMA machines because
> that has a per-node kswapd.

yep.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/





[ Post a follow-up to this message ]



    Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage  
Nikita Danilov


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-06-04 12:45 PM

Andrea Arcangeli writes:
> On Sat, Nov 06, 2004 at 12:26:57PM +1100, Nick Piggin wrote: 
>
> all allocations should have a failure path to avoid deadlocks. But in

This is not currently possible for a complex operation that allocates
multiple pages and has always complete as a whole.

We need page-reservation API of some sort. There were several attempts
to introduce this, but none get into mainline.

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/





[ Post a follow-up to this message ]



    Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage  
Marcelo Tosatti


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-06-04 12:45 PM

On Sat, Nov 06, 2004 at 02:20:18AM +0100, Andrea Arcangeli wrote:
> On Fri, Nov 05, 2004 at 03:32:50PM -0800, Jesse Barnes wrote: 

Hi Andrea,
[vbcol=seagreen]
> I don't like it, kswapd may fail balancing because there's a GFP_DMA
> allocation that eat the last dma page, but we should not kill tasks if
> we fail to balance in kswapd, we should kill tasks only when no fail
> path exists (i.e. only during page faults, everything else in the kernel
> has a fail path and it should never trigger oom).

The OOM killer is only going to get triggered if kswapd is not able
to make _any_ progress in all zones.  So it wont "fail balancing because the
re's
a GFP_DMA allocation that eat the last dma page".

As long as frees _one_ page during all passes from DEF_PRIORITY till priorit
y=0,
it wont kill any task. See?

I dont get your point.

> If you move it in kswapd there's no way to prevent oom-killing from a
> syscall allocation (I guess even right now it would go wrong in this
> sense, but at least right now it's more fixable).

I dont understand what you mean. "prevent oom-killing from a syscall allocat
ion" ?

>  I want to move the oom
> kill outside the alloc_page paths. The oom killing is all about the page
> faults not having a fail path, and in turn the oom killing should be
> moved in the page fault code, not in the allocator. Everything else
> should keep returning -ENOMEM to the caller.

Isnt OOM killing all about the reclaiming efforts not being able to make pro
gress?

> So to me moving the oom killer into kswapd looks a regression.

To me having tasks trigger the OOM kill is fundamentally broken
because it doesnt take into account kswapd page freeing
efforts which are in-progress at the very moment.

That makes senses a lot of sense to me - would love to be proved
wrong.

See, its completly screwed right now. The code inside out_of_memory()
which only triggers OOM if it has happened several times during the
past few seconds is horrible and shows how bad it is.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/





[ Post a follow-up to this message ]



    Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage  
Marcelo Tosatti


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-06-04 12:45 PM

On Sat, Nov 06, 2004 at 12:26:57PM +1100, Nick Piggin wrote:
>
>
> Andrea Arcangeli wrote:
> 
>
> Probably a good idea. OTOH, some kernel allocations might really
> need to be performed and have no failure path. For example __GFP_REPEAT.
>
> I think maybe __GFP_REPEAT allocations at least should be able to
> cause an OOM. Not sure though.

As I said in my answer to Andrea, OOM killing is about allocations not being
able to succeed (aka VM not being able to free pages).

kswapd is freeing pages, he is the one who knows about progress.
 
>
> Also, I think it would do the wrong thing on NUMA machines because
> that has a per-node kswapd.

Right, we need to handle the NUMA case correctly (we need to which does "don
t kill if other nodes
have available memory").

But still, it looks the right thing to do to me.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/





[ Post a follow-up to this message ]



    Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage  
Marcelo Tosatti


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-06-04 12:45 PM

On Sat, Nov 06, 2004 at 02:55:50AM +0100, Thomas Gleixner wrote:
> On Sat, 2004-11-06 at 02:20 +0100, Andrea Arcangeli wrote: 
>
> My point is not where oom-killer is triggered. My point is the decision
> criteria of oom-killer, when it is finally invoked, which process to
> kill. That's kind of independend of your patch. Your patch corrects the
> context in which oom-killer is called. My concern is that the decision
> critrion which process should be killed is not sufficient. In my case it
> kills sshd instead of a process which forks a bunch of child processes.
> Thats just wrong, because it takes away the chance to log into the
> machine remotely and fix the problem.

Hi Thomas,

Yes your patches are correct and needed independantly of where OOM killer
is triggered from.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/





[ Post a follow-up to this message ]



    Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage  
Andrea Arcangeli


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-06-04 10:45 PM

On Sat, Nov 06, 2004 at 09:47:56AM +0000, Hugh Dickins wrote:
> Problematic, yes: don't overlook that GFP_REPEAT and GFP_NOFAIL _can_
> fail, returning NULL: when the process is being OOM-killed (PF_MEMDIE).

that looks weird, why that? The oom killer must be robust against a task
not going anyway regardless of this (task can be stuck in nfs or
similar). If a fail path ever existed, __GFP_NOFAIL should not have been
used in the first place. I don't see many valid excuses to use
__GFP_NOFAIL if we can return NULL without the caller running into an
infinite loop.

btw, PF_MEMDIE has always been racy in the way it's being set, so it can
corrupt the p->flags, but the race window is very small to trigger it
(and even if it triggers, it probably wouldn't be fatal). That's why I
don't use PF_MEMDIE in 2.4-aa.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/





[ Post a follow-up to this message ]



    Sponsored Links  




 





   All times are GMT. The time now is 12:02 AM.      Post New Thread    Post A Reply      
Pages (4): [1] 2 3 4 »   Last Thread   Next Thread Next


Most Popular forums 

Forum Jump:
Rate This Thread:

Forum Rules:
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is OFF
vB code is ON
Smilies are ON
[IMG] code is OFF
 

Back To The Top
Home | Usercp | Faq | Register