|
Home > Archive > Unix Programming > January 2007 > Reading code of a function?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Reading code of a function?
|
|
| Michael B Allen 2007-01-17, 1:33 am |
| Is there any way to reliably read the actual code of a function?
If the function is statically linked you can simply access the symbol. For
I can 'hexdump' the hash_str code like:
hexdump(stdout, hash_str, 128, 16);
output:
00000: 55 89 e5 83 ec 0c c7 45 fc 05 15 00 00 8b 45 08 |U......E......E.|
00010: 89 45 f8 83 7d 0c 00 74 09 8b 45 08 03 45 0c 89 |.E..}..t..E..E..|
....
Looking at objdump -d I can confirm the above is indeed correct.
However, if the symbol is *dynamically* linked, the hexdump output yields
repetative garbage:
00000: ff 25 84 c5 04 08 68 18 01 00 00 e9 b0 fd ff ff |.%....h.........|
00010: ff 25 88 c5 04 08 68 20 01 00 00 e9 a0 fd ff ff |.%....h ........|
....
Any suggestions?
Thanks,
Mike
| |
| Paul Pluzhnikov 2007-01-17, 1:33 am |
| Michael B Allen <mba2000@ioplex.com> writes:
> Is there any way to reliably read the actual code of a function?
Sure: if the processor can read it, so can you.
> Any suggestions?
Your question is not very clear (at least not to me).
In particular, it is difficult to understand what you are talking
about here:
> However, if the symbol is *dynamically* linked, the hexdump output yields
> repetative garbage:
>
> 00000: ff 25 84 c5 04 08 68 18 01 00 00 e9 b0 fd ff ff |.%....h.........|
Surely you are not expecting to find any executable code at offset
0 in the object file? If the offset 0 was "just an example", what
offset did you *actually* use (and how did you arrive at it)?
Perhaps you should ask your question again, after reading this:
http://catb.org/~esr/faqs/smart-questions.html
Cheers,
--
In order to understand recursion you must first understand recursion.
Remove /-nsp/ for email.
| |
| Alan Curry 2007-01-17, 1:33 am |
| In article <pan.2007.01.17.02.06.37.432404@ioplex.com>,
Michael B Allen <mba2000@ioplex.com> wrote:
>
>However, if the symbol is *dynamically* linked, the hexdump output yields
>repetative garbage:
>
>00000: ff 25 84 c5 04 08 68 18 01 00 00 e9 b0 fd ff ff |.%....h.........|
>00010: ff 25 88 c5 04 08 68 20 01 00 00 e9 a0 fd ff ff |.%....h ........|
>
Looks like you got a dump of the PLT entry. It's not garbage, it's executable
code and it's part of the dynamic linking process. It's repetitive because
there's an entry for every dynamically linked function, and each entry is
very short (16 bytes in your example).
>Any suggestions?
Disassemble those first few bytes, and look at what they do. The first time
that location is executed, it calls the dynamic linker to find the function
of the proper name and jump into it. It also stores the result so that the
next time the PLT entry is executed, it jumps directly to the function
without going through the lookup again.
If you want to make a program that can inspect itself as easily as with gdb,
you're in for a lot of work.
--
Alan Curry
pacman@world.std.com
| |
| Michael B Allen 2007-01-17, 1:17 pm |
| On Wed, 17 Jan 2007 07:18:10 +0000, Alan Curry wrote:
> In article <pan.2007.01.17.02.06.37.432404@ioplex.com>,
> Michael B Allen <mba2000@ioplex.com> wrote:
>
> Looks like you got a dump of the PLT entry. It's not garbage, it's executable
> code and it's part of the dynamic linking process. It's repetitive because
> there's an entry for every dynamically linked function, and each entry is
> very short (16 bytes in your example).
We'll I need the actual .text of the function such that I can copy it
into a buffer, cast it into a function pointer and be able to call it.
>
> Disassemble those first few bytes, and look at what they do. The first time
> that location is executed, it calls the dynamic linker to find the function
> of the proper name and jump into it. It also stores the result so that the
> next time the PLT entry is executed, it jumps directly to the function
> without going through the lookup again.
>
> If you want to make a program that can inspect itself as easily as with gdb,
> you're in for a lot of work.
Not quite what I need. Perhaps I should explain a little further.
I have a data structure in shared memory being accessed by multiple
processes. This structure represents an ADT that uses a hash function
supplied by the user when the ADT is initialized. However, because a
pointer in one process does not necessarily have the same value within
another I cannot simply store a pointer to the hash function within the
structure. Instead, I copy the hash function's .text into shared memory
and store it's offset relative to the beginning of the shared mem.
Yeah, I could pass a pointer to the hash function as a parameter every
time a process calls one of the ADT functions but that would be
pretty ugly.
Anyway, I have found a method that seems to work - dlsym returns the
..text of the function. For now I suppose I'm satisfied with that but
clearly I'll have to tweek things when porting to different platforms.
Mike
| |
| Alan Curry 2007-01-17, 7:28 pm |
| In article <pan.2007.01.17.15.59.04.967179@ioplex.com>,
Michael B Allen <mba2000@ioplex.com> wrote:
>On Wed, 17 Jan 2007 07:18:10 +0000, Alan Curry wrote:
>
>I have a data structure in shared memory being accessed by multiple
>processes. This structure represents an ADT that uses a hash function
>supplied by the user when the ADT is initialized. However, because a
>pointer in one process does not necessarily have the same value within
>another I cannot simply store a pointer to the hash function within the
>structure. Instead, I copy the hash function's .text into shared memory
>and store it's offset relative to the beginning of the shared mem.
What if the hash function calls a helper function that isn't part of your
clever scheme? Aren't you back where you started, with different addresses in
different processes? The PLT is just a particular quirky case of this, a
small wrapper function that locates and calls another function.
>Anyway, I have found a method that seems to work - dlsym returns the
>.text of the function. For now I suppose I'm satisfied with that but
>clearly I'll have to tweek things when porting to different platforms.
dlsym sounds like the right answer for the immediate problem, but the whole
exercise still sounds ugly to me. How do you decide how many bytes to copy? I
hope you don't think that a compiled function necessarily ends at the first
ret instruction.
--
Alan Curry
pacman@world.std.com
| |
| Michael B Allen 2007-01-17, 7:28 pm |
| On Wed, 17 Jan 2007 21:19:10 +0000, Alan Curry wrote:
> In article <pan.2007.01.17.15.59.04.967179@ioplex.com>,
> Michael B Allen <mba2000@ioplex.com> wrote:
>
> What if the hash function calls a helper function that isn't part of your
> clever scheme? Aren't you back where you started, with different addresses in
> different processes? The PLT is just a particular quirky case of this, a
> small wrapper function that locates and calls another function.
Right. The hash function cannot and does not call other functions.
>
> dlsym sounds like the right answer for the immediate problem, but the whole
> exercise still sounds ugly to me. How do you decide how many bytes to copy? I
> hope you don't think that a compiled function necessarily ends at the first
> ret instruction.
Uh, right. Yeah, uh, I knew that. That's because uh, mmm ... err *why*
can't you get the size from objdump?
Mike
| |
| Paul Pluzhnikov 2007-01-18, 1:32 am |
| Michael B Allen <mba2000@ioplex.com> writes:
The function is also not allowed to access any global data, because
in PIC code such access is indirected, and the "global" pointer
will not be properly set up when function code is copied elsewhere.
[vbcol=seagreen]
> Right. The hash function cannot and does not call other functions.
On platforms that maintain separate '%gp' register this function
will not be able to access any non-immediate data at all. I think
PowerPC, PA-RISC, MIPS, and ia64 will all present a problem.
Yes, we've told that to Michael about 14 month ago:
http://groups.google.com/group/comp...4a35a238c9202fc
[vbcol=seagreen]
>
> err *why* can't you get the size from objdump?
Objdump doesn't understand some file formats at all, and often
gives you incorrect info on others.
You can get the size by examining disassembly, but you'll have
to repeat the exercise for each user-supplied function, for each
platform, each compiler, and each set of compilation flags.
And hope that user didn't turn on compiler and linker optimizations
which could split a single function into several "chunks" and
scattered them all over the DSO (this actually happens a lot in
x64 DLLs compiled with VS 2005, but I haven't seen this on any
UNIX yet).
Cheers,
--
In order to understand recursion you must first understand recursion.
Remove /-nsp/ for email.
| |
| Michael B Allen 2007-01-18, 1:32 am |
| Ok, I think I may have solved this problem. This solution may even
satisfy Paul :-)
The ADT (a hashmap) initialization routine could use dladdr to get the
name of the "shared function" (e.g. "hash_str") and place *it* in shared
memory. When the function needs to be called it uses dlsym on the name
to get the function. However, calling dlsym each time the function needs
to be resolved would be prohibitively slow so the function pointer would
have to be cached in a global table containing the function name and it's
address. Because the global is not in shared memory each process will
have it's own table with the correct address for that process as supplied
by dlsym. Searching the table will introduce a slight performance impact
but the table would only have a few entries (the number of unique shared
functions used throughout the program which for my current application
would be two).
I think that would yield acceptable performance and it would allow the
shared functions to call other functions, use globals, etc. I wouldn't
need to know the size of the .text or store anything architecture
specific.
Sound like a plan?
Mike
| |
| Logan Shaw 2007-01-18, 1:32 am |
| Michael B Allen wrote:
> Ok, I think I may have solved this problem. This solution may even
> satisfy Paul :-)
>
> The ADT (a hashmap) initialization routine could use dladdr to get the
> name of the "shared function" (e.g. "hash_str") and place *it* in shared
> memory. When the function needs to be called it uses dlsym on the name
> to get the function. However, calling dlsym each time the function needs
> to be resolved would be prohibitively slow so the function pointer would
> have to be cached in a global table containing the function name and it's
> address. Because the global is not in shared memory each process will
> have it's own table with the correct address for that process as supplied
> by dlsym. Searching the table will introduce a slight performance impact
> but the table would only have a few entries (the number of unique shared
> functions used throughout the program which for my current application
> would be two).
>
> I think that would yield acceptable performance and it would allow the
> shared functions to call other functions, use globals, etc. I wouldn't
> need to know the size of the .text or store anything architecture
> specific.
>
> Sound like a plan?
That's the basic direction I think I'd go, but with two changes:
(1) I would pass the name of the dynamic library to load with dlsym()
instead of the name of the function. The name of the function
would be a fixed part of the library's interface.
(2) I would store the list of library names in an array in the shared
memory area. If you do that, the abstract data type can use the
array index as the piece of information that identifies which
hash function to use. And, the various processes that share this
abstract data type can the same index to find the pointer to the
function (the result of dlsym()) in their own array of pointers,
thus making lookup really fast.
- Logan
| |
| Paul Pluzhnikov 2007-01-18, 1:32 am |
| Logan Shaw <lshaw-usenet@austin.rr.com> writes:
> Michael B Allen wrote:
There are several gotcha's with this solution ...
[vbcol=seagreen]
The first gotcha is that now every "client" process has to provide
exported function "hash_str".
The second gotcha is that these hash_str()s better be identical or
at least compatible. If they are not, you'll have difficult to
debug ADT corruption.
[vbcol=seagreen]
Presumably you have some routine that all clients call to attach
to the ADT in shared memory. That is a good time to perform dlsym()
and store resulting pointer:
typedef int (*HASHFN)(const char*);
typedef struct {
...
void *addr;
HASHFN hashfunc;
} ADT;
ADT *attach()
{
ADT *p = malloc(sizeof(*p));
p->addr = // attaches shmem
p->hashfunc = (HASHFN)dlsym(...);
return p;
}
After that there is no need for any searching -- the function
pointer is "right there".
[vbcol=seagreen]
> That's the basic direction I think I'd go, but with two changes:
>
> (1) I would pass the name of the dynamic library to load with dlsym()
> instead of the name of the function. The name of the function
> would be a fixed part of the library's interface.
This addresses the hash_str() mismatch.
I would use *absolute* pathname to the library (which avoids
possibility that two clients load two different dynamic libraries
which are both named "hash.so", e.g. because they have different
LD_LIBRARY_PATH, and end up with incompatible hash_str() implementations).
> (2) I would store the list of library names in an array in the shared
> memory area.
I don't see the need for a list of shared libs, but perhaps I missed
something ...
Cheers,
--
In order to understand recursion you must first understand recursion.
Remove /-nsp/ for email.
| |
| Michael B Allen 2007-01-18, 7:28 am |
| On Wed, 17 Jan 2007 23:04:06 -0600, Logan Shaw wrote:
> Michael B Allen wrote:
>
> That's the basic direction I think I'd go, but with two changes:
>
> (1) I would pass the name of the dynamic library to load with dlsym()
> instead of the name of the function. The name of the function
> would be a fixed part of the library's interface.
Well if the functions were part of the ADT then I wouldn't need to dlopen
or dlsym anything because I could just use integer constants to identify
them. The whole point was to allow users to supply arbitrary functions.
However, you have me thinking that given the hoops one must go through
to produce a function pointer that can be used by multiple processs I
suppose I should just use a predefined set of functions suitable for
use in shared mem.
For example the ADT initialization routine looks like:
int
hashmap_init(struct hashmap *h,
unsigned int load_factor,
hash_fn hash,
cmp_fn cmp,
void *context,
struct allocator *al)
{
...
h->hash = hash;
h->cmp = cmp;
So now I have specific constants used to identify each function. I
think it is safe to assume that no other function will have an address
of 0x00000001 or 0x000000002:
#define HASHMAP_HASH_STR ((hash_fn)1)
#define HASHMAP_CMP_STR ((cmp_fn)2)
So process A calls the ADT initialization routine with these constants
instead of the function symbols:
ret = hashmap_init(&h,
0,
HASHMAP_HASH_STR,
HASHMAP_CMP_STR,
NULL,
NULL);
Now, when process B calls an ADT function (e.g. hashmap_put) it simply
does a logical comparison to determine if the function is one of the
predefined ones:
hash_fn hash = h->hash;
if (hash == HASHMAP_HASH_STR)
hash = hash_str; // proper hash_str for this proc
val = hash(entry->key, h->context);
Mike
| |
| Logan Shaw 2007-01-19, 1:33 am |
| Paul Pluzhnikov wrote:
> Logan Shaw <lshaw-usenet@austin.rr.com> writes:
>
[vbcol=seagreen]
[vbcol=seagreen]
[vbcol=seagreen]
> I don't see the need for a list of shared libs, but perhaps I missed
> something ...
As I understood it, there was a need (or desire) to load new hash functions
at runtime and bind a given instance of the ADT to its own hash function.
Michael was considering having each process load its own copy of the
hash executable code with dlsym(), then registering the function by name
in each process and having the ADT refer to hash functions by name. As
far as I can see, this means that when you want to call the hash function
for a given ADT instance, you have to look up its name in some kind of
string dictionary.
I was suggesting that, instead, each function should get a slot in an array.
In the shared memory, there would be an array of function names, and the
ADT could refer to an index in that array. Each process would use the same
set of array indices to store the points to the copy of the hash function
it has loaded. Then when you want to get the pointer for the hash function
for a given ADT, you look in the ADT's struct in shared memory, get the
index, and the function lookup is a direct array lookup rather than a
lookup in a string dictionary.
- Logan
| |
| Michael B Allen 2007-01-19, 1:18 pm |
| On Fri, 19 Jan 2007 00:21:42 -0600, Logan Shaw wrote:
> I was suggesting that, instead, each function should get a slot in an array.
> In the shared memory, there would be an array of function names, and the
> ADT could refer to an index in that array. Each process would use the same
> set of array indices to store the points to the copy of the hash function
> it has loaded. Then when you want to get the pointer for the hash function
> for a given ADT, you look in the ADT's struct in shared memory, get the
> index, and the function lookup is a direct array lookup rather than a
> lookup in a string dictionary.
This isn't what I did but the index lookup part helped me realize the
solution.
I created a global table of function pointers (not in shared memory):
hash_fn _hashmap_hash_fns[] = {
NULL,
hash_str,
hash_wcs
};
and constants to refer to the functions by index:
#define HASHMAP_HASH_STR ((hash_fn)1)
#define HASHMAP_HASH_WCS ((hash_fn)2)
and modified the ADT functions to intercept these:
hash_fn hash = (int)h->hash < _HASHMAP_NUM_HASH_FNS ? \
_hashmap_hash_fns[(int)h->hash] : h->hash;
Works like a champ. No copying .text in shared memory. No
dlopen/dlsym. Subroutines and globals are ok in the hash
functions. Problem thoroughly solved.
Thanks for your help,
Mike
|
|
|
|
|