nm, awk, sort
I was working on exposing the posix_
related C functions in the Rust nix crate for use with a job control shell when I had a sudden thought: “What is the longest glibc function name?”.
I do! But regardless, the idea has arrived and It Must Be Done. At first, I wondered if I could retrieve a list of symbols from the excellent Elixir Cross Referencer project, but a cursory browse through the site and source code revealed nothing trivial. Then I remembered about the nm utility, and since I have a copy of glibc handy on my machine:
$ fd "libc.so" /usr/lib
/usr/lib/libc.so
/usr/lib/libc.so.6
Let’s check it out!
$ nm "/usr/lib/libc.so.6"
nm: /usr/lib/libc.so.6: no symbols
Apparently, according to this Stack Overflow post:
Let’s check it out! Since I want to eventually measure function names by length, I’ll use --without-symbol-versions
instead.
$ nm -D --without-symbol-versions "/usr/lib/libc.so.6"
000000000003a020 T a64l
0000000000022466 T abort
...
0000000000108c00 T __xstat
0000000000108c00 T __xstat64
$ nm -D --without-symbol-versions "/usr/lib/libc.so.6" | wc -l
3038
Holy moly! Let’s clean this up a bit. Scrolling through the output, I see certain symbols I would like to not consider, like internal functions (eg. _
) and glibc related constants (eg. GLIBC_
).
$ nm -SNIP- | awk '$NF ~ /^[^_]/ && $(NF - 1) ~ /^[^A]/'
000000000003a020 T a64l
0000000000022466 T abort
...
00000000001453d0 T xprt_register
0000000000145510 T xprt_unregister
Here, I am retrieving lines where the last column does not start with an underscore and where the second last column does not start with the letter A.
Since I’m already using awk, might as well use it to limit output to the last column as well as its length:
$ nm -SNIP- | awk '$NF ~ /^[^_]/ && $(NF - 1) ~ /^[^A]/ { print $NF, length($NF) }'
a64l 4
abort 5
...
xprt_register 13
xprt_unregister 15
All that’s left is to sort by the second column, which we can conveniently do with… sort
!
Let’s also filter the output to remove duplicates, which are in the output of nm
for some reason.
$ nm --SNIP-- | awk -SNIP- | sort --key=2 --general-numeric-sort | uniq
abs 3
brk 3
...
posix_spawn_file_actions_addclosefrom_np 40
posix_spawn_file_actions_addtcsetpgrp_np 40
There we have it! The longest function name in glibc is posix_
. What functions do we have with length greater or equal to 25?
argp_program_version_hook 25
posix_spawnattr_getpgroup 25
posix_spawnattr_setpgroup 25
pthread_attr_getguardsize 25
pthread_attr_getstackaddr 25
pthread_attr_getstacksize 25
pthread_attr_setguardsize 25
pthread_attr_setstackaddr 25
pthread_attr_setstacksize 25
pthread_condattr_getclock 25
pthread_condattr_setclock 25
pthread_mutexattr_destroy 25
pthread_mutexattr_gettype 25
pthread_mutexattr_settype 25
register_printf_specifier 25
posix_spawnattr_getsigmask 26
posix_spawnattr_setsigmask 26
pthread_attr_getschedparam 26
pthread_attr_getsigmask_np 26
pthread_attr_setschedparam 26
pthread_attr_setsigmask_np 26
pthread_getattr_default_np 26
pthread_rwlockattr_destroy 26
pthread_rwlock_clockrdlock 26
pthread_rwlock_clockwrlock 26
pthread_rwlock_timedrdlock 26
pthread_rwlock_timedwrlock 26
pthread_setattr_default_np 26
pthread_attr_getaffinity_np 27
pthread_attr_getdetachstate 27
pthread_attr_getschedpolicy 27
pthread_attr_setaffinity_np 27
pthread_attr_setdetachstate 27
pthread_attr_setschedpolicy 27
pthread_barrierattr_destroy 27
pthread_condattr_getpshared 27
pthread_condattr_setpshared 27
pthread_mutexattr_getrobust 27
pthread_mutexattr_setrobust 27
pthread_mutex_consistent_np 27
obstack_alloc_failed_handler 28
pthread_attr_getinheritsched 28
pthread_attr_setinheritsched 28
pthread_mutexattr_getkind_np 28
pthread_mutexattr_getpshared 28
pthread_mutexattr_setkind_np 28
pthread_mutexattr_setpshared 28
pthread_mutex_getprioceiling 28
pthread_mutex_setprioceiling 28
posix_spawnattr_getschedparam 29
posix_spawnattr_getsigdefault 29
posix_spawnattr_setschedparam 29
posix_spawnattr_setsigdefault 29
posix_spawn_file_actions_init 29
program_invocation_short_name 29
pthread_kill_other_threads_np 29
pthread_mutexattr_getprotocol 29
pthread_mutexattr_setprotocol 29
pthread_rwlockattr_getkind_np 29
pthread_rwlockattr_getpshared 29
pthread_rwlockattr_setkind_np 29
pthread_rwlockattr_setpshared 29
posix_spawnattr_getschedpolicy 30
posix_spawnattr_setschedpolicy 30
pthread_barrierattr_getpshared 30
pthread_barrierattr_setpshared 30
pthread_mutexattr_getrobust_np 30
pthread_mutexattr_setrobust_np 30
posix_spawn_file_actions_adddup2 32
posix_spawn_file_actions_addopen 32
posix_spawn_file_actions_destroy 32
pthread_mutexattr_getprioceiling 32
pthread_mutexattr_setprioceiling 32
posix_spawn_file_actions_addclose 33
posix_spawn_file_actions_addchdir_np 36
posix_spawn_file_actions_addfchdir_np 37
posix_spawn_file_actions_addclosefrom_np 40
posix_spawn_file_actions_addtcsetpgrp_np 40
Hmm, posix_
/ pthread
dominance.
Not really, unfortunately. For people looking to reproduce the results, I have version 2.37
of glibc. The glibc functions with the longest names are interestingly enough non-portable gnu extensions to posix_
. For people running older machines, the longest function name is likely to be posix_
, which is part of the POSIX standard.
We can also look at the shortest functions:
tee 3
ftw 3
ffs 3
err 3
dup 3
div 3
brk 3
abs 3
Ping me if you manage to come up with a sentence using as many of these function names as you can.
data, data
Let’s look at some interesting (at least to me) data! Here’s a plot of function name length against count:
- Mean
- 10.47
- Median
- 9
- Mode
- 7
- Variance
- 30.21
- Standard Deviation
- 5.50
Speaking of Python, let’s do further analysis on the list of function names!
>>> import collections
>>> unique_letters = collections.defaultdict(list)
>>> for word in words:
... letters = "".join(sorted(set(word)))
... pair = (word, letters)
... unique_letters[len(letters)].append(pair)
...
>>> max_unique_letters = unique_letters[max(unique_letters)]
>>> len(max_unique_letters[0][1])
18
>>> print("\n".join(str(t) for t in max_unique_letters))
('pthread_mutexattr_setprioceiling', '_acdeghilmnoprstux')
('posix_spawnattr_getschedpolicy', '_acdeghilnoprstwxy')
('pthread_mutex_setprioceiling', '_acdeghilmnoprstux')
With 18 unique letters, these three functions share the prize for Function Name With The Most Unique Letters. Which does make sense! pthread
, mutex
, ceiling
, policy
all cover many different letters.
Wonder no more!
>>> import textwrap
>>> with open("/usr/share/dict/words") as f:
... english_words = f.read().splitlines()
...
>>> valid_words = sorted(set(words).intersection(set(english_words)))
>>> print(textwrap.fill(" ".join(valid_words)))
abort abs accept access acct advance alarm atoll bind clock clone
close connect daemon daylight div err error exit finite flock fork
free gets glob index ioctl kill labs link listen login logout mount
nice open pause personality pipe poll puts raise rand random read
reboot remove rename revoke rewind select send shutdown signal sleep
socket splice stat step swab sync system tee time times timezone
truncate wait warn write
>>> len(valid_words)
70
That’s a lot of english words!
I… don’t think so? Going down the rabbit hole: I got the list of words from the Arch Linux words package, which links the Aspell wordlist as upstream, which led me to Spell Checker Oriented Word Lists (SCOWL).
They also have a simple web frontend to look up words in the list. Searching for ioctl:
It’s the nerd “hacker” list. Explains a lot!
Conclusion
Not quite sure what to make of my brief excursion into glibc symbol names. Go on your own trip, maybe?