Let’s say you’ve got a list of globs: '*', 'a*',
'*a', '*a*'. Now sort them. Sort them based on how
specific they are. More specific globs match fewer things. Less specific
globs match more things.
If Dr. Seuss had had anything to say about globs, what would he have said?
Glob on String
Blob on Thing
Glob on String for Blog on Thing
Blog on Glob for Thing on String
When Globs on Strings match Blogs on Things
Then… Blogs on Globs match Strings on Things
And When Blogs match Globs and Strings match Things
Then… Blogs on Globs match Blogs on Things on Globs on Strings!
So how do you sort globs? And why would you want to sort them? I’m
working on a project where you use globs to pick out the error
conditions that you are interested in. These error conditions have
names. So you can match entire classes of error by using globs. For
example 'foo.*' matches the errors 'foo.bar' and
'foo.baz'.
But what if I also add '*.*' to the mix. This also matches
'foo.bar' and 'foo.baz'. But I want 'foo.*' to match
first. It’s more specific. For some definition of specific.
Again, 'a*' is more specific than '*a*'. 'a*' can
match fewer things. Let’s look at this more closely. Say we have the
set of words: [aa,ab,ba,bb]. Then
* matches aa,ab,ba,bb
a* matches aa, ab
*a matches aa, ba
*a* matches aa, ab, ba
So we can order the globs like so: 'a*', '*a',
'*a*', '*', in order of increasing matches. The ones at
the front are more specific. They match fewer things.
What about 'a*' and '*a'. How do we decide which comes
first. Rather arbitrarily, lets say that the more specific prefix of
'a*' makes it the winner. Prefixes are more specific than
suffixes. Hmm. That’s getting way too philosophical. Take Java
packages: 'com.ricebridge.csvman.*' is more specific than
'*.csvman.test'. The latter can occur in any top level
package, but the former in only one.
Yeah it’s pretty weak, but hey!
Now here’s the thing: what is the rule for sorting globs? I want a rule which:
1. is human computable 'just by looking' — that is, the rule is obvious
2. creates a full ordering of any set of globs — no arbitrary ordering of non-identical globs
3. has reasonable efficiency
The first idea: Use the number of stars. More stars means more
specific. When the number of stars is the same, sort
alphabetically. Neat and easy. And wrong. Sure it puts 'a*b*c'
before 'a*c', but it also puts '*a*' before 'a*',
which is incorrect.
Let’s look at globs. There are stars, and then there are normal
characters. They always form an alternating pattern: 'a*b*c'. Or
'*a*b*'. So this looks like an essential (and not accidental)
feature of globs. The number of stars and the number of
'normals'. Yes, in general, the more stars you have, the more
constrained you are, but there’s more going on.
After thinking about this for a bit, I realised that the outside stars
are special. Very special. A glob with a star at the start or end, or
both, can match a lot of things. Way more things that a glob with
stars inside other characters. What if we just define an ordering for
the outside stars. And we have one already. Form the preceding
discussion: 'a', 'a*', '*a', '*a*', '*'.
Now that 'a' can stand for anything inside the outside
stars. Other globs in fact. But those globs can’t have outside stars!
The glob '**' is the same as the glob '*'. Adjacent stars
merge into one. So '*a*', with a = '*b*' becomes
'**b**' becomes '*b*'. On the other hand with a = 'b*c', '*a*'
becomes '*b*c*'. A different beast entirely.
Now at this point you might be saying, “hold on a sec, you’re
trying to order regular expressions! That's insane! You need to
evaluate them to do that!” Because globs are just regular
expressions: 'a*' is really /a.*/ in regex land. True. But I
think that for the special case of globs built from stars only (let’s
drop the use of '?'), we can in fact define an ordering without
requiring evaluation. The problem is sufficiently complex to require
some thought, but it's not a 'hard' problem.
I would think that due to back-tracking, solving the sorting problem
for standard regular expressions is probably going to involve solving
the halting problem, since you have to evaluate each regex. But I
could be wrong. I often am.
Back to the globs. We can sort based on the outside stars. Fine. We
have five subsets to sort now: 'a', 'a*',
'*a', '*a*', '*'. '*' is a special case —
it’s always last. It's the least specific glob of all, matching
anything. The glob with no stars, 'a', is also easy. Just sort
alphabetically.
For the other three we note that a will always take the forms
'b', 'b*b', 'b*b*b', …, and so on. That is, the essential thing is
the number of inside stars. In this case, the more stars you have, the
less you can match. For example 'a*a' will match
'azbza', 'azcza'. But 'a*b*a' will only match
'azbza'. More stars are more specific. Problem solved!
Nope. Not quite. What about globs having the same number of inside
stars? What about 'a*a' versus 'a*ab'? Which is more
specific? I think 'a*ab' is. It has more information, more
constraints on what it can match. For any given finite set of words,
there are more ways to end in 'a' alone than 'ab'. So the
next criteria is: more normal characters, more specific.
You can see the next catch, can't you? What if we have the same number
of normal characters? Alphabetic sort? No, doesn’t work – the
characters may be spread out between in the stars in different ways:
'a*ab' versus 'ab*a'. Which is more specific? Hmm. Let’s
invoke our prefix rule again. This makes 'ab*a' more specific
than 'a*ab', because it has a longer prefix. The criteria is:
whoever’s got the mostest on the leftest.
And finally, if the character spread is identical, then we go
alphabetic. So 'aa*a' precedes 'ab*a'. Let's see what we have…
The human rules are:
1. Prefixes are more specific than suffixes
2. Outside stars first: 'a', 'a*', '*a','*a*'
3. Inside stars: more stars means more specific
4. Normal characters: more characters, more to the left, is more specific
5. Otherwise go alphabetic
Here’s a sample list of globs, starting with
the most specific. See if this list matches your idea of how they
should be ordered.
com.ricebridge.csvman.CsvManager
com.*.*Manager
com.ricebridge.*.CsvManager
com.ricebridge.csvman.*
*.ricebridge.csvman
*.ricebridge.*
*
This is almost right. But really I think com.ricebridge.*.CsvManager
is more specific than com.*.*Manager. How can we arrange this? Do more
stars really mean more specific? Or do normal characters provide far
more specificity?
The more normal characters you use, the more specific you are
being. Stars provide very little information. But strings of normal
characters really narrow things down. But in which set of words? What is the
total set of words that we are matching in the com.ricebridge
scenario? Infinite? Is there some structure? If these are the names of
Java classes (my target domain), then com.ricebridge.*.CsvManager is
definitely more specific (matches fewer items) than
com.*.*Manager. Seems like prefixes really do rule. I wonder is this
somehow connected to Zipf's Law…
Anyway, let’s drop rules 3 and 4, and replace them with:
Inside characters: longest prefix wins
So even if you have loads of stars, if you have a short prefix, you lose. This gives:
com.ricebridge.csvman.CsvManager
com.ricebridge.*.CsvManager
com.*.*Manager
com.ricebridge.csvman.*
*.ricebridge.csvman
*.ricebridge.*
*
That’s a lot better. Let’s restate the human rules:
1. Outside stars first: 'a', 'a*', '*a', '*a*'
2. Inside characters: longest prefix wins
4. Prefixes same length? more stars wins
3. Otherwise go alphabetic
That new rule number 4 means that 'a*b*c' is more specific than 'a*b'. So you only compare the minimum number of prefixes. If they're are equal, take the longest glob as more specific.
A much better ruleset! And it pretty much conforms to the criteria for the
sorting algorithm: human computable, full ordering, and reasonable
performance.
Time to go code it up. Post a comment if you want to save me from myself!