naïve algorithms will let you down

Over the last few years I have been in a position where I have been party to discussions about software with those who know little to nothing about the art of computer science.  Many of these people were in charge of very large governmental budgets for some fairly high-profile public-facing software systems.  These systems typically involved a large amount of deep algorithmic work, which on the face of it might seem simple, but was actually very complicated.  Given this situation you would have thought people would listen when advice was being given out.  Oh how wrong you would be!

For some context to this I will introduce a problem, which is not strictly related to the situation described above (to protect the guilty), but has a similar flavour.

The problem I’m going to describe is that of allocating a set of school children to their (or I suppose their parents) most desirable school.  Each parent is asked to rank in strict order of preference 5 schools.  A centralised system is then used to assign each child to a school.

The naïve (serial-dictatorship) solution to this problem would be to simply assign each child to the first school on their list that is not already full.  So what is the problem with such a solution?  Well, firstly we can end up with an assignment of children to schools that does not necessarily have maximum cardinality, i.e. there exists another allocation in which a larger number of children are matched to schools.  Also, we may end up with an allocation in which a child c1 ends up at his fifth placed school while another child c2 obtains his first-choice, but if c2 moved to his second-choice then c1 would be able to get his third-choice.  If we allow this switch to take place we clearly end up with a “better” allocation.

Despite the poor characteristics obtained from the naïve algorithm, I suspect that many many organisations, both governmental and private, use such an algorithm when assigning people to resources when preference information is available.   The problem is that people in charge of building such systems often only see the need to allocate these people to resources in whatever way possible and in the shortest time.  So what happens?  Well they get a system that looks flashy and kind of works.  However a couple of months in when they have a group of people complaining and operating outside the system, the reality finally dawns on them.  I’m sure in the UK alone we could count quite a few computer systems that have fallen under the weight of this pressure.

So we have shown that the naïve algorithm does not always work.  Can we safely assume that the heart of a software product lies in those complicated algorithms at its core?  I think we can say so with Google, but who knows in general.  In the example I described I think this is definitely the case.  So what would be the “best” way to allocate these children to schools?

There are different meanings of “best” but all would ensure that the allocation has maximum cardinality, without this we are at a lost cause straight away. Following this we could attempt to maximise the number of children assigned to their first-choice school, then subject to this we maximise the number of children assigned to their second choice school, etc.  Therefore if we receive an angry parent complaining that a child got their third-choice school you can at least tell them that there was no other possible assignment in which some other child with their first or second place does not get worse off.  Hence there is no incentive for the administrator to allow any changes to take place.  This is known as a greedy maximum matching (allocation).  Lastly we need to make sure that these algorithms are efficient, most algorithms are no use to anybody if they are exponential.

So there we go, just remember that naïve algorithms are not always (and often not) the best.

practically np

Often I find that people have the opinion that the class of NP-complete problems, as discussed in my previous post, in general, don’t show up in practice. And if they do show up, then they normally represent a restriction on the problem that we can solve easily. This line of thought though is unfortunately untrue 🙁 .

An example of a problem that has significant practical consequences, yet is NP-hard, is that of finding a maximum size stable matching in the context of what is known as the Hospital/Residents problem with ties.

In the Hospital/Residents problem (HR), we seek to assign a set of student doctors (residents) to hospital posts (a hospital may offer more than one post) – this task is carried out in many countries in a student’s first year after graduating from medical school. Each student is asked to rank a set of hospitals in strict preference order, with the hospitals ranking the set of students also by preference. However, in many schemes, hospital’s preference lists are allowed to contain ties. That is, a hospital is allowed to be indifferent between a set of students – this generally produces a more realistic set of preference lists, as otherwise the hospital may typically have to rank several thousand students individually. We note that this description of the problem is not complete but I’m describing a known practical situation here, not the general case of HR with ties, where a student’s preference list may also contain ties.

So what is a stable matching? A matching is stable in the context above if there is no student s that strictly prefers a hospital h to his current assignee h' (or s is unmatched), and h is either under-subscribed or prefers s to at least one assignee. I think an example will better illustrate:

students       hospitals
s: h h'         h: s (t r)
t: h h'         h': r s t
r: h h'

The above states that student s prefers h to h', similarly for students t and r. For the hospitals, each is offering 1 post, and hospital h is indifferent between t and r and prefers s to both of them, with h' preferring r to s to t respectively. Now the matching {(t,h),(r,h')} is not stable. Why? Because s is unmatched and would rather be assigned to h than be unmatched, and h prefers s to t, so both s and h would get better off if h reject t and took on s. In this example a stable matching is {(s,h),(r,h')}.

Mmmmm maybe the last bit was a little complicated but I just wanted to give a feel for why we desire a matching to be stable.

As I mentioned earlier the problem of finding a maximum stable matching in this context is NP-hard. So why is this a problem? Well obviously we want to match as many students as possible, if not these people will not get a job, so this is very important to them. Also, clearly we want a matching to be stable for otherwise you end up with chaos as students and hospitals realise that they would be better off if they just switched. This is exactly what happened in the US prior to establishing a system (NRMP) that was based on stable matchings way back in the 50s.

However, despite its importance as a problem we are stuck! If we want to deal with ties (and also if we have no ties but “couples” who want to submit joint preference lists) we are in the realms of using heuristics. What this means is that we are highly unlikely to find a solution that maximises the number of students matched. This may not sound bad to you, but to those involved in the matching scheme, it could affect the rest of their life. That said, even a stable solution less than optimal size (it’s easy to find such a solution in polynomial-time) is more acceptable than a solution that is not stable.

Anyway, there you have it, a practical NP-hard problem. Also, there are many others. If this was of interest then I can maybe write up something about the Kidney Exchange Problem – this concerns the organisation of kidney transplants from live donors and it is known that the most important practical variants of this problem are NP-hard. This is in every sense of the word a life or death problem.

the same old mistakes

Well, two days two entries, not bad eh.  This one will be short though, because I’m still feeling pretty tired.  Which is one of the points I want to make actually.  After a year of, how can I say it, managing my own time?  I’m now back in the unreal world of real work.  I hate getting up in the morning though.  I don’t mind so much when the sun is shining but when it’s pissing it down there just feels little incentive.  More so in the heart of winter when the sun doesn’t come up until 9am.  The pleasures of living in Scotland.  I have no issues with getting up at 10am but 9am or before is just a day breaker.

So I hear you say, what is this work I hear you talk about?   Well for the next six months I will be working for Glasgow University creating/maintaining/living matching software (maybe I will write a post about what I will be doing – when I finally figure that out). I have actually been working for around a week now, bizarrely enough though I still have not really been in. That is a long story I won’t go into. None-the-less I have been doing a fair amount of work, which leads me to my point.

After a few years of (self-imposed) exile from using Java, the tempestuous love affair has been rekindled in the last week. The majority of the stuff within the department I’m working in is implemented in C# but I was struggling to find a constraint programming toolkit that could be used with C#. However, I must make the same mistakes again and again when I go back to Java. That is, without fail, I get “java.lang.NoClassDefFoundError: AppName” when running the application, despite setting the correct classpath on the command line. And without fail I sit pissing around with the command trying every possible permutation to get it to work. Then, the bell rings, and I remember this happens to me every time, and the solution presents itself (usually about 2 hours wasted).

So the problem: well I like to use cygwin as my terminal in windows, simply because it looks and feels nicer than a dirty old windows command prompt. However, the java command that I pick up on my path in the cygwin installation is the windows version, and as a consequence doesn’t understand the path separator in the classpath parameter, e.g. it understands neither java -classpath .:./lib/log4j.jar AppName nor java -classpath .;./lib/log4j.jar. With the first command the java application just thinks .:./lib/log4j.jar is the full path and on the second it treats the ; as the end of the first command. The solution? Well for me it was just to use windows command prompt, where everything worked fine. For others it will be to use the cygwin version of java (which I assume is available but have not checked). The latter means managing two version of java on my system which seemed enough work to warrant not doing it. Instead I prefer to waste half my day typing ls into the command prompt.