Word filtering program

As part of an interview, I was asked to produce a C/C++ program doing the following:

Given a file containing two different lists of lines, say A and B, produce a list

	A11 B32
	A12 B33
	A12 B34
	...
	
where each line means: all the words in line 11 of list A appear in line 32 of list B, etc.

Obviously, you would not do that program in C out of the blue (unless you want to hardcode linked lists, probably hashes and whatnot). So, even though my mastery of C++ was quite limited (so, no mastery at all), I tried to do it in C++. You guessed I failed at it, but I was piqued to do it afterwards, and here it is.

In the tarball there is a Makefile (which will produce a 'filter' executable), a 'sentences.txt' file I built and the source filter.cpp, which you can browse.

The problem is specific, and so is the solution. The program is written for that quite specific problem, not to be used as the basis for, say, a 'web filter', which would be a real-life related problem. The main issue is that this problem is (at first sight and thought) best solved like this;

  1. Parse first the second list, which contains larger lines and quite likely less words, producing a hash (a map in STL terms) from strings to vectors of integers, so that each word is mapped to a list of the lines it appears in.
  2. Once the above hash (Bwords) is created, rewind the file and start parsing the first list, line by line. For each line, create a list (vector, which is as simple) of the words in that line, and, if (w1, w2, ... wk) is that list, compute the intersection (using the set_intersection algorithm of the STL) of Bwords[w1], ..., Bwords[wk]. This is stored as a list of integers inter.
  3. For each element e in inter, output
    Acount Be
    	    
    (count being the line counter for the first list).
There are probably faster ways of doing it, but I guess not too faster for this problem. It also depends on the distribution of words in the first and second lists.

The problem stated that 'words' should be considered all caps, as regards comparison, and that accents were to be removed (thus, á becomes A, etc...). This is the purpose of the to_up function in the code.

Comments are always welcome.

Back to Pedro's page