Scan For Matches

scan_for_matches is a utility written in C for locating patterns in DNA or protein FASTA files.

You can download the source and installation instructions at http://www.theseed.org/servers/downloads/scan_for_matches.tgz.

The utility was written by Ross Overbeek; David Joerg and Morgan Price wrote sections of an earlier version. It is worth noting that it was strongly influenced by the elegant tools developed and distributed by David Searls.

Installation

To get started, you will need to compile the program.

        gcc -O -o scan_for_matches  ggpunit.c scan_for_matches.c

If you do not use GNU C, use

        cc -O -DCC -o scan_for_matches  ggpunit.c scan_for_matches.c

Once you have compiled scan_for_matches, you can verify that it works with

        clone% run_tests tmp
        clone% diff tmp test_output

You may get a few strange lines of the sort

        clone% run_tests tmp
        rm: tmp: No such file or directory
        clone% diff tmp test_output

These should cause no concern. However, if the diff shows that tmp and test_output are different, contact us.

You should now be able to use scan_for_matches by following the instructions given below (which is all the normal user should have to understand, once things are installed properly).

How to run scan_for_matches

To run the program, you type need to create two files

  1. the first file contains the pattern you wish to scan for; we'll call this file pat_file in what follows (but any name is ok)
  2. the second file contains a set of sequences to scan. These should be in FASTA format.

Once these files have been created, you just use

        scan_for_matches pat_file < input_file

to scan all of the input sequences for the given pattern. As an example, suppose that pat_file contains a single line of the form

                p1=4...7 3...8 ~p1

Then,

                scan_for_matches pat_file < test_dna_input

(using the test_dna_input found in the install package) should produce two "hits".

        clone% scan_for_matches pat_file < test_dna_input
        >tst1:[6,27]
        cguaacc ggttaacc gguuacg 
        >tst2:[6,27]
        CGUAACC GGTTAACC GGUUACG 
        clone% 

Simple Patterns Built by Matching Ranges and Reverse Complements

Let us first explain this simple pattern:

                p1=4...7 3...8 ~p1

The pattern consists of three pattern units separated by spaces.

The first pattern unit is

                p1=4...7

which means match 4 to 7 characters and call them p1. The second pattern unit is

                3...8

which means then match 3 to 8 characters. The last pattern unit is

                ~p1

which means match the reverse complement of p1. The first reported hit is shown as

        >tst1:[6,27]
        cguaacc ggttaacc gguuacg 

which states that characters 6 through 27 of sequence tst1 were matched. cguaac matched the first pattern unit, ggttaacc the second, and gguuacg the third. This is an example of a common type of pattern used to search for sections of DNA or RNA that would fold into a hairpin loop.

Searching Both Strands

Now for a short aside: scan_for_matches only searched the sequences in the input file; it did not search the opposite strand. With a pattern of the sort we just used, there is no need to search the opposite strand; however, it is normally the case that you will wish to search both the sequence and the opposite strand (i.e., the reverse complement of the sequence). To do that, you would just use the -c command line switch. For example,

        scan_for_matches -c pat_file < test_dna_input

Hits on the opposite strand will show a beginning location greater than the end location of the match.

Defining Pairing Rules and Allowing Mismatches, Insertions, and Deletions

Let us stop now and ask What additional features would one need to really find the kinds of loop structures that characterize tRNAs, rRNAs, and so forth? I can immediately think of two:

  1. you will need to be able to allow non-standard pairings (those other than G-C and A-U), and
  2. you will need to be able to tolerate some number of mismatches and bulges.

Let us first show how to handle non-standard rules for pairing in reverse complements. Consider the following pattern, which we show as two line (you may use as many lines as you like in forming a pattern, although you can only break a pattern at points where space would be legal):

            r1={au,ua,gc,cg,gu,ug,ga,ag} 
            p1=2...3 0...4 p2=2...5 1...5 r1~p2 0...4 ~p1        

The first pattern unit does not actually match anything; rather, it defines a "pairing rule" in which standard pairings are allowed, as well as G-A and A-G (in case you wondered, Us and Ts and upper and lower case can be used interchangeably; for example r1={AT,UA,gc,cg} could be used to define the standard rule for pairings). The second line consists of six pattern units which may be interpreted as follows:

 

p1=2...3match 2 or 3 characters (call it p1)
0...4match 0 to 4 characters
p2=2...5match 2 to 5 characters (call it p2)
1...5match 1 to 5 characters
r1~p2match the reverse complement of p2, allowing G-A and A-G pairs
0...4match 0 to 4 characters
~p1match the reverse complement of p1 allowing only G-C, C-G, A-T, and T-A pairs

 

Thus, r1~p2 means match the reverse complement of p2 using rule r1.

Now let us consider the issue of tolerating mismatches and bulges. You may add a qualifier to the pattern unit that gives the tolerable number of mismatches, deletions, and insertions. Thus,

                p1=10...10 3...8 ~p1[1,2,1]

means that the third pattern unit must match 10 characters, allowing one mismatch (a pairing other than G-C, C-G, A-T, or T-A), two deletions (a deletion is a character that occurs in p1, but has been deleted from the string matched by ~p1), and one insertion (an insertion is a character that occurs in the string matched by ~p1, but not for which no corresponding character occurs in p1). In this case, the pattern would match

              ACGTACGTAC GGGGGGGG GCGTTACCT

which is, you must admit, a fairly weak loop. It is common to allow mismatches, but you will find yourself using insertions and deletions much more rarely. In any event, you should note that allowing mismatches, insertions, and deletions does force the program to try many additional possible pairings, so it does slow things down a bit.

How Patterns Are Matched

Now is as good a time as any to discuss the basic flow of control when matching patterns. Recall that a pattern is a sequence of pattern units. Suppose that the pattern units were

        u1 u2 u3 u4 ... un

The scan of a sequence S begins by setting the current position to 1. Then, an attempt is made to match u1 starting at the current position. Each attempt to match a pattern unit can succeed or fail. If it succeeds, then an attempt is made to match the next unit. If it fails, then an attempt is made to find an alternative match for the immediately preceding pattern unit. If this succeeds, then we proceed forward again to the next unit. If it fails we go back to the preceding unit. This process is called backtracking. If there are no previous units, then the current position is incremented by one, and everything starts again. This proceeds until either the current position goes past the end of the sequence or all of the pattern units succeed. On success, scan_for_matches reports the hit, the current position is set just past the hit, and an attempt is made to find another hit.

If you wish to limit the scan to simply finding a maximum of, say, 10 hits, you can use the -n option (-n 10 would set the limit to 10 reported hits). For example,

        scan_for_matches -c -n 1 pat_file < test_dna_input

would search for just the first hit (and would stop searching the current sequences or any that follow in the input file).

Anchors

The pattern units ^ and $ now work as in normal regular expressions. That is

			TTF $

matches only TTF at the end of the string and

		        ^ TTF 

matches only an initial TTF.

The pattern unit

			<p1

matches the reverse of the string named p1. That is, if p1 matched GCAT, then <p1 would match TACG. Thus,

		   p1=6...6 

matches a real palindrome (not the biologically common meaning of reverse complement)

Searching for repeats

In the last section, we discussed almost all of the details required to allow you to look for repeats. Consider the following set of patterns:

 :

p1=6...6 3...8 p1find exact 6 character repeat separated by to 8 characters
p1=6...6 3..8 p1[1,0,0]allow one mismatch
p1=3...3 p1[1,0,0] p1[1,0,0] p1[1,0,0]match 12 characters that are the remains of a 3-character sequence occurring 4 times
p1=4...8 0...3 p2=6...8 p1 0...3 p2This would match things like ATCT G TCTTT ATCT TG TCTTT

 

Searching for particular sequences

Occasionally, one wishes to match a specific, known sequence. In such a case, you can just give the sequence (along with an optional statement of the allowable mismatches, insertions, and deletions). Thus,

 :

p1=6...8 GAGA ~p1match a hairpin with GAGA as the loop
RRRRYYYYmatch 4 purines followed by 4 pyrimidines
TATAA[1,0,0]match TATAA, allowing 1 mismatch

 

Matches against a weight matrix

We will conclude our examples of the types of pattern units available for matching against nucleotide sequences by discussing a crude implementation of matching using a weight matrix.

Suppose that you wanted to match a sequence of eight characters. The consensus of these eight characters is GRCACCGS, but the actual frequencies of occurrence are given in the matrix below. Thus, the first character is an A 16% the time and a G 84% of the time. The second is an A 57% of the time, a C 10% of the time, a G 29% of the time, and a T 4% of the time.

 

C1C2C3C4C5C6C7C8
A165709501800
C01080010060050
G84290002010050
T042050200

 

One could use the following pattern unit to search for inexact matches related to such a weight matrix:

        {(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5),
         (0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)} > 450

This pattern unit will attempt to match exactly eight characters. For each character in the sequence, the entry in the corresponding tuple is added to an accumulated sum. If the sum is greater than 450, the match succeeds; else it fails.

This feature also allows ranges. Thus,

  600 >  {(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5),
         (0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)} > 450

will work, as well.

Allowing Alternatives

Very occasionally, you may wish to allow alternative pattern units (i.e., match either A or B). You can do this using something like

                ( GAGA | GCGCA)

which says match either GAGA or GCGCA. You may take alternatives of a list of pattern units, for example

        (p1=3...3 3...8 ~p1 | p1=5...5 4...4 ~p1 GGG)

would match one of two sequences of pattern units. There is one clumsy aspect of the syntax: to match a list of alternatives, you need to fully parenthesize the request. Thus,

        (GAGA | (GCGCA | TTCGA))

would be needed to try the three alternatives.

One Minor Extension

Sometimes a pattern will contain a sequence of distinct ranges, and you might wish to limit the sum of the lengths of the matched subsequences. For example, suppose that you basically wanted to match something like

    ARRYYTT p1=0...5 GCA[1,0,0] p2=1...6 ~p1 4...8 ~p2 p3=4...10 CCT

but that the sum of the lengths of p1, p2, and p3 must not exceed eight characters. To do this, you could add

        length(p1+p2+p3) < 9

as the last pattern unit. It will just succeed or fail (but does not actually match any characters in the sequence).

Matching Protein Sequences

Suppose that the input file contains protein sequences. In this case, you must invoke scan_for_matches with the -p option. When matching proteins, you cannot use aspects of the language that relate directly to nucleotide sequences (e.g., the -c command line option or pattern constructs referring to the reverse complement of a previously matched unit).

You also have two additional constructs that allow you to match either one of a set of amino acids or any amino acid other than those a given set. For example,

        p1=0...4 any(HQD) 1...3 notany(HK) p1

would successfully match a string like

           YWV D AA C YWV

Using the show_hits Utility

When viewing a large set of complex matches, you might find it convenient to post-process the scan_for_matches output to get a more readable version. We provide a simple post-processor called show_hits. To see its effect, just pipe the output of a scan_for_matches into show_hits.

Normal Output:

        clone% scan_for_matches -c pat_file < tmp
        >tst1:[1,28]
        gtacguaacc  ggttaac cgguuacgtac 
        >tst1:[28,1]
        gtacgtaacc  ggttaac cggttacgtac 
        >tst2:[2,31]
        CGTACGUAAC C GGTTAACC GGUUACGTACG 
        >tst2:[31,2]
        CGTACGTAAC C GGTTAACC GGTTACGTACG 
        >tst3:[3,32]
        gtacguaacc g gttaactt cgguuacgtac 
        >tst3:[32,3]
        gtacgtaacc g aagttaac cggttacgtac 

Piped Through show_hits:

        clone% scan_for_matches -c pat_file < tmp | show_hits
        tst1:[1,28]:  gtacguaacc   ggttaac  cgguuacgtac
        tst1:[28,1]:  gtacgtaacc   ggttaac  cggttacgtac
        tst2:[2,31]:  CGTACGUAAC C GGTTAACC GGUUACGTACG
        tst2:[31,2]:  CGTACGTAAC C GGTTAACC GGTTACGTACG
        tst3:[3,32]:  gtacguaacc g gttaactt cgguuacgtac
        tst3:[32,3]:  gtacgtaacc g aagttaac cggttacgtac
        clone% 

Optionally, you can specify which of the fields in the matches you wish to sort on, and show_hits will sort them. The field numbers start with 0. So, you might get something like

        clone% scan_for_matches -c pat_file < tmp | show_hits 2 1
        tst2:[2,31]:  CGTACGUAAC C GGTTAACC GGUUACGTACG
        tst2:[31,2]:  CGTACGTAAC C GGTTAACC GGTTACGTACG
        tst3:[32,3]:  gtacgtaacc g aagttaac cggttacgtac
        tst1:[1,28]:  gtacguaacc   ggttaac  cgguuacgtac
        tst1:[28,1]:  gtacgtaacc   ggttaac  cggttacgtac
        tst3:[3,32]:  gtacguaacc g gttaactt cgguuacgtac
        clone% 

In this case, the hits have been sorted on fields 2 and 1 (that is, the third and second matched subfields).

show_hits is just one possible little post-processor, and you might well wish to write a customized one for yourself.

Reducing the Cost of a Search

The scan_for_matches utility uses a fairly simple search, and may consume large amounts of CPU time for complex patterns. When you have a complex pattern that includes a number of varying ranges, imprecise matches, and so forth, it is useful to pipeline matches. That is, form a simpler pattern that can be used to scan through a large database extracting sections that might be matched by the more complex pattern. Let us illustrate with a short example. Suppose that you really wished to match the pattern

    p1=3...5 0...8 ~p1[1,1,0] p2=6...7 3...6 AGC 3...5 RYGC ~p2[1,0,0]

In this case, the pattern units AGC 3...5 RYGC can be used to rapidly constrain the overall search. You can preprocess the overall database using the pattern:

          31...31 AGC 3...5 RYGC 7...7

Put the complex pattern in pat_file1 and the simpler pattern in pat_file2. Then use,

	scan_for_matches -c pat_file2 < nucleotide_database |
        scan_for_matches pat_file1

The output will show things like

    >seqid:[232,280][2,47]
    matches pieces

Then, the actual section of the sequence that was matched can be easily computed as [233,278] (remember, the positions start from 1, not 0).

Let me finally add, you should do a few short experiments to see whether or not such pipelining actually improves performance--it is not always obvious where the time is going, and I have sometimes found that the added complexity of pipelining actually slowed things up. It gets its best improvements when there are exact matches of more than just a few characters that can be rapidly used to eliminate large sections of the database.