uniqmerge - emit those lines which appear in a set number of the input files.
Synopsis
Description
Options
Examples
See Also
License
Copyright
Acknowledgements
Authors
uniqmerge [ -h -? -help --help --? ]
uniqmerge [options...] outputfile inputfile1...<inputfileN
uniqmerge reads text from several input files. The contents of each file are sorted and duplicates removed. Then the next file is processed similarly and the unique lines from each file merged. Lines which appear in more than one file increment the count on the unique strings. When all files have been processed the resulting data is emitted subject to the criteria that the repeat count (the number of files in which each unique line appears) is >=N or <=N.
Input files may also contain tagged blocks. If so, they must all contain the same number of tags. A tagged block begins with a unique tag string starting in the first character on the line. It is followed by zero or more text lines. Equivalent tagged blocks are processed from all input files, then the next equivalent tagged block, and so forth.
Numeric data may also be processed. In that case .01, 1.0e-02, and 000.01 are all equivalent.
uniqmerge may be obtained from:
ftp://saf.bio.caltech.edu/pub/software/linux_or_unix_tools/
Use of uniqmerge is subject to the License terms.
outputfile The name of the output file. If it is - output goes to stdout. (Mandatory parameter.) inputfiles A list of the names of the input files. If one is - then input from that file comes from stdin. (Mandatory parameter.) -tag TAGSTRING Look for tagged text blocks in the input files and process the Nth block for each input file together. A single line beginning with TAGSTRING marks the start of each tagged block. (Default is not to look for tagged text blocks.) -count Prefix the output with the number of input files the unique line appeared in. Note that this is NOT necessarily the number of times the line appeared in total, since all identical lines in an input file count as 1. May be combined with -source and -tcount. (Default is not to emit counts.) -source Prefix the output with the number of the input files where a unique line appeared. 1 is the first input file, 2 the second, and so forth. This is only useful for <=1 , the default mode, where each output line was only present in one file. For modes where the unique lines appeared in more than one input file the first input file number will be emitted. May be combined with -count and -tcount. (Default is not to emit source file numbers.) -tcount Prefix the tagged lines in the output with the number of the tag. That is, the fifth tag encountered is preceded by 5. May be combined with -source and -count. (Default is not to emit tag counts.) -nocase String comparisons are carried out ignoring case. (Default is text comparisons are case sensitive.) -sc start_column The first character column to select. Columns are numbered from 1. If in any line, other than a tag line, start_column is beyond the end of the line a fatal error will result. (Default start_column=1.) -ec end_column The last character column to select. It is permissible to set an end_column which is beyond the end of a line, in which case the effective end_column becomes the last column in the line. (Defaults to the last column in the input line.) -nc number_of_columns Number of columns to select. Do not specify both -nc and -ec. -ge N Emit those unique lines which appeared in N or more input files. (Default mode, see -le.) -le N Emit those unique lines which appeared in N or fewer input files. (Default mode, N=1, emit strings that appeared in only one input file..) -sort SORTMODE Sort modes are s+,s-,n+,n- where s/n are string/numeric, and +/- are ascending/descending. (Default sort mode: s+ = sort strings into ascending order). -ssize STACKSIZE STACKSIZE is the maximum number of strings in equivalent tagged blocks or untagged whole files which may be processed. (Default: 32000). -h -help --help -? --?? Print the help message. (Default - do not print help message.) -i Emit version, copyright, license and contact information.( Default - do not emit information.)
% uniqmerge
% uniqmerge -hList the the command line options.
% uniqmerge - in1 in2 in3Emit those lines which occur in only one of the three input files to stdout, sorted as strings in ascending order.
% uniqmerge -tag tag - in1 in2 in3As before, but process in tagged blocks.
% uniqmerge -tag tag -n- - in1 in2 in3As before, but process as numbers instead of strings and emit in numerical descending order. (If any field is nonnumeric a fatal error will result.)
% uniqmerge -tag tag -sc 5 -nc 5 -n- - in1 in2 in3As before, but extract the numeric value from character columns [5-9].
% uniqmerge -ge 3 -tag tag -sc 5 -nc 5 -n- - in1 in2 in3As before, but emit numbers which occur in all three input files.
In many cases it is convenient to use extract or awk to reformat data before running uniqmerge.
You may run this program on any platform. You may redistribute the source code of this program subject to the condition that you do not first modify it in any way. You may distribute binary versions of this program so long as they were compiled from unmodified source code. There is no charge for using this software. You may not charge others for the use of this software.
Copyright (C) 2002, 2003 David Mathog and Caltech.
This program was inspired by uniq from Richard Stallman and David MacKenzie.
David Mathog, Biology Division, Caltech <mathog@caltech.edu>
| uniqmerge (1) | 03 May 2003 |