extract - extract character ranges or tokens from text files.
Synopsis
Description
Options
Operator Order
String syntax
Examples
See Also
License
Copyright
Acknowledgements
Authors
extract [ -h -? -help --help --? ]
extract [options...] <inputfile >outputfile
extract reads a text file from stdin and extracts a range of rows and columns (character positions) and sends them to stdout. Alternatively, it can process tokens instead of character columns. Alternatively, it can remove the selected range instead of emitting it.
extract is much simpler to use than awk or perl and is sufficient for most column/row extraction tasks.
extract may be obtained from:
ftp://saf.bio.caltech.edu/pub/software/linux_or_unix_tools/
This reference page describes version 1.0.28
Use of extract is subject to the License terms.
-all Emit unprocessed the text rows outside of the range specified with -sr,-er,-nr. (Default is not to emit these rows.) -bol bolstring When set the prefix bolstring is emitted before any output for each input row. Specifically, there will be one prefix string emitted for each input row even if the rest of the output row is empty. The default is not to prefix with anything. bolstring may be an empty string. Note that the prefix precedes any line numbers triggered by -n. -bs Add backslashes (unix escape characters) before any character other than alphabet, numeric, underscore, period, or slash. Note that this only applies within a field, so that, for instance, if the program is running in token mode a token range [1,3] would apply the backslashes between characters within each token but not between tokens. To work around that limitation use [dv\\:1,3]. (Default is not to add backslashes.) -cols format Specify in great detail the format of the output line. Using other command line options one column is singled out and those options are applied to it (subject to the logical changes indicated by -rm or -ins). When -cols is used the other command line options specify the default values for all column fields and multiple column fields (indicated by [] brackets within format) may be specified. Between column fields static strings may be introduced. These static strings may contain any symbol, including escaped characters. See String Syntax for examples. [ and ] must be escaped in a string or they will be intrepreted as the limits of a column field. Within a column field a colon (:) separated set of options are allowed. Characters Within a column field [ and ] are not allowed but all other characters are and escapes may be used to include colons. Arbitrary combinations of static strings and column fields may be employed, freely mixing token and character mode columns, and emitting columns in any order, including emitting a single column multiple times. Typically format must be quoted or escaped on the command line so that the shell does not mangle it before passing it into the program. The options for a column field are:
+ set_as = match command line specifications
p default = match program defaults (overrides -pd,-lj,-uc,etc.)
- disable = disable options
If employed as a single character it applies to all settings and must be the first option within a column field. As a suffix these may be applied singly to each of the -cols options.mt/mc/m-/mp/m+ token mode/character mode/disable/default/set_as. Also sets the delimit state in some instances to match the command line, but this may be overridden again by a subsequent :d*: clause in the same column field. (overrides -mt/-mc)
jl/jr/jc/j-/jp/j+ justify left/right/center/disable/default/set_as (overrides -j*)
trl/trr/trb/trc/tr-/trp/tr+ trim left/right/both/compress/disable/default/set_as (overrides -tr*)
cu/cl/cf/c-/cp/c+ case upper/lower/first/disable/default/set_as (overrides -c*)
bs/c-/cp/c+ backslashes apply(as needed)/disable/default/set_as (overrides -c*)
dt/dvN/d-/dp/d+ emit delimit from token/with char N/disable/default/set_as. Restriction: the delimit character N must be escaped if it is a colon or a backslash, ie \: and \\. (overrides -d*)pd###/pd-/pdp/pd+ pad with ### spaces/disable/default/set_as (overrides -pd or -fw)
fw###/fw-/fwp/fw+ field width to ### spaces/disable/default/set_as (overrides -pd or -fw)
fp###/fp-/fpp/fp+ floating point precision to ### spaces/disable/default/set_as (overrides -fp)
fff/ffe/ff-/ffp/ff+ floating point format to float/exponent/default/set_as (overrides -ffe/-fff)
rsSTR/rs-/rsp/rs+ replacement string is STR/disable/default/set_as. Restriction: STR may not contain a colon. (overrides -rs)
rcdsSTR/rcdcSTR/rcd-/rcdp/rcd+ rcds string is STR/case insensitive STR/disable/default/set_as. Restriction: STR may not contain a colon. (overrides -rcds)
rcssSTR/rcs-/rcsp/rcs+ rcss string is STR/disable/default/set_as. Restriction: STR may not contain a colon. (overrides -rcss)
rtdsSTR/rtdcSTR/rtd-/rtdp/rtd+ rtds string is STR/case insensitive STR/disable/default/set_as. Restriction: STR may not contain a colon. (overrides -rtds)
rtssSTR/rts-/rtsp/rts+ rtss string is STR/disable/default/set_as. Restriction: STR may not contain a colon. (overrides -rtss)
[c] [s,e] [s,] [,e] [s,,i] [,e,i] [,,i] range values for single column, column range (start,end,increment), open ended(start ,or range), and tail (offset, count) ranges. The single range for each column field is employed instead of that specified by -sc. The range values must be the final option in a column field. Both the s and e values may be positive or negative. If positive, they are column/token positions measured from the front of the line, where 1 is the first column. If negative, they are column/token positions measured from the end of the line, where -1 is the last column. Mixing modes like [10,-10] is possible but can generate fatal errors if the line is too short or has too few tokens to satisfy the range. The increment value defaults to 1 and may be anything other than zero. Increments only function in token mode - in character mode the value may be set but it is ignored. The expression [,,-1] emits all tokens in a line in reverse order.
-crok Retain a Carriage Return character which appears before the End of Line character. (Default is to delete it.) -cu -cl -cf In selected characters/tokens change case to upper, lower, or first letter upper and rest lower. (Default is to leave case unmodified.) -dbg Emit state and parsing information as each input line is processed. (Default is to not emit this information.) -dl delimiter_string Change the delimiters used to define tokens. Typically delimiter_string must be quoted or escaped on the command line so that the shell does not interpret it. (Default string is space,colon,tab.) -dt When tokens are emitted followed by delimiters use as that delimiter that which defined the end of the current token. (Default). See also -d- and -dv. -dq -dqs While parsing tokens ignore delimiters within double quotes. -dq returns the token with the surrounding double quotes, -dqs returns the token without the quotes. (Default is to recognize delimiters no matter where they occur.) -dv delimit_character When tokens are emitted followed by delimiters use -dv delimit_character. (Default is -dt). -d- Do not emit a delimiter following a token. This is most often used in combination with the -s, -pd/fw, and -j* switches. (Default is -dt, see also -dv). -ec end_column The last character column to select. (Defaults to -1, the last column.) -eol eolstring When set the output from each input row is terminated with eolstring instead of \n. Specifically, there will be one eolstring emitted for each input row even if the rest of the output row is empty. eolstring may be an empty string. This may be used to compress multiple input lines into a single output line. Typically \n would be injected into the output through -if and/or -cols and comma, space or colon would be used for eolstring. -eqlen When reading from multiple input files require that they all have exactly the same number of lines. (Default is to read as many lines as are present in each.) -er end_row The last text row to process. (Defaults to the final row in the file.) -ffe -fff Format all fields as floating point numbers in exponential or floating formats. Example exponential: 1.234e+05. Example floating: 1.234. Precision,set by -fp, is the number of digits after the decimal point. If the field being formatted cannot be converted into a valid number a fatal error will result. The field width is set by -fw or -pad, however, if the resulting formatted number will not fit into the designated width the output will be expanded to fit, so be sure to leave enough space for the largest possible number. (Default is to format fields as text.) -filebol STRING Writes STRING before the stream of data to the output file. See also -fileol. ( Default is not to write a string before the data stream.) -fileeol STRING Writes STRING after the stream of data to the output file. . See also -filbol. ( Default is not to send a string after the data stream.) -fp precision The precision for floating point formats. See -fff and -ffe. (Default precision is 6.) -fw number_of_characters Specifies in number_of_characters the field width. The input field is either padded or truncated as required. When fields are processed they are padded, then justified, then the character cases adjusted. See also -pd. (Default is 0 - no change to field sizes.) -h -help --help -? --?? Print the help message. (Default - do not print help message.) -hcols Print detailed hcols help. (Default - do not print help message.) -hexamples Print examples. (Default - do not print examples.) -hnd If embedded null characters are encountered in the input they are deleted. See also -hnr,-hns,-hnd,-hnsubs. (Default is -hnr.) -hnr If embedded null characters are encountered in the input they are retained. However, the appearance of such a null character is a fatal event since a string containing them cannot be further processed. See also -hnd,-hns,-hnsubs. (Default.) -hns If embedded null characters are encountered in the input they are substituted with /255. See also -hnr,-hnd,-hnsubs. (Default is -hnr.) -hnsubs CHAR If embedded null characters are encountered in the input they are substituted with CHAR. See also -hnr,-hnd,-hns. (Default is -hnr.) -i Emit version, copyright, license and contact information.( Default - do not emit information.) -if tag Conditionally operate on an input line. The syntax for tag is [!}[^]string[$], where string is any text which may contain tab and numeric escapes as in -dl
If neither ^ nor $ is present string may appear anywhere in a line. These special characters must be escaped when they are part of the string: \^, \$, \!, and \\. When the tag is located in a line it is processed, otherwise, that line is just echoed to the output. Use ^$ to match an empty string and !^$ to match all nonempty strings. (An empty string is one containing no characters.) Command line interpreters may interfere with some of the special characters. If that occurs use decimal representations: \32 for !,\94 for ^,and/or \36 for $. See also -ifonly.
^ string is located at the front of a line
$ string is located at the end of a line
! invert logic - operate when string is not found
(Default is to process all lines within the specified row range.)-ifbol When set those rows in an if block are emitted without the BOL string prefixed. This is used primarily to mark all rows other than those in the if block with a prefix tag. -ifeol When set those rows in an if block are emitted without an EOL character. This may be used to compress multiple input lines within an if block into a single output line. -ifn N Extends the condition set by -if for N more lines. May not be combined with -ifterm. -ifonly When set only those rows satisfying -if and -ifn are emitted. -ifterm endtag Extends the condition set by -if through the first line containing the endtag. The rules for processing the endtag are the same as for the -if tag. May not be combined with -ifn. When the tags are chosen so that the beginning -if and terminating -ifterm are not the same line use -iftermeol STRING to finish off the end of the if block. When these tags are the same the endtag really indicates the input line following the preceding if block. In this case use -iftermbol STRING to write a string between the two if blocks and do not use -iftermbol. -iftermbol STRING Writes STRING before the first character in an -if block following termination by N lines, an -ifterm endtag, or the end of the file. Primarily this is useful when -ifterm endtag and -if tag are the same and a separator needs to be written between consecutive if blocks. Only one STRING is written for each if block terminator no matter how many input lines the block contains. -iftermeol STRING Writes STRING after the last character in an -if block following termination by N lines, an -ifterm endtag, or the end of the file. Only one STRING is written for each if block terminator no matter how many input lines the block contains. -in file1[,file2,file3,..fileN] Read input from one or more specified files in a comma delimited list. When reading from more than one file the lines from each are concatenated into a single input line in the order shown. Use -indl to delimit the substrings. The special file name - corresponds to stdin. Only a single input file may be read from stdin. See also -eqlen. (Default is to read from stdin.) -indl StreamDelimit When reading from more than one input file the string StreamDelimit is placed between each substring in the resultant final input string. (Default is an empty string - input strings are directly concatenated.) -is in situ modify the indicated character or token range and emit them and the unmodified surrounding region. This option may not be used with -rm or -cols. (Default is to emit only the selected character/token range.) -jl -jc -jr Justify field left, center, or right. (Default is to not change justification.) -mc Process lines as character columns. See also -mt. (Default.) -merge N Examine the N first characters in consecutive rows. If they are the same emit the N character prefix once and the remainder of each matching row in sequence as one new row. Use -mdl to place delimiters between these fragments. The comparison is case sensitive. Prefix based merging follows merging from multiple input files and precedes any if contingent operations. See also -unmerge. (Default is to not merge based on common prefix.) -mdl MergeDelimit When -merge is set and consecutive rows are being concatenated introduce the string MergeDelimit between the fragments from each row. (Default is an empty string - input strings are directly concatenated.) -mt Process lines as tokens. In this mode -sc,-ec, and -nc values refer to token numbers.(Default is character columns = -mc )
If a single token is emitted then no delimiters is emitted with it. However, two or more tokens are emitted as:token1 delim1 token2 delim2 token3 etc. tokenN
where delim1 is the first delimiter following token1. When -s is also used delim1 will be the only delimiter after token1 but if -s is not specified there may be other delimiters after delim1 and these will not be emitted. The last token emitted is not followed by a delimiter.-n Prefix each line of output with: "line_number:". The line number is from the input file. -nc number_of_columns Number of columns to select. Do not specify both -nc and -ec. -nr number_of_rows Number of text rows to process starting from sr. Do not specify both -nr and -er. -out output_file Write output to the specified file. (Default is to write to stdout.) -pd number_of_characters Specifies the number_of_characters (spaces) to be added to the right side of the field. When fields are processed they are padded, then justified, then the character cases adjusted. See also -fw. (Default is 0 - no padding.) -rcdc RCDS_STRING Case insensitive form of -rcds -rcds RCDS_STRING Remove from the output any characters found in the string RCDS_STRING. If that string begins with ! only those characters which match will be retained. This option may be combined with -rcss to induce substitution instead of deletion. rcds is derived from "Replace Character Delete String". (Default is to emit all characters without filtering.) -rcss RCSS_STRING When a character matches in RCDS_STRING it is substituted from the same position in RCSS_STRING. These two strings must be the same length. When substituting a ! in RCDS_STRING has no special meaning. rcss is derived from "Replace Character Substitute String". (Default is to emit all characters without filtering.) -rm Remove the selected character columns/tokens instead of emitting them. This option may not be used with -is or -cols. (Default is to emit only the selected character/token range.) -rs replacement_string replacement_string substitutes for empty fields. Typically employed to insert NA or 0 in a tab delimited file which left unspecified values as empty fields. (Default leave empty fields empty.) -rtdc RTDC_STRING Case insensitive form of -rtds -rtds RTDS_STRING Remove from the output string the text contained in RTDS_STRING. Multiple instances, if present, will be removed. This option may be combined with -rtss to induce substitution instead of deletion. rtds is derived from "Replace Text Delete String". (Default is to emit all text without replacement.) -rtss RTSS_STRING When a part of aline of text maches matches RTDS_STRING it is substituted with RTSS_STRING. These two strings need not be the same length. rtss is derived from "Replace Text Substitute String". (Default is to emit all text without replacement.) -s Emit a token for each delimiter encountered. When -s is specified tokens may consist of empty strings. This mode is for use with delimited data as from a spreadsheet. (Default is to emit one token for each run of delimiters.) -sc start_column The first character column to select. Columns are numbered from 1. Negative values are allowed and represent columns measured from the end of the line, where -1 is the last column. (Default start_column=1.) -sr start_row The first text row (line of text) to process. Rows are numbered from 1. (Default start_row=1.) -template N Template match two files. This is used to fill in the holes in a column of a table if all of the rows are known. Use -in template,file to specify which is the template (the first) and which is the file to compare to it (the second.) The contents of the two files must be in the same order (for instance, sorted, but any order is ok) and the file may contain a subset of the rows present in the the template. It may not contain any rows not present in the template. Compare the first N characters in a case sensitive manner and if they are the same pass the row from the file into the program. If they are different this indicates a "hole" in the file. Instead, pass the first N characters from the template followed by the string specified by -indl. Normally this would be set to something like "NA", to indicate the presence of the hole. -template is incompatible with -merge. It may be used with -eqlen to verify that all expected rows are present. It is strongly suggested that the data in the first N columns of both files be justified and padded with spaces - otherwise "AB" will not match "AB data" for N=4. When a template is compared to a file the first blank line in each will act as an end of file. (Default no template processing.) -trl -trr -trb -trc Trim out whitespace (spaces and tabs) in the field on the left, right, or both sides. Internal whitespace is not affected. -trc eliminates white space on both ends and compresses runs of internal whitespace to a single space. (Default is to leave the whitespace as is.) -unmerge N Take a line consisting of multiple tokens and treat it as several lines, each beginning with the same first token, and containing sequential groups of N tokens, until all are consumed. Token delimiters are from -mdl, or if that isn't specified, -dl. Character column data must be converted to token (delimited) data before it may be processed with -unmerge. Each line number emitted by -n when -unmerge is active derives from the original input line. If an input line is unmerged into four lines each will have the same line number. See also -merge. (Default is to not unmerge.) -wl widest_line Widest input line in characters. (Default widest_line=16000.) -xc maXimum_Columns Maximum number of column fields ([] in -cols) and/or tokens that may be referenced. (Default maXimum_Columns=8192.)
Many options can interact with each other and the manner in which they do so is largely determined by the order of operations. The order in which operations are executed and the options that affect them are:
Literal strings which appear in the -cols, -rs, -dl, or -dv options are subject to the following substitutions:
\\ -> \ \n -> LF character \r -> CR character \t -> tab [[ -> [ \[ -> [ ]] -> ] \] -> ] \12 -> character whose value is 12 (values 1-255 only) \1200 -> \1200 (because number was not in the allowed range) \anything_else -> \anything_else When \ is the last character on a line it does not escape the line terminator and it is emitted. So -cols "[1] \" will emit lines ending with \.
% extract
% extract -hList the the command line options.
% extract -sc 50 <infile.txt >outfile.txtExtract characters 50 to end of row for every line in infile.txt and write them to outfile.txt.
% extract -sr 4 -sc 5 -ec 10 <infile.txt >outfile.txtExtract characters 5-10 from rows 4 to end of infile.txt and write them to outfile.txt.
% extract -sc 5 -nc 10 <infile.txt >outfile.txtExtract characters 5-14 from all rows in infile.txt and write them to outfile.txt.
% extract -sc 2 -ec 3 -mt -dl ':,;' <infile.txt >outfile.txtExtract the 2nd and 3rd tokens delimited by one or more :,; characters from each row in infile.txt and write them to outfile.txt.
% extract -sr 4 -er 40 -sc 2 -ec 3 -mt -dl ':,;' -s -all -rm <infile.txt >outfile.txtProcess infile.txt as follows:
% ( cd / ; du -k ) | extract -cols '[jr:fw14:1] [2]' -mt
1. Emit verbatim rows 1 through 3.
2. For rows 4 though 40 emit the 1st, and 4th through Nth tokens delimited by a single :,; character.
3. Emit verbatim rows 41 to the final row in the file.Lists the size of all directories on a Unix system with the size field right formatted so that the columns all line up.
% ls -al | extract -cols '[mc:1,32][fw14:jr:5] [6] [fw2:7] [jr:fw5:8] [9]' -mt -dl ' 'Straighten the columns in a directory listing on a Unix system with large files.
% extract -cols '[,-2]' <infile.txtConverts a Windows CRLF text file to a Unix LF text file when run on a Unix system.
% extract -cols 'foo[cu:jl:fw20:3,5]blah[-:mc:10,30]er[1]' -mt -fw 30 <infile.txtProcess each line of infile.txt as follows:
% extract <infile.txt >outfile.txt -if '^>' -cols '>SPECIAL [1,]'
1. Emit "foo".
2. Emit tokens 3,4, and 5 upper cased in a 20 character field, left justified.
3. Emit "blah".
4. Emit characters 10 through 30.
5. Emit "er".
6. Emit column 1 in a field of width 30.
Lines beginning with '>' are emitted with the modification shown. All other lines are echoed unchanged.
% extract -mt -dv '\t' -cols '[1,5][[WOW!]]\n[6]' <infile.txtEmit the first five tokens separated by tabs and then on the next line emit [WOW!] followed immediately by the sixth token. WARNING: the EOL string is operating system specific. The example above is correct for Unix.
% extract -eol ',' -if Teacher -cols '\n[1,]' <infile.txtIf the infile consists of "Teacher name" lines each followed by many lines of student names, the output will consist of one blank line (assuming the first input line has "Teacher" in it) followed by lines like: "Teacher name, student1,student2,...studentn". WARNING: the EOL string is operating system specific. The example above is correct for Unix.
% extract -mt -if Teacher -ifterm Teacher -iftermbol '\n'
-ifeol -cols '[2],' <infile.txtIf the infile consists of many instances of a "Teacher: Name" line followed by N "Student: Name" lines the output will consist of several lineslike: "Teacher name, student1,student2,...studentn".
% extract -indl ',' -in file1,file2,-Merge the contents of file1, file2, and stdin, placing a comma between the part of the line from each file.
% extract -mdl ',' -merge 4 -in file1,file2,-As above but also merge consecutive rows which begin with the same 4 character prefix. If three such rows were "foo 1", "foo 2", and "foo 3" the single output row would be "foo 1, 2, 3".
% extract -rcds '\r\12' -in file1Remove carriage returns and linefeeds from the file and emit to stdout.
% extract -rcds 'Tt' -rcss 'Uu' -in file1Substitute characters T->U and t->u and emit to stdout.
% extract -rtds 'Thomas' -rtss 'Tom' -in file1Substitute string Tom for Thomas and emit to stdout.
% extract -merge 5 -mdl ',' -in file1If file1 contained the lines "abcd_ 1", "abcde 2", "abcde 3","abcdf 4" the output would be "abcd_ 1", "abcde 2,3" ,"abcdf 4"
% extract -unmerge 2 -in file1If file1 contained the line: "blah a b c d e" the output would be: "blah ab", "blah cd", "blah e".
% extract -in template,file -indl ' MISS' -template 3 -out foutIf template contains "120","121","122" and file contains "120 fred","122 mary" write "120 fred","121 MISS""122 mary" to fout.
% find . | extract -cols 'extract -in [1,] -out foo.tmp -rtds "/usr/bin/perl" -rtss "/usr/bin/perl5" ; mv foo.tmp [1,]' | execinputUse extract recursively as a stream editor. For each input file found by find the first extract prepares a command line where a second instance of extract converts each instance of "/usr/bin/perl" to "/usr/bin/perl5". The final execinput executes these command lines one at a time. (Note that the output first goes to a temporary file and is then copied back over the original input file.)
% extract -nr 1 -sc 3 -all -in unicode.txt -hndDelete embedded null characters from 16 bit unicode text. If the -hnd was omitted there would be a fatal error when the first null character was encountered during the reading of this file. Also deletes the first two characters of the first line only, which comprise the unicode Byte Order Mark.
none
You may run this program on any platform. You may redistribute the source code of this program subject to the condition that you do not first modify it in any way. You may distribute binary versions of this program so long as they were compiled from unmodified source code. There is no charge for using this software. You may not charge others for the use of this software.
Copyright (C) 2002, 2003, 2004, 2005 David Mathog and Caltech.
This program was inspired by Pat Rankin's EXTRACT utility for VMS.
David Mathog, Biology Division, Caltech <mathog@caltech.edu>
| extract (1) | 30 Mar 2005 |