Manual Reference Pages  - extract (1)

NAME

extract - extracts and formats character ranges or tokens from text files.

CONTENTS

Synopsis
Description
Operator Order
Options
String Syntax
Examples
See Also
License
Copyright
Acknowledgements
Authors

SYNOPSIS

extract [options...]

DESCRIPTION

extract reads a text file (or files) and extracts a range of rows and columns (character positions), optionally reformats this data, and then outputs it. extract can process tokens instead of or in addition to character columns. For many simple text processing tasks extract is much simpler to use than sed, awk, or perl.

extract may be obtained as part of the drm_tools package from: ftp://saf.bio.caltech.edu/pub/software/linux_or_unix_tools/

OPERATOR ORDER

There are many extract command line options but only those whose default values are not appropriate for a particular text modification must be specified subject to the caveat that at least one command line option must be specified. The order in which operations are executed and the command line options that affect those operations are:

OPTIONS

-all Emit unprocessed the text rows outside of the range specified with -sr , -er , -nr. (Default is not to emit these rows.)

-bol <bolstring>
  When set the prefix <bolstring> is emitted before any output for each input row. Specifically, there will be one prefix string emitted for each input row even if the rest of the output row is empty. <bolstring> may be an empty string. Note that the prefix precedes any line numbers triggered by -n. (Default is an empty string.)

-bs -ba -b2
  Add backslashes (unix escape characters) before any character (other than alphabet, numeric, underscore, period, or slash), before all characters, or before all but the first character. If -ecc is also used the specified character is used instead of backslash. Note that this only applies within a field, so that, for instance, if the program is running in token mode a token range [1,3] would apply the backslashes between characters within each token but not between tokens. To work around that limitation use [dv\:1,3]. (Default is not to add backslashes.)

-cols <format>
  Specify in great detail the format of the output line including the selection of multiple columns from each input line. (Default is to select a single column, which may be the entire input line.)

When -cols is specified the other command line options specify the default values for all column fields. Multiple column fields (indicated by [] brackets within <format> ) may be specified. Text strings containing any symbol, including escaped characters, may be introduced between column fields. See String Syntax for examples. [ and ] must be escaped in a string or they will be intrepreted as the limits of a column field. Column fields contain zero or more options delimited by colons ( : ) followed by a mandatory range value. Characters [ and ] are not allowed within a column field but all other characters are and escapes may be used to include colons. Arbitrary combinations of text strings and column fields may be employed, freely mixing token and character mode columns, and emitting columns in any order, including emitting a single column multiple times. Typically <format> must be quoted or escaped on the command line so that the shell does not mangle it before passing it into the program.

The options for a column field are: + = set_as match command line specifications; p = default match program defaults (overrides -pd , -lj , -uc , etc.); - = disable disable options. If employed as a single character it applies to all settings and must be the first option within a column field. As a suffix these may be applied singly to each of the -cols options.

mt/mc/m-/mp/m+ token mode/character mode/disable/default/set_as. Also sets the delimit state in some instances to match the command line, but this may be overridden again by a subsequent :d*: clause in the same column field. (overrides -mt , -mc )

jl/jr/jc/j-/jp/j+ justify left/right/center/disable/default/set_as (overrides -j* )

trl/trr/trb/trc/tr-/trp/tr+ trim left/right/both/compress/disable/default/set_as (overrides -tr* )

cu/cl/cf/c-/cp/c+ case upper/lower/first/disable/default/set_as (overrides -c* )

bs/ba/b2/b-/bp/b+ backslashes apply(as needed)/all/all but first/disable/default/set_as (overrides -bs )

eccCHAR/eccp/ecc+ escape character is CHAR /default/set_as (overrides -ecc )

dt/dvN/d-/dp/d+ emit actual token delimiter / char N / disable / default / set_as. Restriction: the delimit character N must be escaped if it is a colon or a backslash, ie \: and \\. (overrides -d* )

pd###/pd-/pdp/pd+ pad with ### spaces/disable/default/set_as (overrides -pd and -fw )

fw###/fw-/fwp/fw+ field width ### spaces/disable/default/set_as (overrides -pd and -fw )

fp###/fp-/fpp/fp+ floating point precision ### spaces/disable/default/set_as (overrides -fp )

fff/ffe/ff-/ffp/ff+ floating point format to float/exponent/default/set_as (overrides -ffe and -fff )

rsSTR/rs-/rsp/rs+ replacement string is STR /disable/default/set_as. Restriction: STR may not contain a colon. (overrides -rs )

rcdsSTR/rcdcSTR/rcd-/rcdp/rcd+ rcds string is STR /case insensitive STR /disable/default/set_as. Restriction: STR may not contain a colon. (overrides -rcds )

rcssSTR/rcs-/rcsp/rcs+ rcss string is STR /disable/default/set_as. Restriction: STR may not contain a colon. (overrides -rcss )

rtdsSTR/rtdcSTR/rtd-/rtdp/rtd+ rtds string is STR /case insensitive STR /disable/default/set_as. Restriction: STR may not contain a colon. (overrides -rtds )

rtssSTR/rts-/rtsp/rts+ rtss string is .B STR /disable/default/set_as. Restriction: STR may not contain a colon. (overrides -rtss )

[c] [s,e] [s,] [,e] [s,,i] [,e,i] [,,i] range values for a single Column or column range (Start, End, Increment). Defaults for omitted column range valuesare: Start = the first column; End = the last column; and Increment = 1. The range values must be the final option in a column field, optionally preceded by a colon delimited list composed of the options listed above. Both the Start and End values may be positive or negative. If positive, they are column/token positions measured from the front of the line, where 1 is the first column. If negative, they are column/token positions measured from the end of the line, where -1 is the last column. Mixing modes like [10,-10] is possible but can generate fatal errors if the line is too short or has too few tokens to satisfy the range. The increment value defaults to 1 and may be anything other than zero. Increments only function in token mode - in character mode the value may be set but it is ignored. The expression [,,-1] emits all tokens in a line in reverse order. (overrides -sc , -ec , -nc. )

-crok Retain a Carriage Return character which appears before the End of Line character. (Default is to delete it.)

-cu -cl -cf
  In selected characters/tokens change case to upper, lower, or first letter upper and rest lower. (Default is to leave case unmodified.)

-dbg Emit state and parsing information as each input line is processed. Only a developer modifying the program’s code is likely to find this useful. (Default is not to emit this information.)

-dl <delimiter_string>
  Change the delimiters used to define tokens. Typically <delimiter_string> must be quoted or escaped on the command line so that the shell does not interpret it. (Default string contains the characters space, colon, and tab )

-dt When tokens are emitted followed by delimiters use as that delimiter that which defined the end of the current token. (Default). See also -d- and -dv.

-dq -dqs
  While parsing tokens ignore delimiters within double quotes. -dq returns the token with the surrounding double quotes, -dqs returns the token without the quotes. (Default is to recognize delimiters no matter where they occur.)

-dv <delimit_character>
  When tokens are emitted followed by delimiters use -dv <delimit_character> . (Default is -dt ).

-d- Do not emit a delimiter following a token. This is most often used in combination with the -s, -pd, -fw, and -j* switches. (Default is -dt , see also -dv ).

-ec <end_column>
  The last character column to select. (Default is -1, the last column.)

-ecc <escape_character>
  When set the escape character for the -bs,ba,b2 commands becomes <escape_character> . This may be used to separate character based columns with delimiters so that the result can be read into a spreadsheet easily. (The default escape character is a backslash.)

-eol <eolstring>
  When set the output from each input row is terminated with <eolstring>. Specifically, there will be one <eolstring> emitted for each input row even if the rest of the output row is empty. <eolstring> may be an empty string. This may be used to compress multiple input lines into a single output line. Typically \n would be injected into the output through -if and/or -cols and a comma, space or colon would be used for <eolstring>. (Default value of <eolstring> is \n )

-eqlen When reading from multiple input files require that they all have exactly the same number of lines. (Default is to read as many lines as are present in each.)

-er <end_row>
  The last text row to process. (Default is the last row in the file.)

-ffe -fff
  Format all fields as floating point numbers in exponential or floating formats. Example exponential: 1.234e+05. Example floating: 1.234. The precision is set by -fp and is the number of digits after the decimal point. If the field being formatted cannot be converted into a valid number a fatal error will result. The field width is set by -fw or -pd. If the resulting formatted number will not fit into the designated width the output will be expanded to fit, so be sure to leave enough space for the largest possible number. (Default is to format fields as text.)

-filebol <STRING>
  Writes <STRING> before the stream of data to the output file. See also -fileeol. (Default is not to write a string before the data stream.)

-fileeol <STRING>
  Writes <STRING> after the stream of data to the output file. See also -filebol. ( Default is not to send a string after the data stream.)

-fp <precision>
  The precision for floating point formats. See -fff and -ffe. (Default precision is 6.)

-fw <number_of_characters>
  <number_of_characters> specifies the field width. The input field is either padded or truncated as required. See also -pd. (Default is 0 - no change to field sizes.)

-h -help --help -? --??
  Print the help message. (Default is not to print help message.)

-hcols Print detailed -cols help. (Default is not to print the -cols help message.)

-hexamples
  Print examples. (Default is not to print examples.)

-hnd If embedded null characters are encountered in the input they are deleted. hnd is an acronym for "Handle Nulls Delete". See also -hnr,-hns,-hnd,-hnsubs. (Default is -hnr )

-hnr If embedded null characters are encountered in the input they are retained. However, the appearance of such a null character is a fatal event since a string containing them cannot be further processed. hnr is an acronym for "Handle Nulls Retain". See also -hnd,-hns,-hnsubs. (Default)

-hns If embedded null characters are encountered in the input they are substituted with \255. hns is an acronym for "Handle Nulls Substitute". See also -hnr,-hnd,-hnsubs. (Default is -hnr. )

-hnsubs <CHAR>
  If embedded null characters are encountered in the input they are substituted with <CHAR>. hnsubs is an acronym for "Handle Nulls Substitute". See also -hnr,-hnd,-hns. (Default is -hnr. )

-i Emit version, copyright, license and contact information. (Default is not to emit information.)

-if <tag>
  Conditionally operate on an input line. The syntax for <tag> is [!][^]string[$] , where: string is any text which may contain tab and numeric escapes as for -dl ; ^ string is located at the front of a line ; $ string is located at the end of a line ; ! invert logic - operate when string is not found. If neither ^ nor $ is present string may appear anywhere in a line. These special characters must be escaped when they are part of the string part of the expression: ^ , $ , ! , and \. Lines containing the <tag> are processed, other lines are just echoed to the output. Use ^$ to match an empty string and !^$ to match all nonempty strings. (An empty string is one containing no characters.) Command line interpreters may interfere with some of the special characters. If that occurs use decimal representations: \33 for ! , \94 for ^ , \36 for $. See also -ifonly. (Default is to process all lines within the specified row range.)

-ifbol When set those rows in an if block are emitted without the BOL string prefixed. This is used primarily to mark all rows other than those in the if block with a prefix tag. (Default is to emit the BOL string.)

-ifeol When set those rows in an if block are emitted without an EOL character. This may be used to compress multiple input lines within an if block into a single output line. (Default is to emit the EOL string.)

-ifn <N> Extends the condition set by -if for <N> more lines. May not be combined with -ifterm. (Default is not to extend the conditional processing.)

-ifonly
  When set only those rows satisfying -if and -ifn are emitted. (Default is to emit other lines unchanged.)

-ifnorestart
  Normally within an -if block each line is tested to see if it matches the -if <tag> and if it does the block is extended. This happens when either -ifn <N> or -ifterm <endtag> is also specified. If -ifnorestart is specified under these conditions lines within an existing -if block are not tested and so the block will not be "restarted". (Default is to restart.)

-ifterm <endtag>
  Extends the condition set by -if through the first line containing the <endtag>. The rules for processing the <endtag> are the same as for the -if <tag>. May not be combined with -ifn. When the tags are chosen so that the beginning -if and terminating -ifterm are not the same line use -iftermeol <STRING> to finish off the end of the if block. When these tags are the same the <endtag> really indicates the input line following the preceding if block. In this case use -iftermbol <STRING> to write a string between the two if blocks and do not use -iftermeol. (Default is not to extend conditional processing.)

-iftermbol <STRING>
  Writes <STRING> before the first character in the last line of an -if block. That line is determined by either -ifn <N> or -ifterm <endtag> or the end of the file. Primarily this is useful when -ifterm <endtag> and -if <tag> are the same and a separator needs to be written between consecutive if blocks. Only one <STRING> is written for each if block terminator no matter how many input lines the block contains. (Default value is an empty string.)

-iftermeol <STRING>
  Writes <STRING> after the last character in an -if block. The end of the block is determined from -ifn <N> , or -ifterm <endtag> , or if neither of these are specified, the first line not matching -if <tag> , or the end of the file. Only one <STRING> is written for each if block terminator no matter how many input lines the block contains. (Default value is an empty string.)

-in file1[,file2,file3,..fileN]
  Read input from one or more specified files in a comma delimited list. When reading from more than one file the lines from each are concatenated into a single input line in the order shown. Use -indl to delimit the substrings. The special file name - corresponds to stdin. Only a single input file may be read from stdin. See also -eqlen. The -h option displays the maximum number of input files. (Default is to read from stdin.)

-indl <StreamDelimit>
  When reading from more than one input file the string <StreamDelimit> is placed between each substring in the resultant final input string. (Default is an empty string - input strings are directly concatenated.)

-is Modify the indicated character or token range "in situ" and emit them and the unmodified surrounding region. This option may not be used with -rm or -cols. (Default is to emit only the selected character/token range.)

-jl -jc -jr
  Justify field left, center, or right. (Default is not to change justification.)

-ll Prefix each line of output with "line_length:". The line length is the number of characters in the final input line after reading a line from all input files and inserting delimiters. If -n is also specified the line number value is emitted, then the line length value is emitted, and finally the rest of the output line is emitted. (Default is not to emit line lengths.)

-mc Process lines as character columns. See also -mt. (Default.)

-merge <N>
  Examine the <N> first characters in consecutive rows. If they are the same emit the <N> character prefix once and the remainder of each matching row in sequence as one new row. Use -mdl to place delimiters between these fragments. The comparison is case sensitive. Prefix based merging follows merging from multiple input files and precedes any if contingent operations. See also -unmerge. (Default is not to merge based on common prefix.)

-mdl <MergeDelimit> When -merge is set and consecutive rows are being concatenated introduce the string <MergeDelimit> between the fragments from each row. (Default is an empty string - input strings are directly concatenated.)

-mt Process lines as tokens. In this mode -sc , -ec , and -nc values refer to token numbers. If a single token is emitted then no delimiter is emitted with it. However, two or more tokens are emitted as:
token1 delim1 token2 delim2 token3 ... tokenN
Where: delim1 is the first delimiter following token1. Note that no terminal delimiter is added after the last token. This mode is appropriate when delimiters are white space. Add -s when every delimiter indicates a token and empty tokens are allowed. For instance, when reading spreadsheet data. See also -dl. (Default is -mc. )

-n Prefix each line of output with: "line_number:". The line number is that line’s position in the input file. (Default is not to number input lines.)

-nc <number_of_columns>
  Number of columns to process starting from sc. Do not specify both -nc and -ec. (Default is to process all columns.)

-nr <number_of_rows>
  Number of text rows to process starting from sr. Do not specify both -nr and -er. (Default is to process all rows.)

-out <output_file>
  Write output to the specified file. (Default is to write to stdout.)

-pd <number_of_characters>
  Specifies the <number_of_characters> (spaces) to be added to the right side of the field. When fields are processed they are padded, then justified, then the character cases adjusted. See also -fw. (Default is 0 - no padding.)

-rcdc <RCDS_STRING>
  Case insensitive form of -rcds

-rcds <RCDS_STRING>
  Remove from the output any characters found in the string <RCDS_STRING>. If that string begins with ! only those characters which match will be retained. This option may be combined with -rcss to induce substitution instead of deletion. rcds is an acronym for "Replace Character Delete String". (Default is to emit all characters without filtering.)

-rcss <RCSS_STRING>
  When a character matches in <RCDS_STRING> it is substituted from the same position in <RCSS_STRING>. These two strings must be the same length. When substituting a ! in <RCDS_STRING> has no special meaning. rcss is an acronym for "Replace Character Substitute String". (Default is to emit all characters without filtering.)

-rm Remove the selected character columns/tokens instead of emitting them. This option may not be used with -is or -cols. (Default is to emit only the selected character/token range.)

-rs <replacement_string>
  <replacement_string> substitutes for empty fields. Typically employed to insert NA or 0 in a tab delimited file which left unspecified values as empty fields. Note, a colon ( : ) is used to delimit fields filled with <replacement_string>. Use -dv to change this. (Default leave empty fields empty.)

-rtdc <RTDC_STRING>
  Case insensitive form of -rtds

-rtds <RTDS_STRING>
  Remove from the input string the text contained in <RTDS_STRING>. Multiple instances, if present, will be removed. This option may be combined with -rtss to induce substitution instead of deletion. rtds is an acronym for "Replace Text Delete String". (Default is to emit all text without replacement.)

-rtss <RTSS_STRING>
  When a part of a line of text maches matches <RTDS_STRING> it is substituted with <RTSS_STRING>. These two strings need not be the same length. rtss is an acronym for "Replace Text Substitute String". (Default is to emit all text without replacement.)

-s Emit a token for each delimiter encountered. When -s is specified tokens may consist of empty strings. This mode is for use with delimited data as from a spreadsheet. (Default is to emit one token for each run of delimiters.)

-sc <start_column>
  The first character column to select. Columns are numbered from 1. Negative values are allowed and represent columns measured from the end of the line, where -1 is the last column. (Default is 1, the first column.)

-sr <start_row>
  The first text row (line of text) to process. Rows are numbered from 1. (Default is 1, the first row.)

-template <N>
  Template match two files. This is used to fill in the holes in a column of a table if all of the rows are known. Use -in <template,file> to specify which is the <template> (the first) and which is the <file> to compare to it (the second.) The contents of the two files must be in the same order (for instance, sorted, but any order is ok). The <file> may contain a subset of the rows present in the the <template>. It may not contain any rows not present in the <template>. Compare the first <N> characters in a case sensitive manner and if they are the same pass the row from the <file> into the program. If they are different this indicates a "hole" in the file. Instead, pass the first <N> characters from the <template> followed by the string specified by -indl. Normally this would be set to something like "NA", to indicate the presence of the hole. -template is incompatible with -merge. It may be used with -eqlen to verify that all expected rows are present. It is strongly suggested that the data in the first <N> columns of both files be justified and padded with spaces - otherwise "AB" will not match "AB data" for <N> = 4. When a template is compared to a file the first blank line in each will act as an end of file. (Default is no template processing.)

-trl -trr -trb -trc
  Trim out whitespace (spaces and tabs) in the field on the left, right, or both sides. Internal whitespace is not affected. -trc eliminates white space on both ends and compresses runs of internal whitespace to a single space. (Default is to leave the whitespace as is.)

-unmerge <N>
  Take a line consisting of multiple tokens and treat it as several lines, each beginning with the same first token, and containing sequential groups of <N> tokens, until all are consumed. Token delimiters are from -mdl , or if that isn’t specified, -dl. Character column data must be converted to token (delimited) data before it may be processed with -unmerge. Each line number emitted by -n when -unmerge is active derives from the original input line. If an input line is unmerged into four lines each will have the same line number. See also -merge. (Default is not to unmerge.)

-wl <widest_line>
  Widest input line in characters. (Default is 16000 characters.)

-xc <maXimum_Columns>
  Maximum number of column fields ([] in -cols and/or tokens that may be referenced. (Default is 8192 fields.)

STRING SYNTAX

Text strings which appear in the -cols, -rs, -dl, or -dv options are subject to the following substitutions:

\\ -> \
\n -> LF character
\r -> CR character
\t -> tab
[[ -> [
\[ -> [
]] -> ]
\] -> ]
\12 -> character whose value is 12 (values 1-255 only)
\1200 -> \1200 (because number was not in the allowed range)
\anything_else -> \anything_else

When \ is the last character on a line it does not escape the line terminator and it is emitted. So -cols ’[1] \’ will emit lines ending with \.

EXAMPLES

% extract -h
  List the command line options.

% cat file | extract -sr 1
  Echo all text from stdin to stdout. (Specifying any one command line option with its default value will do the same.)

% extract -sc 50 <infile.txt >outfile.txt
  Extract characters 50 to end of row for every line in infile.txt and write them to outfile.txt.

% extract -sr 4 -sc 5 -ec 10 <infile.txt >outfile.txt
  Extract characters 5-10 from rows 4 to end of infile.txt and write them to outfile.txt.

% extract -sc 5 -nc 10 <infile.txt >outfile.txt
  Extract characters 5-14 from all rows in infile.txt and write them to outfile.txt.

% extract -sc 2 -ec 3 -mt -dl ’:,;’ <infile.txt >outfile.txt
  Extract the 2nd and 3rd tokens delimited by one or more :,; characters from each row in infile.txt and write them to outfile.txt.

% extract -sr 4 -er 40 -sc 2 -ec 3 -mt -dl ’:,;’ -s -all -rm <infile.txt >outfile.txt
  Process infile.txt as follows:
1. Emit verbatim rows 1 through 3.
2. For rows 4 though 40 emit the 1st, and 4th through Nth tokens delimited by a single :,; character.
3. Emit verbatim rows 41 to the final row in the file.

% ( cd / ; du -k ) | extract -cols ’[jr:fw14:1] [2]’ -mt
  Lists the size of all directories on a Unix system with the size field right formatted so that the columns all line up.

% ls -al | extract -cols ’[mc:1,32][fw14:jr:5] [6] [fw2:7] [jr:fw5:8] [9]’ -mt -dl ’ ’
  Straighten the columns in a directory listing on a Unix system with large files.

% extract -cols ’[,-2]’ <infile.txt
  Converts a Windows CRLF text file to a Unix LF text file when run on a Unix system.

% extract -cols ’foo[cu:jl:fw20:3,5]blah[-:mc:10,30]er[1]’ -mt -fw 30 <infile.txt
  Process each line of infile.txt as follows:
1. Emit "foo".
2. Emit tokens 3,4, and 5 upper cased in a 20 character field, left justified.
3. Emit "blah".
4. Emit characters 10 through 30.
5. Emit "er".
6. Emit column 1 in a field of width 30.

% extract <infile.txt >outfile.txt -if ’^>’ -cols ’>SPECIAL [1,]’
  Lines beginning with > are emitted with the modification shown. All other lines are echoed unchanged.

% extract -mt -dv ’\t’ -cols ’[1,5]\n[[WOW!]][6]’ <infile.txt
  Emit the first five tokens separated by tabs and then on the next line emit [WOW!] followed immediately by the sixth token.

% extract -eol ’,’ -if Teacher -cols ’\n[1,]’ -fileeol ’\n’ <infile.txt
  If the infile consists of "Teacher name" lines each followed by many lines of student names, the output will consist of one blank line (assuming the first input line has "Teacher" in it) followed by lines like: "Teacher name, student1,student2,...studentn,".

% extract -mt -if Teacher -ifterm Teacher -iftermbol \n
-ifeol -cols ’[2],’ <infile.txt
  If the infile consists of many instances of a "Teacher: Name" line followed by N "Student: Name" lines the output will consist of several lineslike: "Teacher name, student1,student2,...studentn,".

% extract -indl ’,’ -in file1,file2,-
  Merge the contents of file1, file2, and stdin, placing a comma between the part of the line from each file.

% extract -mdl ’,’ -merge 4 -in file1,file2,-
  As above but also merge consecutive rows which begin with the same 4 character prefix. If three such rows were "foo 1", "foo 2", and "foo 3" the single output row would be "foo 1, 2, 3".

% extract -rcds ’\r\12’ -in file1
  Remove carriage returns and linefeeds from the file and emit to stdout.

% extract -rcds ’Tt’ -rcss ’Uu’ -in file1
  Substitute characters T->U and t->u and emit to stdout.

% extract -rtds ’Thomas’ -rtss ’Tom’ -in file1
  Substitute string Tom for Thomas and emit to stdout.

% extract -merge 5 -mdl ’,’ -in file1
  If file1 contained the lines "abcd_ 1", "abcde 2", "abcde 3","abcdf 4" the output would be "abcd_ 1", "abcde 2,3" ,"abcdf 4"

% extract -unmerge 2 -in file1
  If file1 contained the line: "blah a b c d e" the output would be: "blah ab", "blah cd", "blah e".

% extract -in template,file -indl ’ MISS’ -template 3 -out fout
  If template contains "120","121","122" and file contains "120 fred","122 mary" write "120 fred","121 MISS""122 mary" to fout.

% find . | extract -cols ’extract -in [1,] -out foo.tmp -rtds /usr/bin/perl -rtss /usr/bin/perl5 ; mv foo.tmp [1,]' | execinput
  Use extract recursively as a stream editor. For each input file found by find the first extract prepares a command line where a second instance of extract converts each instance of /usr/bin/perl to /usr/bin/perl5. The final execinput executes these command lines one at a time. (Note that the output first goes to a temporary file and is then copied back over the original input file.)

% extract -nr 1 -sc 3 -all -in unicode.txt -hnd
  Delete embedded null characters from 16 bit unicode text. If the -hnd was omitted there would be a fatal error when the first null character was encountered during the reading of this file. Also deletes the first two characters of the first line only, which comprise the unicode Byte Order Mark.

% extract -b2 -ecc ’,’ -in data.txt -out comma_delimited.txt
  Place a comma between every character in data.txt. The result may be read into a spreadsheet with one character per cell.

SEE ALSO

execinput(1)

LICENSE

GNU General Public License 2

COPYRIGHT

Copyright (C) 2007 David Mathog and Caltech.

ACKNOWLEDGEMENTS

This program was inspired by Pat Rankin’s EXTRACT utility for VMS.

AUTHORS

David Mathog, Biology Division, Caltech <mathog@caltech.edu>


drm_tools extract (1) 1.0.35 JUL 03 2007
Generated by manServer 1.07 from /usr/common/man/man1/extract.1 using man macros.