Problem: There are four main kinds of dashes: hyphens, en-dashes, em-dashes, and minus. Wikipedia discusses the differences between the dashes. LaTeX produces dashes using one, two, or three hyphens: (- for hyphen; -- for en-dash; --- for em-dash) or $-$ for minus. When expressing ranges of numbers (e.g., pages 96-100), an en-dash should be used. However, all my page numbers were in a format where only a single hyphen was used.
Thus, I wanted to replace "-" with "--" but only for page numbers.
The initial text file looked a little like this but with 1000 more references:
@ARTICLE{Reder1987CP,
author = {Reder, L. M.},
title = {Strategy selection in question answering},
journal = {Cognitive Psychology},
year = {1987},
volume = {19},
pages = {90-138},
endnotereftype = {Journal Article},
shorttitle = {Strategy selection in question answering}
}
@ARTICLE{Reder1982PR,
author = {Reder, L. M.},
title = {Plausability judgments versus fact retrieval: Strategies for sentence
verification},
journal = {Psychological Review},
year = {1982},
volume = {89(3)},
pages = {248-278},
endnotereftype = {Journal Article},
shorttitle = {Plausability judgments versus fact retrieval: Strategies for sentence
verification}
}
And I wanted something like this (note the "pages = {...}"):
@ARTICLE{Reder1987CP,
author = {Reder, L. M.},
title = {Strategy selection in question answering},
journal = {Cognitive Psychology},
year = {1987},
volume = {19},
pages = {90--138},
endnotereftype = {Journal Article},
shorttitle = {Strategy selection in question answering}
}
@ARTICLE{Reder1982PR,
author = {Reder, L. M.},
title = {Plausability judgments versus fact retrieval: Strategies for sentence
verification},
journal = {Psychological Review},
year = {1982},
volume = {89(3)},
pages = {248--278},
endnotereftype = {Journal Article},
shorttitle = {Plausability judgments versus fact retrieval: Strategies for sentence
verification}
}
Solution: The natural choice was to use regular expressions. Many programming languages (and some text editors) support regular expressions. Because I'm most familiar with R, I tend to use R to process regular expressions. It's probably not the most obvious choice, but it does allow me to get feedback about how the patterns are matched and replaced. And it means that I can leverage my skills in R to use regular expressions. It also means that when I need to use string manipulation for data analysis, I am familiar with the tools.
Overview of regular expressions: For readers unfamiliar with regular expressions, they are an extremely powerful tool for finding and replacing text. Information about support for regular expressions in R can be found by typing ?regex. Additional information about the actual search and replace functions can be found by looking at the help for one of the string manipulation functions such as ?gsub. Data Manipulation with R has a chapter on string manipulation in R that I found helpful. RegularExpression.Info also has a tutorial.
Copy of the R Code
x <- readLines("clipboard-128")
#Copy the BibTeX database from the
#Clipboard (or this could be a file)
#result is a character vector where each line is an element
# The initial filter reads:
# "^" start of text
# " page = " literal text
# "[{]" the open brace is a special character
# and needs to be escaped by square brackets
# "[[:digit:]]" any number from 0 to 9
# "+" one or more of the preceding characters
# (i.e.,one or more numbers)
# "-" literal text
# "[[:digit:]]" any number from 0 to 9
# "+" one or more of the preceding characters
# (i.e., one or more numbers)
initialFilter <- "^ pages = [{][[:digit:]]+-[[:digit:]]+"
myPattern <- "-"
myReplacement <- "--"
xOutput <- x
# Apply initial filter
xSubset <- grep(initialFilter, x)
# Replace matches within filter
xOutput[xSubset] <- sub(pattern = myPattern,
replacement = myReplacement, x = x[xSubset])
# Basic Check that it worked
cbind(x[x != xOutput], xOutput[x != xOutput])
# Check replacement: shows original and replaced
xOutput
# Write the replaced text to a file
writeLines(xOutput, "xOutput.txt")
Copy of the R Output from the Check:
The following shows the first few lines of the check. The first column shows the original text and second column shows the replaced text:
> cbind(x[x != xOutput], xOutput[x != xOutput])
[,1] [,2]
[1,] " pages = {598-614}," " pages = {598--614},"
[2,] " pages = {883-901}," " pages = {883--901},"
[3,] " pages = {360-364}," " pages = {360--364},"
[4,] " pages = {288-318}," " pages = {288--318},"
[5,] " pages = {3-27}," " pages = {3--27},"
[6,] " pages = {567-589}," " pages = {567--589},"
[7,] " pages = {259-290}," " pages = {259--290},"
[8,] " pages = {270-304}," " pages = {270--304},"
Main points that I take away from this:
- R has powerful string manipulation tools; They're worth learning, if you use R.
- R has a habit of introducing users to powerful tools hidden from the typical Windows setup.
- R, LaTeX, BibTeX, Sweave, and Regular expressions are all text-driven systems in contrast to largely menu-driven systems such as SPSS, MS Word, and Endnote. Their textual nature facilitates their mutual co-operation.
- Running checks on replacement operations in regular expressions is important