Great recent song by The Fratellis on their 2006 album Costello Music, "Ole Black 'N' Blue Eyes." The video seems not to have any connection to any plausible interpretation of the lyrics for me; I'd suggest ignoring it and just listening: (click here<)
The first thing that jumped out at me was how closely the progression of the song mirrored the Pachelbel Canon in D's ground bass (though transposed to B-flat):
(Not sure if the right hand part matches the actual guitar parts used in the piece, but they're close enough for illustrative purposes).
But better than seeing similarities, I thought a little remix of the opening minute or so would show just how compatible the two pieces were. Enjoy.
29 December 2008
Pachelbel lives on.
Labels:
analysis,
music,
popular music
05 October 2008
Overwrite IE Favorites with Firefox Bookmarks
I love Firefox, but on Windows IE has one big advantage over it: it's made by the same people who made the OS. This lets IE access its favorites from all sorts of places, not just from the browser. I'm a task-oriented person rather than application-oriented: I tend to think "Gotta pay the bills now" not "Gotta open up the browser and then do things that require a browser, such as bill pay." IE 4.0+ (and Windows98) was oriented towards people like me, by putting the Favorites option in the start menu, every folder, etc. It never really caught on, and in fact the Favorites option was quietly removed from the start menu in either XP or Vista, though it can be re-enabled from the taskbar options.
There are a couple of ways to get your Firefox bookmarks in the Favorites. There are lots of bookmark synchronization programs, though they tend to need to be run manually--which is messy for me. There's also the PlainOlFavorites plug-in for Firefox which lets you just use IE favorites instead of Bookmarks. I loved this plug in for a few years, but I like FF's native Bookmark system enough that I wanted to make my own synchronization system.
Here's a little Python script I came up with. It copies YESTERDAY's Firefox backup over your Favorites (copying the current version would just have been too messy). It requires Python 2.6 or you'll need to install the "simplejson" module and replace all references to "json" with "simplejson" in the code. Do not run this if you use IE favorites: it will DELETE THEM ALL and overwrite them with Firefox favorites. You can comment out the "nukeFavorites()" line if you just want to add to your existing IE favorites. I plop this script into my Startup folder and it works great!
# bookmarks_json.py -- Michael Scott Cuthbert
import json
import copy
import sys
import os
import shutil
class BookmarkMaker(object):
startdir = os.environ["USERPROFILE"] + r'\Favorites'
internetShortcut = "[InternetShortcut]\nURL="
folderpath = os.environ["APPDATA"] + "\\Mozilla\\Firefox\\Profiles\\"
backuppath = r'\bookmarkbackups'
def nukeFavorites(self):
'''deletes everything in the Favorites dir -- do not use this if you update Favorites!'''
nukeem = os.listdir(self.startdir)
for thisfile in nukeem:
if thisfile != "desktop.ini":
thisFullFile = self.startdir + os.sep + thisfile
try:
os.remove(thisFullFile)
except OSError:
shutil.rmtree(thisFullFile)
def run(self):
stringFile = self.openRightFile()
self.parse(stringFile)
def openRightFile(self):
backupPath = self.backupPath()
theseFiles = os.listdir(backupPath);
theseFiles.sort(reverse=True)
for thisFile in theseFiles:
if thisFile.startswith('bookmarks-') and thisFile.endswith('.json'):
return self.openJson(backupPath + os.sep + thisFile)
def backupPath(self):
profile = self.mostRecentProfile(self.folderpath)
return profile + self.backuppath
def mostRecentProfile(self, profilesPath):
'''returns the most recently modified Profile in the profilesPath'''
newestName = ""
newestTime = 0
allDirs = os.listdir(profilesPath)
for thisDir in allDirs:
thisFullPath = profilesPath + os.sep + thisDir
try:
thisModTime = os.path.getmtime(thisFullPath + os.sep + "places.sqlite")
if thisModTime > newestTime:
newestTime = thisModTime
newestName = thisFullPath
except:
pass
return newestName
def openJson(self, jsonFile):
try:
myfile = open(jsonFile) #folderpath + '\\' + bookmarkfile)
except:
raise Exception("Could not open file " + jsonFile)
return myfile.read()
def parse(self, stringData):
self.jsonStuff = json.loads(stringData)
self.realBase = self.findRealBase(self.jsonStuff)
self.parseReally(self.realBase)
def findRealBase(self, searchIn):
return searchIn["children"][0]["children"]
def parseReally(self, startingPoint, currentDir = []):
for thing in startingPoint:
if thing.has_key("children"): ## Bookmark folder
tempdir = copy.copy(currentDir)
tempdir.append(thing["title"])
self.parseReally(thing["children"], tempdir)
elif thing.has_key("uri"): ## Bookmark
self.createFavorite(thing["uri"], os.sep.join(currentDir) + os.sep + thing["title"] + ".url")
def createFavorite(self, uri, filename):
shortcutOut = self.internetShortcut + uri
if not(filename.startswith(os.sep)):
filename = os.sep + filename
filename = self.startdir + filename
try:
self.recursiveMakeDir(filename)
filehandle = open(filename, "w")
filehandle.write(shortcutOut)
except:
pass
def recursiveMakeDir(self, filename):
if filename.count(os.sep) > 12:
raise Exception("Whoa, you sure you want to go so deep and make " + filename + "?")
parentDir = os.path.dirname(filename)
if os.path.exists(parentDir):
return True
else:
self.recursiveMakeDir(parentDir)
os.mkdir(parentDir)
if (__name__ == "__main__"):
bmm = BookmarkMaker()
bmm.nukeFavorites()
bmm.run()
20 September 2008
New mail script
I used to use this little program called "nfrm" (or frm -n) to tell me on log in to a unix system if I had new mail. Unfortunately what worked in the past no longer was keeping accurate lists in the world where I was also checking my mail via IMAP and the iPhone, where there's a lot of mail I've been informed (in some way) about, but have never actually read.
Here's a quick Perl script that I wrote which checks three folders (my inbox, a work-related box, and mailing lists) for new messages. It requires Mail::MboxParser (install via "cpan" then "install Mail::MboxParser" or have your sysadmin do it for you):
Here's a quick Perl script that I wrote which checks three folders (my inbox, a work-related box, and mailing lists) for new messages. It requires Mail::MboxParser (install via "cpan" then "install Mail::MboxParser" or have your sysadmin do it for you):
#!/usr/bin/perl
use Mail::MboxParser;
use Term::ReadKey;
use strict;
my ($wchar, $hchar, $wpixels, $hpixels) = GetTerminalSize();
my $available_space = $wchar - 13;
my $from_length = int($available_space * .33);
my $subject_length = int($available_space * .67);
my @mailboxes = @ARGV;
if (scalar(@mailboxes) == 0) {
@mailboxes = (["**** ", $ENV{'MAIL'}],
["NBER ", "/homes/nber/cuthbert/mail/nber"],
["Lists", "/homes/nber/cuthbert/mail/lists"]);
}
my $parseropts = {
enable_cache => 1,
enable_grep => 1,
cache_file_name => '/tmp/msc-mbox-cache-file',
};
foreach my $box (@mailboxes) {
next unless -e $box->[1];
my $mb = Mail::MboxParser->new($box->[1],
decode => 'ALL',
parseropts => $parseropts);
while (my $msg = $mb->next_message) {
if ($msg->header->{status} !~ /R/) {
my $name = $msg->from->{name};
if (!$name) { $name = $msg->from->{email} }
$name =~ s/\"//g;
$name =~ s/^\s+//;
$name = substr($name, 0, $from_length);
my $subject = $msg->header->{subject};
$subject = substr($subject, 0, $subject_length);
printf("%-8s %-${from_length}s %-${subject_length}s\n", $box->[0], $name, $subject);
}
}
}
14 September 2008
Sexist ol' folks! (maybe not so much)
I was reading Matt Bai's most recent post on the presidential campaign; not as good as his usual missives, but still basically entertaining (and informative I didn't realize he was white; I guess I assumed Bai as an Asain name). But I focused mainly on a graphic that the New York Times placed in a sidebar showing the percentage of Americans who believe that a woman will be elected president during their lifetimes. Another statistical graphic without any exegesis; when will the Times learn? Based on the main gist of the article, I suppose we are to conclude that younger people are more open to the idea of a female president, and, therefore, more likely to conclude that a woman will be elected. Of the under-45 voters who expressed an opinion (95%), 17% believed that a woman would not be elected in their lifetimes. On the other hand, of the over-65 voters with an opinion (83%), 47% believed that no woman would be elected while they lived.
But the young have something more important than open-mindedness that accounts for the statistical difference: life expectancy. Simply put, the under 45 voting crowd will witness many more elections, and, therefore many more chances of electing a female president. Let's plot this out. Given a probability X of a woman being elected in any election, the probability that a woman would NOT be elected after N elections (or 4N years) is:
(1-X)^N
The following chart gives the probability that a woman would not be elected given various values for X (left column) and N (top row):
Let's say that an "Under 45 voter" is 35 (a mean of 18-45 weighed toward the fact that younger people vote in lower percentages) and let's call an "Over 65 voter" 70 years old. It's obvious that a woman will not be elected president this year, so we'll say that the 70 year olds have, on average 3.5 more elections in their lifetime where a woman could be elected (I'm giving a them more longevity than you'd expect since there is some probability that McCain will be elected and die in his first term, giving a Mrs. President Palin. Though this poll was taken in June, after Hillary had dropped out but before McCain's choice of a female VP seemed likely). The 35-year-old voter has, on average, 10 more elections (I'm giving them a few fewer years on average to live because a 35 year old hasn't proved they won't die before 70). And we'll define "likely to become president" the standard way: over 50% chance. That is, if the voter thinking that it was likely had to bet with even odds, he or she would bet for a woman president being elected.
What can we extrapolate from the chart and formula? If a (rational) 35-year-old voter said that he thought it likely for a woman to be elected, he would need to believe that a woman has over a 7% chance of being elected in any given election, while a 70-year-old voter needs to believe that there's an 18% chance of a woman being elected.
A quick-and-dirty way of extrapolating from the results of the poll to the table is to say if 25% of people believe X is likely, we should look at the point in the table where X is 25% likely to happen. (There are better ways depending on what distribution of beliefs you think voters have, but this should work well enough for now). So since 47% of over-65 voters don't believe a woman will be elected in the next 3.5 elections, we can suppose that they believe women have a 20% chance of being elected any given election.
What about the younger crowds? 17% of the under-45 crowd believe that after the next 10 elections we still won't have a woman president. That works out to belief in a 16% chance of a woman being elected president any given election. So if anything, the young are less optimistic of women's chances of breaking that ultimate glass ceiling than older voters are.
What CBS (the poll takers) should have done is removed a variable by asking, "Do you think it is likely that a woman will be elected president in the next 30 years?" so that all their respondents were thinking along the same time frame. The woman vs. men answer (an apparent +5% optimism gap toward women) is similarly skewed by age and probably underestimates women's optimism about the topic. Since women live longer, they probably disproportionately make up the (more optimistic) over-65 audience. In fact, we may be able to explain the entire age difference in this response by the gender differences and women's propensity to live longer.
(I realize now that I have used the word "optimism" throughout this article, assuming, dear readers, that like me you'd view the possibility of a female president as a good thing. If for some reason you disagree, please comment with your reasons. Oh and the graphic's title should have been Madam President. No one said that the female president needed to be married.)
Labels:
graphics,
politics,
statistics
06 August 2008
Adjusted ERA+ and leaders
The most important stat for determining the quality a pitcher (at least over the course of a career) is ERA. There's really no argument that can be made against it. Wins are far too dependent on how the rest of the team (i.e., batters) do; strikeouts certainly show domination, and career strikeout numbers show longevity as well. But in terms of doing his job, preventing runs, career ERA is the single number that shows how well this was accomplished. (Okay, perhaps ERA over a 5-10 year peak period might be better, to not diminish the careers of those who choose to keep playing into their "worse than average but far better than replacement" years). The main problems with ERA are that it does not adjust for ballparks, leagues, mound heights, steroid usage of batters, or other factors that changed over the years.
Fortunately, the concept of adjusted ERA takes care of all this. And adjusted ERA+ makes reading adjusted ERA even easier. An aERA+ of 125 roughly means an ERA 25% lower than the league, adjusting for park effects etc. (okay, actually it's 20% lower; since an aERA+ of 200 means an ERA 50% of that of the league). An aERA+ of 125 also means you're likely to need to prepare a speech for Cooperstown. The vast majority of pitchers above 125 and almost all eligible starters above 130 are in the Hall.
For a long time, there was a big gap between aERA+ #1 and #2, with Pedro Martinez's 160 towering over Lefty Grove's 148. This disappointing season has lowered Pedro to a mere 157, but it would take more than a few years of (relative) mediocrity to drop him below Grove or Walter Johnson. (Next on the list, for those curious, is Dan Quisenberry which is probably the best Hall of Fame case for Quisenberry that could ever be made).
But why bring all this up now? Because despite Martinez's struggles, there's now (as of last week) a much bigger gap between #1 and #2:
What happened? Rivera pitched his 1000th inning last week, making him eligible for career ERA awards. Perhaps the IP requirement should be raised so that the list isn't dominated by closers. But maybe that's not necessary. With the development of the minor-league (or even high school) relief specialist, fewer and fewer will reach this plateau with each passing year. Bochy and Bud Black's cautious use of Trevor Hoffman since he returned from surgery six years ago has kept him from breaking the 1000 IP mark. As of this evening, he's 22 IP short. Since he's been averaging .35 IP per game over the past four seasons, he's likely to end the season (and possibly his career? the fan in me hopes not) about five innings short. His aERA+ of 145 would place him 9th or 10th all time. Prior to this disastrous season, he would have been third.
(For the complete list, see Baseball-Reference.com)
The only other reliever on the ERA title horizon is Billy Wagner, whose aERA+ of 180 would place him comfortably in second, though still at a great distance below Rivera. However with only 819 IP, he will likely need three more seasons to qualify. And anything could happen by then. Troy Percival, with an aERA+ of 151 but only 687 IP, averaging fewer than 50IP/year at age 38, seems highly unlikely to qualify.
Of the other active high saves players, Roberto Hernandez already qualifies at 131; a Hall of Fame number for a starter, but unlikely to get a notice for a reliever with "only" 326 saves. Jose Mesa is what we always thought him to be, exactly an average pitcher (aERA+ of 100), and thus probably a below average reliever. Todd Jones is a little better, but nothing to write to Cooperstown about. Ditto Jason Isringhausen, 100IP short; though kudos to him for having such dreadful recent performances that his manager is bringing back the (great) idea of a closer by committee.
And as far as the brilliant young closers of today. Yes, very exciting. Yes, I know Papelbon has an aERA+ of 269(!); K-Rod at 186; Nathan at 152; Lidge at 147. But for an article on a stat that requires 1000 IP, please ring again when there's just a few hundred innings left to go. Why so pessimistic? Robb Nen, John Wettland, Randy Meyers, Tom Henke, Jeff Montgomery, Rod Beck, Ugueth Urbina, and (ah heck, throw in the newest member of the fraternity) Eric Gagne. If these names mean something to you, you'll understand.
Fortunately, the concept of adjusted ERA takes care of all this. And adjusted ERA+ makes reading adjusted ERA even easier. An aERA+ of 125 roughly means an ERA 25% lower than the league, adjusting for park effects etc. (okay, actually it's 20% lower; since an aERA+ of 200 means an ERA 50% of that of the league). An aERA+ of 125 also means you're likely to need to prepare a speech for Cooperstown. The vast majority of pitchers above 125 and almost all eligible starters above 130 are in the Hall.
For a long time, there was a big gap between aERA+ #1 and #2, with Pedro Martinez's 160 towering over Lefty Grove's 148. This disappointing season has lowered Pedro to a mere 157, but it would take more than a few years of (relative) mediocrity to drop him below Grove or Walter Johnson. (Next on the list, for those curious, is Dan Quisenberry which is probably the best Hall of Fame case for Quisenberry that could ever be made).
But why bring all this up now? Because despite Martinez's struggles, there's now (as of last week) a much bigger gap between #1 and #2:
1. Mariano Rivera 197
2. Pedro Martinez 157
What happened? Rivera pitched his 1000th inning last week, making him eligible for career ERA awards. Perhaps the IP requirement should be raised so that the list isn't dominated by closers. But maybe that's not necessary. With the development of the minor-league (or even high school) relief specialist, fewer and fewer will reach this plateau with each passing year. Bochy and Bud Black's cautious use of Trevor Hoffman since he returned from surgery six years ago has kept him from breaking the 1000 IP mark. As of this evening, he's 22 IP short. Since he's been averaging .35 IP per game over the past four seasons, he's likely to end the season (and possibly his career? the fan in me hopes not) about five innings short. His aERA+ of 145 would place him 9th or 10th all time. Prior to this disastrous season, he would have been third.
(For the complete list, see Baseball-Reference.com)
The only other reliever on the ERA title horizon is Billy Wagner, whose aERA+ of 180 would place him comfortably in second, though still at a great distance below Rivera. However with only 819 IP, he will likely need three more seasons to qualify. And anything could happen by then. Troy Percival, with an aERA+ of 151 but only 687 IP, averaging fewer than 50IP/year at age 38, seems highly unlikely to qualify.
Of the other active high saves players, Roberto Hernandez already qualifies at 131; a Hall of Fame number for a starter, but unlikely to get a notice for a reliever with "only" 326 saves. Jose Mesa is what we always thought him to be, exactly an average pitcher (aERA+ of 100), and thus probably a below average reliever. Todd Jones is a little better, but nothing to write to Cooperstown about. Ditto Jason Isringhausen, 100IP short; though kudos to him for having such dreadful recent performances that his manager is bringing back the (great) idea of a closer by committee.
And as far as the brilliant young closers of today. Yes, very exciting. Yes, I know Papelbon has an aERA+ of 269(!); K-Rod at 186; Nathan at 152; Lidge at 147. But for an article on a stat that requires 1000 IP, please ring again when there's just a few hundred innings left to go. Why so pessimistic? Robb Nen, John Wettland, Randy Meyers, Tom Henke, Jeff Montgomery, Rod Beck, Ugueth Urbina, and (ah heck, throw in the newest member of the fraternity) Eric Gagne. If these names mean something to you, you'll understand.
Labels:
baseball,
Mariano Rivera,
save,
statistics,
Trevor Hoffman
04 August 2008
Hate the stat, love the statter
I spent part of the last week of July reading the various, mostly extremely short, obits of Jerome Holtzman, sportswriter extraordinaire (and the Chicago cubs follower who did not incite the city of Sandy Eggo to riot by calling the selling of Sushi and the installation of baby-changing stations in Men's rooms at S.D.'s ballpark "the beginning of the decline of America." q.v., one M. Royko). Throughout all the obits, the discussion centered on Holtzman's invention of the "save" as baseball statistic.
(N.Y. Times obit.)
What puzzled me the most is, with the notable exception of The Hardball Times, how often commentators were blasting Holtzman for what a bad statistic the save is. To be sure, it is a bad statistic: it rewards (in the end, financially) players for participating in one facet of the game that has never been shown to be more significant than several others. It gives the same benefit to pitchers who bail their team out of the toughest of all situations as to closers who record a few outs with a pretty sizable lead. But consider again what the save replaced: a world where relief pitching had no worth on paper (and in a world where the "Win", another terrible stat, was even more important). Also consider what it meant in the early 60s to introduce a new stat into the world. Computing saves retroactively for every baseball player was not as simple as a few lines of Perl and a download from retrosheet. And, the tinkering that Holtzman and others applied to the save suggests that he did not believe he had discovered the perfect, never to be supplanted statistic. Admire what he did for the time in which he lived.
No, marshal your scorn for those who with the hindsight of time and better access to information still defend the save. Those who cling to a bad idea are far worse than those who throw the idea out to the marketplace in the first place.
P.S. I wish there were a stat with more sophistication than Ari Kaplan's "Fan Save Value", but with fewer lookup charts than his more accurate "Save Value." The former is not much of an improvement over the Save (and couldn't easily translate to a generalized "relief value") while the latter has too many "hidden" constants relating to expected runs (that aren't really constant, but actually change from year to year that I can't see it actually being adopted.
Here's a simple(r) formula for calculating expected runs, that you can carry around with you:
total bases occupied simply means to sum up the base numbers with runners. So runners on second and third is 2+3 = 5. Outs left is 3 - outs. So no outs = 3.
Here's how this version of expected runs compares to the standard expected runs chart:
As you can see, the formula slightly over-predicts expected runs (though with the important exception of the most-common occurrence, no outs, no one on, which almost balances out the rest of the error). The only case where it's over 13 hundreths of a run off is the rare case of no outs, bases loaded, which it over-estimates by 1/3 of a run. If a formula is to overestimate any situation, I'm happy with it being this rare situation: an "oh shit!" moment for any incoming reliever. In any case, it's an easier formula to remember than 24 "random" numbers.
(N.Y. Times obit.)
What puzzled me the most is, with the notable exception of The Hardball Times, how often commentators were blasting Holtzman for what a bad statistic the save is. To be sure, it is a bad statistic: it rewards (in the end, financially) players for participating in one facet of the game that has never been shown to be more significant than several others. It gives the same benefit to pitchers who bail their team out of the toughest of all situations as to closers who record a few outs with a pretty sizable lead. But consider again what the save replaced: a world where relief pitching had no worth on paper (and in a world where the "Win", another terrible stat, was even more important). Also consider what it meant in the early 60s to introduce a new stat into the world. Computing saves retroactively for every baseball player was not as simple as a few lines of Perl and a download from retrosheet. And, the tinkering that Holtzman and others applied to the save suggests that he did not believe he had discovered the perfect, never to be supplanted statistic. Admire what he did for the time in which he lived.
No, marshal your scorn for those who with the hindsight of time and better access to information still defend the save. Those who cling to a bad idea are far worse than those who throw the idea out to the marketplace in the first place.
P.S. I wish there were a stat with more sophistication than Ari Kaplan's "Fan Save Value", but with fewer lookup charts than his more accurate "Save Value." The former is not much of an improvement over the Save (and couldn't easily translate to a generalized "relief value") while the latter has too many "hidden" constants relating to expected runs (that aren't really constant, but actually change from year to year that I can't see it actually being adopted.
Here's a simple(r) formula for calculating expected runs, that you can carry around with you:
ER = (5 + total_runners + 3 * (total bases occupied))
* outs_left / 30
total bases occupied simply means to sum up the base numbers with runners. So runners on second and third is 2+3 = 5. Outs left is 3 - outs. So no outs = 3.
Here's how this version of expected runs compares to the standard expected runs chart:
Outs 1B 2B 3B My ERs Table Difference
0 0 0 0 0.50 0.54 -0.04
0 0 0 1 1.50 1.46 0.04
0 0 1 0 1.20 1.17 0.03
0 0 1 1 2.20 2.14 0.06
0 1 0 0 0.90 0.93 -0.03
0 1 0 1 1.90 1.86 0.04
0 1 1 0 1.60 1.49 0.11
0 1 1 1 2.60 2.27 0.33
1 0 0 0 0.33 0.29 0.04
1 0 0 1 1.00 0.98 0.02
1 0 1 0 0.80 0.71 0.09
1 0 1 1 1.47 1.47 0.00
1 1 0 0 0.60 0.55 0.05
1 1 0 1 1.27 1.24 0.03
1 1 1 0 1.07 0.97 0.10
1 1 1 1 1.73 1.60 0.13
2 0 0 0 0.17 0.11 0.06
2 0 0 1 0.50 0.38 0.12
2 0 1 0 0.40 0.34 0.06
2 0 1 1 0.73 0.63 0.10
2 1 0 0 0.30 0.25 0.05
2 1 0 1 0.63 0.54 0.09
2 1 1 0 0.53 0.46 0.07
2 1 1 1 0.87 0.82 0.05
As you can see, the formula slightly over-predicts expected runs (though with the important exception of the most-common occurrence, no outs, no one on, which almost balances out the rest of the error). The only case where it's over 13 hundreths of a run off is the rare case of no outs, bases loaded, which it over-estimates by 1/3 of a run. If a formula is to overestimate any situation, I'm happy with it being this rare situation: an "oh shit!" moment for any incoming reliever. In any case, it's an easier formula to remember than 24 "random" numbers.
Labels:
baseball,
save,
statistics
30 May 2008
Are bullpens underused?
This post was mostly written during Spring Training, so 2007 figures are used throughout. Life prevented posting until now.
Back in the "good old days" of baseball, bullpens were nearly non-existent. Two-man rotations were common, 600 inning-pitched seasons were possible, and one man pitched every inning of an entire major league season (Wondering who? See below). Since then, we've created four-man rotations, dedicated bullpens, closers, five-man rotations, long relievers, setup men, and, coming soon to a ballpark near you, the seventh-inning specialist. And throughout all thus, grumpy old men--along with grumpy young men, grumpy old women, grumpy young women, and grumpy transexuals of all ages--have decried the changes as a weakening of the quality of starting pitching.
But is it possible that everyone is wrong? Could the problems with pitching be traced to an underuse of those 6 to 8 "guns" not considered durable enough to start a game? Let's consider some baseball axioms before we look further. I like math, so I'll use some weird symbols, but try to explain them.
Axiom 1: ∂(ERA)/∂(IP) > 0
That is just to say that we expect ERA to rise for every additional inning pitched. I'll admit that this axiom might not hold for a former AA-pitcher getting his first few innings in at the major league level, or for someone coming back just off the disabled list. But for the most part, it's hard to disagree with, whether we're dealing with innings pitched within a game or innings pitched over a season.
Axiom 2: ERA(closer) < ERA(ace) ; ERA(setup man) < ERA(#2 starter) ; etc...
The ERA of your relief squad tends to be lower than the starting pitchers. Or to put it in another way, if you only had to put your best pitcher in for one inning of work, he would almost certainly come from the bullpen. Compare the best reliever to the best starter on almost every club and you'll see that the best reliever comes out ahead. Usually even the best two or three relievers have better ERAs than aces--this holds true for great teams and miserable ones. In 2007, the Detroit tigers had three relievers log more than 40 innings with a lower ERA than their ace, Justin Verlander, and two more with a lower ERA than their second best pitcher to log at least 15 starts. The Padres (the team I know best): Even a triple-crown winning Cy Young pitcher (Peavy) had an ERA bested by a setup man, Heath Bell (with 93 2/3 IP, not a small sample); and four more relief pitchers (Hoffman, Brocail, K. Cameron, and Justin Hampson) bested Chris Young, their second best pitcher. We see the same patterns even for lousy teams: four Royals relievers outpitched Gil Meche's 3.67 ERA.
Compare the ERAs of bullpens as of mid-2007 to this analysis of starters at the end of the season. The average bullpen is about as good as the average no. 2 starter! In other words, it does not really matter if by the sixth inning your starter is still "feeling good." Unless he's your ace, or throwing a shutout or no hitter, or your bullpen is absolutely drained from a recent 18-inning game, it's time to call in some new arms. Do so and your expected chance of winning just went up.
All else being equal, every inning that you have someone on the mound with a higher ERA than someone else you could put out there is an inning of poor managing. All else being equal, bullpens should be used more until their ERAs rise to meet that of the starters.
Now here's the argument I'm ready to hear: all else is not equal. Not every inning is as important as every other. That's definitely true! The best relievers pitch in the most important situations: with the game close, and a win on the line if only he can not allow any runs. So it makes sense that you want some of your lowest ERA-men pitching then. But when do starters begin pitching? They begin with the game tied, where any run allowed or not makes a huge difference in the probability of winning or losing the game -- nearly as important a situation as what setup men and closers pitch in. But there are several members of the bullpen who tend to pitch in less important innings; that is, blowouts in either direction. So if anything, the ERAs of bullpens should be substantially higher than starters. That they are not, shows that the bullpens are being over-rested and under used.
Trivia answer: Jim Devlin of the 1877 Louisville Grays pitched every inning in a 61 game season, compiling 559 innings pitched and allowing a total of 4 HRs. Though his ERA was a very good 2.25 (146 ERA+), it says something about the way official scorers have changed over the years: though he allowed only 140 earned runs, he allowed a total of 288 runs, or 8 more unearned runs than earned.
(Before I'm accused of copying this trivia information from Wikipedia, be sure to check who added it there in the first place).
Back in the "good old days" of baseball, bullpens were nearly non-existent. Two-man rotations were common, 600 inning-pitched seasons were possible, and one man pitched every inning of an entire major league season (Wondering who? See below). Since then, we've created four-man rotations, dedicated bullpens, closers, five-man rotations, long relievers, setup men, and, coming soon to a ballpark near you, the seventh-inning specialist. And throughout all thus, grumpy old men--along with grumpy young men, grumpy old women, grumpy young women, and grumpy transexuals of all ages--have decried the changes as a weakening of the quality of starting pitching.
But is it possible that everyone is wrong? Could the problems with pitching be traced to an underuse of those 6 to 8 "guns" not considered durable enough to start a game? Let's consider some baseball axioms before we look further. I like math, so I'll use some weird symbols, but try to explain them.
Axiom 1: ∂(ERA)/∂(IP) > 0
That is just to say that we expect ERA to rise for every additional inning pitched. I'll admit that this axiom might not hold for a former AA-pitcher getting his first few innings in at the major league level, or for someone coming back just off the disabled list. But for the most part, it's hard to disagree with, whether we're dealing with innings pitched within a game or innings pitched over a season.
Axiom 2: ERA(closer) < ERA(ace) ; ERA(setup man) < ERA(#2 starter) ; etc...
The ERA of your relief squad tends to be lower than the starting pitchers. Or to put it in another way, if you only had to put your best pitcher in for one inning of work, he would almost certainly come from the bullpen. Compare the best reliever to the best starter on almost every club and you'll see that the best reliever comes out ahead. Usually even the best two or three relievers have better ERAs than aces--this holds true for great teams and miserable ones. In 2007, the Detroit tigers had three relievers log more than 40 innings with a lower ERA than their ace, Justin Verlander, and two more with a lower ERA than their second best pitcher to log at least 15 starts. The Padres (the team I know best): Even a triple-crown winning Cy Young pitcher (Peavy) had an ERA bested by a setup man, Heath Bell (with 93 2/3 IP, not a small sample); and four more relief pitchers (Hoffman, Brocail, K. Cameron, and Justin Hampson) bested Chris Young, their second best pitcher. We see the same patterns even for lousy teams: four Royals relievers outpitched Gil Meche's 3.67 ERA.
Compare the ERAs of bullpens as of mid-2007 to this analysis of starters at the end of the season. The average bullpen is about as good as the average no. 2 starter! In other words, it does not really matter if by the sixth inning your starter is still "feeling good." Unless he's your ace, or throwing a shutout or no hitter, or your bullpen is absolutely drained from a recent 18-inning game, it's time to call in some new arms. Do so and your expected chance of winning just went up.
All else being equal, every inning that you have someone on the mound with a higher ERA than someone else you could put out there is an inning of poor managing. All else being equal, bullpens should be used more until their ERAs rise to meet that of the starters.
Now here's the argument I'm ready to hear: all else is not equal. Not every inning is as important as every other. That's definitely true! The best relievers pitch in the most important situations: with the game close, and a win on the line if only he can not allow any runs. So it makes sense that you want some of your lowest ERA-men pitching then. But when do starters begin pitching? They begin with the game tied, where any run allowed or not makes a huge difference in the probability of winning or losing the game -- nearly as important a situation as what setup men and closers pitch in. But there are several members of the bullpen who tend to pitch in less important innings; that is, blowouts in either direction. So if anything, the ERAs of bullpens should be substantially higher than starters. That they are not, shows that the bullpens are being over-rested and under used.
Trivia answer: Jim Devlin of the 1877 Louisville Grays pitched every inning in a 61 game season, compiling 559 innings pitched and allowing a total of 4 HRs. Though his ERA was a very good 2.25 (146 ERA+), it says something about the way official scorers have changed over the years: though he allowed only 140 earned runs, he allowed a total of 288 runs, or 8 more unearned runs than earned.
(Before I'm accused of copying this trivia information from Wikipedia, be sure to check who added it there in the first place).
Labels:
baseball,
bullpens,
pitching,
statistics
Context and baseball statistics
From this week's MLB Power Rankings on ESPN.com:
We're supposed to take this as to mean that Hideki Matsui is doing great this month. But what does it actually mean? How many hitters would we expect to have more multi-hit games than no-hit games? 1%? 10%? Stats like this out of context drive me crazy. It's pretty easy to figure it out at least for simple cases.
The probability of not getting a hit in any at bat is (1 - Batting Average), so the probability of going 4 at bats without a hit is (1-BA) to the 4th power. Here's a little program (in Python) for figuring this out:
So we can see that for even an average player, a no hit game happens only 1 in 3 games, and for a good batter, or a good batter on a real roll, these things happen rather seldom. What about Multi-Hit Games? If you have four at bats per game, there are 11 different ways of getting two or more hits (one way of getting four hits, 4 of 3, and 6 of 2). In the chart below, x = hit, and o = not hit:
(For those interested, the number of ways of getting 4, 3, 2, 1, 0 hits, that is, 1, 4, 6, 4, 1 is the fourth row of Pascal's triangle). So one way of calculating the probability of multi-hit game is to find the probability of a single hit game (4*BA*(1-BA)^3) add to it the probability of a no-hit game, and subtract it from 1:
So as long as you're getting 4 ABs per game, you don't really need to be a great hitter to expect to get more multi-hit games than no-hit games. In fact, a BA of just .267 will do it for you. Returning to the original post, we see that Matsui is doing better than 1:1 in May, getting a 9:5 Multi:No ratio. How good do you have to be to get that? A .320 BA will suffice. Matsui's BA has actually been slightly better than that in May, .337, but he's also been getting just under 4 ABs a game (3.83).
4 ABs a game (or even 3.8) is really only manageable if you're not having many plate appearances that don't count for at bats -- in other words, if you're not walking much. In April, Matsui was averaging only 3.1 ABs per game. This makes it much harder to have so many multi-hit games: if you have 3ABs per game, you need to be batting above .348 to get more more multi-hit games than no-hits, and have a whopping .411 to have Matsui's ratio.
Looking closely at the numbers, there's much less to cheer about for fans of Godzilla: the rise in multi-hit games came almost entirely from a drop in walks (from 12 to 7), resulting in a OBP 40 points lower in May than in April. He did not make up the difference in SLG either, dropping 50 points there. So upon closer inspection, the numbers tell an entirely different story: in everything but BA Matsui had a May that was worse than April and below his career levels.
In May, Hideki Matsui has more multi-hit games (nine) than he has games in which he hasn't gotten a hit (five)
We're supposed to take this as to mean that Hideki Matsui is doing great this month. But what does it actually mean? How many hitters would we expect to have more multi-hit games than no-hit games? 1%? 10%? Stats like this out of context drive me crazy. It's pretty easy to figure it out at least for simple cases.
The probability of not getting a hit in any at bat is (1 - Batting Average), so the probability of going 4 at bats without a hit is (1-BA) to the 4th power. Here's a little program (in Python) for figuring this out:
def noHit(BA):
return (1-BA)**4
>>> noHit(.250): 0.32
>>> noHit(.300): 0.24
>>> noHit(.400): 0.13
So we can see that for even an average player, a no hit game happens only 1 in 3 games, and for a good batter, or a good batter on a real roll, these things happen rather seldom. What about Multi-Hit Games? If you have four at bats per game, there are 11 different ways of getting two or more hits (one way of getting four hits, 4 of 3, and 6 of 2). In the chart below, x = hit, and o = not hit:
xxxx = four hits
oxxx = three hits
xoxx
xxox
xxxo
xxoo = two hits
xoxo
xoox
oxxo
oxox
ooxx
xooo = one hit
oxoo
ooxo
ooox
oooo = no hits
(For those interested, the number of ways of getting 4, 3, 2, 1, 0 hits, that is, 1, 4, 6, 4, 1 is the fourth row of Pascal's triangle). So one way of calculating the probability of multi-hit game is to find the probability of a single hit game (4*BA*(1-BA)^3) add to it the probability of a no-hit game, and subtract it from 1:
def multiHit(BA):
return 1-(noHit(BA) + 4*(BA*(1-BA)**3))
>>>multiHit(.250): 0.26
>>>multiHit(.300): 0.35
>>>multiHit(.400): 0.52
So as long as you're getting 4 ABs per game, you don't really need to be a great hitter to expect to get more multi-hit games than no-hit games. In fact, a BA of just .267 will do it for you. Returning to the original post, we see that Matsui is doing better than 1:1 in May, getting a 9:5 Multi:No ratio. How good do you have to be to get that? A .320 BA will suffice. Matsui's BA has actually been slightly better than that in May, .337, but he's also been getting just under 4 ABs a game (3.83).
4 ABs a game (or even 3.8) is really only manageable if you're not having many plate appearances that don't count for at bats -- in other words, if you're not walking much. In April, Matsui was averaging only 3.1 ABs per game. This makes it much harder to have so many multi-hit games: if you have 3ABs per game, you need to be batting above .348 to get more more multi-hit games than no-hits, and have a whopping .411 to have Matsui's ratio.
Looking closely at the numbers, there's much less to cheer about for fans of Godzilla: the rise in multi-hit games came almost entirely from a drop in walks (from 12 to 7), resulting in a OBP 40 points lower in May than in April. He did not make up the difference in SLG either, dropping 50 points there. So upon closer inspection, the numbers tell an entirely different story: in everything but BA Matsui had a May that was worse than April and below his career levels.
Labels:
baseball,
matsui,
python,
statistics
01 April 2008
The sort of mixed metaphors I love
"You talk about just inside! That was about a hemidemisemiquaver inside."
-- Jerry Coleman (Hall of Fame Broadcaster, 83 years old)
-- Jerry Coleman (Hall of Fame Broadcaster, 83 years old)
16 March 2008
Und warum? An Ecoist rationale
Of all the "clocks" that we keep track of—national debt, population of the world, casualties in [favorite unjust war of the week]—the one that spirals out of control the most senselessly must be the number of blogs on the Internet. The "check availability" of domain names for Blogspot.com is probably one of the most clicked links on the web, given that nearly every witty, literary, or common name has now been taken. So why another?
I can only offer weak arguments. The first from necessity: my web page, so well-suited to undergraduate life and adapted adequately for the world of grad school, is a little out-of-date for a professional academic. I'm looking for a place to put ideas that aren't yet ready for full publication, and Wikipedia just didn't seem the right place. I certainly have the HTML skills to create something similar to a blog without the stigma the term entails. But at a certain point, one needs to acknowledge exactly what one is doing and embrace it. That doesn't mean I won't try to obfuscate this blog via perl sometime later...
The second reason comes from a challenge that Umberto Eco (a name surely to appear often in the blog) threw upon American academics over twenty years ago. He wrote:
Perhaps some time in the near future I will be emboldened to take up his gauntlet directly, to actually send my thoughts off to newspapers (thereby destroying rain forests, reaching octogenarians, etc.), but for now, this blog seems an adequate post-modern solution.
And the title: mode and time may govern our life, but prolation gives order to the minima of our daily activities.
I can only offer weak arguments. The first from necessity: my web page, so well-suited to undergraduate life and adapted adequately for the world of grad school, is a little out-of-date for a professional academic. I'm looking for a place to put ideas that aren't yet ready for full publication, and Wikipedia just didn't seem the right place. I certainly have the HTML skills to create something similar to a blog without the stigma the term entails. But at a certain point, one needs to acknowledge exactly what one is doing and embrace it. That doesn't mean I won't try to obfuscate this blog via perl sometime later...
The second reason comes from a challenge that Umberto Eco (a name surely to appear often in the blog) threw upon American academics over twenty years ago. He wrote:
"I believe that an intellectual should use newspapers the way private diaries and personal letters were once used. At white heat, in the rush of an emotion, stimulated by an event, you write your reflections, hoping that someone will read them and then forget them." (Travels in Hyperreality, p. x)
Perhaps some time in the near future I will be emboldened to take up his gauntlet directly, to actually send my thoughts off to newspapers (thereby destroying rain forests, reaching octogenarians, etc.), but for now, this blog seems an adequate post-modern solution.
And the title: mode and time may govern our life, but prolation gives order to the minima of our daily activities.
Labels:
Eco,
introduction
Subscribe to:
Posts (Atom)