Graph Databases and Baseball

In my opinion, stating that baseball remains America’s Pastime in 2023 is a tough argument to make, as it often is as the third-most popular sport in the United States. However, it is very obvious that baseball has been and always will be statistics-driven: baseball-reference.com has data all the way back to the 1876 National League season. Two franchises from that season still exists – the Chicago Cubs and Atlanta Braves – but interestingly both used a name subsequently used by their later-arriving American League brethren.

And the data available continues to grow: Retrosheet researchers continue to search for box scores and play-by-plays for older seasons, their latest release including box scores and play-by-play for the 1919 and 1920 seasons. Occasionally this research changes individuals’ stats. The data is available for download, if interested.

Transactions

As a fan of the Oakland (soon-to-be Las Vegas) Athletic, I harass my wife constantly by identifying the former Oakland players on other teams – a bone-crushingly simple exercise in recent decades as stars are traded for prospects instead of retaining them at higher salaries. Very frustrating.

Similarly, there’s the ex-Cubs Factor theory, tying teams’ playoff success (or lack thereof) to the number of ex-Cubs whom played for each team in a playoff series or World Series; fortunately, it did not prevent Oakland in 1973 from winning the World Series (though mitigating factors are claimed).

In recent years the Minnesota Twins seem hesitant to trade promising-but-underperforming young players because of David Ortiz’s success in Boston, keeping Miguel Sano until his struggles were too obvious – and painful to ignore, who is currently out of baseball after being released following the 2022 season.

With this background, I decided to explore player movements between teams over time. Through the 1975 season, baseball’s Reserve Clause made it basically impossible for players to have any say where they played or what they were paid; that said, players could change teams if owners decided to trade or sell players. Free agency was introduced before the 1976 season, dramatically changing the game.

Fortunately – and perhaps unsurprisingly – data about all player transactions is available for analysis and I thought that I could use Neo4J to interrogate the data.

Prerequisites

If you want to do your own analysis, you’ll need:

  • Java 17 or later is required, whatever your favorite OpenJDK flavor may be.
  • Maven 3 to build.
  • A Neo4J database into which the data is loaded preferably unshared as the database is purged before each run. You have been warned!
  • The GitHub repository scsosna99/bbtxn, locally cloned. The repo already has the raw data included.

Raw Data

As mentioned above, Retrosheet has a abundance of raw data available, of which the player and transaction data appeared relevant. Future enhancements may incorporate additional data, such as batting average, earned run average, wins above replacement, etc.

Players

Retrosheet’s biographical database contains information about players, managers, coaches, and umpires including: name, birth and death information, first and last appearance date, bat, throws, height, weight, death and cemetery information. The values are comma-separated and double-quoted only when required.

The biographic zip file is available here.

Transactions

Retrosheet’s transaction database contains the details of all transactions: e.g., transaction date, the player involved, the team(s) involved, the transaction type. The values are comma-separated and every string is double-quoted, whether required or not.

Baseball has approximately sixty transaction types that describe how players join or leave teams, from the mundane – player drafted, traded, sold – to the interesting – jumping teams, refusing to report, purchase voided.

When the team or teams involved are from the three baseball major leagues – e.g. American, Federal, National – the well-known team abbreviation is used: LAD for Los Angeles Dodgers, BRO for Brooklyn Dodgers, etc. For all other leagues, the team and league names are explicitly spelled out. e.g. Providence of the International League.

The transactions zip file is available here.

Note #1: the home page provides links for year-by-year transactions; however the page has not been updated since 2020. The zip file contains all transactions since 2022.

Note #2: some Negro League data is available but is separated from the other major leagues. I have not yet looked into what is available and how it may be used.

Note #3: I only loaded post-1900 transaction data as pre-1900 transaction data is less complete and consistent, and required additional exception handling. And let’s be honest: how many of you even known that the 1899 Cleveland Spiders who lost 134 games, much less who was on their roster?

Teams

Retrosheet does not provide a definitive list of teams, so I created my own.

The file teams.txt is a list of all current major league teams or, for the Federal Leagues, the teams that existed when the league folded. No American or National league team has folded since the American League was founded in 1901.

The file teammap.txt allows us to aggregate transactions for a franchise, e.g. BRO maps to LAD, combining Dodger transactions for both Brooklyn and Los Angeles into the current Los Angeles location.

Neo4J Model

I considered different ways to represent the transactions in a graph database, and that modeling the transactions as relationships between teams and players seemed fairly intuitive.

Therefore, the players and teams are nodes linked together by the transactional relationships.

Transactions Types

Though the many transaction types uniquely describe how players moved between teams, the transactions types can be grouped into three basic types.

To Transaction

A player joins a team after being unsigned by any team, e.g., drafted out of high school or college, or signed as an international player.

This shows that John Samuel Wilson was drafted by the Red Sox; the transaction date (October 1926) is a property on the relationship.

From Transaction

A player leaves or is dissociated with their team after their signed contract is invalidated, e.g., the contract expires, the player is released by the team, or the player retires.

Since a player must be signed to a team before leaving a team, this graph shows that John Cornelius Ray was traded to the Angels before the 1988 season before being released.

FromTo Transaction

A player disassociates from one team and joins another, most often through a trade or sale of the player.

Roger Maris, best known for breaking Babe Ruth’s season home run mark in 1961 was traded twice, though from the graph it’s impossible to determine the specifics. [In fact, he was traded by Cleveland to the Athletics in 1958 and then to the Yankees in 1960.]

Extreme Player Movements

Octavio Dotel played for 13 teams, none more than once.

This is only the second-most, as Edwin Jackson played for 14 teams.

Examples

Team Frequency

For a pair of teams, how many players have been exchanged, either through trade or sale. Most are trades, but scrolling through the results you do see player sales as well.

Cypher Query

MATCH (t1:Team)-[tx1]->(p:Player)-[tx2]->(t2:Team)
WHERE tx1.retrosheetId = tx2.retrosheetId AND t1.retrosheetId <> t2.retrosheetId 
RETURN t1.retrosheetId, t2.retrosheetId, tx1.transactionType, COUNT(p)
ORDER BY COUNT(p) DESC, tx1.transactionType, t1.retrosheetId, t2.retrosheetId

Results

Additional Notes

This query would not be difficult in a relational database or even in Excel, but demonstrates a Neo4J Cypher query using the data model. An easy place to start.

Teammates during a career

With whom did a player play, by team, and how (transaction type) did they join the team.

This query uses retired pitcher Jon Lester as the base player.

Cypher Query

MATCH (bp:Player {firstName: "Jonathan Tyler", lastName: "Lester"})-[tx1]-(bt:Team)
WITH bt.retrosheetId AS teamId,
     bp.playerDebut AS debut,
     MAX(tx1.transactionDate) AS lastTxn
MATCH (ot:Team)<-[tx2]-(op:Player)
WHERE ot.retrosheetId = teamId AND 
      tx2.transactionDate >= debut AND
      tx2.transactionDate <= lastTxn
RETURN teamId,
       op.name,
       tx2.transactionDate,
       tx2.transactionType
ORDER BY teamId,
         op.name,
         tx2.transactionDate

Results

Players Returning To Same Team

Identify players whom left a team and later returned, including details about team and the transaction.

This query only considers players who debuted in the 1960s, which is a property on the Player node.

Cypher Query

MATCH (t1:Team)-[tx1]->(p:Player)-[tx2]->(t2:Team)
WHERE DATETIME(p.playerDebut) > DATETIME('1960-01-01T00:00:00')
  AND DATETIME(p.playerDebut) > DATETIME('1970-01-01T00:00:00') 
  AND t1 = t2 
  AND DATETIME(tx1.transactionDate) < DATETIME(tx2.transactionDate)
RETURN p.name, 
       t1.name, tx1.transactionType, tx1.transactionDate,
       tx2.transactionType, tx2.transactionDate

Results

How To Read

  • Stan Bahnsen became a free agent in October 1980 and resigned with the team in December 1980.
  • Rich Baney was sold by the Orioles in June 1971 but returned to the Orioles in September 1971 for unknown reasons.

Players Traded Back To Original Team by Number of Days

Identify players who were traded between two teams, ordered by the number of days between the transactions

Cypher Query

MATCH (t1:Team)-[tx1:TRADED_FROM]->(p:Player)-[tx2:TRADED_TO]->(t2:Team)-[tx3:TRADED_FROM]->(p:Player)-[tx4:TRADED_TO]->(t1:Team)
WHERE tx1.retrosheetId = tx2.retrosheetId 
  AND tx3.retrosheetId = tx4.retrosheetId 
  AND t1.retrosheetId IS NOT NULL
  AND t2.retrosheetId IS NOT NULL 
  AND tx1.transactionDate < tx3.transactionDate 
RETURN p.name, 
       t1.name AS OriginalTeam, 
       t2.name AS TradedTeam, 
       tx1.transactionDate AS Trade1Date, 
       tx3.transactionDate AS Trade2Date, 
       Duration.inDays(DATETIME(tx1.transactionDate), DATETIME(tx3.transactionDate)).days as DaysBetween 
ORDER BY DaysBetween, 
         p.name, 
         t1.name, 
         t2.name

Results

How To Read

Colin David Rea was traded from the Padres to the Marlins and back to the Padres within 3 days.

Closing Thoughts

These example that scratches the surface of how the transactions can be analyzed.

With additional player information, e.g., their lifetime statistics, more detailed analysis could be achieved, such as use WAR to provide a value to a transaction.

The most awkward aspect of this data model is ordering transactions, as relationships have no concept of sequencing. Sometimes the transaction date is easy to use, but not in all situations.