How does Major League Baseball teams’ spending on salary correlate with wins throughout the season? Teams can’t exactly pay their way to a World Series victory, but it can’t hurt, right? Or can they?  What teams spend more money but still lose? Which teams pay dearly for their wins?

We made a scatterplot graph to see if these questions have a clear cut answer.  Let’s show you how we did it.

 

Find and cleaning the data

Before we can start answering any of these questions, we need to assemble our data. To find the MLB standings for 2016, go to http://mlb.com/mlb/standings/ and choose ‘2016’ in the dropdown menu. Then copy and paste the data into an excel spreadsheet and label it MLB_standings.xlsx.

We have a lot more data in this sheet than we need, so now we are going to clean it.

Open the MLB_standings.xlsx spreadsheet. You only need the first two columns, so you can delete everything else. You also don’t need any of the headers that appear in the middle of the data, so you can delete all that as well. Label the columns ‘Teams’ and ‘Wins’ The last step in cleaning this data is to take out the letters that appear before the team names in the team name column, we will not need that for our analysis.

Your data should start out looking like Table A and end up like Table B.

A.       B.

Next go to http://www.spotrac.com/mlb/payroll/2016/ and copy and paste the data into a second excel spreadsheet labeled MLB_payroll.xlsx. Now we will clean the second set of data. Open the MLB_payroll.xlsx spreadsheet. The first thing you can do is delete row 15 labeled ‘average’, we won’t need this for our analysis. You will only need columns B and I, so you can delete all the others. Label these columns ‘Teams’ and ‘Total Payroll.’

Your data should start out looking like Table A and end up like Table B.

A.    B.

 

Combining the data

Next you’ll need to combine the data into one spreadsheet. Create a new spreadsheet and name it MLB_combined.xlsx. Then go back to your original spreadsheets and select all the data in MLB_standings.xlsx spreadsheet and sort the data alphabetically using the sort tool, which is under the ‘Data’ tab in excel. Do the same for the MLB_payroll.xlsx spreadsheet. With all your data in the same order, you can now move on to the next step. Copy and paste the data from the standings spreadsheet into columns A and B of  MLB_combined.xlsx spreadsheet, then copy just the ‘Wins’ column of the payroll spreadsheet into column C.

Save your work and you are ready to start visualizing the data.

 

Visualize the data

As Stray notes in The Curious Journalist’s Guide to Data, scatter plot graphics are a popular way to visualize correlation, so we decide to go that route and used Plot.ly. Plot.ly is a free online analytics and visualization tool. Anybody looking to explore data can go to plot.ly and sign on to an account.

Once in the plot.ly  select the scatter plot tool. Next you need to upload your data. The interface allows for easy Excel file upload in the upper right corner.

As soon as the data comes in you are then able to manipulate the axes in the toolbar on the left side of the screen. The last item to edit is the graph title and naming the axes. Once you’ve completed the database automatically stores it. You are then go to the repository where you can access the data and graphs and chose to export in a number of formats.

In the graphic below we decided to put the total salary payroll on the Y axis and the total wins on the X axis. With any data you can play around with the story you want to convey, however we believe that putting wins on the X axis is best as a reader’s eye goes from left to right and makes for the most compelling visual story.

Click on this for interactive mapping. (https://plot.ly/~hikatebrown/42.embed)

The above map includes the added line to highlight the correlation of money spent to games won. The line was added through photoshop and allows the reader to draw conclusions about correlation. The available link below the static graph allows the user to go to the embedded plot.ly server and interact with the map. Once there it’s easy to explore the numbers behind the data points.

 

What does the data mean?

Paying the most money doesn’t follow an exact linear path to winning the most games. However, as one might expect, the four teams that spent over $200 million all won more games than the three teams that spent under $100 million. 10 teams are really close to following the straight line sketched out in the chart above.

Having money to spend is important, but it’s how smart teams are with their resources that will make or break a season.  If money spent directly lead to wins, the Dodgers would probably be the most dominant team in the history of baseball with $280 million spent.  However, they ended up falling short in the playoffs.  No one wants to be the Yankees, who spent $227 million – 2nd highest in baseball – and fail to make the playoffs.  However the Indians, sitting in the bottom 10 with $114 million on their payroll, nearly won the World Series because they used their money smartly.  Not a bad position to be in.