With SQL we can filter coloumns with SELECT and rows with WHERE. To show this I will use the Lahman package on RStudio, which is a baseball database. First, we need to need to install the two packages: Lahman and sqldf.
library(Lahman)
library(sqldf)
Suppose we want to see the total number of homeruns for the Yankees in 1927. We could write the following:
query<-"SELECT playerID,yearID,teamID,HR From Batting
Where teamID='NYA' and yearID=1927"
sqldf(query)
## playerID yearID teamID HR
## 1 beallwa01 1927 NYA 0
## 2 bengobe01 1927 NYA 0
## 3 collipa01 1927 NYA 7
## 4 combsea01 1927 NYA 6
## 5 duganjo01 1927 NYA 2
## 6 durstce01 1927 NYA 0
## 7 gazelmi01 1927 NYA 0
## 8 gehrilo01 1927 NYA 47
## 9 giardjo01 1927 NYA 0
## 10 grabojo01 1927 NYA 0
## 11 hoytwa01 1927 NYA 0
## 12 koenima01 1927 NYA 3
## 13 lazzeto01 1927 NYA 18
## 14 meusebo01 1927 NYA 8
## 15 moorewi01 1927 NYA 1
## 16 morehra01 1927 NYA 1
## 17 paschbe01 1927 NYA 2
## 18 pennohe01 1927 NYA 0
## 19 pipgrge01 1927 NYA 1
## 20 ruethdu01 1927 NYA 1
## 21 ruthba01 1927 NYA 60
## 22 shawkbo01 1927 NYA 0
## 23 shockur01 1927 NYA 0
## 24 thomamy01 1927 NYA 0
## 25 weraju01 1927 NYA 1
Batting colnames(Batting)
Here we will learn how to create a query for New York Yankee players that have more than 40 homeruns.
query<-"SELECT playerID,yearID,teamID,HR From Batting
Where teamID='NYA' and HR>=40"
sqldf(query)
## playerID yearID teamID HR
## 1 ruthba01 1920 NYA 54
## 2 ruthba01 1921 NYA 59
## 3 ruthba01 1923 NYA 41
## 4 ruthba01 1924 NYA 46
## 5 ruthba01 1926 NYA 47
## 6 gehrilo01 1927 NYA 47
## 7 ruthba01 1927 NYA 60
## 8 ruthba01 1928 NYA 54
## 9 ruthba01 1929 NYA 46
## 10 gehrilo01 1930 NYA 41
## 11 ruthba01 1930 NYA 49
## 12 gehrilo01 1931 NYA 46
## 13 ruthba01 1931 NYA 46
## 14 ruthba01 1932 NYA 41
## 15 gehrilo01 1934 NYA 49
## 16 gehrilo01 1936 NYA 49
## 17 dimagjo01 1937 NYA 46
## 18 mantlmi01 1956 NYA 52
## 19 mantlmi01 1958 NYA 42
## 20 mantlmi01 1960 NYA 40
## 21 mantlmi01 1961 NYA 54
## 22 marisro01 1961 NYA 61
## 23 jacksre01 1980 NYA 41
## 24 martiti02 1997 NYA 44
## 25 giambja01 2002 NYA 41
## 26 giambja01 2003 NYA 41
## 27 rodrial01 2005 NYA 48
## 28 rodrial01 2007 NYA 54
## 29 grandcu01 2011 NYA 41
## 30 grandcu01 2012 NYA 43
We will narrow the output more with a query example of players having more than 40 homeruns (HR) but less than 60 strikeouts(SO)
query1<-"SELECT playerID,yearID,teamID,SO,HR From Batting
Where SO<60 and HR>40"
sqldf(query1)
## playerID yearID teamID SO HR
## 1 hornsro01 1922 SLN 50 42
## 2 willicy01 1923 PHI 57 41
## 3 ottme01 1929 NY1 38 42
## 4 gehrilo01 1931 NYA 56 46
## 5 ruthba01 1931 NYA 51 46
## 6 gehrilo01 1934 NYA 31 49
## 7 gehrilo01 1936 NYA 46 49
## 8 troskha01 1936 CLE 58 42
## 9 dimagjo01 1937 NYA 37 46
## 10 mizejo01 1940 SLN 49 43
## 11 mizejo01 1947 NY1 42 51
## 12 willite01 1949 BOS 48 43
## 13 kinerra01 1951 PIT 57 42
## 14 camparo01 1953 BRO 58 41
## 15 rosenal01 1953 CLE 48 43
## 16 kluszte01 1954 CIN 35 49
## 17 mayswi01 1954 NY1 57 41
## 18 kluszte01 1955 CIN 40 47
## 19 aaronha01 1957 ML1 58 44
## 20 sievero01 1957 WS1 55 42
## 21 aaronha01 1969 ATL 47 44
## 22 aaronha01 1971 ATL 58 47
## 23 thomafr04 1993 CHA 54 41
## 24 bondsba01 2002 SFN 47 46
## 25 bondsba01 2003 SFN 58 45
## 26 bondsba01 2004 SFN 41 45
## 27 pujolal01 2004 SLN 52 46
## 28 pujolal01 2006 SLN 50 49
We are producing the instances of Phillies in the 1970’s hitting more than 30 homeruns
query2<-"SELECT playerID,yearID,teamID,HR From Batting
Where yearID>1970 and yearID<1979 and HR>=30 and teamID='PHI'"
sqldf(query2)
## playerID yearID teamID HR
## 1 johnsde01 1971 PHI 34
## 2 montawi01 1971 PHI 30
## 3 schmimi01 1974 PHI 36
## 4 luzingr01 1975 PHI 34
## 5 schmimi01 1975 PHI 38
## 6 schmimi01 1976 PHI 38
## 7 luzingr01 1977 PHI 39
## 8 schmimi01 1977 PHI 38
## 9 luzingr01 1978 PHI 35
For this example we will present Player hits with more than 50 homeruns but lets have players with the most homeruns at the top
query3<-"SELECT playerID,yearID,teamID,HR From Batting
Where HR>=50
Order By HR DESC"
sqldf(query3)
## playerID yearID teamID HR
## 1 bondsba01 2001 SFN 73
## 2 mcgwima01 1998 SLN 70
## 3 sosasa01 1998 CHN 66
## 4 mcgwima01 1999 SLN 65
## 5 sosasa01 2001 CHN 64
## 6 sosasa01 1999 CHN 63
## 7 marisro01 1961 NYA 61
## 8 ruthba01 1927 NYA 60
## 9 ruthba01 1921 NYA 59
## 10 foxxji01 1932 PHA 58
## 11 greenha01 1938 DET 58
## 12 howarry01 2006 PHI 58
## 13 gonzalu01 2001 ARI 57
## 14 rodrial01 2002 TEX 57
## 15 wilsoha01 1930 CHN 56
## 16 griffke02 1997 SEA 56
## 17 griffke02 1998 SEA 56
## 18 ruthba01 1920 NYA 54
## 19 ruthba01 1928 NYA 54
## 20 kinerra01 1949 PIT 54
## 21 mantlmi01 1961 NYA 54
## 22 ortizda01 2006 BOS 54
## 23 rodrial01 2007 NYA 54
## 24 bautijo02 2010 TOR 54
## 25 davisch02 2013 BAL 53
## 26 mantlmi01 1956 NYA 52
## 27 mayswi01 1965 SFN 52
## 28 fostege01 1977 CIN 52
## 29 mcgwima01 1996 OAK 52
## 30 rodrial01 2001 TEX 52
## 31 thomeji01 2002 CLE 52
## 32 kinerra01 1947 PIT 51
## 33 mizejo01 1947 NY1 51
## 34 mayswi01 1955 NY1 51
## 35 fieldce01 1990 DET 51
## 36 jonesan01 2005 ATL 51
## 37 foxxji01 1938 BOS 50
## 38 belleal01 1995 CLE 50
## 39 anderbr01 1996 BAL 50
## 40 vaughgr01 1998 SDN 50
## 41 sosasa01 2000 CHN 50
## 42 fieldpr01 2007 MIL 50
Find ll instances ofa player striking out less than 10 times. At least 400 at-bat(AB). Players with least SO at top
query4<-"SELECT playerID,yearID,teamID,SO, AB From Batting
Where AB>=400 and SO<10
Order by SO"
sqldf(query4)
## playerID yearID teamID SO AB
## 1 doyleja01 1894 NY1 3 422
## 2 seweljo01 1932 NYA 3 503
## 3 seweljo01 1925 CLE 4 608
## 4 seweljo01 1929 CLE 4 578
## 5 seweljo01 1933 NYA 4 524
## 6 wardjo01 1893 NY1 5 588
## 7 holloch01 1922 CHN 5 592
## 8 mcinnst01 1922 CLE 5 537
## 9 wanerll01 1936 PIT 5 414
## 10 wrighge01 1875 BS1 6 408
## 11 broutda01 1889 BSN 6 485
## 12 keelewi01 1894 BLN 6 590
## 13 wardjo01 1894 NY1 6 540
## 14 quinnjo02 1895 SLN 6 543
## 15 mcinnst01 1924 BSN 6 581
## 16 seweljo01 1926 CLE 6 578
## 17 wardjo01 1889 NY1 7 479
## 18 crossla01 1893 PHI 7 415
## 19 quinnjo02 1893 SLN 7 547
## 20 crossla01 1894 PHI 7 529
## 21 vaughfa01 1896 CIN 7 433
## 22 cochrmi01 1927 PHA 7 432
## 23 seweljo01 1927 CLE 7 569
## 24 traynpi01 1929 PIT 7 540
## 25 muelldo01 1956 NY1 7 453
## 26 connoro01 1885 NY1 8 455
## 27 glassja01 1887 IN3 8 483
## 28 glassja01 1890 NY1 8 512
## 29 donovpa01 1893 PIT 8 499
## 30 dungasa01 1893 CHN 8 465
## 31 pinknge01 1893 LS3 8 446
## 32 brodist01 1894 BLN 8 573
## 33 quinnjo02 1894 SLN 8 405
## 34 bierblo01 1895 PIT 8 466
## 35 crossla01 1895 PHI 8 535
## 36 hoydu01 1895 CIN 8 429
## 37 roushed01 1921 CIN 8 418
## 38 collied01 1923 CHA 8 505
## 39 collied01 1925 CHA 8 425
## 40 speaktr01 1927 WS1 8 523
## 41 cochrmi01 1929 PHA 8 514
## 42 seweljo01 1931 NYA 8 484
## 43 wanerll01 1933 PIT 8 500
## 44 verbaem01 1947 PHI 8 540
## 45 ansonca01 1883 CHN 9 413
## 46 broutda01 1887 DTN 9 500
## 47 hoydu01 1893 WAS 9 564
## 48 bierblo01 1894 PIT 9 525
## 49 broutda01 1894 BLN 9 525
## 50 milledo01 1894 SLN 9 481
## 51 keelewi01 1896 BLN 9 544
## 52 mckeaed01 1896 CL4 9 571
## 53 speaktr01 1918 CLE 9 471
## 54 dealch01 1921 CHN 9 422
## 55 mcinnst01 1921 BOS 9 584
## 56 severha01 1921 SLA 9 472
## 57 highan01 1926 BSN 9 476
## 58 summaho01 1926 CLE 9 581
## 59 seweljo01 1928 CLE 9 588
## 60 ricesa01 1929 WS1 9 616
## 61 leachfr01 1931 NY1 9 515
## 62 busched01 1945 PHA 9 416
## 63 holmeto01 1945 BSN 9 636
## 64 boudrlo01 1948 CLE 9 560
## 65 mitchda01 1952 CLE 9 511
Aggreation is the clustering of pieces to output one final result. This can be achieved my using grouping and Sum and average. Continuing with our use of the Lahman database we will:
Find Babe Ruth’s career homerun total and squash it into one no comma after homerun
query5<-"select playerID, sum(HR) From Batting
WHERE playerID='ruthba01'
Group by playerID"
sqldf(query5)
## playerID sum(HR)
## 1 ruthba01 714
Find the homerun totals of all players but limit the display to only those that hit 600 or more. The players with the highest total at the top where is filtering before group and having is grouping after
query6<-"select playerID, sum(HR) From Batting
Group by playerID
Having sum(HR)>600
Order by sum(HR) Desc"
sqldf(query6)
## playerID sum(HR)
## 1 bondsba01 762
## 2 aaronha01 755
## 3 ruthba01 714
## 4 rodrial01 696
## 5 mayswi01 660
## 6 griffke02 630
## 7 thomeji01 612
## 8 sosasa01 609
Here we will find the players with the highest homerun average over their career. We will filter out those who have an average of more than 30 HR showing players with the best average at the top avg() #max() #min()
query7<-"select playerID, avg(HR) From Batting
Group by playerID
Having avg(HR)>30
Order by avg(HR) Desc"
sqldf(query7)
## playerID avg(HR)
## 1 pujolal01 36.93750
## 2 bondsba01 34.63636
## 3 mcgwima01 34.29412
## 4 kinerra01 33.54545
## 5 aaronha01 32.82609
## 6 bryankr01 32.50000
## 7 ruthba01 32.45455
## 8 sosasa01 32.05263
## 9 cabremi01 31.85714
## 10 belleal01 31.75000
## 11 rodrial01 31.63636
## 12 schmimi01 30.44444
## 13 abreujo02 30.33333
This is the joining of two tables using the common variable in both tables using Inner Join and “On Table1.variable=Table2.variable.” When mentioning the variable I have to specify which table it is coming from as seen in the Where table.variable.
query8<-"select nameFirst, nameLast, yearID, teamID, HR
From Batting Inner Join Master
On Batting.playerID=Master.playerID
Where Batting.playerID='ruthba01'"
sqldf(query8)
## nameFirst nameLast yearID teamID HR
## 1 Babe Ruth 1914 BOS 0
## 2 Babe Ruth 1915 BOS 4
## 3 Babe Ruth 1916 BOS 3
## 4 Babe Ruth 1917 BOS 2
## 5 Babe Ruth 1918 BOS 11
## 6 Babe Ruth 1919 BOS 29
## 7 Babe Ruth 1920 NYA 54
## 8 Babe Ruth 1921 NYA 59
## 9 Babe Ruth 1922 NYA 35
## 10 Babe Ruth 1923 NYA 41
## 11 Babe Ruth 1924 NYA 46
## 12 Babe Ruth 1925 NYA 25
## 13 Babe Ruth 1926 NYA 47
## 14 Babe Ruth 1927 NYA 60
## 15 Babe Ruth 1928 NYA 54
## 16 Babe Ruth 1929 NYA 46
## 17 Babe Ruth 1930 NYA 49
## 18 Babe Ruth 1931 NYA 46
## 19 Babe Ruth 1932 NYA 41
## 20 Babe Ruth 1933 NYA 34
## 21 Babe Ruth 1934 NYA 22
## 22 Babe Ruth 1935 BSN 6
Find all instances where players hitting more than 50 homeruns give their full name
query9<-"Select nameFirst, nameLast, yearID, teamID, HR
From Batting Inner Join Master
On Batting.playerID=Master.playerID
Where Hr>50
order by HR Desc"
sqldf(query9)
## nameFirst nameLast yearID teamID HR
## 1 Barry Bonds 2001 SFN 73
## 2 Mark McGwire 1998 SLN 70
## 3 Sammy Sosa 1998 CHN 66
## 4 Mark McGwire 1999 SLN 65
## 5 Sammy Sosa 2001 CHN 64
## 6 Sammy Sosa 1999 CHN 63
## 7 Roger Maris 1961 NYA 61
## 8 Babe Ruth 1927 NYA 60
## 9 Babe Ruth 1921 NYA 59
## 10 Jimmie Foxx 1932 PHA 58
## 11 Hank Greenberg 1938 DET 58
## 12 Ryan Howard 2006 PHI 58
## 13 Luis Gonzalez 2001 ARI 57
## 14 Alex Rodriguez 2002 TEX 57
## 15 Hack Wilson 1930 CHN 56
## 16 Ken Griffey 1997 SEA 56
## 17 Ken Griffey 1998 SEA 56
## 18 Babe Ruth 1920 NYA 54
## 19 Babe Ruth 1928 NYA 54
## 20 Ralph Kiner 1949 PIT 54
## 21 Mickey Mantle 1961 NYA 54
## 22 David Ortiz 2006 BOS 54
## 23 Alex Rodriguez 2007 NYA 54
## 24 Jose Bautista 2010 TOR 54
## 25 Chris Davis 2013 BAL 53
## 26 Mickey Mantle 1956 NYA 52
## 27 Willie Mays 1965 SFN 52
## 28 George Foster 1977 CIN 52
## 29 Mark McGwire 1996 OAK 52
## 30 Alex Rodriguez 2001 TEX 52
## 31 Jim Thome 2002 CLE 52
## 32 Ralph Kiner 1947 PIT 51
## 33 Johnny Mize 1947 NY1 51
## 34 Willie Mays 1955 NY1 51
## 35 Cecil Fielder 1990 DET 51
## 36 Andruw Jones 2005 ATL 51
Our next example is showing Babe Ruth with his team name using the Teams database
query10<-"Select playerID, Batting.yearID, name, Batting.HR
From Batting Inner Join Teams
On Batting.teamID=Teams.teamID AND Batting.yearID=Teams.yearID
Where playerID='ruthba01'
order by Batting.HR Desc"
sqldf(query10)
## playerID yearID name HR
## 1 ruthba01 1927 New York Yankees 60
## 2 ruthba01 1921 New York Yankees 59
## 3 ruthba01 1920 New York Yankees 54
## 4 ruthba01 1928 New York Yankees 54
## 5 ruthba01 1930 New York Yankees 49
## 6 ruthba01 1926 New York Yankees 47
## 7 ruthba01 1924 New York Yankees 46
## 8 ruthba01 1929 New York Yankees 46
## 9 ruthba01 1931 New York Yankees 46
## 10 ruthba01 1923 New York Yankees 41
## 11 ruthba01 1932 New York Yankees 41
## 12 ruthba01 1922 New York Yankees 35
## 13 ruthba01 1933 New York Yankees 34
## 14 ruthba01 1919 Boston Red Sox 29
## 15 ruthba01 1925 New York Yankees 25
## 16 ruthba01 1934 New York Yankees 22
## 17 ruthba01 1918 Boston Red Sox 11
## 18 ruthba01 1935 Boston Braves 6
## 19 ruthba01 1915 Boston Red Sox 4
## 20 ruthba01 1916 Boston Red Sox 3
## 21 ruthba01 1917 Boston Red Sox 2
## 22 ruthba01 1914 Boston Red Sox 0
For our next example we will list all the players named Bob who averaged more than one million dollars per year in salary for their career, including Last Name. We will use colnames: Master and Salaries
query11<- "Select nameFirst,nameLast,nameLast,avg(salary)
From Master Inner Join Salaries
On Master.playerID=Salaries.playerID
Where nameFirst='Bob'
Group by Master.playerID
Having avg(salary)>1000000"
sqldf(query11)
## nameFirst nameLast nameLast avg(salary)
## 1 Bob Boone Boone 1047580
## 2 Bob Horner Horner 1416667
## 3 Bob Howry Howry 1804583
## 4 Bob Ojeda Ojeda 1035926
## 5 Bob Stanley Stanley 1067500
## 6 Bob Tewksbury Tewksbury 1288182
## 7 Bob Welch Welch 1930417
## 8 Bob Wickman Wickman 2635813