Filtering Rows and Comlums

With SQL we can filter coloumns with SELECT and rows with WHERE. To show this I will use the Lahman package on RStudio, which is a baseball database. First, we need to need to install the two packages: Lahman and sqldf.

library(Lahman)
library(sqldf)

Suppose we want to see the total number of homeruns for the Yankees in 1927. We could write the following:

query<-"SELECT playerID,yearID,teamID,HR From Batting
Where teamID='NYA' and yearID=1927"
sqldf(query)
##     playerID yearID teamID HR
## 1  beallwa01   1927    NYA  0
## 2  bengobe01   1927    NYA  0
## 3  collipa01   1927    NYA  7
## 4  combsea01   1927    NYA  6
## 5  duganjo01   1927    NYA  2
## 6  durstce01   1927    NYA  0
## 7  gazelmi01   1927    NYA  0
## 8  gehrilo01   1927    NYA 47
## 9  giardjo01   1927    NYA  0
## 10 grabojo01   1927    NYA  0
## 11  hoytwa01   1927    NYA  0
## 12 koenima01   1927    NYA  3
## 13 lazzeto01   1927    NYA 18
## 14 meusebo01   1927    NYA  8
## 15 moorewi01   1927    NYA  1
## 16 morehra01   1927    NYA  1
## 17 paschbe01   1927    NYA  2
## 18 pennohe01   1927    NYA  0
## 19 pipgrge01   1927    NYA  1
## 20 ruethdu01   1927    NYA  1
## 21  ruthba01   1927    NYA 60
## 22 shawkbo01   1927    NYA  0
## 23 shockur01   1927    NYA  0
## 24 thomamy01   1927    NYA  0
## 25  weraju01   1927    NYA  1

Batting colnames(Batting)

Here we will learn how to create a query for New York Yankee players that have more than 40 homeruns.

query<-"SELECT playerID,yearID,teamID,HR From Batting
Where teamID='NYA' and HR>=40"
sqldf(query)
##     playerID yearID teamID HR
## 1   ruthba01   1920    NYA 54
## 2   ruthba01   1921    NYA 59
## 3   ruthba01   1923    NYA 41
## 4   ruthba01   1924    NYA 46
## 5   ruthba01   1926    NYA 47
## 6  gehrilo01   1927    NYA 47
## 7   ruthba01   1927    NYA 60
## 8   ruthba01   1928    NYA 54
## 9   ruthba01   1929    NYA 46
## 10 gehrilo01   1930    NYA 41
## 11  ruthba01   1930    NYA 49
## 12 gehrilo01   1931    NYA 46
## 13  ruthba01   1931    NYA 46
## 14  ruthba01   1932    NYA 41
## 15 gehrilo01   1934    NYA 49
## 16 gehrilo01   1936    NYA 49
## 17 dimagjo01   1937    NYA 46
## 18 mantlmi01   1956    NYA 52
## 19 mantlmi01   1958    NYA 42
## 20 mantlmi01   1960    NYA 40
## 21 mantlmi01   1961    NYA 54
## 22 marisro01   1961    NYA 61
## 23 jacksre01   1980    NYA 41
## 24 martiti02   1997    NYA 44
## 25 giambja01   2002    NYA 41
## 26 giambja01   2003    NYA 41
## 27 rodrial01   2005    NYA 48
## 28 rodrial01   2007    NYA 54
## 29 grandcu01   2011    NYA 41
## 30 grandcu01   2012    NYA 43

We will narrow the output more with a query example of players having more than 40 homeruns (HR) but less than 60 strikeouts(SO)

query1<-"SELECT playerID,yearID,teamID,SO,HR From Batting
Where SO<60 and HR>40"
sqldf(query1)
##     playerID yearID teamID SO HR
## 1  hornsro01   1922    SLN 50 42
## 2  willicy01   1923    PHI 57 41
## 3    ottme01   1929    NY1 38 42
## 4  gehrilo01   1931    NYA 56 46
## 5   ruthba01   1931    NYA 51 46
## 6  gehrilo01   1934    NYA 31 49
## 7  gehrilo01   1936    NYA 46 49
## 8  troskha01   1936    CLE 58 42
## 9  dimagjo01   1937    NYA 37 46
## 10  mizejo01   1940    SLN 49 43
## 11  mizejo01   1947    NY1 42 51
## 12 willite01   1949    BOS 48 43
## 13 kinerra01   1951    PIT 57 42
## 14 camparo01   1953    BRO 58 41
## 15 rosenal01   1953    CLE 48 43
## 16 kluszte01   1954    CIN 35 49
## 17  mayswi01   1954    NY1 57 41
## 18 kluszte01   1955    CIN 40 47
## 19 aaronha01   1957    ML1 58 44
## 20 sievero01   1957    WS1 55 42
## 21 aaronha01   1969    ATL 47 44
## 22 aaronha01   1971    ATL 58 47
## 23 thomafr04   1993    CHA 54 41
## 24 bondsba01   2002    SFN 47 46
## 25 bondsba01   2003    SFN 58 45
## 26 bondsba01   2004    SFN 41 45
## 27 pujolal01   2004    SLN 52 46
## 28 pujolal01   2006    SLN 50 49

We are producing the instances of Phillies in the 1970’s hitting more than 30 homeruns

query2<-"SELECT playerID,yearID,teamID,HR From Batting
Where yearID>1970 and yearID<1979 and HR>=30 and teamID='PHI'"
sqldf(query2)
##    playerID yearID teamID HR
## 1 johnsde01   1971    PHI 34
## 2 montawi01   1971    PHI 30
## 3 schmimi01   1974    PHI 36
## 4 luzingr01   1975    PHI 34
## 5 schmimi01   1975    PHI 38
## 6 schmimi01   1976    PHI 38
## 7 luzingr01   1977    PHI 39
## 8 schmimi01   1977    PHI 38
## 9 luzingr01   1978    PHI 35

Ranking Data

For this example we will present Player hits with more than 50 homeruns but lets have players with the most homeruns at the top

query3<-"SELECT playerID,yearID,teamID,HR From Batting
Where HR>=50
Order By HR DESC"
sqldf(query3)
##     playerID yearID teamID HR
## 1  bondsba01   2001    SFN 73
## 2  mcgwima01   1998    SLN 70
## 3   sosasa01   1998    CHN 66
## 4  mcgwima01   1999    SLN 65
## 5   sosasa01   2001    CHN 64
## 6   sosasa01   1999    CHN 63
## 7  marisro01   1961    NYA 61
## 8   ruthba01   1927    NYA 60
## 9   ruthba01   1921    NYA 59
## 10  foxxji01   1932    PHA 58
## 11 greenha01   1938    DET 58
## 12 howarry01   2006    PHI 58
## 13 gonzalu01   2001    ARI 57
## 14 rodrial01   2002    TEX 57
## 15 wilsoha01   1930    CHN 56
## 16 griffke02   1997    SEA 56
## 17 griffke02   1998    SEA 56
## 18  ruthba01   1920    NYA 54
## 19  ruthba01   1928    NYA 54
## 20 kinerra01   1949    PIT 54
## 21 mantlmi01   1961    NYA 54
## 22 ortizda01   2006    BOS 54
## 23 rodrial01   2007    NYA 54
## 24 bautijo02   2010    TOR 54
## 25 davisch02   2013    BAL 53
## 26 mantlmi01   1956    NYA 52
## 27  mayswi01   1965    SFN 52
## 28 fostege01   1977    CIN 52
## 29 mcgwima01   1996    OAK 52
## 30 rodrial01   2001    TEX 52
## 31 thomeji01   2002    CLE 52
## 32 kinerra01   1947    PIT 51
## 33  mizejo01   1947    NY1 51
## 34  mayswi01   1955    NY1 51
## 35 fieldce01   1990    DET 51
## 36 jonesan01   2005    ATL 51
## 37  foxxji01   1938    BOS 50
## 38 belleal01   1995    CLE 50
## 39 anderbr01   1996    BAL 50
## 40 vaughgr01   1998    SDN 50
## 41  sosasa01   2000    CHN 50
## 42 fieldpr01   2007    MIL 50

Find ll instances ofa player striking out less than 10 times. At least 400 at-bat(AB). Players with least SO at top

query4<-"SELECT playerID,yearID,teamID,SO, AB From Batting
Where AB>=400 and SO<10
Order by SO"
sqldf(query4)
##     playerID yearID teamID SO  AB
## 1  doyleja01   1894    NY1  3 422
## 2  seweljo01   1932    NYA  3 503
## 3  seweljo01   1925    CLE  4 608
## 4  seweljo01   1929    CLE  4 578
## 5  seweljo01   1933    NYA  4 524
## 6   wardjo01   1893    NY1  5 588
## 7  holloch01   1922    CHN  5 592
## 8  mcinnst01   1922    CLE  5 537
## 9  wanerll01   1936    PIT  5 414
## 10 wrighge01   1875    BS1  6 408
## 11 broutda01   1889    BSN  6 485
## 12 keelewi01   1894    BLN  6 590
## 13  wardjo01   1894    NY1  6 540
## 14 quinnjo02   1895    SLN  6 543
## 15 mcinnst01   1924    BSN  6 581
## 16 seweljo01   1926    CLE  6 578
## 17  wardjo01   1889    NY1  7 479
## 18 crossla01   1893    PHI  7 415
## 19 quinnjo02   1893    SLN  7 547
## 20 crossla01   1894    PHI  7 529
## 21 vaughfa01   1896    CIN  7 433
## 22 cochrmi01   1927    PHA  7 432
## 23 seweljo01   1927    CLE  7 569
## 24 traynpi01   1929    PIT  7 540
## 25 muelldo01   1956    NY1  7 453
## 26 connoro01   1885    NY1  8 455
## 27 glassja01   1887    IN3  8 483
## 28 glassja01   1890    NY1  8 512
## 29 donovpa01   1893    PIT  8 499
## 30 dungasa01   1893    CHN  8 465
## 31 pinknge01   1893    LS3  8 446
## 32 brodist01   1894    BLN  8 573
## 33 quinnjo02   1894    SLN  8 405
## 34 bierblo01   1895    PIT  8 466
## 35 crossla01   1895    PHI  8 535
## 36   hoydu01   1895    CIN  8 429
## 37 roushed01   1921    CIN  8 418
## 38 collied01   1923    CHA  8 505
## 39 collied01   1925    CHA  8 425
## 40 speaktr01   1927    WS1  8 523
## 41 cochrmi01   1929    PHA  8 514
## 42 seweljo01   1931    NYA  8 484
## 43 wanerll01   1933    PIT  8 500
## 44 verbaem01   1947    PHI  8 540
## 45 ansonca01   1883    CHN  9 413
## 46 broutda01   1887    DTN  9 500
## 47   hoydu01   1893    WAS  9 564
## 48 bierblo01   1894    PIT  9 525
## 49 broutda01   1894    BLN  9 525
## 50 milledo01   1894    SLN  9 481
## 51 keelewi01   1896    BLN  9 544
## 52 mckeaed01   1896    CL4  9 571
## 53 speaktr01   1918    CLE  9 471
## 54  dealch01   1921    CHN  9 422
## 55 mcinnst01   1921    BOS  9 584
## 56 severha01   1921    SLA  9 472
## 57  highan01   1926    BSN  9 476
## 58 summaho01   1926    CLE  9 581
## 59 seweljo01   1928    CLE  9 588
## 60  ricesa01   1929    WS1  9 616
## 61 leachfr01   1931    NY1  9 515
## 62 busched01   1945    PHA  9 416
## 63 holmeto01   1945    BSN  9 636
## 64 boudrlo01   1948    CLE  9 560
## 65 mitchda01   1952    CLE  9 511

Aggregation

Aggreation is the clustering of pieces to output one final result. This can be achieved my using grouping and Sum and average. Continuing with our use of the Lahman database we will:

Find Babe Ruth’s career homerun total and squash it into one no comma after homerun

query5<-"select playerID, sum(HR) From Batting
WHERE playerID='ruthba01'
Group by playerID"

sqldf(query5)
##   playerID sum(HR)
## 1 ruthba01     714

Find the homerun totals of all players but limit the display to only those that hit 600 or more. The players with the highest total at the top where is filtering before group and having is grouping after

query6<-"select playerID, sum(HR) From Batting
Group by playerID
Having sum(HR)>600
Order by sum(HR) Desc"

sqldf(query6)
##    playerID sum(HR)
## 1 bondsba01     762
## 2 aaronha01     755
## 3  ruthba01     714
## 4 rodrial01     696
## 5  mayswi01     660
## 6 griffke02     630
## 7 thomeji01     612
## 8  sosasa01     609

Ordering

Here we will find the players with the highest homerun average over their career. We will filter out those who have an average of more than 30 HR showing players with the best average at the top avg() #max() #min()

query7<-"select playerID, avg(HR) From Batting
Group by playerID
Having avg(HR)>30
Order by avg(HR) Desc"

sqldf(query7)
##     playerID  avg(HR)
## 1  pujolal01 36.93750
## 2  bondsba01 34.63636
## 3  mcgwima01 34.29412
## 4  kinerra01 33.54545
## 5  aaronha01 32.82609
## 6  bryankr01 32.50000
## 7   ruthba01 32.45455
## 8   sosasa01 32.05263
## 9  cabremi01 31.85714
## 10 belleal01 31.75000
## 11 rodrial01 31.63636
## 12 schmimi01 30.44444
## 13 abreujo02 30.33333

Joins

This is the joining of two tables using the common variable in both tables using Inner Join and “On Table1.variable=Table2.variable.” When mentioning the variable I have to specify which table it is coming from as seen in the Where table.variable.

query8<-"select nameFirst, nameLast, yearID, teamID, HR 
From Batting Inner Join Master
On Batting.playerID=Master.playerID
Where Batting.playerID='ruthba01'"
sqldf(query8)
##    nameFirst nameLast yearID teamID HR
## 1       Babe     Ruth   1914    BOS  0
## 2       Babe     Ruth   1915    BOS  4
## 3       Babe     Ruth   1916    BOS  3
## 4       Babe     Ruth   1917    BOS  2
## 5       Babe     Ruth   1918    BOS 11
## 6       Babe     Ruth   1919    BOS 29
## 7       Babe     Ruth   1920    NYA 54
## 8       Babe     Ruth   1921    NYA 59
## 9       Babe     Ruth   1922    NYA 35
## 10      Babe     Ruth   1923    NYA 41
## 11      Babe     Ruth   1924    NYA 46
## 12      Babe     Ruth   1925    NYA 25
## 13      Babe     Ruth   1926    NYA 47
## 14      Babe     Ruth   1927    NYA 60
## 15      Babe     Ruth   1928    NYA 54
## 16      Babe     Ruth   1929    NYA 46
## 17      Babe     Ruth   1930    NYA 49
## 18      Babe     Ruth   1931    NYA 46
## 19      Babe     Ruth   1932    NYA 41
## 20      Babe     Ruth   1933    NYA 34
## 21      Babe     Ruth   1934    NYA 22
## 22      Babe     Ruth   1935    BSN  6

Find all instances where players hitting more than 50 homeruns give their full name

query9<-"Select nameFirst, nameLast, yearID, teamID, HR 
From Batting Inner Join Master
On Batting.playerID=Master.playerID
Where Hr>50
order by HR Desc"
sqldf(query9)
##    nameFirst  nameLast yearID teamID HR
## 1      Barry     Bonds   2001    SFN 73
## 2       Mark   McGwire   1998    SLN 70
## 3      Sammy      Sosa   1998    CHN 66
## 4       Mark   McGwire   1999    SLN 65
## 5      Sammy      Sosa   2001    CHN 64
## 6      Sammy      Sosa   1999    CHN 63
## 7      Roger     Maris   1961    NYA 61
## 8       Babe      Ruth   1927    NYA 60
## 9       Babe      Ruth   1921    NYA 59
## 10    Jimmie      Foxx   1932    PHA 58
## 11      Hank Greenberg   1938    DET 58
## 12      Ryan    Howard   2006    PHI 58
## 13      Luis  Gonzalez   2001    ARI 57
## 14      Alex Rodriguez   2002    TEX 57
## 15      Hack    Wilson   1930    CHN 56
## 16       Ken   Griffey   1997    SEA 56
## 17       Ken   Griffey   1998    SEA 56
## 18      Babe      Ruth   1920    NYA 54
## 19      Babe      Ruth   1928    NYA 54
## 20     Ralph     Kiner   1949    PIT 54
## 21    Mickey    Mantle   1961    NYA 54
## 22     David     Ortiz   2006    BOS 54
## 23      Alex Rodriguez   2007    NYA 54
## 24      Jose  Bautista   2010    TOR 54
## 25     Chris     Davis   2013    BAL 53
## 26    Mickey    Mantle   1956    NYA 52
## 27    Willie      Mays   1965    SFN 52
## 28    George    Foster   1977    CIN 52
## 29      Mark   McGwire   1996    OAK 52
## 30      Alex Rodriguez   2001    TEX 52
## 31       Jim     Thome   2002    CLE 52
## 32     Ralph     Kiner   1947    PIT 51
## 33    Johnny      Mize   1947    NY1 51
## 34    Willie      Mays   1955    NY1 51
## 35     Cecil   Fielder   1990    DET 51
## 36    Andruw     Jones   2005    ATL 51

Our next example is showing Babe Ruth with his team name using the Teams database

query10<-"Select playerID, Batting.yearID, name, Batting.HR 
From Batting Inner Join Teams 
On Batting.teamID=Teams.teamID AND Batting.yearID=Teams.yearID
Where playerID='ruthba01'
order by Batting.HR Desc"
sqldf(query10)
##    playerID yearID             name HR
## 1  ruthba01   1927 New York Yankees 60
## 2  ruthba01   1921 New York Yankees 59
## 3  ruthba01   1920 New York Yankees 54
## 4  ruthba01   1928 New York Yankees 54
## 5  ruthba01   1930 New York Yankees 49
## 6  ruthba01   1926 New York Yankees 47
## 7  ruthba01   1924 New York Yankees 46
## 8  ruthba01   1929 New York Yankees 46
## 9  ruthba01   1931 New York Yankees 46
## 10 ruthba01   1923 New York Yankees 41
## 11 ruthba01   1932 New York Yankees 41
## 12 ruthba01   1922 New York Yankees 35
## 13 ruthba01   1933 New York Yankees 34
## 14 ruthba01   1919   Boston Red Sox 29
## 15 ruthba01   1925 New York Yankees 25
## 16 ruthba01   1934 New York Yankees 22
## 17 ruthba01   1918   Boston Red Sox 11
## 18 ruthba01   1935    Boston Braves  6
## 19 ruthba01   1915   Boston Red Sox  4
## 20 ruthba01   1916   Boston Red Sox  3
## 21 ruthba01   1917   Boston Red Sox  2
## 22 ruthba01   1914   Boston Red Sox  0

For our next example we will list all the players named Bob who averaged more than one million dollars per year in salary for their career, including Last Name. We will use colnames: Master and Salaries

query11<- "Select nameFirst,nameLast,nameLast,avg(salary) 
From Master Inner Join Salaries
On Master.playerID=Salaries.playerID
Where nameFirst='Bob'
Group by Master.playerID
Having avg(salary)>1000000"
sqldf(query11)
##   nameFirst  nameLast  nameLast avg(salary)
## 1       Bob     Boone     Boone     1047580
## 2       Bob    Horner    Horner     1416667
## 3       Bob     Howry     Howry     1804583
## 4       Bob     Ojeda     Ojeda     1035926
## 5       Bob   Stanley   Stanley     1067500
## 6       Bob Tewksbury Tewksbury     1288182
## 7       Bob     Welch     Welch     1930417
## 8       Bob   Wickman   Wickman     2635813