import pyspark
from pyspark.sql import SparkSession
= SparkSession.builder.master("local[*]").getOrCreate() spark
Apache Hadoop
Hadoop provides storage and ways to easily process big data sets. Storage is managed by the Hadoop Distributed File System (HDFS), and the data is processed using MapReduce.
- HDFS divides up data from multiple sources and distributes them across different servers to be processed.The computing environment is redundant, allowing the application to run if a server fails.
- MapReduce distributes data across multiple machines and the brings the processed data back together so it’s coherent.
Hadoop has its limits, however. Data cannot be processed in real time. It can only collect data for a certain period of time and then process it all at once. This process is called batch processing.
Apache Spark
Spark is built for processing large amounts of data, as well as data analysis, machine learning, data visualization, and streaming real-time data.
Spark starts with the driver node, which communicates to the cluser manager. The cluser manager to distributes tasks to different worker nodes. Worker nodes execute the task they were given, communicate with each other if needed, and send the results back to the driver node.
Here are some other attribuutes of Spark:
1. In-Memory Processing
- Loads data into memory once and performs all operations in-memory
2. Data Reuse
- Data is cached so that it can be reused
3. Faster Execution
- Allows for real-time processing
Hadoop and Spark can be used together to store big data sets and quickly process data.
PySpark
PySpark is an API that allows the use of Spark in Python.
- PySpark can incorporate Pandas DataFrames and SQL tables.
PySpark has methods that make data transformation easy to complete, similar to Pandas. A Spark DataFrame has a few key differences:
1. Data is distributed among different machines
2. Operations are executed the same in each node
3. Can process more data than one machine can handle
4. Transformations are not computed until called to action (lazy evaluation)
5. High fault tolerance; can function if a node is disabled and recovers lost data
6. Built for extremely large amounts of data
PySpark in Google Colab
PySpark is very similar to Pandas. It is very convenient to transform data just like we learned before, just with a slightly different syntax.
Here are the basics of coding with PySpark:
Loading Data
Every time you use PySpark, you must establish a SparkSession
entry point. This allows you to transform DataFrames and SQL tables.
There are two ways to approach reading a CSV file. First, if the file is in your local directory, follow this syntax that is similar to Pandas:
= '/content/drive/MyDrive/lecture-data/cces.csv'
path = spark.read.csv(path,
df =True,
inferSchema=True)
header df.show()
- Note: in order to see any display of a DataFrame at any point while you’re working with it, you must use
.show()
.
Second, if the file is from a URL, you have to create a Pandas DataFrame first. From there, you can convert the Pandas DataFrame into a Spark DataFrame.
import pandas as pd
= pd.read_csv('https://bcdanl.github.io/data/nba.csv')
df_pd
= spark.createDataFrame(df_pd)
df df.show()
+---------------+--------------------+--------+--------+--------+
| Name| Team|Position|Birthday| Salary|
+---------------+--------------------+--------+--------+--------+
| Shake Milton| Philadelphia 76ers| SG| 9/26/96| 1445697|
| Christian Wood| Detroit Pistons| PF| 9/27/95| 1645357|
| PJ Washington| Charlotte Hornets| PF| 8/23/98| 3831840|
| Derrick Rose| Detroit Pistons| PG| 10/4/88| 7317074|
| Marial Shayok| Philadelphia 76ers| G| 7/26/95| 79568|
| Draymond Green|Golden State Warr...| PF| 3/4/90|18539130|
| Kendrick Nunn| Miami Heat| SG| 8/3/95| 1416852|
| Cedi Osman| Cleveland Cavaliers| SF| 4/8/95| 2907143|
| Brook Lopez| Milwaukee Bucks| C| 4/1/88|12093024|
| Torrey Craig| Denver Nuggets| SF|12/19/90| 2000000|
|Jordan Clarkson| Cleveland Cavaliers| PG| 6/7/92|13437500|
| Alex Caruso| Los Angeles Lakers| PG| 2/28/94| 2750000|
| Norvel Pelle| Philadelphia 76ers| FC| 2/3/93| 79568|
| Tyler Johnson| Phoenix Suns| PG| 5/7/92|19245370|
| Alec Burks|Golden State Warr...| SG| 7/20/91| 2320044|
| JaMychal Green|Los Angeles Clippers| PF| 6/21/90| 4767000|
| Dwight Howard| Los Angeles Lakers| C| 12/8/85| 5603850|
| Nikola Jokic| Denver Nuggets| C| 2/19/95|27504630|
| Chris Boucher| Toronto Raptors| PF| 1/11/93| 1588231|
| Marcus Morris| New York Knicks| PF| 9/2/89|15000000|
+---------------+--------------------+--------+--------+--------+
only showing top 20 rows
Summarizing Data
df.printSchema()
prints column names and data types
- the argument, nullable = True allows columns with a null value to print
df.columns
prints list of columnsdf.dtypes
returns a list of tuples containing the column name and data typedf.count()
prints the total number of rowsdf.describe()
prints summary statistics for each column
Displaying Data
df.show()
: default shows the first 20 rows
- arguments:
- n = : number of rows to display
- truncate = : either boolean value, or a number specifying how many characters to keep
- vertical = : boolean value; if True, each row is displayed vertically
- arguments:
Selecting Columns
Selecting one column:
"Name").show(5) df.select(
+--------------+
| Name|
+--------------+
| Shake Milton|
|Christian Wood|
| PJ Washington|
| Derrick Rose|
| Marial Shayok|
+--------------+
only showing top 5 rows
Selecting multiple columns:
"Name", "Team", "Salary").show(5) df.select(
+--------------+------------------+-------+
| Name| Team| Salary|
+--------------+------------------+-------+
| Shake Milton|Philadelphia 76ers|1445697|
|Christian Wood| Detroit Pistons|1645357|
| PJ Washington| Charlotte Hornets|3831840|
| Derrick Rose| Detroit Pistons|7317074|
| Marial Shayok|Philadelphia 76ers| 79568|
+--------------+------------------+-------+
only showing top 5 rows
Counting Methods
Like previously mentioned, you can use df.count()
for a count of the entire DataFrame. You can also count specific columns. Here are two ways to do this:
from pyspark.sql.functions import countDistinct
= df.select(countDistinct("Team")).collect()[0][0]
num_teams num_teams
30
This code shows the number of observations of the unique values in the Team column.
"Team").count().show(5) df.groupBy(
+--------------------+-----+
| Team|count|
+--------------------+-----+
| Phoenix Suns| 15|
| Boston Celtics| 16|
| Dallas Mavericks| 13|
|New Orleans Pelicans| 16|
| Brooklyn Nets| 17|
+--------------------+-----+
only showing top 5 rows
This code shows how many times each unique value in Team occurrs.
Sorting
df.orderBy()
sorts values by a variable given. It can be given ascending/descending intstructions. Sorting by multiple columns requires the use of a list.
"Name").show(5) df.orderBy(
+-----------------+--------------------+--------+--------+--------+
| Name| Team|Position|Birthday| Salary|
+-----------------+--------------------+--------+--------+--------+
| Aaron Gordon| Orlando Magic| PF| 9/16/95|19863636|
| Aaron Holiday| Indiana Pacers| PG| 9/30/96| 2239200|
| Abdel Nader|Oklahoma City Thu...| SF| 9/25/93| 1618520|
| Adam Mokoka| Chicago Bulls| G| 7/18/98| 79568|
|Admiral Schofield| Washington Wizards| SF| 3/30/97| 1000000|
+-----------------+--------------------+--------+--------+--------+
only showing top 5 rows
The default sorting is ascending.
from pyspark.sql.functions import desc
"Salary")).show(5) df.orderBy(desc(
+-----------------+--------------------+--------+--------+--------+
| Name| Team|Position|Birthday| Salary|
+-----------------+--------------------+--------+--------+--------+
| Stephen Curry|Golden State Warr...| PG| 3/14/88|40231758|
|Russell Westbrook| Houston Rockets| PG|11/12/88|38506482|
| Chris Paul|Oklahoma City Thu...| PG| 5/6/85|38506482|
| John Wall| Washington Wizards| PG| 9/6/90|38199000|
| James Harden| Houston Rockets| PG| 8/26/89|38199000|
+-----------------+--------------------+--------+--------+--------+
only showing top 5 rows
"Team", desc("Salary")]).show(5) df.orderBy([
+----------------+-------------+--------+--------+--------+
| Name| Team|Position|Birthday| Salary|
+----------------+-------------+--------+--------+--------+
|Chandler Parsons|Atlanta Hawks| SF|10/25/88|25102512|
| Evan Turner|Atlanta Hawks| PG|10/27/88|18606556|
| Allen Crabbe|Atlanta Hawks| SG| 4/9/92|18500000|
| De'Andre Hunter|Atlanta Hawks| SF| 12/2/97| 7068360|
| Jabari Parker|Atlanta Hawks| PF| 3/15/95| 6500000|
+----------------+-------------+--------+--------+--------+
only showing top 5 rows
nsmallest
andnlargest
are not functions in PySpark, but there is an equivalent way to do it:
# nsmallest example:
"Salary").limit(5).show()
df.orderBy(
# nlargest example:
"Salary")).limit(5).show() df.orderBy(desc(
Row-Based Access
PySpark does not use row indexing, so you have to use other ways to access rows:
1. df.limit()
or df.take()
takes an integer and returns a list of the number of rows
2. df.collect()
returns all the reconds as a list of rows
Here is an example:
filter("Team == 'New York Knicks'").show()
df.5).show()
df.limit(5)
df.take( df.collect()
+-----------------+---------------+--------+--------+--------+
| Name| Team|Position|Birthday| Salary|
+-----------------+---------------+--------+--------+--------+
| Marcus Morris|New York Knicks| PF| 9/2/89|15000000|
| Damyean Dotson|New York Knicks| SG| 5/6/94| 1618520|
| Ignas Brazdeikis|New York Knicks| SF| 1/8/99| 898310|
| Ivan Rabb|New York Knicks| PF| 2/4/97| 79568|
| Kevin Knox|New York Knicks| PF| 8/11/99| 4380120|
| Julius Randle|New York Knicks| C|11/29/94|18000000|
|Mitchell Robinson|New York Knicks| C| 4/1/98| 1559712|
| Wayne Ellington|New York Knicks| SG|11/29/87| 8000000|
| RJ Barrett|New York Knicks| SG| 6/14/00| 7839960|
| Elfrid Payton|New York Knicks| PG| 2/22/94| 8000000|
| Allonzo Trier|New York Knicks| PG| 1/17/96| 3551100|
| Reggie Bullock|New York Knicks| SF| 3/16/91| 4000000|
| Bobby Portis|New York Knicks| C| 2/10/95|15000000|
| Taj Gibson|New York Knicks| C| 6/24/85| 9000000|
| Frank Ntilikina|New York Knicks| PG| 7/28/98| 4855800|
| Kadeem Allen|New York Knicks| PG| 1/15/93| 79568|
+-----------------+---------------+--------+--------+--------+
+--------------+------------------+--------+--------+-------+
| Name| Team|Position|Birthday| Salary|
+--------------+------------------+--------+--------+-------+
| Shake Milton|Philadelphia 76ers| SG| 9/26/96|1445697|
|Christian Wood| Detroit Pistons| PF| 9/27/95|1645357|
| PJ Washington| Charlotte Hornets| PF| 8/23/98|3831840|
| Derrick Rose| Detroit Pistons| PG| 10/4/88|7317074|
| Marial Shayok|Philadelphia 76ers| G| 7/26/95| 79568|
+--------------+------------------+--------+--------+-------+
[Row(Name='Shake Milton', Team='Philadelphia 76ers', Position='SG', Birthday='9/26/96', Salary=1445697),
Row(Name='Christian Wood', Team='Detroit Pistons', Position='PF', Birthday='9/27/95', Salary=1645357),
Row(Name='PJ Washington', Team='Charlotte Hornets', Position='PF', Birthday='8/23/98', Salary=3831840),
Row(Name='Derrick Rose', Team='Detroit Pistons', Position='PG', Birthday='10/4/88', Salary=7317074),
Row(Name='Marial Shayok', Team='Philadelphia 76ers', Position='G', Birthday='7/26/95', Salary=79568),
Row(Name='Draymond Green', Team='Golden State Warriors', Position='PF', Birthday='3/4/90', Salary=18539130),
Row(Name='Kendrick Nunn', Team='Miami Heat', Position='SG', Birthday='8/3/95', Salary=1416852),
Row(Name='Cedi Osman', Team='Cleveland Cavaliers', Position='SF', Birthday='4/8/95', Salary=2907143),
Row(Name='Brook Lopez', Team='Milwaukee Bucks', Position='C', Birthday='4/1/88', Salary=12093024),
Row(Name='Torrey Craig', Team='Denver Nuggets', Position='SF', Birthday='12/19/90', Salary=2000000),
Row(Name='Jordan Clarkson', Team='Cleveland Cavaliers', Position='PG', Birthday='6/7/92', Salary=13437500),
Row(Name='Alex Caruso', Team='Los Angeles Lakers', Position='PG', Birthday='2/28/94', Salary=2750000),
Row(Name='Norvel Pelle', Team='Philadelphia 76ers', Position='FC', Birthday='2/3/93', Salary=79568),
Row(Name='Tyler Johnson', Team='Phoenix Suns', Position='PG', Birthday='5/7/92', Salary=19245370),
Row(Name='Alec Burks', Team='Golden State Warriors', Position='SG', Birthday='7/20/91', Salary=2320044),
Row(Name='JaMychal Green', Team='Los Angeles Clippers', Position='PF', Birthday='6/21/90', Salary=4767000),
Row(Name='Dwight Howard', Team='Los Angeles Lakers', Position='C', Birthday='12/8/85', Salary=5603850),
Row(Name='Nikola Jokic', Team='Denver Nuggets', Position='C', Birthday='2/19/95', Salary=27504630),
Row(Name='Chris Boucher', Team='Toronto Raptors', Position='PF', Birthday='1/11/93', Salary=1588231),
Row(Name='Marcus Morris', Team='New York Knicks', Position='PF', Birthday='9/2/89', Salary=15000000),
Row(Name='Kevin Huerter', Team='Atlanta Hawks', Position='SG', Birthday='8/27/98', Salary=2636280),
Row(Name='Rui Hachimura', Team='Washington Wizards', Position='PF', Birthday='2/8/98', Salary=4469160),
Row(Name='George Hill', Team='Milwaukee Bucks', Position='PG', Birthday='5/4/86', Salary=10133907),
Row(Name='Nickeil Alexander-Walker', Team='New Orleans Pelicans', Position='SG', Birthday='9/2/98', Salary=2964840),
Row(Name='Jaylen Hoard', Team='Portland Trail Blazers', Position='SF', Birthday='3/30/99', Salary=79568),
Row(Name='Tyler Cook', Team='Cleveland Cavaliers', Position='PF', Birthday='9/23/97', Salary=79568),
Row(Name='Otto Porter', Team='Chicago Bulls', Position='SF', Birthday='6/3/93', Salary=27250576),
Row(Name='Langston Galloway', Team='Detroit Pistons', Position='PG', Birthday='12/9/91', Salary=7333333),
Row(Name='Evan Turner', Team='Atlanta Hawks', Position='PG', Birthday='10/27/88', Salary=18606556),
Row(Name='Norman Powell', Team='Toronto Raptors', Position='SG', Birthday='5/25/93', Salary=10116576),
Row(Name='Nicolas Claxton', Team='Brooklyn Nets', Position='PF', Birthday='4/17/99', Salary=898310),
Row(Name='Michael Frazier', Team='Houston Rockets', Position='G', Birthday='3/8/94', Salary=79568),
Row(Name='Paul Millsap', Team='Denver Nuggets', Position='PF', Birthday='2/10/85', Salary=30000000),
Row(Name='Furkan Korkmaz', Team='Philadelphia 76ers', Position='SG', Birthday='7/24/97', Salary=1620564),
Row(Name='Trey Burke', Team='Philadelphia 76ers', Position='PG', Birthday='11/12/92', Salary=2028594),
Row(Name='Bradley Beal', Team='Washington Wizards', Position='SG', Birthday='6/28/93', Salary=27093018),
Row(Name='Thomas Bryant', Team='Washington Wizards', Position='C', Birthday='7/31/97', Salary=8000000),
Row(Name='Dean Wade', Team='Cleveland Cavaliers', Position='PF', Birthday='11/20/96', Salary=79568),
Row(Name='Chris Paul', Team='Oklahoma City Thunder', Position='PG', Birthday='5/6/85', Salary=38506482),
Row(Name='Josh Hart', Team='New Orleans Pelicans', Position='SF', Birthday='3/6/95', Salary=1934160),
Row(Name='LaMarcus Aldridge', Team='San Antonio Spurs', Position='C', Birthday='7/19/85', Salary=26000000),
Row(Name='DaQuan Jeffries', Team='Sacramento Kings', Position='SG', Birthday='8/30/97', Salary=898310),
Row(Name='Hamidou Diallo', Team='Oklahoma City Thunder', Position='SF', Birthday='7/31/98', Salary=1416852),
Row(Name='Jamal Murray', Team='Denver Nuggets', Position='PG', Birthday='2/23/97', Salary=4444746),
Row(Name='Darius Bazley', Team='Oklahoma City Thunder', Position='PF', Birthday='6/12/00', Salary=2284800),
Row(Name='Robert Franks', Team='Charlotte Hornets', Position='F', Birthday='12/18/96', Salary=79568),
Row(Name='Gerald Green', Team='Houston Rockets', Position='SF', Birthday='1/26/86', Salary=2564753),
Row(Name='Thaddeus Young', Team='Chicago Bulls', Position='PF', Birthday='6/21/88', Salary=12900000),
Row(Name='Sviatoslav Mykhailiuk', Team='Detroit Pistons', Position='SF', Birthday='6/10/97', Salary=1416852),
Row(Name='Ian Mahinmi', Team='Washington Wizards', Position='C', Birthday='11/5/86', Salary=15450051),
Row(Name='Deonte Burton', Team='Oklahoma City Thunder', Position='SG', Birthday='1/31/94', Salary=1416852),
Row(Name='Markelle Fultz', Team='Orlando Magic', Position='PG', Birthday='5/29/98', Salary=9745200),
Row(Name='Aaron Gordon', Team='Orlando Magic', Position='PF', Birthday='9/16/95', Salary=19863636),
Row(Name='Dzanan Musa', Team='Brooklyn Nets', Position='SF', Birthday='5/8/99', Salary=1911600),
Row(Name='Patrick McCaw', Team='Toronto Raptors', Position='SF', Birthday='10/25/95', Salary=4000000),
Row(Name='Bismack Biyombo', Team='Charlotte Hornets', Position='C', Birthday='8/28/92', Salary=17000000),
Row(Name='JaVale McGee', Team='Los Angeles Lakers', Position='C', Birthday='1/19/88', Salary=4000000),
Row(Name='Juwan Morgan', Team='Utah Jazz', Position='F', Birthday='4/17/97', Salary=796806),
Row(Name='Marc Gasol', Team='Toronto Raptors', Position='C', Birthday='1/29/85', Salary=25595700),
Row(Name='Marcus Smart', Team='Boston Celtics', Position='PG', Birthday='3/6/94', Salary=12553571),
Row(Name='Rudy Gobert', Team='Utah Jazz', Position='C', Birthday='6/26/92', Salary=24258427),
Row(Name='Wesley Iwundu', Team='Orlando Magic', Position='SF', Birthday='12/20/94', Salary=1618520),
Row(Name='Dwight Powell', Team='Dallas Mavericks', Position='C', Birthday='7/20/91', Salary=10259375),
Row(Name='Goran Dragic', Team='Miami Heat', Position='PG', Birthday='5/6/86', Salary=19217900),
Row(Name='Theo Pinson', Team='Brooklyn Nets', Position='SG', Birthday='11/5/95', Salary=1445697),
Row(Name='Danilo Gallinari', Team='Oklahoma City Thunder', Position='PF', Birthday='8/8/88', Salary=22615559),
Row(Name='Joe Ingles', Team='Utah Jazz', Position='PF', Birthday='10/2/87', Salary=11454546),
Row(Name='Jarrett Culver', Team='Minnesota Timberwolves', Position='SG', Birthday='2/20/99', Salary=5813640),
Row(Name='Robert Covington', Team='Minnesota Timberwolves', Position='PF', Birthday='12/14/90', Salary=11301219),
Row(Name='Damyean Dotson', Team='New York Knicks', Position='SG', Birthday='5/6/94', Salary=1618520),
Row(Name='Patrick Beverley', Team='Los Angeles Clippers', Position='PG', Birthday='7/12/88', Salary=12345680),
Row(Name='Kevin Love', Team='Cleveland Cavaliers', Position='C', Birthday='9/7/88', Salary=28942830),
Row(Name='Quinn Cook', Team='Los Angeles Lakers', Position='PG', Birthday='3/23/93', Salary=3000000),
Row(Name='Justin Wright-Foreman', Team='Utah Jazz', Position='G', Birthday='10/27/97', Salary=79568),
Row(Name='Noah Vonleh', Team='Minnesota Timberwolves', Position='C', Birthday='8/24/95', Salary=2000000),
Row(Name='Tyus Jones', Team='Memphis Grizzlies', Position='PG', Birthday='5/10/96', Salary=9258000),
Row(Name='Dewayne Dedmon', Team='Sacramento Kings', Position='C', Birthday='8/12/89', Salary=13333334),
Row(Name='Malcolm Brogdon', Team='Indiana Pacers', Position='PG', Birthday='12/11/92', Salary=20000000),
Row(Name='Ben McLemore', Team='Houston Rockets', Position='SG', Birthday='2/11/93', Salary=2028594),
Row(Name='Wilson Chandler', Team='Brooklyn Nets', Position='PF', Birthday='5/10/87', Salary=2564753),
Row(Name='Isaac Bonga', Team='Washington Wizards', Position='PG', Birthday='11/8/99', Salary=1416852),
Row(Name='Adam Mokoka', Team='Chicago Bulls', Position='G', Birthday='7/18/98', Salary=79568),
Row(Name='Lonzo Ball', Team='New Orleans Pelicans', Position='PG', Birthday='10/27/97', Salary=8719320),
Row(Name='Jalen Brunson', Team='Dallas Mavericks', Position='PG', Birthday='8/31/96', Salary=1416852),
Row(Name='John Collins', Team='Atlanta Hawks', Position='PF', Birthday='9/23/97', Salary=2686560),
Row(Name='Marvin Williams', Team='Charlotte Hornets', Position='PF', Birthday='6/19/86', Salary=15006250),
Row(Name='Brad Wanamaker', Team='Boston Celtics', Position='PG', Birthday='7/25/89', Salary=1445697),
Row(Name='Donte DiVincenzo', Team='Milwaukee Bucks', Position='SG', Birthday='1/31/97', Salary=2905800),
Row(Name='Omari Spellman', Team='Golden State Warriors', Position='PF', Birthday='7/21/97', Salary=1897800),
Row(Name='Joe Harris', Team='Brooklyn Nets', Position='SF', Birthday='9/6/91', Salary=7666667),
Row(Name="Royce O'Neale", Team='Utah Jazz', Position='PF', Birthday='6/5/93', Salary=1618520),
Row(Name='Deandre Ayton', Team='Phoenix Suns', Position='C', Birthday='7/23/98', Salary=9562920),
Row(Name='Cory Joseph', Team='Sacramento Kings', Position='PG', Birthday='8/20/91', Salary=12000000),
Row(Name='Malcolm Miller', Team='Toronto Raptors', Position='SF', Birthday='3/6/93', Salary=1588231),
Row(Name='Justise Winslow', Team='Miami Heat', Position='PF', Birthday='3/26/96', Salary=13000000),
Row(Name='Kevin Durant', Team='Brooklyn Nets', Position='PF', Birthday='9/29/88', Salary=37199000),
Row(Name='Evan Fournier', Team='Orlando Magic', Position='SF', Birthday='10/29/92', Salary=17000000),
Row(Name='Chris Silva', Team='Miami Heat', Position='PF', Birthday='9/19/96', Salary=79568),
Row(Name='Vince Carter', Team='Atlanta Hawks', Position='PF', Birthday='1/26/77', Salary=2564753),
Row(Name='Cody Zeller', Team='Charlotte Hornets', Position='C', Birthday='10/5/92', Salary=14471910),
Row(Name='Brian Bowen', Team='Indiana Pacers', Position='SG', Birthday='10/2/98', Salary=79568),
Row(Name='Aaron Holiday', Team='Indiana Pacers', Position='PG', Birthday='9/30/96', Salary=2239200),
Row(Name='Troy Daniels', Team='Los Angeles Lakers', Position='SG', Birthday='7/15/91', Salary=2028594),
Row(Name='Buddy Hield', Team='Sacramento Kings', Position='SG', Birthday='12/17/92', Salary=4861207),
Row(Name='Terance Mann', Team='Los Angeles Clippers', Position='SG', Birthday='10/18/96', Salary=1000000),
Row(Name='John Konchar', Team='Memphis Grizzlies', Position='SG', Birthday='3/22/96', Salary=79568),
Row(Name='KZ Okpala', Team='Miami Heat', Position='SF', Birthday='4/28/99', Salary=898310),
Row(Name='Denzel Valentine', Team='Chicago Bulls', Position='SF', Birthday='11/16/93', Salary=3377568),
Row(Name='Marquese Chriss', Team='Golden State Warriors', Position='PF', Birthday='7/2/97', Salary=1678854),
Row(Name='Anthony Davis', Team='Los Angeles Lakers', Position='C', Birthday='3/11/93', Salary=27093019),
Row(Name='Nemanja Bjelica', Team='Sacramento Kings', Position='PF', Birthday='5/9/88', Salary=6825000),
Row(Name='Chandler Parsons', Team='Atlanta Hawks', Position='SF', Birthday='10/25/88', Salary=25102512),
Row(Name='Courtney Lee', Team='Dallas Mavericks', Position='SG', Birthday='10/3/85', Salary=12759670),
Row(Name='Myles Turner', Team='Indiana Pacers', Position='C', Birthday='3/24/96', Salary=18000000),
Row(Name="Kyle O'Quinn", Team='Philadelphia 76ers', Position='C', Birthday='3/26/90', Salary=2174318),
Row(Name='Bryn Forbes', Team='San Antonio Spurs', Position='SG', Birthday='7/23/93', Salary=2875000),
Row(Name='Duncan Robinson', Team='Miami Heat', Position='PF', Birthday='4/22/94', Salary=1416852),
Row(Name='Devin Booker', Team='Phoenix Suns', Position='SG', Birthday='10/30/96', Salary=27285000),
Row(Name='Grant Williams', Team='Boston Celtics', Position='PF', Birthday='11/30/98', Salary=2379840),
Row(Name='DeMarcus Cousins', Team='Los Angeles Lakers', Position='C', Birthday='8/13/90', Salary=3500000),
Row(Name='DeMar DeRozan', Team='San Antonio Spurs', Position='SF', Birthday='8/7/89', Salary=27739975),
Row(Name='Kristaps Porzingis', Team='Dallas Mavericks', Position='PF', Birthday='8/2/95', Salary=27285000),
Row(Name='Brandon Knight', Team='Cleveland Cavaliers', Position='PG', Birthday='12/2/91', Salary=15643750),
Row(Name='Thabo Sefolosha', Team='Houston Rockets', Position='PF', Birthday='5/2/84', Salary=2564753),
Row(Name='David Nwaba', Team='Brooklyn Nets', Position='SF', Birthday='1/14/93', Salary=1678854),
Row(Name='Quinndary Weatherspoon', Team='San Antonio Spurs', Position='G', Birthday='9/10/96', Salary=79568),
Row(Name='Dewan Hernandez', Team='Toronto Raptors', Position='C', Birthday='12/9/96', Salary=898310),
Row(Name='Isaiah Thomas', Team='Washington Wizards', Position='PG', Birthday='2/7/89', Salary=2320044),
Row(Name='Bruce Brown', Team='Detroit Pistons', Position='SG', Birthday='8/15/96', Salary=1416852),
Row(Name='Keldon Johnson', Team='San Antonio Spurs', Position='SF', Birthday='10/11/99', Salary=1950600),
Row(Name='Damian Jones', Team='Atlanta Hawks', Position='C', Birthday='6/30/95', Salary=2305057),
Row(Name='Luguentz Dort', Team='Oklahoma City Thunder', Position='G', Birthday='4/19/99', Salary=79568),
Row(Name='Terence Davis', Team='Toronto Raptors', Position='SG', Birthday='5/16/97', Salary=898310),
Row(Name='Chandler Hutchison', Team='Chicago Bulls', Position='SF', Birthday='4/26/96', Salary=2332320),
Row(Name='Steven Adams', Team='Oklahoma City Thunder', Position='C', Birthday='7/20/93', Salary=25842697),
Row(Name='Jordan Poole', Team='Golden State Warriors', Position='SG', Birthday='6/19/99', Salary=1964760),
Row(Name='Sekou Doumbouya', Team='Detroit Pistons', Position='SF', Birthday='12/23/00', Salary=3285120),
Row(Name='Zion Williamson', Team='New Orleans Pelicans', Position='F', Birthday='7/6/00', Salary=9757440),
Row(Name='Mike Muscala', Team='Oklahoma City Thunder', Position='C', Birthday='7/1/91', Salary=2028594),
Row(Name='Skal Labissiere', Team='Portland Trail Blazers', Position='C', Birthday='3/18/96', Salary=2338846),
Row(Name='Meyers Leonard', Team='Miami Heat', Position='C', Birthday='2/27/92', Salary=11286515),
Row(Name='Reggie Jackson', Team='Detroit Pistons', Position='PG', Birthday='4/16/90', Salary=18086956),
Row(Name='Alfonzo McKinnie', Team='Cleveland Cavaliers', Position='SF', Birthday='9/17/92', Salary=1588231),
Row(Name='Yuta Watanabe', Team='Memphis Grizzlies', Position='SF', Birthday='10/13/94', Salary=79568),
Row(Name='Kentavious Caldwell-Pope', Team='Los Angeles Lakers', Position='SG', Birthday='2/18/93', Salary=8089282),
Row(Name='Kelan Martin', Team='Minnesota Timberwolves', Position='SF', Birthday='8/3/95', Salary=79568),
Row(Name='OG Anunoby', Team='Toronto Raptors', Position='SF', Birthday='7/17/97', Salary=2281800),
Row(Name='Tyler Herro', Team='Miami Heat', Position='SG', Birthday='1/20/00', Salary=3640200),
Row(Name='Richaun Holmes', Team='Sacramento Kings', Position='C', Birthday='10/15/93', Salary=4767000),
Row(Name='Tyson Chandler', Team='Houston Rockets', Position='C', Birthday='10/2/82', Salary=2564753),
Row(Name='Solomon Hill', Team='Memphis Grizzlies', Position='SF', Birthday='3/18/91', Salary=13290395),
Row(Name='Keita Bates-Diop', Team='Minnesota Timberwolves', Position='SF', Birthday='1/23/96', Salary=1416852),
Row(Name='Kelly Olynyk', Team='Miami Heat', Position='C', Birthday='4/19/91', Salary=12667885),
Row(Name='Jaxson Hayes', Team='New Orleans Pelicans', Position='C', Birthday='5/23/00', Salary=4862040),
Row(Name='CJ McCollum', Team='Portland Trail Blazers', Position='SG', Birthday='9/19/91', Salary=27556959),
Row(Name='Darius Miller', Team='New Orleans Pelicans', Position='SF', Birthday='3/21/90', Salary=7250000),
Row(Name='Luka Doncic', Team='Dallas Mavericks', Position='PG', Birthday='2/28/99', Salary=7683360),
Row(Name='DeMarre Carroll', Team='San Antonio Spurs', Position='PF', Birthday='7/27/86', Salary=7000000),
Row(Name='Cristiano Felicio', Team='Chicago Bulls', Position='C', Birthday='7/7/92', Salary=8156500),
Row(Name='Zach LaVine', Team='Chicago Bulls', Position='PG', Birthday='3/10/95', Salary=19500000),
Row(Name='Tremont Waters', Team='Boston Celtics', Position='PG', Birthday='1/10/98', Salary=79568),
Row(Name='Dejounte Murray', Team='San Antonio Spurs', Position='PG', Birthday='9/19/96', Salary=2321735),
Row(Name='Jerome Robinson', Team='Los Angeles Clippers', Position='SG', Birthday='2/22/97', Salary=3567720),
Row(Name='Rudy Gay', Team='San Antonio Spurs', Position='PF', Birthday='8/17/86', Salary=14500000),
Row(Name='Ryan Broekhoff', Team='Dallas Mavericks', Position='SG', Birthday='8/23/90', Salary=1416852),
Row(Name='Jake Layman', Team='Minnesota Timberwolves', Position='PF', Birthday='3/7/94', Salary=3581986),
Row(Name='Cameron Johnson', Team='Phoenix Suns', Position='PF', Birthday='3/3/96', Salary=4033440),
Row(Name='Allen Crabbe', Team='Atlanta Hawks', Position='SG', Birthday='4/9/92', Salary=18500000),
Row(Name='Justin James', Team='Sacramento Kings', Position='SG', Birthday='1/24/97', Salary=898310),
Row(Name='Emmanuel Mudiay', Team='Utah Jazz', Position='PG', Birthday='3/5/96', Salary=1737145),
Row(Name='Avery Bradley', Team='Los Angeles Lakers', Position='PG', Birthday='11/26/90', Salary=6767000),
Row(Name='Victor Oladipo', Team='Indiana Pacers', Position='PG', Birthday='5/4/92', Salary=21000000),
Row(Name='Caleb Martin', Team='Charlotte Hornets', Position='SF', Birthday='9/28/95', Salary=898310),
Row(Name='Coby White', Team='Chicago Bulls', Position='SG', Birthday='2/16/00', Salary=5307120),
Row(Name='Isaiah Hartenstein', Team='Houston Rockets', Position='C', Birthday='5/5/98', Salary=1416852),
Row(Name='Will Barton', Team='Denver Nuggets', Position='SF', Birthday='1/6/91', Salary=12776786),
Row(Name='Dwayne Bacon', Team='Charlotte Hornets', Position='SG', Birthday='8/30/95', Salary=1618520),
Row(Name='Harrison Barnes', Team='Sacramento Kings', Position='PF', Birthday='5/30/92', Salary=24147727),
Row(Name='Tim Frazier', Team='Detroit Pistons', Position='PG', Birthday='11/1/90', Salary=1620564),
Row(Name='Jimmy Butler', Team='Miami Heat', Position='SF', Birthday='9/14/89', Salary=32742000),
Row(Name='Gary Harris', Team='Denver Nuggets', Position='SG', Birthday='9/14/94', Salary=17839286),
Row(Name='Thon Maker', Team='Detroit Pistons', Position='C', Birthday='2/25/97', Salary=3569643),
Row(Name='Shai Gilgeous-Alexander', Team='Oklahoma City Thunder', Position='PG', Birthday='7/12/98', Salary=3952920),
Row(Name='Hassan Whiteside', Team='Portland Trail Blazers', Position='C', Birthday='6/13/89', Salary=27093018),
Row(Name='Karl-Anthony Towns', Team='Minnesota Timberwolves', Position='C', Birthday='11/15/95', Salary=27285000),
Row(Name='Ky Bowman', Team='Golden State Warriors', Position='PG', Birthday='6/16/97', Salary=79568),
Row(Name='Ben Simmons', Team='Philadelphia 76ers', Position='PG', Birthday='7/20/96', Salary=8113929),
Row(Name='Terrence Ross', Team='Orlando Magic', Position='SF', Birthday='2/5/91', Salary=12500000),
Row(Name='Jordan McLaughlin', Team='Minnesota Timberwolves', Position='PG', Birthday='4/9/96', Salary=79568),
Row(Name='Daniel Theis', Team='Boston Celtics', Position='C', Birthday='4/4/92', Salary=5000000),
Row(Name='Jonathan Isaac', Team='Orlando Magic', Position='PF', Birthday='10/3/97', Salary=5806440),
Row(Name='Cheick Diallo', Team='Phoenix Suns', Position='C', Birthday='9/13/96', Salary=1678854),
Row(Name='Serge Ibaka', Team='Toronto Raptors', Position='C', Birthday='9/18/89', Salary=23271604),
Row(Name='Amile Jefferson', Team='Orlando Magic', Position='PF', Birthday='5/7/93', Salary=1339515),
Row(Name='Cam Reddish', Team='Atlanta Hawks', Position='SF', Birthday='9/1/99', Salary=4245720),
Row(Name="De'Anthony Melton", Team='Memphis Grizzlies', Position='PG', Birthday='5/28/98', Salary=1416852),
Row(Name='Udonis Haslem', Team='Miami Heat', Position='C', Birthday='6/9/80', Salary=2564753),
Row(Name='Charlie Brown', Team='Atlanta Hawks', Position='SG', Birthday='2/2/97', Salary=79568),
Row(Name='Elie Okobo', Team='Phoenix Suns', Position='PG', Birthday='10/23/97', Salary=1416852),
Row(Name='Gordon Hayward', Team='Boston Celtics', Position='PF', Birthday='3/23/90', Salary=32700690),
Row(Name='Marco Belinelli', Team='San Antonio Spurs', Position='SF', Birthday='3/25/86', Salary=5846154),
Row(Name='Javonte Green', Team='Boston Celtics', Position='SF', Birthday='7/23/93', Salary=898310),
Row(Name='Rondae Hollis-Jefferson', Team='Toronto Raptors', Position='SF', Birthday='1/3/95', Salary=2500000),
Row(Name='Carmelo Anthony', Team='Portland Trail Blazers', Position='PF', Birthday='5/29/84', Salary=2159029),
Row(Name='Danny Green', Team='Los Angeles Lakers', Position='SG', Birthday='6/22/87', Salary=14634147),
Row(Name='Stephen Curry', Team='Golden State Warriors', Position='PG', Birthday='3/14/88', Salary=40231758),
Row(Name='Eric Paschall', Team='Golden State Warriors', Position='PF', Birthday='11/4/96', Salary=898310),
Row(Name='Daniel Gafford', Team='Chicago Bulls', Position='C', Birthday='10/1/98', Salary=898310),
Row(Name='Anfernee Simons', Team='Portland Trail Blazers', Position='SG', Birthday='6/8/99', Salary=2149560),
Row(Name='Frank Kaminsky', Team='Phoenix Suns', Position='C', Birthday='4/4/93', Salary=4767000),
Row(Name='Luke Kennard', Team='Detroit Pistons', Position='SG', Birthday='6/24/96', Salary=3827160),
Row(Name='Josh Okogie', Team='Minnesota Timberwolves', Position='SG', Birthday='9/1/98', Salary=2530680),
Row(Name='Rodney Hood', Team='Portland Trail Blazers', Position='SF', Birthday='10/20/92', Salary=5718000),
Row(Name="De'Andre Hunter", Team='Atlanta Hawks', Position='SF', Birthday='12/2/97', Salary=7068360),
Row(Name='Klay Thompson', Team='Golden State Warriors', Position='SG', Birthday='2/8/90', Salary=32742000),
Row(Name='Jrue Holiday', Team='New Orleans Pelicans', Position='PG', Birthday='6/12/90', Salary=26131111),
Row(Name='PJ Dozier', Team='Denver Nuggets', Position='PG', Birthday='10/25/96', Salary=79568),
Row(Name='Andre Drummond', Team='Detroit Pistons', Position='C', Birthday='8/10/93', Salary=27093018),
Row(Name='Jared Harper', Team='Phoenix Suns', Position='PG', Birthday='9/14/97', Salary=79568),
Row(Name='Russell Westbrook', Team='Houston Rockets', Position='PG', Birthday='11/12/88', Salary=38506482),
Row(Name='Tony Bradley', Team='Utah Jazz', Position='C', Birthday='1/8/98', Salary=1962360),
Row(Name='Oshae Brissett', Team='Toronto Raptors', Position='SF', Birthday='6/20/98', Salary=79568),
Row(Name='Gary Clark', Team='Houston Rockets', Position='PF', Birthday='11/16/94', Salary=1416852),
Row(Name='Pascal Siakam', Team='Toronto Raptors', Position='PF', Birthday='4/2/94', Salary=2351838),
Row(Name='Eric Bledsoe', Team='Milwaukee Bucks', Position='PG', Birthday='12/9/89', Salary=15625000),
Row(Name='Tomas Satoransky', Team='Chicago Bulls', Position='PG', Birthday='10/30/91', Salary=10000000),
Row(Name='Davis Bertans', Team='Washington Wizards', Position='PF', Birthday='11/12/92', Salary=7000000),
Row(Name='Amir Coffey', Team='Los Angeles Clippers', Position='G', Birthday='6/17/97', Salary=79568),
Row(Name='Ignas Brazdeikis', Team='New York Knicks', Position='SF', Birthday='1/8/99', Salary=898310),
Row(Name='Ivan Rabb', Team='New York Knicks', Position='PF', Birthday='2/4/97', Salary=79568),
Row(Name='Khris Middleton', Team='Milwaukee Bucks', Position='SF', Birthday='8/12/91', Salary=30603448),
Row(Name='Kevin Knox', Team='New York Knicks', Position='PF', Birthday='8/11/99', Salary=4380120),
Row(Name='Jeff Green', Team='Utah Jazz', Position='PF', Birthday='8/28/86', Salary=2564753),
Row(Name='Ersan Ilyasova', Team='Milwaukee Bucks', Position='PF', Birthday='5/15/87', Salary=7000000),
Row(Name='Caleb Swanigan', Team='Sacramento Kings', Position='PF', Birthday='4/18/97', Salary=2033160),
Row(Name='Al Horford', Team='Philadelphia 76ers', Position='C', Birthday='6/3/86', Salary=28000000),
Row(Name='Clint Capela', Team='Houston Rockets', Position='C', Birthday='5/18/94', Salary=16896552),
Row(Name='Georges Niang', Team='Utah Jazz', Position='PF', Birthday='6/17/93', Salary=1645357),
Row(Name='Wesley Matthews', Team='Milwaukee Bucks', Position='SF', Birthday='10/14/86', Salary=2564753),
Row(Name='Rajon Rondo', Team='Los Angeles Lakers', Position='PG', Birthday='2/22/86', Salary=2564753),
Row(Name='Delon Wright', Team='Dallas Mavericks', Position='PG', Birthday='4/26/92', Salary=9473684),
Row(Name='Ja Morant', Team='Memphis Grizzlies', Position='PG', Birthday='8/10/99', Salary=8730240),
Row(Name='Fred VanVleet', Team='Toronto Raptors', Position='PG', Birthday='2/25/94', Salary=9346153),
Row(Name='Brandon Clarke', Team='Memphis Grizzlies', Position='PF', Birthday='9/19/96', Salary=2478840),
Row(Name='Miye Oni', Team='Utah Jazz', Position='SG', Birthday='8/4/97', Salary=898310),
Row(Name='Julius Randle', Team='New York Knicks', Position='C', Birthday='11/29/94', Salary=18000000),
Row(Name='Glenn Robinson III', Team='Golden State Warriors', Position='SF', Birthday='1/8/94', Salary=1882867),
Row(Name='Dillon Brooks', Team='Memphis Grizzlies', Position='SF', Birthday='1/22/96', Salary=1618520),
Row(Name='Zylan Cheatham', Team='New Orleans Pelicans', Position='SF', Birthday='11/17/95', Salary=79568),
Row(Name='Markieff Morris', Team='Detroit Pistons', Position='PF', Birthday='9/2/89', Salary=3200000),
Row(Name='Malik Beasley', Team='Denver Nuggets', Position='SG', Birthday='11/26/96', Salary=2731713),
Row(Name='John Wall', Team='Washington Wizards', Position='PG', Birthday='9/6/90', Salary=38199000),
Row(Name='Vlatko Cancar', Team='Denver Nuggets', Position='SF', Birthday='4/10/97', Salary=898310),
Row(Name='Alize Johnson', Team='Indiana Pacers', Position='PF', Birthday='4/22/96', Salary=1416852),
Row(Name='Andrew Wiggins', Team='Minnesota Timberwolves', Position='SF', Birthday='2/23/95', Salary=27504630),
Row(Name='Khyri Thomas', Team='Detroit Pistons', Position='SG', Birthday='5/8/96', Salary=1416852),
Row(Name='Mitchell Robinson', Team='New York Knicks', Position='C', Birthday='4/1/98', Salary=1559712),
Row(Name='Damian Lillard', Team='Portland Trail Blazers', Position='PG', Birthday='7/15/90', Salary=29802321),
Row(Name='Nassir Little', Team='Portland Trail Blazers', Position='PF', Birthday='2/11/00', Salary=2105520),
Row(Name='Mikal Bridges', Team='Phoenix Suns', Position='SF', Birthday='8/30/96', Salary=4161000),
Row(Name='Kyle Anderson', Team='Memphis Grizzlies', Position='PF', Birthday='9/20/93', Salary=9073050),
Row(Name='Garrett Temple', Team='Brooklyn Nets', Position='PG', Birthday='5/8/86', Salary=4767000),
Row(Name='Kyle Korver', Team='Milwaukee Bucks', Position='PF', Birthday='3/17/81', Salary=6004753),
Row(Name='Al-Farouq Aminu', Team='Orlando Magic', Position='PF', Birthday='9/21/90', Salary=9258000),
Row(Name='James Harden', Team='Houston Rockets', Position='PG', Birthday='8/26/89', Salary=38199000),
Row(Name='Derrick White', Team='San Antonio Spurs', Position='PG', Birthday='7/2/94', Salary=1948080),
Row(Name='JaKarr Sampson', Team='Indiana Pacers', Position='SF', Birthday='3/20/93', Salary=1737145),
Row(Name='Dario Saric', Team='Phoenix Suns', Position='PF', Birthday='4/8/94', Salary=3481985),
Row(Name='Ivica Zubac', Team='Los Angeles Clippers', Position='C', Birthday='3/18/97', Salary=6481482),
Row(Name='Juan Hernangomez', Team='Denver Nuggets', Position='PF', Birthday='9/28/95', Salary=3321029),
Row(Name='Jarrell Brantley', Team='Utah Jazz', Position='PF', Birthday='6/7/96', Salary=79568),
Row(Name='Eric Gordon', Team='Houston Rockets', Position='PG', Birthday='12/25/88', Salary=14057730),
Row(Name='Naz Reid', Team='Minnesota Timberwolves', Position='F', Birthday='8/26/99', Salary=898310),
Row(Name='Justin Robinson', Team='Washington Wizards', Position='PG', Birthday='10/12/97', Salary=898310),
Row(Name='Grayson Allen', Team='Memphis Grizzlies', Position='SG', Birthday='10/8/95', Salary=2429400),
Row(Name='Trevor Ariza', Team='Sacramento Kings', Position='SF', Birthday='6/30/85', Salary=12200000),
Row(Name='Brandon Goodwin', Team='Atlanta Hawks', Position='PG', Birthday='10/2/95', Salary=79568),
Row(Name="E'Twaun Moore", Team='New Orleans Pelicans', Position='PG', Birthday='2/25/89', Salary=8664928),
Row(Name='Mario Hezonja', Team='Portland Trail Blazers', Position='PF', Birthday='2/25/95', Salary=1737145),
Row(Name='Henry Ellenson', Team='Brooklyn Nets', Position='PF', Birthday='1/13/97', Salary=79568),
Row(Name='Johnathan Motley', Team='Los Angeles Clippers', Position='PF', Birthday='5/4/95', Salary=79568),
Row(Name='James Ennis', Team='Philadelphia 76ers', Position='SF', Birthday='7/1/90', Salary=1882867),
Row(Name='Andre Roberson', Team='Oklahoma City Thunder', Position='SF', Birthday='12/4/91', Salary=10740740),
Row(Name='Garrison Mathews', Team='Washington Wizards', Position='SG', Birthday='10/24/96', Salary=79568),
Row(Name='Jahlil Okafor', Team='New Orleans Pelicans', Position='C', Birthday='12/15/95', Salary=1702486),
Row(Name='Mfiondu Kabengele', Team='Los Angeles Clippers', Position='C', Birthday='8/14/97', Salary=1977000),
Row(Name='Treveon Graham', Team='Minnesota Timberwolves', Position='SG', Birthday='10/28/93', Salary=1645357),
Row(Name='Seth Curry', Team='Dallas Mavericks', Position='PG', Birthday='8/23/90', Salary=7461380),
Row(Name="D'Angelo Russell", Team='Golden State Warriors', Position='PG', Birthday='2/23/96', Salary=27285000),
Row(Name='Justin Holiday', Team='Indiana Pacers', Position='SG', Birthday='4/5/89', Salary=4767000),
Row(Name='Tyrone Wallace', Team='Atlanta Hawks', Position='PG', Birthday='6/10/94', Salary=1620564),
Row(Name='Miles Bridges', Team='Charlotte Hornets', Position='SF', Birthday='3/21/98', Salary=3755400),
Row(Name='Bogdan Bogdanovic', Team='Sacramento Kings', Position='SG', Birthday='8/18/92', Salary=8529386),
Row(Name='Matt Thomas', Team='Toronto Raptors', Position='SG', Birthday='8/4/94', Salary=898310),
Row(Name='Jordan Bell', Team='Minnesota Timberwolves', Position='C', Birthday='1/7/95', Salary=1620564),
Row(Name='Wenyen Gabriel', Team='Sacramento Kings', Position='PF', Birthday='3/26/97', Salary=79568),
Row(Name='Tony Snell', Team='Detroit Pistons', Position='SF', Birthday='11/10/91', Salary=11392857),
Row(Name='Shaquille Harrison', Team='Chicago Bulls', Position='PG', Birthday='10/6/93', Salary=1620564),
Row(Name='Yogi Ferrell', Team='Sacramento Kings', Position='PG', Birthday='5/9/93', Salary=3150000),
Row(Name='Mike Scott', Team='Philadelphia 76ers', Position='PF', Birthday='7/16/88', Salary=4767000),
Row(Name='Jarred Vanderbilt', Team='Denver Nuggets', Position='PF', Birthday='4/3/99', Salary=1416852),
Row(Name='Jeff Teague', Team='Minnesota Timberwolves', Position='PG', Birthday='6/10/88', Salary=19000000),
Row(Name='Zach Norvell', Team='Los Angeles Lakers', Position='SG', Birthday='12/9/97', Salary=79568),
Row(Name='Maxi Kleber', Team='Dallas Mavericks', Position='C', Birthday='1/29/92', Salary=8000000),
Row(Name='Matisse Thybulle', Team='Philadelphia 76ers', Position='SG', Birthday='3/4/97', Salary=2582160),
Row(Name='Ryan Arcidiacono', Team='Chicago Bulls', Position='PG', Birthday='3/26/94', Salary=3000000),
Row(Name='Wayne Ellington', Team='New York Knicks', Position='SG', Birthday='11/29/87', Salary=8000000),
Row(Name='Kawhi Leonard', Team='Los Angeles Clippers', Position='SF', Birthday='6/29/91', Salary=32742000),
Row(Name='Montrezl Harrell', Team='Los Angeles Clippers', Position='C', Birthday='1/26/94', Salary=6000000),
Row(Name='Jusuf Nurkic', Team='Portland Trail Blazers', Position='C', Birthday='8/23/94', Salary=12000000),
Row(Name='Matthew Dellavedova', Team='Cleveland Cavaliers', Position='PG', Birthday='9/8/90', Salary=9607500),
Row(Name='Cody Martin', Team='Charlotte Hornets', Position='SF', Birthday='9/28/95', Salary=1173310),
Row(Name='Zhaire Smith', Team='Philadelphia 76ers', Position='SG', Birthday='6/4/99', Salary=3058800),
Row(Name='RJ Barrett', Team='New York Knicks', Position='SG', Birthday='6/14/00', Salary=7839960),
Row(Name='Lonnie Walker', Team='San Antonio Spurs', Position='SG', Birthday='12/14/98', Salary=2764200),
Row(Name='Taurean Prince', Team='Brooklyn Nets', Position='SF', Birthday='3/22/94', Salary=3481985),
Row(Name='Elfrid Payton', Team='New York Knicks', Position='PG', Birthday='2/22/94', Salary=8000000),
Row(Name='Blake Griffin', Team='Detroit Pistons', Position='PF', Birthday='3/16/89', Salary=34449964),
Row(Name='Marko Guduric', Team='Memphis Grizzlies', Position='SG', Birthday='3/8/95', Salary=2625000),
Row(Name='Zach Collins', Team='Portland Trail Blazers', Position='C', Birthday='11/19/97', Salary=4240200),
Row(Name='Stanley Johnson', Team='Toronto Raptors', Position='PF', Birthday='5/29/96', Salary=3623000),
Row(Name='Boban Marjanovic', Team='Dallas Mavericks', Position='C', Birthday='8/15/88', Salary=3500000),
Row(Name='Josh Magette', Team='Orlando Magic', Position='PG', Birthday='11/28/89', Salary=79568),
Row(Name='Kyle Lowry', Team='Toronto Raptors', Position='PG', Birthday='3/25/86', Salary=33296296),
Row(Name='Darius Garland', Team='Cleveland Cavaliers', Position='PG', Birthday='1/26/00', Salary=6400920),
Row(Name='Frank Jackson', Team='New Orleans Pelicans', Position='PG', Birthday='5/4/98', Salary=1618520),
Row(Name='Dragan Bender', Team='Milwaukee Bucks', Position='PF', Birthday='11/17/97', Salary=1678854),
Row(Name='Kenrich Williams', Team='New Orleans Pelicans', Position='PF', Birthday='12/2/94', Salary=1416852),
Row(Name='Jerami Grant', Team='Denver Nuggets', Position='PF', Birthday='3/12/94', Salary=9346153),
Row(Name='Allonzo Trier', Team='New York Knicks', Position='PG', Birthday='1/17/96', Salary=3551100),
Row(Name='Pat Connaughton', Team='Milwaukee Bucks', Position='SG', Birthday='1/6/93', Salary=1723050),
Row(Name='Domantas Sabonis', Team='Indiana Pacers', Position='C', Birthday='5/3/96', Salary=3529554),
Row(Name='Dylan Windler', Team='Cleveland Cavaliers', Position='GF', Birthday='9/22/96', Salary=2035800),
Row(Name='Antonius Cleveland', Team='Dallas Mavericks', Position='SG', Birthday='2/2/94', Salary=79568),
Row(Name='Damion Lee', Team='Golden State Warriors', Position='SG', Birthday='10/21/92', Salary=79568),
Row(Name='Khem Birch', Team='Orlando Magic', Position='C', Birthday='9/28/92', Salary=3000000),
Row(Name='Aron Baynes', Team='Phoenix Suns', Position='C', Birthday='12/9/86', Salary=5453280),
Row(Name='Kemba Walker', Team='Boston Celtics', Position='PG', Birthday='5/8/90', Salary=32742000),
Row(Name='Nerlens Noel', Team='Oklahoma City Thunder', Position='C', Birthday='4/10/94', Salary=1882867),
Row(Name='Jabari Parker', Team='Atlanta Hawks', Position='PF', Birthday='3/15/95', Salary=6500000),
Row(Name='Carsen Edwards', Team='Boston Celtics', Position='SG', Birthday='3/12/98', Salary=1228026),
Row(Name='Anthony Tolliver', Team='Portland Trail Blazers', Position='PF', Birthday='6/1/85', Salary=2564753),
Row(Name='Lauri Markkanen', Team='Chicago Bulls', Position='PF', Birthday='5/22/97', Salary=5300400),
Row(Name='Kris Dunn', Team='Chicago Bulls', Position='PG', Birthday='3/18/94', Salary=5348007),
Row(Name='Reggie Bullock', Team='New York Knicks', Position='SF', Birthday='3/16/91', Salary=4000000),
Row(Name='Mike Conley', Team='Utah Jazz', Position='PG', Birthday='10/11/87', Salary=32511623),
Row(Name='Jaylen Nowell', Team='Minnesota Timberwolves', Position='SG', Birthday='7/9/99', Salary=1400000),
Row(Name='Gorgui Dieng', Team='Minnesota Timberwolves', Position='C', Birthday='1/18/90', Salary=16229213),
Row(Name='Patrick Patterson', Team='Los Angeles Clippers', Position='PF', Birthday='3/14/89', Salary=3068660),
Row(Name='Jarrett Allen', Team='Brooklyn Nets', Position='C', Birthday='4/21/98', Salary=2376840),
Row(Name='Bobby Portis', Team='New York Knicks', Position='C', Birthday='2/10/95', Salary=15000000),
Row(Name='Joel Embiid', Team='Philadelphia 76ers', Position='C', Birthday='3/16/94', Salary=27504630),
Row(Name='Jonas Valanciunas', Team='Memphis Grizzlies', Position='C', Birthday='5/6/92', Salary=16000000),
Row(Name='Chris Chiozza', Team='Washington Wizards', Position='PG', Birthday='11/21/95', Salary=79568),
Row(Name='Kent Bazemore', Team='Portland Trail Blazers', Position='SF', Birthday='7/1/89', Salary=19269663),
Row(Name='Tristan Thompson', Team='Cleveland Cavaliers', Position='C', Birthday='3/13/91', Salary=18539130),
Row(Name='Mason Plumlee', Team='Denver Nuggets', Position='C', Birthday='3/5/90', Salary=14041096),
Row(Name='Shabazz Napier', Team='Minnesota Timberwolves', Position='PG', Birthday='7/14/91', Salary=1845301),
Row(Name='Edmond Sumner', Team='Indiana Pacers', Position='PG', Birthday='12/31/95', Salary=2000000),
Row(Name='Alex Len', Team='Atlanta Hawks', Position='C', Birthday='6/16/93', Salary=4160000),
Row(Name='Josh Richardson', Team='Philadelphia 76ers', Position='SF', Birthday='9/15/93', Salary=10116576),
Row(Name='Bojan Bogdanovic', Team='Utah Jazz', Position='SF', Birthday='4/18/89', Salary=17000000),
Row(Name='Iman Shumpert', Team='Brooklyn Nets', Position='PG', Birthday='6/26/90', Salary=2031676),
Row(Name='Daryl Macon', Team='Miami Heat', Position='SG', Birthday='11/29/95', Salary=79568),
Row(Name='Rodney McGruder', Team='Los Angeles Clippers', Position='SG', Birthday='7/29/91', Salary=4807693),
Row(Name='Bam Adebayo', Team='Miami Heat', Position='C', Birthday='7/18/97', Salary=3454080),
Row(Name='Jacob Evans', Team='Golden State Warriors', Position='SG', Birthday='6/18/97', Salary=1928280),
Row(Name='Nigel Williams-Goss', Team='Utah Jazz', Position='PG', Birthday='9/16/94', Salary=1500000),
Row(Name='Terrance Ferguson', Team='Oklahoma City Thunder', Position='SF', Birthday='5/17/98', Salary=2475840),
Row(Name='Michael Carter-Williams', Team='Orlando Magic', Position='PG', Birthday='10/10/91', Salary=2028594),
Row(Name='Bol Bol', Team='Denver Nuggets', Position='C', Birthday='11/16/99', Salary=79568),
Row(Name='Willie Cauley-Stein', Team='Golden State Warriors', Position='C', Birthday='8/18/93', Salary=2177483),
Row(Name='Nikola Vucevic', Team='Orlando Magic', Position='C', Birthday='10/24/90', Salary=28000000),
Row(Name='Nicolas Batum', Team='Charlotte Hornets', Position='SF', Birthday='12/14/88', Salary=25565217),
Row(Name='Kyrie Irving', Team='Brooklyn Nets', Position='PG', Birthday='3/23/92', Salary=31742000),
Row(Name='Jeremy Lamb', Team='Indiana Pacers', Position='SF', Birthday='5/30/92', Salary=10500000),
Row(Name='Donovan Mitchell', Team='Utah Jazz', Position='SG', Birthday='9/7/96', Salary=3635760),
Row(Name='Thanasis Antetokounmpo', Team='Milwaukee Bucks', Position='SF', Birthday='7/18/92', Salary=1445697),
Row(Name='James Johnson', Team='Miami Heat', Position='PF', Birthday='2/20/87', Salary=15349400),
Row(Name='Monte Morris', Team='Denver Nuggets', Position='PG', Birthday='6/27/95', Salary=1588231),
Row(Name='Terry Rozier', Team='Charlotte Hornets', Position='PG', Birthday='3/17/94', Salary=19894737),
Row(Name='DeAndre Jordan', Team='Brooklyn Nets', Position='C', Birthday='7/21/88', Salary=9881598),
Row(Name='Jae Crowder', Team='Memphis Grizzlies', Position='SF', Birthday='7/6/90', Salary=7815533),
Row(Name='Josh Gray', Team='New Orleans Pelicans', Position='PG', Birthday='9/9/93', Salary=79568),
Row(Name='Goga Bitadze', Team='Indiana Pacers', Position='C', Birthday='7/20/99', Salary=2816760),
Row(Name='Kobi Simmons', Team='Charlotte Hornets', Position='PG', Birthday='7/4/97', Salary=79568),
Row(Name='Derrick Favors', Team='New Orleans Pelicans', Position='C', Birthday='7/15/91', Salary=17650000),
Row(Name='Landry Shamet', Team='Los Angeles Clippers', Position='SG', Birthday='3/13/97', Salary=1995120),
Row(Name='Jalen McDaniels', Team='Charlotte Hornets', Position='PF', Birthday='1/31/98', Salary=898310),
Row(Name='Bruno Caboclo', Team='Memphis Grizzlies', Position='SF', Birthday='9/21/95', Salary=1845301),
Row(Name='Drew Eubanks', Team='San Antonio Spurs', Position='PF', Birthday='2/1/97', Salary=79568),
Row(Name='Raul Neto', Team='Philadelphia 76ers', Position='PG', Birthday='5/19/92', Salary=1737145),
Row(Name='Jalen Lecque', Team='Phoenix Suns', Position='G', Birthday='6/13/00', Salary=898310),
Row(Name='Giannis Antetokounmpo', Team='Milwaukee Bucks', Position='PF', Birthday='12/6/94', Salary=25842697),
Row(Name='Malik Monk', Team='Charlotte Hornets', Position='SG', Birthday='2/4/98', Salary=4028400),
Row(Name='Tacko Fall', Team='Boston Celtics', Position='C', Birthday='12/10/95', Salary=79568),
Row(Name='Justin Jackson', Team='Dallas Mavericks', Position='PF', Birthday='3/28/95', Salary=3280920),
Row(Name='Paul George', Team='Los Angeles Clippers', Position='SF', Birthday='5/2/90', Salary=33005556),
Row(Name='Jayson Tatum', Team='Boston Celtics', Position='PF', Birthday='3/3/98', Salary=7830000),
Row(Name='Admiral Schofield', Team='Washington Wizards', Position='SF', Birthday='3/30/97', Salary=1000000),
Row(Name='Louis King', Team='Detroit Pistons', Position='F', Birthday='4/6/99', Salary=79568),
Row(Name='Kostas Antetokounmpo', Team='Los Angeles Lakers', Position='PF', Birthday='11/20/97', Salary=79568),
Row(Name='Rodions Kurucs', Team='Brooklyn Nets', Position='PF', Birthday='2/5/98', Salary=1699236),
Row(Name='Spencer Dinwiddie', Team='Brooklyn Nets', Position='PG', Birthday='4/6/93', Salary=10605600),
Row(Name='Doug McDermott', Team='Indiana Pacers', Position='PF', Birthday='1/3/92', Salary=7333333),
Row(Name='Romeo Langford', Team='Boston Celtics', Position='SG', Birthday='10/25/99', Salary=3458400),
Row(Name='Caris LeVert', Team='Brooklyn Nets', Position='SF', Birthday='8/25/94', Salary=2625717),
Row(Name='Michael Kidd-Gilchrist', Team='Charlotte Hornets', Position='PF', Birthday='9/26/93', Salary=13000000),
Row(Name='LeBron James', Team='Los Angeles Lakers', Position='PF', Birthday='12/30/84', Salary=37436858),
Row(Name='Taj Gibson', Team='New York Knicks', Position='C', Birthday='6/24/85', Salary=9000000),
Row(Name='Ty Jerome', Team='Phoenix Suns', Position='G', Birthday='7/8/97', Salary=2193480),
Row(Name='Chris Clemons', Team='Houston Rockets', Position='SG', Birthday='7/23/97', Salary=79568),
Row(Name='Luke Kornet', Team='Chicago Bulls', Position='C', Birthday='7/15/95', Salary=2250000),
Row(Name='Trey Lyles', Team='San Antonio Spurs', Position='PF', Birthday='11/5/95', Salary=5500000),
Row(Name='Sterling Brown', Team='Milwaukee Bucks', Position='SF', Birthday='2/10/95', Salary=1618520),
Row(Name='Andre Iguodala', Team='Memphis Grizzlies', Position='SF', Birthday='1/28/84', Salary=17185185),
Row(Name='Vincent Poirier', Team='Boston Celtics', Position='C', Birthday='10/17/93', Salary=2505793),
Row(Name='Frank Ntilikina', Team='New York Knicks', Position='PG', Birthday='7/28/98', Salary=4855800),
Row(Name='Jordan McRae', Team='Washington Wizards', Position='PG', Birthday='3/28/91', Salary=1645357),
Row(Name='Enes Kanter', Team='Boston Celtics', Position='C', Birthday='5/20/92', Salary=4767000),
Row(Name='John Henson', Team='Cleveland Cavaliers', Position='C', Birthday='12/28/90', Salary=9732396),
Row(Name='Jaylen Brown', Team='Boston Celtics', Position='SF', Birthday='10/24/96', Salary=6534829),
Row(Name='Jonah Bolden', Team='Philadelphia 76ers', Position='PF', Birthday='1/2/96', Salary=1698450),
Row(Name='Chimezie Metu', Team='San Antonio Spurs', Position='PF', Birthday='3/22/97', Salary=1416852),
Row(Name='Tobias Harris', Team='Philadelphia 76ers', Position='PF', Birthday='7/15/92', Salary=32742000),
Row(Name='Semi Ojeleye', Team='Boston Celtics', Position='PF', Birthday='12/5/94', Salary=1618520),
Row(Name='Jevon Carter', Team='Phoenix Suns', Position='PG', Birthday='9/14/95', Salary=1416852),
Row(Name='Brandon Ingram', Team='New Orleans Pelicans', Position='PF', Birthday='9/2/97', Salary=7265485),
Row(Name='Moritz Wagner', Team='Washington Wizards', Position='C', Birthday='4/26/97', Salary=2063520),
Row(Name='Dorian Finney-Smith', Team='Dallas Mavericks', Position='PF', Birthday='5/4/93', Salary=4000000),
Row(Name='Danuel House', Team='Houston Rockets', Position='SF', Birthday='6/7/93', Salary=3540000),
Row(Name='Nicolo Melli', Team='New Orleans Pelicans', Position='C', Birthday='1/26/91', Salary=4102564),
Row(Name='Talen Horton-Tucker', Team='Los Angeles Lakers', Position='GF', Birthday='11/25/00', Salary=898310),
Row(Name='Ed Davis', Team='Utah Jazz', Position='C', Birthday='6/5/89', Salary=4767000),
Row(Name='Kyle Guy', Team='Sacramento Kings', Position='G', Birthday='8/11/97', Salary=79568),
Row(Name='Kadeem Allen', Team='New York Knicks', Position='PG', Birthday='1/15/93', Salary=79568),
Row(Name='Dante Exum', Team='Utah Jazz', Position='PG', Birthday='7/13/95', Salary=9600000),
Row(Name='Abdel Nader', Team='Oklahoma City Thunder', Position='SF', Birthday='9/25/93', Salary=1618520),
Row(Name='Bruno Fernando', Team='Atlanta Hawks', Position='C', Birthday='8/15/98', Salary=1400000),
Row(Name='Dion Waiters', Team='Miami Heat', Position='SG', Birthday='12/10/91', Salary=12100000),
Row(Name='Jared Dudley', Team='Los Angeles Lakers', Position='PF', Birthday='7/10/85', Salary=2564753),
Row(Name='Max Strus', Team='Chicago Bulls', Position='SG', Birthday='3/28/96', Salary=79568),
Row(Name='Kevon Looney', Team='Golden State Warriors', Position='C', Birthday='2/6/96', Salary=4464286),
Row(Name='Willy Hernangomez', Team='Charlotte Hornets', Position='C', Birthday='5/27/94', Salary=1557250),
Row(Name='Melvin Frazier', Team='Orlando Magic', Position='SG', Birthday='8/30/96', Salary=1416852),
Row(Name='Austin Rivers', Team='Houston Rockets', Position='PG', Birthday='8/1/92', Salary=2174310),
Row(Name='Harry Giles', Team='Sacramento Kings', Position='PF', Birthday='4/22/98', Salary=2578800),
Row(Name='Robin Lopez', Team='Milwaukee Bucks', Position='C', Birthday='4/1/88', Salary=4767000),
Row(Name='Collin Sexton', Team='Cleveland Cavaliers', Position='PG', Birthday='1/4/99', Salary=4764960),
Row(Name='Ricky Rubio', Team='Phoenix Suns', Position='PG', Birthday='10/21/90', Salary=16200000)]
Changing Variables
- Adding Columns
df.withColumn()
takes the new name, along with how you’d like to create the new column. When using an already existing column, you must specify that it is a column by usingcol('ColumnName')
.
- Note: you need to import
col()
in order for it to be recognized.
Example: Dividing the NBA salaries by 1000
from pyspark.sql.functions import col
= df.withColumn("SalaryK", col("Salary")/1000) df
+--------------+------------------+--------+--------+-------+--------+
| Name| Team|Position|Birthday| Salary| SalaryK|
+--------------+------------------+--------+--------+-------+--------+
| Shake Milton|Philadelphia 76ers| SG| 9/26/96|1445697|1445.697|
|Christian Wood| Detroit Pistons| PF| 9/27/95|1645357|1645.357|
| PJ Washington| Charlotte Hornets| PF| 8/23/98|3831840| 3831.84|
| Derrick Rose| Detroit Pistons| PG| 10/4/88|7317074|7317.074|
| Marial Shayok|Philadelphia 76ers| G| 7/26/95| 79568| 79.568|
+--------------+------------------+--------+--------+-------+--------+
only showing top 5 rows
- Removing Columns
df.drop()
takes one or multiple column names.
Example: Removing ‘SalaryK’
= df.drop("SalaryK").show(5) df
+--------------+------------------+--------+--------+-------+
| Name| Team|Position|Birthday| Salary|
+--------------+------------------+--------+--------+-------+
| Shake Milton|Philadelphia 76ers| SG| 9/26/96|1445697|
|Christian Wood| Detroit Pistons| PF| 9/27/95|1645357|
| PJ Washington| Charlotte Hornets| PF| 8/23/98|3831840|
| Derrick Rose| Detroit Pistons| PG| 10/4/88|7317074|
| Marial Shayok|Philadelphia 76ers| G| 7/26/95| 79568|
+--------------+------------------+--------+--------+-------+
only showing top 5 rows
- Renaming Columns
df.withColumnRenamed()
takes the current column name, followed by the new name Example: Changing ‘Birthday’ to ‘DateOfBirth’
= df.withColumnRenamed("Birthday", "DateOfBirth").show(5) df
+--------------+------------------+--------+-----------+-------+
| Name| Team|Position|DateOfBirth| Salary|
+--------------+------------------+--------+-----------+-------+
| Shake Milton|Philadelphia 76ers| SG| 9/26/96|1445697|
|Christian Wood| Detroit Pistons| PF| 9/27/95|1645357|
| PJ Washington| Charlotte Hornets| PF| 8/23/98|3831840|
| Derrick Rose| Detroit Pistons| PG| 10/4/88|7317074|
| Marial Shayok|Philadelphia 76ers| G| 7/26/95| 79568|
+--------------+------------------+--------+-----------+-------+
only showing top 5 rows
- Rearranging Columns Use
select()
to order the columns in the way that you would like.
Example:
= df.select("Name", "Team", "Position", "Salary").show(5) df
+--------------+------------------+--------+-------+
| Name| Team|Position| Salary|
+--------------+------------------+--------+-------+
| Shake Milton|Philadelphia 76ers| SG|1445697|
|Christian Wood| Detroit Pistons| PF|1645357|
| PJ Washington| Charlotte Hornets| PF|3831840|
| Derrick Rose| Detroit Pistons| PG|7317074|
| Marial Shayok|Philadelphia 76ers| G| 79568|
+--------------+------------------+--------+-------+
only showing top 5 rows
Mathematical & Vectorized Operations
- Aggregate Functions:
mean()
min()
max()
stdev_pop()
median()
These functions are used withinselectExpr()
. They take the name of the variable you’d like to aggregate, and then add as “new_variable_name” after.
Here are the aggregation functions in action:
df.selectExpr("mean(Salary) as mean_salary",
"min(Salary) as min_salary",
"max(Salary) as max_salary",
"stddev_pop(Salary) as std_salary"
).show()
+-----------------+----------+----------+-----------------+
| mean_salary|min_salary|max_salary| std_salary|
+-----------------+----------+----------+-----------------+
|7653583.764444444| 79568| 40231758|9278483.657952718|
+-----------------+----------+----------+-----------------+
- Using the
functions
package
from pyspark.sql import functions as F
This package will allow you to create new columns or transform current ones.
Examples:
- F.avg()
- F.concat()
- F.lit()
- F.col()
Here are these fucntions in action:
= df.select(F.avg("Salary").alias("mean_salary")).collect()[0]["mean_salary"]
salary_mean
= (
df2
df"Salary_2x", F.col("Salary") * 2) # Add Salary_2x
.withColumn(
.withColumn("Name_w_Position", # Concatenate Name and Position
"Name"), F.lit(" ("), F.col("Position"), F.lit(")")))
F.concat(F.col(
.withColumn("Salary_minus_Mean", # Subtract mean salary
"Salary") - F.lit(salary_mean))
F.col(5) ).show(
+--------------+------------------+--------+--------+-------+---------+-------------------+-------------------+
| Name| Team|Position|Birthday| Salary|Salary_2x| Name_w_Position| Salary_minus_Mean|
+--------------+------------------+--------+--------+-------+---------+-------------------+-------------------+
| Shake Milton|Philadelphia 76ers| SG| 9/26/96|1445697| 2891394| Shake Milton (SG)| -6207886.764444444|
|Christian Wood| Detroit Pistons| PF| 9/27/95|1645357| 3290714|Christian Wood (PF)| -6008226.764444444|
| PJ Washington| Charlotte Hornets| PF| 8/23/98|3831840| 7663680| PJ Washington (PF)|-3821743.7644444443|
| Derrick Rose| Detroit Pistons| PG| 10/4/88|7317074| 14634148| Derrick Rose (PG)| -336509.7644444443|
| Marial Shayok|Philadelphia 76ers| G| 7/26/95| 79568| 159136| Marial Shayok (G)| -7574015.764444444|
+--------------+------------------+--------+--------+-------+---------+-------------------+-------------------+
only showing top 5 rows
Converting Data Types
.cast()
is used after a variable is specified, and takes different data types as a string.to_date()
converts data to a specified date format. It takes the variable to be changed, and the specific date format you wish to chose.
from pyspark.sql.functions import to_date
= df.withColumn('DateOfBirth_ts', to_date('Birthday','M/d/yy')).show(5) df
+--------------+------------------+--------+--------+-------+--------------+
| Name| Team|Position|Birthday| Salary|DateOfBirth_ts|
+--------------+------------------+--------+--------+-------+--------------+
| Shake Milton|Philadelphia 76ers| SG| 9/26/96|1445697| 2096-09-26|
|Christian Wood| Detroit Pistons| PF| 9/27/95|1645357| 2095-09-27|
| PJ Washington| Charlotte Hornets| PF| 8/23/98|3831840| 2098-08-23|
| Derrick Rose| Detroit Pistons| PG| 10/4/88|7317074| 2088-10-04|
| Marial Shayok|Philadelphia 76ers| G| 7/26/95| 79568| 2095-07-26|
+--------------+------------------+--------+--------+-------+--------------+
only showing top 5 rows
Main Data Types:
- int
- float
- string
- boolean
- date
- timestamp
Filtering by a Condition
df.filter()
takes one or multiple conditions to be met and displayed. Separate conditions by putting each one in parentheses and with the & or | sign.
Here are some examples with a new DataFrame:
import pandas as pd
from pyspark.sql import SparkSession
= SparkSession.builder.master("local[*]").getOrCreate()
spark = pd.read_csv("https://bcdanl.github.io/data/employment.csv")
df_pd = df_pd.where(pd.notnull(df_pd), None) # Convert NaN to None
df_pd = spark.createDataFrame(df_pd) df
filter(col("Salary") > 100000).show(5) df.
+----------+------+----------+--------+-----+---------+
|First Name|Gender|Start Date| Salary| Mgmt| Team|
+----------+------+----------+--------+-----+---------+
| Douglas| Male| 8/6/93| NaN| true|Marketing|
| Maria|Female| NULL|130590.0|false| Finance|
| Jerry| NULL| 3/4/05|138705.0| true| Finance|
| Larry| Male| 1/24/98|101004.0| true| IT|
| Dennis| Male| 4/18/87|115163.0|false| Legal|
+----------+------+----------+--------+-----+---------+
only showing top 5 rows
#or
filter(
df."Team") == "Finance" ) &
( col("Salary") >= 100000 )
( col(5) ).show(
+----------+------+----------+--------+-----+-------+
|First Name|Gender|Start Date| Salary| Mgmt| Team|
+----------+------+----------+--------+-----+-------+
| Maria|Female| NULL|130590.0|false|Finance|
| Jerry| NULL| 3/4/05|138705.0| true|Finance|
| Bruce| Male| 11/28/09|114796.0|false|Finance|
| Carl| Male| 5/3/06|130276.0| true|Finance|
| Irene| NULL| 7/14/15|100863.0| true|Finance|
+----------+------+----------+--------+-----+-------+
only showing top 5 rows
#or
filter(
df."Team") == "Finance") |
(col("Team") == "Legal") |
(col("Team") == "Sales")
(col(5) ).show(
+----------+------+----------+--------+-----+-------+
|First Name|Gender|Start Date| Salary| Mgmt| Team|
+----------+------+----------+--------+-----+-------+
| Maria|Female| NULL|130590.0|false|Finance|
| Jerry| NULL| 3/4/05|138705.0| true|Finance|
| Dennis| Male| 4/18/87|115163.0|false| Legal|
| NULL|Female| 7/20/15| 45906.0| NULL|Finance|
| Julie|Female| 10/26/97|102508.0| true| Legal|
+----------+------+----------+--------+-----+-------+
only showing top 5 rows
isin()
used within filter()
, takes a list of values within a variable and filters only those values.
filter(col('Team').isin('Finance','Legal','Sales')).show(5) df.
+----------+------+----------+--------+-----+-------+
|First Name|Gender|Start Date| Salary| Mgmt| Team|
+----------+------+----------+--------+-----+-------+
| Maria|Female| NULL|130590.0|false|Finance|
| Jerry| NULL| 3/4/05|138705.0| true|Finance|
| Dennis| Male| 4/18/87|115163.0|false| Legal|
| NULL|Female| 7/20/15| 45906.0| NULL|Finance|
| Julie|Female| 10/26/97|102508.0| true| Legal|
+----------+------+----------+--------+-----+-------+
only showing top 5 rows
between()
is also used within filter()
. It takes a range of values and returns True if a value falls wihin the range.
= df.filter(col('Salary').between(90000,100000))
df_between 5) df_between.show(
+----------+------+----------+-------+-----+-----------+
|First Name|Gender|Start Date| Salary| Mgmt| Team|
+----------+------+----------+-------+-----+-----------+
| Angela|Female| 11/22/05|95570.0| true|Engineering|
| Jeremy| Male| 9/21/10|90370.0|false| HR|
| Joshua| NULL| 3/8/12|90816.0| true| IT|
| John| Male| 7/1/92|97950.0|false| IT|
| Jerry| Male| 1/10/04|95734.0|false| IT|
+----------+------+----------+-------+-----+-----------+
only showing top 5 rows
Missing Values
Find how many missing values are in a column with isNull()
:
filter(col('Team').isNull()).count() df.
44
You can find how many non-null values by using the same code and replacing isNull()
with isNotNull()
.
Drop rows with missing values with na.drop()
:
= df.na.drop().show(10) df_drop
+----------+------+----------+--------+-----+------------+
|First Name|Gender|Start Date| Salary| Mgmt| Team|
+----------+------+----------+--------+-----+------------+
| Larry| Male| 1/24/98|101004.0| true| IT|
| Dennis| Male| 4/18/87|115163.0|false| Legal|
| Ruby|Female| 8/17/87| 65476.0| true| Product|
| Angela|Female| 11/22/05| 95570.0| true| Engineering|
| Frances|Female| 8/8/02|139852.0| true|Business Dev|
| Julie|Female| 10/26/97|102508.0| true| Legal|
| Brandon| Male| 12/1/80|112807.0| true| HR|
| Gary| Male| 1/27/08|109831.0|false| Sales|
| Kimberly|Female| 1/14/99| 41426.0| true| Finance|
| Lillian|Female| 6/5/16| 59414.0|false| Product|
+----------+------+----------+--------+-----+------------+
only showing top 10 rows
- takes the argument
how = 'all'
, which removes observations that all values are missing - use the argument
subset =
to target rows with missing values in a given variable
= df.na.drop(subset=["Gender", "Team"]).show(10) df_drop_subset
+----------+------+----------+--------+-----+------------+
|First Name|Gender|Start Date| Salary| Mgmt| Team|
+----------+------+----------+--------+-----+------------+
| Douglas| Male| 8/6/93| NaN| true| Marketing|
| Maria|Female| NULL|130590.0|false| Finance|
| Larry| Male| 1/24/98|101004.0| true| IT|
| Dennis| Male| 4/18/87|115163.0|false| Legal|
| Ruby|Female| 8/17/87| 65476.0| true| Product|
| NULL|Female| 7/20/15| 45906.0| NULL| Finance|
| Angela|Female| 11/22/05| 95570.0| true| Engineering|
| Frances|Female| 8/8/02|139852.0| true|Business Dev|
| Julie|Female| 10/26/97|102508.0| true| Legal|
| Brandon| Male| 12/1/80|112807.0| true| HR|
+----------+------+----------+--------+-----+------------+
only showing top 10 rows
na.fill()
fills in null values with a specified value.
= df.na.fill(value = 0, subset = ["Salary"]).show(10) df_fill
+----------+------+----------+--------+-----+------------+
|First Name|Gender|Start Date| Salary| Mgmt| Team|
+----------+------+----------+--------+-----+------------+
| Douglas| Male| 8/6/93| 0.0| true| Marketing|
| Thomas| Male| 3/31/96| 61933.0| true| NULL|
| Maria|Female| NULL|130590.0|false| Finance|
| Jerry| NULL| 3/4/05|138705.0| true| Finance|
| Larry| Male| 1/24/98|101004.0| true| IT|
| Dennis| Male| 4/18/87|115163.0|false| Legal|
| Ruby|Female| 8/17/87| 65476.0| true| Product|
| NULL|Female| 7/20/15| 45906.0| NULL| Finance|
| Angela|Female| 11/22/05| 95570.0| true| Engineering|
| Frances|Female| 8/8/02|139852.0| true|Business Dev|
+----------+------+----------+--------+-----+------------+
only showing top 10 rows
- you can do multiple variables at a time by usingn a dictionary instead of
value = , subset =
Dealing with Duplicates
dropDuplicates()
drops all rows that are exact duplicates
= df.dropDuplicates() df_no_dups
- add
['Variable_Name']
in the function to specify how to drop duplicates
= df.dropDuplicates(["Team"]) df_no_dups_subset
Now you’re all caught up on the PySpark basics!