class: center, middle, title-slide .title[ # Introduction ] .subtitle[ ## JSC 370: Data Science II ] --- # Instructors - Meredith Franklin: meredith.franklin@utoronto.ca - Michael Moon: michael.moon@mail.utoronto.ca --- # My Background - Last year moved from Los Angeles where I was an Assistant/Associate Professor of Biostatistics at University of Southern California - From Canada originally, McGill math for BSc, Ottawa/Carleton Institute of Math for MSc, Harvard for PhD, UChicago for postdoc - At U of T I'm an Associate Professor with tenure in the Department of Statistical Science (51%) and the School of the Environment (49%) --- # My Teaching - Founded a Master's of Health Data Science program at USC that launched in 2020 - Co-taught the introduction data science course - Taught graduate-level spatial statistics, inference, linear models - Taught undergraduate intro to stat (once!) - This semester I am also teaching STA255 --- # My Research - Spatial statistical methods for environmental data - Data science techniques for remote sensing data/imagery - Focus on pollution (air, noise) and climate (ghg, land cover change) - Machine learning becoming a big part of environmental research .center[ ![](img/research_fig.png) ] --- # Course Goals Through this course, you will hone the techniques used in Data Science. You will learn: - Programming in R, and tools Markdown, Git - Exploratory data analysis – generating hypotheses and building intuition - Data visualization – showing data through interpretable summaries - Data collection – data scraping, wrangling, cleaning - Statistical (machine learning) algorithms - Building a github.io website --- # Quercus + Git + Piazza Course website - lecture slides, labs, data https://jsc370.github.io/jsc370-2023/ Quercus - announcements, homework solutions, lab solutions, guest speaker reflections, grading https://q.utoronto.ca/courses/298698 Quercus - questions and discussion --- # What is data science? - Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. -- .center[ ![](img/data-science.png) ] --- .center[ ![](img/data-science-drew-conway.jpg) ] --- # Data science can be really cool <figure align="center"> <img src="https://imgs.xkcd.com/comics/regular_expressions.png" style="width:450px"> <figcaption>Source: https://xkcd.com/208/</figcaption> </figure> --- # With great power comes great responsibility <figure align="center"> <img src="https://imgs.xkcd.com/comics/extrapolating.png" style="width:500px"> <figcaption>Fuente: https://xkcd.com/605/</figcaption> </figure> --- ![](img/demand.png){height=20%} --- # Data Scientists in Demand Also see [here](https://www.amstat.org/asa/News/New-Report-Highlights-Growing-Demand-for-Data-Science-Analytics-Talent.aspx), and [here](https://www.ibm.com/downloads/cas/3RL3VXGA), and [here](https://www.forbes.com/sites/gilpress/2021/06/27/salaries-and-job-opportunities-for-data-scientists-continue-to-rise/), and [here](https://www.glassdoor.com/research/job-market-report/) A good [data science subreddit](https://www.reddit.com/r/datascience/) to follow - it provides insights on jobs, academic programs, and there are AMAs from industry leaders. Another good resource is [Towards Data Science](https://towardsdatascience.com/) --- # What is this course? This course is a introduction to the world of data science following on from where JSC 270 left off. -- The course will teach language agnostic skills that are easily transferable, with examples done in R. -- You can use any language/tool you prefer. But I can only guarantee help if you are using R and RStudio. --- # What is R? <img src="https://www.r-project.org/logo/Rlogo.svg" width="150px" alt="R logo"> > R is a language and environment for statistical computing and graphics. -- https://r-project.org Created by statisticians for statisticians. Over 16,000 packages added to CRAN --- .center[ ![](img/datascience_code.png) ] --- # History of R Originates from S, which was developed by Bell Labs in the 1970s First versions of R were developed by Robert Gentleman and Ross Ihaka of U Aukland in mid-1990s R is intended for statisticians but used by many (>2M users!) R is open source, has nice graphics and visualizations A lot of help is available online (Stack Overflow, R package vignettes, Journal of Statistical Software) --- # R Data Science Resources 1) R Programming for Data Science, 2019. Roger Peng. https://bookdown.org/rdpeng/rprogdatascience/ Supplementary References 2) R for Data Science, 2017 Garrett Grolemund and Hadley Wickham. http://r4ds.had.co.nz/ 3) Exploratory Data Analysis with R, 2020 Roger Peng https://bookdown.org/rdpeng/exdata/ 4) Mastering Software Development in R, 2017 Roger Peng, Sean Kross, Brooke Anderson https://bookdown.org/rdpeng/RProgDA/ --- # R in the terminal <figure align="center"> <img src="R_terminal.png" height="500px"> </figure> --- # What is RStudio? <img src="https://rstudio.com/wp-content/uploads/2018/10/RStudio-Logo.svg" width="400px" alt="RStudio logo"> > RStudio is an integrated development environment (IDE) for R. -- https://rstudio.org/products/rstudio --- .center[ ![](img/moderndive-r-vs-rstudio.png) ] --- # R + RStudio <figure align="center"> <img src="rstudio-now.png" height="500px"> </figure> --- ## GitHub -- - Version control is necessary in the trade of data science and is used in industry and academia -- - Building up a solid GitHub profile will put you in a good position for job hunting -- - You will build a github.io website as part of this course <figure align="center"> <img src="img/git1.png" height="100px"> </figure> <figure align="center"> <img src="img/git2.png" height="100px"> </figure> --- # First Week The lab exercises can be found on the course website in the schedule https://jsc370.github.io/jsc370-2023/ Download the Rmd files Submit individually completed lab at the end of day Wednesday --- # Next Week Guest speaker 1-2pm Monday January 16 Upload 1/2 page summary of guest speaker's seminar by Wednesday end of day to Quercus Lecture 2-3pm Monday Jan 16 (Version control) Lab 1-3pm Wednesday Jan 18 (Version control)