-
Notifications
You must be signed in to change notification settings - Fork 1
/
main.Rmd
198 lines (99 loc) · 6.95 KB
/
main.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
---
title: "main"
author: "Udayan Sawant"
date: "5/1/2020"
output:
html_document: default
word_document: default
---
```{r}
library(readr)
library(readxl)
library(dplyr)
```
#We’re always staying on the pulse of the software development industry so that we can better prepare our students for rapidly – changing technology job market. Many of the students and developers in general are interested in working at top startups and tech companies. How can we tell what programming languages and technologies are used by the most people? How about what languages are growing, and which are shrinking so that we can tell which are most worth investing time in?
#One excellent source of data is Stack Overflow, a programming question and answer site with more than 16 million questions on programming topics. By measuring the number of questions about each technology, we can get an approximate sense of how many people are using it. We're going to use open data from the Stack Exchange Data Explorer to examine the relative popularity of languages like R, Python, Java and JavaScript have changed over time.
#Each Stack Overflow question has a tag, which marks a question to describe its topic or technology. For instance, there's a tag for languages like R or Python, and for packages like ggplot2 or pandas.
#The dataset used, contains one observation for each tag in a year. The dataset also includes both the number of questions asked in that tag in the year and the total number of questions asked in that year
```{r}
by_tag <- read_csv("C:/Users/UDAYAN/Desktop/Projects/Rise and Fall of Programming Languages/by_tag.csv")
by_tag
```
##The above data has one observation for each of the pair, of a tag and year. Representing number of questions asked in that tag in that year and the total number od questions asked in the year.
##Rather than just the counts of the tags, the concern here is the percentage change: the fraction of questions that year have that tag
```{r}
by_tag <- by_tag %>% mutate( change = number / year_total)
by_tag
```
##So far we've been learning and using the R programming language. Wouldn't we like to be sure it's a good investment for the future? Has it been keeping pace with other languages, or have people been switching out of it?
##Let's look at whether the fraction of Stack Overflow questions that are about R has been increasing or decreasing over time.
```{r}
r_comp <- by_tag %>% filter( tag == 'r')
r_comp
```
##Rather than looking at the results in a table, we often want to create a visualization. Change over time is usually visualized with a line plot.
```{r}
library(ggplot2)
ggplot(r_comp, aes(x = year, y = change)) + geom_line()
```
#Similarly for python
```{r}
python <- by_tag %>% filter( tag == 'python')
python
```
```{r}
ggplot(python, aes(x = year, y = change, units= "in")) + geom_line()
```
#What are the most asked - about tags. It's sure been fun to visualize and compare tags over time. The dplyr and ggplot2 tags may not have as many questions as R, but we can tell they're both growing quickly as well.
#We might like to know which tags have the most questions overall, not just within a particular year. Right now, we have several rows for every tag, but we'll be combining them into one. That means we want group_by() and summarize().
```{r}
tags_sorted <- by_tag %>% group_by(tag) %>% summarise( tag_total = sum(year_total)) %>% arrange(desc(tag_total))
tags_sorted
```
# How have large programming languages changed over time ?
#We've looked at selected tags like R, ggplot2, and dplyr, and seen that they're each growing. What tags might be shrinking? A good place to start is to plot the tags that we just saw that were the most-asked about of all time, including JavaScript, Java and C#.
```{r}
tags_highest <- head(tags_sorted$tag)
by_tag_subset <- by_tag %>% filter( tag %in% tags_highest)
ggplot(by_tag_subset, aes(year, change, color = tag)) + geom_line()
```
#Top 10 Programming Languages
#The TIOBE Programming Community index is an indicator of the popularity of programming languages. The index is updated once a month. The ratings are based on the number of skilled engineers world-wide, courses and third party vendors. Popular search engines such as Google, Bing, Yahoo!, Wikipedia, Amazon, YouTube and Baidu are used to calculate the ratings. It is important to note that the TIOBE index is not about the best programming language or the language in which most lines of code have been written. (https://www.tiobe.com/tiobe-index/)
```{r}
prog_lang<- c('java', 'c', 'c++', 'python', 'visual basic.net', 'c#', 'javascript', 'sql', 'php', 'assemblt language')
prog_lang_tags <- by_tag %>% filter(tag %in% prog_lang)
ggplot(prog_lang_tags, aes( x = year, y = change, color = tag)) + geom_line()
```
```{r}
require(tidyr)
require(tidyverse)
require(tidyselect)
library(lattice)
require(dplyr)
```
#Comparing R and Python
#There is a lot of heated discussion over the topic, but there are some great, thoughtful articles as well. Some suggest Python is preferable as a general-purpose programming language, while others suggest data science is better served by a dedicated language and toolchain. The origins and development arcs of the two languages are compared and contrasted, often to support differing conclusions.
#For individual data scientists, some common points to consider:
#Python is a great general programming language, with many libraries dedicated to data science.
#Many (if not most) general introductory programming courses start teaching with Python now.
#Python is the go-to language for many ETL and Machine Learning workflows.
#Many (if not most) introductory courses to statistics and data science teach R now.
#R has become the world’s largest repository of statistical knowledge with reference implementations for thousands, if not tens of thousands, of algorithms that have been vetted by experts. The documentation for many R packages includes links to the primary literature on the subject.
#R has a very low barrier to entry for doing exploratory analysis, and converting that work into a great report, dashboard, or API.
#R with RStudio is often considered the best place to do exploratory data analysis. (https://blog.rstudio.com/2019/12/17/r-vs-python-what-s-the-best-for-language-for-data-science/)
```{r}
r_and_python <- full_join(r_comp, python, by = c("year","tag", "number","year_total","change")) %>% arrange(desc(year, tag))
r_and_python
```
```{r}
ggplot(r_and_python, aes(year, change, color = tag)) + geom_line()
```
```{r}
xyplot(number ~ change + year_total, data = r_and_python)
```
```{r}
summary(lm(number ~ change + year_total, data = r_and_python))
```
```{r}
ggplot(r_and_python, aes(x = number, y = change + year_total)) + geom_point() + geom_smooth(method = "lm", se = TRUE)
```