
1
00:00:00,000 --> 00:00:04,959
I just wanted to make a short note about
the data set that I use in the regular

2
00:00:04,959 --> 00:00:08,126
expression lecture about grep, regex or,
and sub and Gsub.

3
00:00:08,126 --> 00:00:13,026
So in that lecture, I use a specific data
set, which is the Baltimore City Homicide

4
00:00:13,205 --> 00:00:16,910
Event database.
You can look at this data in map form by

5
00:00:16,910 --> 00:00:22,347
going to this website up here and I'll put
the link in Coursera so that you can find

6
00:00:22,347 --> 00:00:25,933
it later.
But basically the, the website is hosted

7
00:00:25,933 --> 00:00:31,072
by the Baltimore Sun the city's newspaper,
and they've kept track of every homicide

8
00:00:31,072 --> 00:00:37,560
in the city that's occurred since 2007 up
to the current, the present day.

9
00:00:37,560 --> 00:00:43,598
So, for example, you can see here that
there have been 171 homicides in 2012 in

10
00:00:43,598 --> 00:00:46,780
Baltimore City.
So the map is interactive.

11
00:00:46,780 --> 00:00:52,431
You can do things like, you know, select
on let's say, the past let's say, the past

12
00:00:52,431 --> 00:00:55,322
six months.
You can select a district of Baltimore if

13
00:00:55,322 --> 00:00:57,955
you want or you can just look at all the
districts.

14
00:00:57,955 --> 00:01:00,950
A zip code an age of the victim, a gender
of the victim,

15
00:01:01,105 --> 00:01:04,564
And then the race of the victim,
As well as the cause of death here.

16
00:01:04,564 --> 00:01:09,090
There's a few different causes of death.
So, you can show the results and I'm just

17
00:01:09,090 --> 00:01:13,121
picking all the homicides that have
occurred in the last six months. And then,

18
00:01:13,121 --> 00:01:17,570
you'll get a map, a little Google map like
this that'll show you where each victim

19
00:01:17,570 --> 00:01:21,444
was found in the city and they're color
coded by the cause of death.

20
00:01:21,444 --> 00:01:25,632
So the red, you can see that the vast
majority of the homicides here in the last

21
00:01:25,632 --> 00:01:28,406
six months were red and that indicates a
shooting.

22
00:01:28,563 --> 00:01:32,804
And then, if you zoom in a little bit,
let's say, if you zoom in on this portion

23
00:01:32,804 --> 00:01:36,323
right here.
You can see, the, you can click on one of

24
00:01:36,323 --> 00:01:41,668
the markers and you'll see the name of the
person the street, the street address of

25
00:01:41,668 --> 00:01:46,953
where the person was found, a little bit
of information, the race, gender, age, and

26
00:01:46,953 --> 00:01:52,239
then where the person died, and then the
cause of death which was determined in

27
00:01:52,239 --> 00:01:56,503
this case to be a shooting.
So you can click on the, each marker like

28
00:01:56,503 --> 00:01:59,266
that.
You'll get information about each victim

29
00:01:59,446 --> 00:02:05,221
and you can kind of browse the map like
this. and then if you scroll down a little

30
00:02:05,221 --> 00:02:10,967
bit more, you'll see that there's a little
bit of statistical analysis here, so

31
00:02:10,967 --> 00:02:16,372
basically just a bar chart of the number
of homicides that occurred each month.

32
00:02:16,577 --> 00:02:20,340
And then, and they're color coded by the
cause of death.

33
00:02:20,340 --> 00:02:25,881
And the little line graph here, is the
number of homicides per month in the

34
00:02:25,881 --> 00:02:30,999
previous year. so, so, you can compare
this year in the bars graph and the last

35
00:02:30,999 --> 00:02:34,873
year's in the line graph.
If you want to look at previous year's

36
00:02:34,873 --> 00:02:38,872
data in this format, you can click on the
little arrow over here.

37
00:02:38,872 --> 00:02:44,558
And now, you can look at monthly homicides
in 2011, monthly homicides in 2010, etc.,

38
00:02:44,558 --> 00:02:47,933
like that.
So feel free to take a look at this data.

39
00:02:48,120 --> 00:02:53,570
unfortunately, the data set is not easily
downloadable in, in a, in a batch form so

40
00:02:53,570 --> 00:02:57,397
that we can look at it for example, and
load it into R.

41
00:02:57,397 --> 00:03:03,209
So, what I have done is essentially cut
and paste the data from the HTML source,

42
00:03:03,422 --> 00:03:09,162
and created a text file, which has a,
which has lot of the HTML source in there,

43
00:03:09,162 --> 00:03:15,399
but essentially boils down to having a one
homicide record per line in the text file.

44
00:03:15,612 --> 00:03:21,140
So, if you want to open, you can open up
the text file, which I'll do right here.

45
00:03:21,140 --> 00:03:26,467
I'll, first of all, unzip the file.
And if you take a look at it, just in your

46
00:03:26,467 --> 00:03:31,277
the text editor or whatever application
you want to use, you'll notice that there

47
00:03:31,277 --> 00:03:34,771
is a lot of text here.
But basically, the first line of data

48
00:03:34,771 --> 00:03:36,988
looks like this.
So, this is the first line.

49
00:03:36,988 --> 00:03:40,631
It does wrap around a little bit because
it's too long for my screen.

50
00:03:40,789 --> 00:03:45,118
But you can see there, there's information
about lat, the latitude and longitude of

51
00:03:45,118 --> 00:03:49,289
where the person was found the person's
name, street address where they were

52
00:03:49,289 --> 00:03:52,562
found, and some information about their
race, gender, and age.

53
00:03:52,720 --> 00:03:55,571
And the date that they were found and
where they died.

54
00:03:55,571 --> 00:03:59,531
And this person, in this case, this person
died at shock trauma which is the

55
00:03:59,531 --> 00:04:03,859
University of Maryland's emergency room.
So, you can see there's one line per

56
00:04:03,859 --> 00:04:06,878
victim.
There's about a little over 1200 lines in

57
00:04:06,878 --> 00:04:09,871
this file,
So 1200 homicides over the last five

58
00:04:09,871 --> 00:04:12,927
years.
And so, you can take a look at this data

59
00:04:12,927 --> 00:04:15,796
set.
You can read it into R, or you can view it

60
00:04:15,983 --> 00:04:18,540
on the map provided by the Baltimore Sun.

