WEBVTT

00:00:01.040 --> 00:00:02.660
Protecting data.

00:00:03.740 --> 00:00:06.350
Data protection starts with policy.

00:00:07.040 --> 00:00:10.480
It is policy that mandates people's behavior,

00:00:10.520 --> 00:00:14.870
and, of course, policy also will be enforced through the contracts

00:00:14.870 --> 00:00:17.700
we have in place with a cloud service provider.

00:00:17.700 --> 00:00:24.590
Policy also indicates who is accountable, in other words, who is the owner.

00:00:25.540 --> 00:00:31.150
It will mandate that we must follow certain handling requirements, procedures.

00:00:31.690 --> 00:00:36.590
The idea of policy is that it provides the authority for the function.

00:00:37.040 --> 00:00:42.590
It establishes the fact that the organization has recognized the

00:00:42.590 --> 00:00:46.980
importance of something and, of course, is therefore compliant

00:00:46.980 --> 00:00:49.620
with things like laws and regulations.

00:00:50.340 --> 00:00:54.360
It requires compliance with the procedures as well

00:00:54.840 --> 00:00:58.630
and will quite often indicate, when it comes to data,

00:00:58.630 --> 00:01:02.640
what is the retention period based, as we saw earlier,

00:01:02.650 --> 00:01:06.470
on business requirements and legal requirements.

00:01:06.470 --> 00:01:11.430
We looked at before an important part of all this is that

00:01:11.430 --> 00:01:15.590
data quite often goes between systems, departments,

00:01:15.590 --> 00:01:19.520
and we can say it can show up in a system here and a report

00:01:19.520 --> 00:01:22.200
over there or can go out to an external party,

00:01:22.200 --> 00:01:27.950
and so it's important that we know where we are actually using our data,

00:01:28.330 --> 00:01:30.930
the location, and what are the uses,

00:01:30.930 --> 00:01:36.280
why do we even have that data? We map out the data flows.

00:01:36.400 --> 00:01:37.680
Maybe, for example,

00:01:37.680 --> 00:01:43.160
we even put in a matrix that shows all of our data elements and where

00:01:43.160 --> 00:01:49.950
they actually get used. It's important also that we ensure data is in

00:01:49.950 --> 00:01:54.690
a consistent format to prevent mistakes and errors that come from

00:01:54.690 --> 00:01:59.280
date fields being different, for example. Do we do day/month/year,

00:01:59.280 --> 00:02:00.390
month/day/year,

00:02:00.390 --> 00:02:03.980
year/month/day? It's very difficult sometimes when we see

00:02:03.980 --> 00:02:07.660
something like 12/3 to know whether or not that's the 12th

00:02:07.660 --> 00:02:10.919
of March or the 3rd of December, for example.

00:02:11.440 --> 00:02:17.910
So normalization is often used to try to keep data in a standard format,

00:02:17.920 --> 00:02:22.710
but also to prevent things like duplication of data when we

00:02:22.710 --> 00:02:25.990
look at something like a database as well.

00:02:25.990 --> 00:02:29.260
When we look at standard formats,

00:02:29.460 --> 00:02:34.720
it's a mistake that even I made years ago working on a system where I saw

00:02:34.720 --> 00:02:37.680
that customer account numbers were always seven digits,

00:02:37.680 --> 00:02:42.050
and so I built my application for a seven‑digit account number.

00:02:42.540 --> 00:02:46.650
But I didn't realize that the standard for the organization was nine,

00:02:47.240 --> 00:02:51.690
and that is where it would have been better if we had a data dictionary

00:02:51.690 --> 00:02:57.140
that said this is the correct format for that type of a field so we don't

00:02:57.150 --> 00:03:01.600
end up with a problem sometime in the future where on one system it's

00:03:01.600 --> 00:03:03.720
defined differently than on another.

00:03:05.840 --> 00:03:11.150
When we look at data, we, of course, have both structured and unstructured data.

00:03:11.740 --> 00:03:15.050
Structured data is something with a defined structure,

00:03:15.050 --> 00:03:20.340
a database schema, for example, that says this is the address field,

00:03:20.340 --> 00:03:22.170
it's 30 characters long,

00:03:22.170 --> 00:03:26.880
it's alphanumeric, this is a date field, and this is the format of

00:03:26.880 --> 00:03:30.790
that, very defined structures and easy to organize,

00:03:30.790 --> 00:03:33.330
for example, in a relational database.

00:03:33.840 --> 00:03:38.160
We can also set up an index based on defined keys,

00:03:38.160 --> 00:03:43.740
the primary key, the foreign keys, so it's easy to be able to look up,

00:03:43.740 --> 00:03:44.000
say,

00:03:44.000 --> 00:03:48.280
a certain customer record based on the defined field

00:03:48.280 --> 00:03:49.860
of the customer account number.

00:03:50.640 --> 00:03:56.300
We, though, also combine many different data sources together into,

00:03:56.300 --> 00:03:58.520
for example, a data warehouse.

00:03:58.620 --> 00:04:03.590
A data warehouse takes input from many different data sources

00:04:03.810 --> 00:04:06.960
and combines or aggregates it together.

00:04:07.340 --> 00:04:11.840
And this is where we'll often do normalization as well to make sure that the

00:04:11.840 --> 00:04:17.440
data that's coming in is all in the correct same format at least so that we

00:04:17.440 --> 00:04:21.060
don't end up corrupting the data in our data warehouse.

00:04:21.060 --> 00:04:26.980
Structured data is easy to do analysis of and to be able to look for trends,

00:04:26.980 --> 00:04:30.040
patterns, and deviations, for example.

00:04:30.040 --> 00:04:33.760
But the problem is a lot of our life is unstructured,

00:04:33.760 --> 00:04:36.960
and we have unstructured data, which is undefined.

00:04:37.540 --> 00:04:40.060
You take, for example, in an email,

00:04:40.440 --> 00:04:44.000
an email is a good example of semi‑structured data

00:04:44.000 --> 00:04:47.410
because you have the metadata who it's to,

00:04:47.410 --> 00:04:51.950
from, the date, and so on, but then you have the body of the email,

00:04:51.950 --> 00:04:54.850
which is just a jumbled collection of words.

00:04:54.850 --> 00:05:00.360
And those are unstructured, so the body of the email is unstructured data.

00:05:00.840 --> 00:05:06.200
And some of our good analysis and analytics today is based

00:05:06.200 --> 00:05:08.900
on doing analysis of individual words,

00:05:08.900 --> 00:05:13.550
but also individual words within their context and within

00:05:13.550 --> 00:05:16.060
the content of where they're being used.

00:05:16.540 --> 00:05:20.760
This works a lot towards things like advertising, for example.

00:05:21.440 --> 00:05:23.080
We also have, of course,

00:05:23.080 --> 00:05:27.570
in today's world a type of data warehouse we know as big data,

00:05:27.570 --> 00:05:31.600
the collection of massive amounts of unstructured data

00:05:31.680 --> 00:05:35.240
rather than just relational databases, and so on,

00:05:35.240 --> 00:05:38.150
as we had with the former data warehouses.

00:05:39.840 --> 00:05:42.280
When we look at semi‑structured data,

00:05:42.410 --> 00:05:45.540
it's data that does not have a rigid schema.

00:05:45.550 --> 00:05:48.760
This is a name field, this is a numeric field,

00:05:48.760 --> 00:05:53.490
for example. And for that reason, it doesn't fit well into that

00:05:53.490 --> 00:05:57.690
structured data necessary in a relational database.

00:05:58.240 --> 00:06:02.730
Examples of this can be something like XML, where with Extensible

00:06:02.730 --> 00:06:08.590
Markup Language, we actually have a field that we then define what

00:06:08.590 --> 00:06:13.620
that field represents so that it can be then moved from an

00:06:13.620 --> 00:06:16.560
unstructured hopefully into a structured sense.

00:06:16.940 --> 00:06:21.190
Another example of this, for example, is comma‑delimited fields,

00:06:21.190 --> 00:06:25.440
CSV files, and emails as we looked at before,

00:06:25.440 --> 00:06:29.660
a combination of both structured and unstructured data.

00:06:31.440 --> 00:06:34.700
We need to know when we have intellectual property to

00:06:34.700 --> 00:06:39.030
ensure that those intellectual property pieces of data are

00:06:39.030 --> 00:06:41.160
also appropriately protected.

00:06:41.740 --> 00:06:46.860
Things like patents, and copyrights, trademarks, and trade secrets.

00:06:47.340 --> 00:06:48.790
So when we're looking at data,

00:06:48.790 --> 00:06:52.400
it's not just about privacy of personal information,

00:06:52.460 --> 00:06:55.210
but often it's about things like secrecy of maybe

00:06:55.210 --> 00:06:57.390
some research we're dealing with,

00:06:57.400 --> 00:07:01.840
protection of our trademarks and our copyrights to make sure

00:07:01.840 --> 00:07:05.160
that nobody else is just stealing our ideas,

00:07:05.540 --> 00:07:08.900
but this requires us to identify the things that are

00:07:08.900 --> 00:07:11.220
important that need to be protected,

00:07:11.220 --> 00:07:15.190
and, of course, then to provide them the appropriate level of protection.

00:07:16.890 --> 00:07:18.520
The key points review.

00:07:18.520 --> 00:07:23.020
We don't want to spend a lot of money to protect something of little value,

00:07:23.020 --> 00:07:27.950
so the protection we give to our data is based on its value.

00:07:28.540 --> 00:07:33.490
But this requires that we clearly know who makes that determination of value,

00:07:33.810 --> 00:07:35.230
the owner,

00:07:35.230 --> 00:07:41.470
what are the levels of classification and handling for that data, and to

00:07:41.470 --> 00:07:45.950
make sure we also protect intellectual property as well.
