Data glossary terms
Table of Contents
A
ABORTED TRANSACTION
A transaction in progress that terminates abnormally.
ABSTRACT CLASS
A class that has no direct instances but whose descendants may have direct instances.
ABSTRACT OPERATION
An operation whose form or protocol is defined but whose implementation is not defined.
ACCESS CONTROL
The means and mechanisms of managing access to and use of resources by users. There are three primary forms of access control: DAC, MAC, and RBAC.
ACCESS CONTROL (DAC)
A DAC (Discretionary Access Control) manages access through the use of on-object ACLs, Access Control Lists, which indicate which users have been granted, or denied, specific privileges or permissions on that object.
ACCESS CONTROL (MAC)
A MAC (Mandatory Access Control) restricts access by assigning each subject and object a classification level label; resource use is then controlled by limiting access to those subjects with equal or superior labels to that of the object.
ACCESS CONTROL (RBAC)
A RBAC (Role-Based Access Control) controls access through the use of job labels, which have been assigned the permissions and privileges needed to accomplish the related job tasks.
ACCESSOR METHOD
A method that provides other objects with access to the state of an object.
AFTERIMAGE
A copy of cord, or page of memory, after it has been modified.
AGGREGATE
Statistical summaries of data, meaning that the data have been analyzed in some way. The Data-Planet repository is an excellent resource for obtaining aggregated data.
AGGREGATION
Apart-of relationship between a component object and an aggregate object. The process of transforming data from a detailed level to a summary level.
AGILE SOFTWARE DEVELOPMENT
An approach to database and software development that emphasizes 'individuals and interactions over processes and tools, working software over comprehensive documentation, and customer collaboration over contract negotiation'.
ALGORITHM
A procedure or formula for solving a problem based on conducting a sequence of specified actions. In the context of big data, an algorithm refers to a mathematical formula embedded in the software to perform an analysis on a set of data.
ALIAS
An alternative name is used for an attribute.
ANOMALY
An error or inconsistency may result when a user attempts to update a table that contains redundant data. The three types of anomalies are insertion, deletion, and modification anomalies.
ANTI-VIRUS (ANTI-MALWARE) SOFTWARE
A security program designed to monitor a system for malicious software. Once malware is detected, the AV program will attempt to remove the item from the system or may simply quarantine the file for further analysis by an administrator.
APPLICATION PARTITIONING
The process of assigning portions of application code to client or server partitions after it is written to achieve better performance and interoperability, the ability of a component to function on different platforms.
APPLICATION PROGRAM INTERFACE (API)
Sets of routines that an application program uses to direct the performance of procedures by the computer’s operating system.
APT (ADVANCED PERSISTENT THREAT)
A security breach enables an attacker to gain access or control over a system for an extended period of time usually without the owner of the system being aware of the violation. Taking advantage of numerous unknown vulnerabilities.
ARTIFICIAL INTELLIGENCE (AI)
The ability of a machine to mimic the capabilities of the human mind, such as learning from examples and experience, recognizing objects, understanding and responding to language, making decisions, and solving problems.
AUGMENTED INTELLIGENCE
A human-centred partnership that brings people and AI together to enhance cognitive performance, including learning, decision making, and new experiences.
AUTHENTICATION
The process of proving an individual is a claimed identity. Authentication is the first element of the AAA services concept, which includes Authentication, Authorization, and Accounting. Authentication occurs after the initial step of identification.
AUTHORIZATION
The security mechanism determining and enforcing what authenticated users are authorized to do within a computer system. The dominant forms of authorization are DAC, MAC and RBAC.
B
BACK END
The back end is all of the code and technology that works behind the scenes to populate the front end with useful information. This includes databases, servers, authentication procedures, and much more.
BACKING UP
Creating a duplicate copy of data onto a separate physical storage device or online/cloud storage solution. A backup is the only insurance against data loss. With a backup, damaged or lost data files can be restored.
BCP (BUSINESS CONTINUITY PLANNING/MANAGEMENT)
A business management plan is used to resolve issues that threaten core business tasks. The goal of BCP is to prevent the failure of mission-critical processes when they have been harmed by a breach or accident.
BEHAVIOR MONITORING
Recording the events and activities of a system and its users. The recorded events are compared against security policy and behavioural baselines to evaluate compliance and/or discover violations.
BIG DATA
A term that describes so much data that you need to take careful steps to avoid week-long script runtimes. Strategies and tools that help computers do complex analysis of very large data sets.
BLACKLIST
A security mechanism prohibiting the execution of those programs on a known malicious or undesired list of software. The blacklist is a list of specific files known to be malicious or otherwise are unwanted.
BLOCK CIPHER
A type of symmetric encryption algorithm that divides data into fixed-length sections and then performs the encryption or decryption operation on each block. The action of dividing a data set enables the algorithm to encrypt data of any size.
BOT/BOTNET
A software application or script that performs tasks on command, allowing an attacker to take control remotely of a computer. A collection of these infected computers is known as a 'botnet' and is controlled by the hacker or 'bot-herder'.
BOTNET
A collection of computers that have been compromised by malicious code to run a remote control agent granting an attacker the ability to remotely take advantage of the system's resources to perform illicit or criminal actions.
BREACH
The moment a hacker successfully exploits a vulnerability in a computer or device and gains access to its files and network.
BUG
An error or mistake in software coding or hardware design or construction. A bug represents a flaw or vulnerability in a system discoverable by attackers and used as a point of compromise.
BUSINESS INTELLIGENCE (BI)
The discipline of analyzing and transforming data to extract valuable business insights to enable decision-making. Today, BI is typically used to refer to descriptive analysis and reporting. It’s descriptive, rather than predictive.
BYOD (BRING YOUR OWN DEVICE)
Refers to a company security policy that allows for employees’ personal devices to be used in business. A BYOD policy sets limitations and restrictions on whether or not a personal phone or laptop can be connected over the corporate network.
C
CATEGORICAL VARIABLE
A variable that distinguishes among subjects by putting them in categories. Also called discrete or nominal variables.
CIPHERTEXT
The unintelligible and seeming random form of data is produced by the cryptographic function of encryption. The Ciphertext is produced by an asymmetric algorithm when a data set is transformed by the encryption process using a selected key.
CLASSIFICATION
A supervised machine learning problem. It deals with categorizing a data point based on its similarity to other data points. You take a set of data where every item already has a category and look at common traits between each item.
CLICKJACKING
A malicious technique by which a victim is tricked into clicking on a URL, button or another screen object other than that intended by or perceived by the user. Clickjacking can be performed in many ways.
CLOUD
A technology that allows us to access our files and/or services through the internet from anywhere in the world. Technically speaking, it’s a collection of computers with large storage capabilities that remotely serve requests.
CLOUD COMPUTING
Anything that involves delivering hosted services over the internet. For big data practitioners, cloud computing is important because their roles involve accessing and interfacing with software and/or data hosted and running on remote servers.
CLUSTERING (TECHNIQUES)
An attempt to collect and categorize sets of points into groups that are 'sufficiently similar', or 'close' to one another. Complexity increases as more features are added to a problem space.
CND (COMPUTER NETWORK DEFENSE)
The establishment of a security perimeter and internal security requirements to defend a network against cyberattacks, intrusions and other violations. A CND is defined by a security policy and can be stress tested.
COLLECTIVE INTELLIGENCE
A group’s combined capability to perform various tasks and solve diverse problems. Businesses can enable this by collaboration, collective efforts, and competition of many individuals in consensus decision-making.
CORRELATION
Correlation is the measure of how much one set of values depends on another. If values increase together, they are positively correlated. If one value from one set increase as the other decreases, they are negatively correlated.
CRACKER
An unauthorized attacker of computers, networks and technology instead of the misused term 'hacker'. However, this term is not as widely used in the media; thus, the term hacker has become more prominent despite the misuse of the term.
CRITICAL INFRASTRUCTURE
The physical or virtual systems and assets that are vital to an organization or country. If an organization's mission-critical processes are interrupted, this could result in the organization ceasing to exist.
CRYPTOGRAPHY
The application of mathematical processes on data-at-rest and data-in-transit to provide the security benefits of confidentiality, authentication, integrity and non-repudiation.
CRYPTOGRAPHY (SYMMETRIC ENCRYPTION)
Symmetric encryption is used to provide confidentiality.
CRYPTOGRAPHY (ASYMMETRIC ENCRYPTION)
Asymmetric encryption is used to provide secure symmetric key generation, secure symmetric key exchange verification of source, verification/control of recipient, digital signature and digital certificates.
CRYPTOGRAPHY (HASHING)
Hashing is the cryptographic operation that produces a representational value from an input data set. A before and after hash can be compared to detect protection of or violation of integrity.
CVE (COMMON VULNERABILITIES AND EXPOSURES)
An online database of attacks, exploits and compromises operated by the MITRE organization for the benefit of the public. It includes any attacks and abuses known for any type of computer system or software product.
CYBER ECOSYSTEM
The collection of computers, networks, communication pathways, software, data and users that comprise either a local private network or the worldwide Internet. It is the digital environment within which software operates and data is managed.
CYBER TEAMS
Groups of professional or amateur penetration testing specialists are tasked with evaluating and potentially improving the security stance of an organization. Common cyber teams include the red, blue and purple/white teams.
CYBERATTACK
Any attempt to violate the security perimeter of a logical environment. An attack can focus on gathering information, damaging business processes, exploiting flaws, monitoring targets, interrupting business tasks, extracting value, and more.
CYBERESPIONAGE
The unethical act of violating the privacy and security of an organization to leak data or disclose internal/private/confidential information. The direct purpose of causing harm to the violated entity is to benefit others.
CYBERSECURITY
The efforts to design, implement, and maintain security for a network, which is connected to the Internet. It is a combination of logical/technical, physical and personnel focused countermeasures, safeguards and security controls.
D
DATA
Fundamentally, data=information. We typically use the term to refer to numeric files that are created and organized for analysis. There are two types of data: aggregate and microdata.
DATA AGGREGATION
A collection of data points and datasets.
DATA ANALYSIS (ANALYTICS)
This discipline is the little brother of data science. Data analysis is focused more on answering questions about the present and the past. It uses less complex statistics and generally tries to identify patterns that can improve an organization.
DATA BREACH
The occurrence of disclosure of confidential information, access to confidential information, destruction of data assets or abusive use of a private IT environment. A data breach results in internal data being made accessible to external entities.
DATA CONSUMPTION
The presentation of insights in a form that aids understanding and action. It is often achieved by adopting analytics techniques to identify insights and data visualization techniques to present the insights.
DATA CULTURE
The values, behaviour, and norms shared by most individuals within an organization regarding data-related issues. Broadly, it refers to the ability of an organization to use data for informed decision-making.
DATA ENGINEERING
Data engineering is all about the back end. These are the people that build systems to make it easy for data scientists to do their analysis. In smaller teams, a data scientist may also be a data engineer.
DATA EXPLORATION
The part of the data science process where a scientist will ask basic questions that help her understand the context of a data set. What you learn during the exploration phase will guide more in-depth analysis later.
DATA FABRIC
An architecture and set of data services that provide consistent capabilities, integrating data management across the cloud and on-premises to accelerate digital transformation.
DATA GOVERNANCE
A framework and a set of practices to help all stakeholders across an organization identify and meet their information needs.
DATA INTEGRITY
A security benefit that verifies data is unmodified and therefore original, complete and intact. Integrity is verified through the use of cryptographic hashing.
DATA JOURNALISM
This discipline is all about telling interesting and important stories with a data-focused approach. It has come about naturally with more information becoming available as data. A story may be about the data or informed by data.
DATA LAKE
A storage repository that holds a vast amount of raw data in its native format until it's required. Every data element within a data lake is assigned a unique identifier and set of extended metadata tags.
DATA MINING
A process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. This term was coined around 1990 and quickly became a buzzword.
DATA PIPELINES
A collection of scripts or functions that pass data along in a series. The output of the first method becomes the input of the second. This continues until the data is appropriately cleaned and transformed for whatever task a team is working on.
DATA POINT (DATUM)
Singular data. Refers to a single point of data.
DATA SCIENCE
The discipline of applying advanced analytics techniques to extract valuable information from data for business decision-making and strategic planning.
DATA SET
A collection of related, discrete items of data that may be accessed individually or collectively, or managed as a single, holistic entity. Data sets are generally organized into some formal structure, often in a tabular format.
DATA STORYTELLING
The practice of building a narrative around data and its accompanying visualizations to help convey context and the meaning of data in a compelling fashion.
DATA THEFT
The act of intentionally stealing data. Data theft can occur via data loss or data leakage events. Data loss occurs when a storage device is lost or stolen. Data leakage occurs when copies of data are possessed by unauthorized entities.
DATA VISUALIZATION
The discipline of information design. It refers to the graphical representation of information using visual elements such as charts, graphs, and maps. The intent is to enable decision-making with the appropriate representation of insights.
DATA WAREHOUSE
A data warehouse is a system used to do a quick analysis of business trends using data from many sources. They’re designed to make it easy for people to answer important statistical questions without a PhD in database architecture.
DATA WRANGLING (MUNGING)
The process of taking data in its original form and 'taming' it until it works better in a broader workflow or project. Taming means making values consistent with a larger data set, replacing or removing values that might affect analysis later.
DATABASE
A collection of data organized for research and retrieval. Example: American Community Survey.
DATABASE MANAGEMENT SYSTEM (DBMS)
System software that serves as an interface between databases and end-users or application programs, ensuring that data is consistently organized and remains easily accessible.
DDOS (DISTRIBUTED DENIAL OF SERVICE) ATTACK
A form of cyber-attack that attempts to block access to and use of a resource. It is a violation of availability. DDOS (or DDoS) is a variation of the DoS attack (see DOS) and can include flooding attacks, connection exhaustion, and resource demand.
DECISION INTELLIGENCE
The discipline of turning information into organizational decisions at scale. Achieve this by applying data science within the context of a business problem by bringing together managerial science and social science disciplines.
DECISION SUPPORT SYSTEMS (DSS)
An information system that supports organizational decision-making activities. This field saw a lot of research in the 1970s, and it saw rapid growth over the next few decades.
DECISION TREES
This machine learning method uses a line of branching questions or observations about a given data set to predict a target value. They tend to over-fit models as data sets grow large.
DECRYPT
The act transforms ciphertext back into its original plaintext or cleartext form. Ciphertext is produced by a symmetric encryption algorithm when a data set is transformed by the encryption process using a selected key.
DEEP LEARNING
A part of the machine learning discipline this a discipline that is based on artificial neural networks that are inspired by the structure of the human brain. It learns from vast amounts of data and is particularly good at finding patterns.
DEEPFAKE
An audio or video clip that has been edited and manipulated to seem real or believable. Can easily convince people into believing a certain story or theory that may result in user behaviour with a bigger impact as in political or financial.
DESCRIPTIVE ANALYTICS
The examination of data or content to answer the question 'What happened?' It is typically characterized by traditional business intelligence (BI) and data visualization.
DESIGN AND VISUALIZATION
An important part of data stories. They help users to identify insights quickly.
DIAGNOSTIC ANALYTICS
A form of advanced analytics that examines data to answer the question 'Why did it happen?' You can achieve it with the help of techniques such as data mining, statistics, and machine learning.
DIGITAL CERTIFICATE
A means by which to prove identity or provide authentication commonly using a trusted third-party entity known as a certificate authority. A digital certificate is based on the x.509 v3 standard.
DIGITAL FORENSICS
The means of gathering digital information to be used as evidence in a legal procedure. Digital forensics focuses on gathering, preserving and analyzing fragile and volatile data from a computer system and/or network.
DLP (DATA LOSS PREVENTION)
A collection of security mechanisms that aim at preventing the occurrence of data loss and/or data leakage. Data loss occurs when a storage device is lost or stolen while data leakage occurs when data is possessed by unauthorized entities.
DMZ (DEMILITARIZED ZONE)
A segment or subnet of a private network where resources are hosted and accessed by the general public from the Internet. The DMZ is isolated from the private network using a firewall.
DOMAIN
A group of computers, printers and devices that are interconnected and governed as a whole. For example, your computer is usually part of a domain at your workplace.
DOS (DENIAL OF SERVICE)
An attack that attempts to block access to and use of a resource. It is a violation of availability. DOS (or DoS) attacks include flooding attacks, connection exhaustion and resource demand.
DRIVE-BY DOWNLOAD
A type of web-based attack that automatically occurs based on the simple act of visiting a malicious or compromised/poisoned Web site. Taking advantage of the default nature of a Web browser to execute mobile code, most often JavaScript.
E
EAVESDROPPING
The act of listening in on a transaction, communication, data transfer or conversation. Eavesdropping can be used to refer to both data packet capture on a network link and audio recording using a microphone.
ENCODE
The act transforms plaintext or cleartext into ciphertext. Ciphertext is produced by a symmetric encryption algorithm when a data set is transformed by the encryption process using a selected key.
ENCRYPTION
The process of encoding data to prevent theft by ensuring the data can only be accessed with a key.
ENCRYPTION KEY
The secret number value is used by an asymmetric encryption algorithm to control the encryption and decryption process. A key is a number defined by its length in binary digits. Generally, the longer the key length, the more security it provides.
ETL (EXTRACT, TRANSFORM, LOAD)
This process is key to data warehouses. It describes the three stages of bringing data from numerous places in a raw form to a screen, ready for analysis. ETL systems are generally gifted to us by data engineers and run behind the scenes.
EXPLOIT
A malicious application or script can be used to take advantage of a computer’s vulnerability.
F
FEATURE ENGINEERING
The process of taking knowledge and translating it into a quantitative value that a computer can understand. We can translate our visual understanding of the image of a mug into a representation of pixel intensities.
FEATURE SELECTION
The process of identifying what traits of a data set are going to be the most valuable when building a model. Helpful with large data sets, as using fewer features will decrease the amount of time and complexity involved in testing a model.
FIELDS OF FOCUS
As businesses become more data-focused, new opportunities open up for people of various skill sets to become part of the data community. These are some of the areas of specialization that exist within the data science realm.
FIREWALL
A security tool, which may be a hardware or software solution that is used to filter network traffic. A firewall is based on an implicit deny stance where all traffic is blocked by default.
FRONT END
The front end is everything a client or user gets to see and interact with directly. This includes data dashboards, web pages, and forms.
FUZZY ALGORITHMS
Algorithms that use fuzzy logic to decrease the runtime of a script. Fuzzy algorithms tend to be less precise than those that use Boolean logic. They also tend to be faster, and computational speed sometimes outweighs the loss in precision.
FUZZY LOGIC
An abstraction of Boolean logic that substitutes the usual True and False and for a range of values between 0 and 1. That is, fuzzy logic allows statements like 'a little true' or 'mostly false'.
G
GREEDY ALGORITHMS
A greedy algorithm will break a problem down into a series of steps. It will then look for the best possible solution at each step, aiming to find the best overall solution available.
H
HACKER
A person who has knowledge and skill in analyzing program code or a computer system, modifying its functions or operations and altering its abilities and capabilities. May be ethical and authorized or maybe malicious and unauthorized.
HACKTIVISM
Attackers who hack for a cause or belief rather than some form of personal gain. Hacktivism is often viewed by attackers as a form of protest or fighting for their perceived 'right' or 'justice'.
HADOOP
An open-source distributed processing framework that manages data processing and storage for big data applications. It provides a reliable means for managing pools of big data and supporting related analytics applications.
HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
The primary data storage system used by Hadoop HDFS employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.
HAT (BLACK)
Hackers that break into the network steals information that will be used to harm the owner or the users without consent. It’s entirely illegal.
HAT (WHITE HAT / BLACK HAT)
When speaking in cyber security terms, the differences in hacker 'hats' refers to the intention of the hacker.
HAT (WHITE)
Breaches the network to gain sensitive information with the owner’s consent, making it completely legal. This method is usually employed to test infrastructure vulnerabilities.
HONEYPOT
A trap or decoy for attackers. A honeypot is used to distract attackers to prevent them from attacking actual production systems. A false system that is configured to look and function as a production system.
I
IAAS (INFRASTRUCTURE-AS-A-SERVICE)
A type of cloud computing service where the provider offers the customer the ability to craft virtual networks within their computing environment.
IDENTITY CLONING
A form of identity theft in which the attacker takes on the identity of a victim and then attempts to live and act as the stolen identity. Often performed to hide a criminal record of the attacker to obtain a job, credit or financial instrument.
IDENTITY FRAUD
A form of identity theft in which a transaction, typically financial, is performed using the stolen identity of another individual. The fraud is due to the attacker impersonating someone else.
IDS (INTRUSION DETECTION SYSTEM)
A security tool that attempts to detect the presence of intruders or the occurrence of security violations to notify administrators, enable more detailed or focused logging or even trigger a response such as disconnecting a session.
INDICATOR
Typically used as a synonym for statistics that describe something about the socio-economic environment of a society, eg, per capita income, unemployment rate, median years of education.
INFORMATION DESIGN
The practice of presenting information in a way that fosters an efficient and effective understanding of the information.
INFORMATION SECURITY POLICY
A written account of the security strategy and goals of an organization. A security policy is usually comprised of standards, policies (or SOPs – Standard Operating Procedures) and guidelines.
INSIDER THREAT
The likelihood or potential that an employee or another form of internal personnel may pose a risk to the stability or security of an organization. An insider has both physical access and logical access.
IP ADDRESS
An internet version of a home address for your computer, which is identified when it communicates over a network; For example, connecting to the internet (a network of networks).
IPS (INTRUSION PREVENTION SYSTEM)
A security tool that attempts to detect the attempt to compromise the security of a target and then prevent that attack from becoming successful. Considered a more active security tool as it attempts to proactively respond to potential threats.
ISP (INTERNET SERVICE PROVIDER)
The organization that provides connectivity to the Internet for individuals or companies. Some ISPs offer additional services above that of just connectivity such as e-mail, web hosting and domain registration.
J
JBOH (JAVASCRIPT-BINDING-OVER-HTTP)
A form of Android-focused mobile device attack enables an attacker to be able to initiate the execution of arbitrary code on a compromised device. A JBOH attack often takes place or is facilitated through compromised or malicious apps.
K
KEYLOGGER
Any means by which the keystrokes of a victim are recorded as they are typed into the physical keyboard. A keylogger can be a software solution or a hardware device used to capture user data.
L
LAN (LOCAL AREA NETWORK)
An interconnection of devices, a network, that is contained within a limited geographic area, typically a single building. LAN, all of the network cables are controlled by the organization, unlike a WAN where the cables are owned by a third party.
LINK JACKING
A potentially unethical practice of redirecting a link to a middle-man or aggregator site or location rather than the original site the link seemed to indicate it was directed towards.
M
MACHINE LEARNING
A subset of the artificial intelligence (AI) discipline that provides systems with the ability to automatically learn and improve from experience without being explicitly programmed.
MACHINE LEARNING TECHNIQUES
The field of machine learning has grown so large that there are now positions for Machine Learning Engineers. The terms below offer a broad overview of some common techniques used in machine learning.
MALWARE (MALICIOUS SOFTWARE)
Any code is written for the specific purpose of causing harm, disclosing information or otherwise violate the security or stability of a system. Includes virus, worm, Trojan horse, logic bomb, backdoor, Remote Access Trojan (RAT), rootkit, ransomware and spyware/adware.
MANAGEMENT SCIENCE
The broad interdisciplinary study of problem-solving and decision-making in human organizations. It has strong linkages to fields such as management, economics, business, and management consulting.
MAPREDUCE
Specific tools that support distributed computing on large data sets. These form core components of the Apache Hadoop software framework.
MEAN (AVERAGE, EXPECTED VALUE)
A calculation that gives us a sense of a 'typical' value for a group of numbers. The mean is the sum of a list of values divided by the number of values in that list.
MEDIAN
In a set of values listed in order, the median is whatever value is in the middle. We often use the median along with the mean to judge if there are values that are unusually high or low in the set. This is an early hint to explore outliers.
MICRODATA
Individual response data obtained in surveys and censuses - these are data points directly observed or collected from a specific unit of observation. Also known as raw data. ICPSR is an excellent resource for obtaining microdata files.
MIS REPORTING (MANAGEMENT INFORMATION SYSTEMS)
The process of providing essential information to run the day-to-day business activities and monitor an organization’s progress. This usually refers to descriptive and operational reporting.
N
NATURAL LANGUAGE PROCESSING (NLP)
A computer program's ability to understand both written and spoken human language. A component of artificial intelligence, NLP has existed for over five decades and has roots in the field of linguistics.
NEURAL NETWORKS
A machine learning method that’s very loosely based on neural connections in the brain. Neural networks are a system of connected nodes that are segmented into layers, input, output, and hidden layers.
NORMALIZE
A set of data is said to be normalized when all of the values have been adjusted to fall within a common range. We normalize data sets to make comparisons easier and more meaningful.
NOSQL
An approach to database design that can accommodate a wide variety of data models, including key-value, document, columnar and graph formats. NoSQL, 'not only SQL', is an alternative to traditional relational databases.
NUMERICAL VARIABLE
Usually referring to a variable whose possible values are numbers. An example would be ‘Bank Prime Loan Rate’.
O
OBJECT-BASED IMAGE ANALYSIS
The analysis of digital images using data from individual pixels. It combines spectral, textural and contextual information to identify thematic classes in an image.
OUTLIER
An outlier is a data point that is considered extremely far from other points. They are generally the result of exceptional cases or errors in measurement, and should always be investigated early in a data analysis workflow.
OUTSIDER THREAT
The likelihood or potential that an outside entity, such as an ex-employee, competitor or even an unhappy customer, may pose a risk to the stability or security of an organization.
OUTSOURCING
The action of obtaining services from an external entity. Rather than performing certain tasks and internal functions, outsourcing enables an organization to take advantage of external entities that can provide services for a fee.
OVERFITTING
Overfitting happens when a model considers too much information. It’s like asking a person to read a sentence while looking at a page through a microscope. The patterns that enable understanding get lost in the noise.
OWASP (OPEN WEB APPLICATION SECURITY PROJECT)
An Internet community focused on understanding web technologies and exploitations. Their goal is to help anyone with a website improve the security of their site through defensive programming, design and configuration.
P
PAAS (PLATFORM-AS-A-SERVICE)
A type of cloud computing service where the provider offers the customer the ability to operate custom code or applications. A PaaS operator determines which operating systems or execution environments are offered.
PACKET SNIFFING
The act of collecting frames or packets off of a data network communication. This activity allows the evaluation of the header contents as well as the payload of network communications.
PARTS OF A WORKFLOW
While every workflow is different, these are some of the general processes that data professionals use to derive insights from data.
PATCH
An update or change or an operating system or application. A patch is often used to repair flaws or bugs in deployed code as well as introduce new features and capabilities.
PATCH MANAGEMENT
The management activity is related to researching, testing, approving and installing updates and patches to computer systems, which includes firmware, operating systems and applications.
PATTERN RECOGNITION
The ability to detect arrangements of characteristics or data that provide information about a given system or data set. Patterns could manifest as recurring sequences of data that can be used to predict trends.
PAYMENT CARD SKIMMERS
A malicious device is used to read the contents of an ATM, debit or credit card when inserted into a POS (Point of Sale) payment system. A skimmer may be an internal component or external addition.
PEN-TESTING (PENETRATION)
A means of security evaluation where automated tools and manual exploitations are performed by security and attack experts. An advanced form of security assessment that should only be used by environments with mature security infrastructure.
PHISHING OR SPEAR PHISHING
A social engineering attack is used by hackers to obtain sensitive information. Phishing attacks can take place over e-mail, text messages, through social networks or via smartphone apps.
PKI (PUBLIC KEY INFRASTRUCTURE)
A security framework (i.e. a recipe) for using cryptographic concepts in support of secure communications, storage and job tasks. A PKI solution is a combination of symmetric encryption, asymmetric encryption, hashing and digital certificate-based authentication.
POS (POINT OF SALE) INTRUSIONS
An attack that gains access to the POS (Point of Sale) devices at a retail outlet enabling an attacker to learn payment card information as well as other customer details. POS intrusions can affect brick-and-mortar or online retail locations.
PREDICTIVE ANALYTICS
A form of advanced analytics that examines data to answer the question 'What is likely to happen?' You can achieve it with the help of techniques such as machine learning and Artificial Intelligence (AI).
PROGRAMMING LANGUAGE
Used by developers and data scientists to perform big data manipulation and analysis. R, Python, and Scala are the three major languages for data science and data mining.
PYTHON
An interpreted, object-oriented programming language that's gained popularity for big data professionals due to its readability and clarity of syntax. Python is easy to learn and highly portable, as code can be interpreted in several OS.
Q
QUALITATIVE DATA/VARIABLES
Information refers to the quality of something. Ethnographic research, participant observation, open-ended interviews, etc., may collect qualitative data.
QUANTITATIVE DATA/VARIABLES
Information that can be handled numerically. Example: spending by US consumers on personal care products and services
QUANTITATIVE ANALYSIS
This field is highly focused on using algorithms to gain an edge in the financial sector. These algorithms either recommend or make trading decisions based on a huge amount of data, often on the order of picoseconds.
R
R PROGRAMMING LANGUAGE
An open-source scripting programming language used for predictive analytics and data visualization. R includes functions that support both linear and nonlinear modelling, classical statistics, classifications and clustering.
RANSOMWARE
A form of malware that holds a victim's data hostage on their computer typically through robust encryption. This is followed by a demand for payment in the form of Bitcoin to release control of the captured data back to the user.
REGRESSION
Regression is another supervised machine learning problem. It focuses on how a target value changes as other values within a data set change. Regression problems generally deal with continuous variables.
RELATIONAL DATABASES
A collection of information that organizes data points with defined relationships for easy access. Data structures (including data tables, indexes and views) remain separate from the physical storage.
RESIDUAL (ERROR)
The residual is a measure of how much a real value differs from some statistical value we calculated based on the set of data.
RESTORE
The process of returning a system to a state of normalcy. A restore or restoration process may involve formatting the main storage device before reinstalling the OS and applications as well as copying data from backups.
RISK ASSESSMENT
The process of evaluating the state of risk of an organization. Risk assessment is often initiated through taking an inventory of all assets, assigning each asset a value, and then considering any potential threats against each asset.
RISK MANAGEMENT
The process of performing a risk assessment and evaluating the responses to risk to mitigate or otherwise handle the identified risks. Countermeasures, safeguards or security controls are to be selected that may eliminate or reduce risk.
ROOTKIT
Another kind of malware that allows cybercriminals to remotely control your computer. Rootkits are especially damaging because they are hard to detect, making it likely that this type of malware could live on your computer for a long time.
S
SAAS (SOFTWARE-AS-A-SERVICE)
A type of cloud computing service where the provider offers the customer the ability to use a provided application. Examples of SaaS include online e-mail services or online document editing systems.
SAMPLE
The sample is the collection of data points we have access to. We use the sample to make inferences about a larger population.
SANDBOXING
A means of isolating applications, code or entire operating systems to perform testing or evaluation. The sandbox limits the actions and resources available to the constrained item.
SCADA (SUPERVISORY CONTROL AND DATA ACQUISITION)
A complex mechanism is used to gather data and physical world metrics as well as perform measurement or management actions of the monitored systems for automatic large complex real-world processes.
SCALA (SCALABLE LANGUAGE)
A software programming language that blends object-oriented methods with functional programming capabilities. This allows it to support a more concise programming style which reduces the amount of code that developers need to write.
SECURITY CONTROL
Anything used as part of a security response strategy that addresses a threat to reduce risk. (Also known as countermeasure or safeguard.)
SECURITY PERIMETER
The boundary of a network or private environment where specific security policies and rules are enforced. The systems and users within the security boundary are forced into compliance with local security rules.
SIEM (SECURITY INFORMATION AND EVENT MANAGEMENT)
A formal process by which the security of an organization is monitored and evaluated constantly. Automatically identify systems that are out of compliance with the security policy and notify the IRT (Incident Response Team) of violations.
SOCIAL ENGINEERING
An attack focusing on people rather than technology. This type of attack is psychological and aims to either gain access to information or a logical or physical environment.
SOCIAL SCIENCE
The branch of science is devoted to the study of societies and the relationships among individuals within those societies. This field is gaining relevance in the data space since it helps gain insights into people’s behavioural aspects.
SOFT SKILLS
Today's most successful big data professionals are those who can harmonize their academic qualifications, innate intellectual abilities and real-world experience with a diverse range of other softer skills.
SOFTWARE
A set of programs that tell a computer to perform a task. These instructions are compiled into a package that users can install and use. For example, Microsoft Office is application software.
SPAM
A form of unwanted or unsolicited messages or communications typically received via e-mail but also occurring through text messaging, social networks or VoIP. Most SPAM is advertising, but some may include malicious code, hyperlinks or attachments.
SPOOF (SPOOFING)
The act of falsifying the identity of the source of communication or interaction. It is possible to spoof IP addresses, MAC addresses and email addresses.
SPYWARE
A form of malware that monitors user activities and reports them to an external party. Spyware can be legitimate in that it is operated by an advertising and marketing agency to gather customer demographics.
STANDARD DEVIATION
The standard deviation of a set of values helps us understand how spread out those values are. This statistic is more useful than the variance because it’s expressed in the same units as the values themselves.
STATISTIC
A number that describes some characteristic, or status, of a variable, eg, a count or a percentage. Example: total nonfarm job starts in August 2014
STATISTICAL COMPUTING
The collection and interpretation of data are aimed at uncovering patterns and trends. It may be used in scenarios such as gathering research interpretations, statistical modelling or designing surveys and studies, and advanced business intelligence.
STATISTICAL SIGNIFICANCE
A result is statistically significant when we judge that it probably didn’t happen due to chance. It is highly used in surveys and statistical studies, though not always an indication of practical value.
STATISTICAL TOOLS
There are several statistics data professionals use to reason and communicate information about their data. These are some of the most basic and vital statistical tools to help you get started.
STATISTICS
Numerical summaries of data that has been analyzed in some way.
STRUCTURED DATA
Structured data is data that has been organized into a formatted repository, typically a database, so that its elements can be made addressable for more effective processing and analysis.
SUMMARY STATISTICS
Summary statistics are the measures we use to communicate insights about our data in a simple way.
SUPERVISED MACHINE LEARNING
With supervised learning techniques, the data scientist gives the computer a well-defined set of data. All of the columns are labelled and the computer knows exactly what it’s looking for.
SUPPLY CHAIN
The path of linked organizations involved in the process of transforming original or raw materials into a finished product that is delivered to a customer.
T
THREAT ASSESSMENT
The process of evaluating the actions, events and behaviours that can cause harm to an asset or organization. Threat assessment is an element of risk assessment and management.
TIME SERIES
A set of measures of a single variable is recorded over a period of time.
TIME SERIES DATA
Any data arranged in chronological order.
TRAINING AND TESTING
This is part of the machine learning workflow. With the predictive model, you first offer a set of training data so it can build understanding. Then you pass the model a test set, where it applies its understanding and tries to predict a target value.
TROJAN HORSE (TROJAN)
A form of malware where a malicious payload is embedded inside of a benign host file. The victim is tricked into believing that the only file being retrieved is the viewable benign host.
TWO-FACTOR AUTHENTICATION
The means of proving identity using two authentication factors are usually considered stronger than any single factor authentication. A form of multi-factor authentication.
U
UNAUTHORIZED ACCESS
Any access or use of a computer system, network or resource which is in violation of the company security policy or when the person or user was not explicitly granted authorization to access or use the resource or system.
UNDERFITTING
Underfitting happens when you don’t offer a model enough information. An example of underfitting would be asking someone to graph the change in temperature over a day and only giving them the high and low.
UNSTRUCTURED DATA
Data that cannot be organized in the manner of structured data. Unstructured data includes emails and social media posts, blogs, and messages, transcripts of audio recordings of people's speech, images and video files, and machine data.
UNSUPERVISED MACHINE LEARNING
In unsupervised learning techniques, the computer builds its own understanding of a set of unlabeled data. Unsupervised ML techniques look for patterns within data and often deal with classifying items based on shared traits.
V
VARIABLE
Any finding that can change or vary. Examples include anything that can be measured, such as the number of logging operations in Alabama.
VARIANCE
The variance of a set of values measures how spread out those values are. Mathematically, it is the average difference between individual values and the mean for the set of values.
VIRTUAL PRIVATE NETWORK (VPN)
A tool that allows the user to remain anonymous while using the internet by masking the location and encrypting traffic.
VIRUS
A form of malware that often attaches itself to a host file or the MBR (Master Boot Record) as a parasite. When the host file or MBR is accessed, it activates the virus enabling it to infect other objects.
VISHING
A form of phishing attack takes place over VoIP. In this attack, the attacker uses VoIP systems to be able to call any phone number with no toll-charge expense. The attacker falsifies their caller-ID to trick the victim.
VPN (VIRTUAL PRIVATE NETWORK)
A communication link between systems or networks that are typically encrypted to provide a secured, private, isolate pathway of communications.
VULNERABILITY
Any weakness in an asset or security protection would allow for a threat to cause harm. It may be a flaw in coding, a mistake in configuration, a limitation of scope or capability, an error in architecture, or clever abuse of systems.
W
WEB SCRAPING
Web scraping is the process of pulling data from a website’s source code. It generally involves writing a script that will identify the information a user wants and pull it into a new file for later analysis.
WHITELIST
A security mechanism prohibiting the execution of any program that is not on a pre-approved list of software. The whitelist is often a list of the file name, path, file size and hash value of the approved software.
WI-FI
A means to support network communication using radio waves rather than cables. Current Wi-Fi or technologies are based on the IEE 802.11 standard and its numerous amendments, which address speed, frequency, authentication and encryption.
WORM
A form of malware that focuses on replication and distribution. A worm is a self-contained malicious program that attempts to duplicate itself and spread to other systems.
Z
ZOMBIE
A term related to the malicious concept of a botnet. The term zombie can be used to refer to the system that is host to the malware agent of the botnet or to the malware agent itself.
Thank you for taking the time to read this article. Hopefully, this has provided you with insight to assist you with your business.