Data Science And Big Data Analytics Pdf

  • and pdf
  • Wednesday, January 27, 2021 1:41:06 AM
  • 2 comment
data science and big data analytics pdf

File Name: data science and big data analytics .zip
Size: 2049Kb
Published: 27.01.2021

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections or of the United States Copyright Act, without either the prior written permis-sion of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Rosewood Drive, Danvers, MA , , fax No warranty may be created or extended by sales or promotional materials.

Data Science and Big Data Analytics

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections or of the United States Copyright Act, without either the prior written permis-sion of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Rosewood Drive, Danvers, MA , , fax No warranty may be created or extended by sales or promotional materials.

The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought.

Neither the publisher nor the author shall be liable for damages arising herefrom. Further, readers should be aware that Internet websites listed in this work may have changed or disappeared between when this work was written and when it is read. For general information on our other products and services please contact our Customer Care Department within the United States at , outside the United States at or fax Wiley publishes in a variety of print and electronic formats and by print-on-demand.

Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. For more information about Wiley products, visit www. David has been an advisor to several universities looking to develop academic programs related to data analytics, and has been a frequent speaker at conferences and industry events. He also has been a a guest lecturer at universi-ties in the Boston area.

Additionally, David collaborated with the U. Federal Reserve in develop-ing predictive models for monitordevelop-ing mortgage portfolios.

Barry is a course developer and cur-riculum advisor in the emerging technology areas of Big Data and data science. Prior to his current role, Barry was a. Prior to joining EMC, Barry held managerial and analytical roles in reliability engineering functions at medical diagnostic and technology companies.

Underscoring the importance of strong executive stakeholder engagement, many of his successes have resulted from not only focusing on the technical details of an analysis, but on the decisions that will be resulting from the analysis. Barry earned a B. Beibei has seven years of experience in the IT industry. Prior to EMC she worked as a software engineer, systems manager, and network manager for a Fortune company where she introduced.

It was a challenging journey at the time as not many understood what it would take to be a true data scientist. Many sincere thanks to many key contributors and subject matter experts David Dietrich, Barry Heller, and Beibei Yang for their work developing content and graphics for the chapters. A special thanks to subject matter experts John Cardente and Ganesh Rajaratnam for their active involvement reviewing multiple book chapters and providing valuable feedback throughout the project.

We are also grateful to the following experts from EMC and Pivotal for their support in reviewing and improving the content in this book:. We also thank Ira Schild and Shane Goodrich for coordinating this project, Mallesh Gurram for the cover design, Chris Conroy and Rob Bradley for graphics, and the publisher, John Wiley and Sons, for timely support in bringing this book to the industry.

There is enormous value potential in Big Data: innovative insights, improved understanding of problems, and countless opportunities to predict—and even to shape—the future.

Data Science is the principal means to discover and tap that potential. Not everyone has studied statistical analysis at a deep level. People with advanced degrees in applied math-ematics are not a commodity. Relatively few organizations have committed resources to large collections of data gathered primarily for the purpose of exploratory analysis.

How does an organization operationalize quickly to take advantage of this trend? EMC Education Services has been listening to the industry and organizations, observing the multi-faceted transformation of the technology landscape, and doing direct research in order to create curriculum and con-tent to help individuals and organizations transform themselves.

For the domain of Data Science and Big Data Analytics, our educational strategy balances three things: people —especially in the context of data science teams, processes —such as the analytic lifecycle approach presented in this book, and tools and technologies —in this case with the emphasis on proven analytic tools.

In many cases, Big Data analytics integrate structured and unstructured data with real-time feeds and queries, opening new paths to innovation and insight. Knowledge of these methods will help people become active contributors to Big Data analytics projects.

The content is structured in twelve chapters. The second chapter presents an analytic project lifecycle designed for the particular characteristics and challenges of hypothesis-driven analysis with Big Data.

Chapter 3 examines fundamental statistical techniques in the context of the open source R analytic software environment. This chapter also highlights the importance of exploratory data analysis via visualizations and reviews the key notions of hypothesis development and testing. Chapter 12 provides guidance on operationalizing Big Data analytics projects.

Much has been written about Big Data and the need for advanced analytics within industry, academia, and government. Availability of new data sources and the rise of more complex analytical opportunities have created a need to rethink existing data architectures to enable analytics that take advantage of Big Data.

Data is created constantly, and at an ever-increasing rate. Mobile phones, social media, imaging technologies to determine a medical diagnosis—all these and more create new data, and that must be stored somewhere for some purpose. Devices and sensors automatically generate diagnostic information that needs to be stored and processed in real time.

These challenges of the data deluge present the opportunity to transform business, government, science, and everyday life. The valuations of these companies are heavily derived from the data they gather and host, which contains more and more intrinsic value as the data grows. Figure highlights several sources of the Big Data deluge.

Social media and genetic sequencing are among the fastest-growing sources of Big Data and examples of untraditional sources of data being used for analysis. For example, in Facebook users posted status updates per second worldwide, which can be leveraged to deduce latent interests or political views of users and show relevant ads. Another example comes from genomics. Genetic sequencing and human genome mapping provide a detailed understanding of genetic makeup and lineage.

The health care industry is looking toward these advances to help predict which illnesses a person is likely to get in his lifetime and take steps to avoid these maladies or reduce their impact through the use of personalized medicine and treatment.

While data has grown, the cost to perform this work has fallen dramatically. Although genotyping analyzes only a fraction of a genome and does not provide as much granularity as genetic sequencing, it does point to the fact that data and complex analysis is becoming more prevalent and less expensive to deploy.

The RDBMS may store characteristics of the support calls as typical structured data, with attributes such as time stamps, machine type, problem type, and operating system. Many insights could be extracted from the unstructured, quasi- or semi-structured data in the call center data. See Figure Quasi-structured data is a common phenomenon that bears closer scrutiny. Consider the following example. These three URLs are:. Together, this comprises a clickstream that can be parsed and mined by data scientists to discover usage patterns and uncover relationships among clicks and areas of interest on a website or group of sites.

The four data types described in this chapter are sometimes generalized into two groups: structured and unstructured data. Big Data describes new kinds of data with which most organizations may not be used to working.

With this in mind, the next section discusses common technology architectures from the standpoint of someone wanting to analyze Big Data. The introduction of spreadsheets enabled business users to create simple logic on data structured in rows and columns and create their own analyses of business problems.

Database administrator training is not required to create spreadsheets: They can be set up to do many things quickly and independently of information technology IT groups. Spreadsheets are easy to share, and end users have control over the logic involved.

This is an ongoing challenge because spreadsheet programs such as Microsoft Excel still run on many computers worldwide. With the proliferation of data islands or spreadmarts , the need to centralize the data is more pressing than ever. More advanced features enabled performance of in-depth analytical techniques such as regressions and neural networks.

Enterprise Data Warehouses EDWs are critical for reporting and BI tasks and solve many of the problems that proliferating spreadsheets introduce, such as which of multiple versions of a spreadsheet is correct.

EDWs—and a good BI strategy—provide direct data feeds from sources that are centrally managed, backed up, and secured. This imposes longer lead times for analysts to get data; most of the time is spent waiting for approvals rather than starting meaningful work.

Additionally, many times the EDW rules restrict analysts from building datasets. Consequently, it is common for additional systems to emerge containing critical data for constructing analytic datasets, managed locally by power users. IT groups generally dislike exis-tence of data sources outside of their control because, unlike an EDW, these datasets are not managed, secured, or backed up. In this model, the IT group may still manage the analytic sandboxes, but they will be purposefully designed to enable robust analytics, while being centrally managed and secured.

Many times, analytic sandboxes enable high-performance computing using in-database processing— the analytics occur within the database itself. The idea is that performance of the analysis will be better if the analytics are run in the database itself, rather than bringing the data to an analytical tool that resides somewhere else.

In addition, rather than the typical structured data in the EDW, analytic sandboxes can house a greater variety of data, such as raw data, textual data, and other kinds of unstructured data, without interfering with critical production databases.

Table summarizes the characteristics of the data repositories mentioned in this section. Due to the characteristics of Big Data, these projects lend themselves to decision sup-port for high-value, strategic decision making with high processing complexity. Performing rapid and complex analysis requires high throughput network connections and a consideration for the acceptable amount of latency.

For instance, developing a real-time product recommender for a website imposes greater system demands than developing a near-real-time recommender, which may still provide acceptable performance, have slightly greater latency, and may be cheaper to deploy. Current business problems provide many opportunities for organizations to become more analytical and data driven, as shown in Table Table outlines four categories of common business problems that organizations contend with where they have an opportunity to leverage advanced analytics to create competitive advantage.

Rather than only performing standard reporting on these areas, organizations can apply advanced analytical techniques to optimize processes and derive more value from these common tasks. Organizations have been trying to reduce customer churn, increase sales, and cross-sell customers for many years. What is new is the opportunity to fuse advanced analytical techniques with Big Data to produce more impactful analyses for these traditional problems.

The last example por-trays emerging regulatory requirements. Many compliance and regulatory laws have been in existence for decades, but additional requirements are added every year, which represent additional complexity and data requirements for organizations.

Laws related to anti-money laundering AML and fraud prevention require advanced analytical techniques to comply with and manage properly. The four business drivers shown in Table require a variety of analytical techniques to address them prop-erly. Although much is written generally about analytics, it is important to distinguish between BI and Data Science.

As shown in Figure , there are several ways to compare these groups of analytical techniques. One way to evaluate the type of analysis being performed is to examine the time horizon and the kind of analytical approaches being used.

Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data

Metrics details. Scholars have been increasingly calling for innovative research in the organizational sciences in general, and the information systems IS field in specific, one that breaks from the dominance of gap-spotting and specific methodical confinements. Hence, pushing the boundaries of information systems is needed, and one way to do so is by relying more on data and less on a priori theory. Data, being considered one of the most important resources in research, and society at large, requires the application of scientific methods to extract valuable knowledge towards theoretical development. However, the nature of knowledge varies from a scientific discipline to another, and the views on data science DS studies are substantially diverse. These views vary from being seen as a new scientific fourth paradigm, to an extension of existing paradigms with new tools and methods, to a phenomenon or object of study. In this paper, we review these perspectives and expand on the view of data science as a methodology for scientific inquiry.

100+ Free Data Science Books

This book presents conjectural advances in big data analysis, machine learning and computational intelligence, as well as their potential applications in scientific computing. It discusses major issues pertaining to big data analysis using computational intelligence techniques, and the conjectural elements are supported by simulation and modelling applications to help address real-world problems. An extensive bibliography is provided at the end of each chapter. Further, the main content is supplemented by a wealth of figures, graphs, and tables, offering a valuable guide for researchers in the field of big data analytics and computational intelligence.

Data science: developing theoretical contributions in information systems via text analytics

To browse Academia. Skip to main content. By using our site, you agree to our collection of information through the use of cookies. To learn more, view our Privacy Policy. Log In Sign Up. Download Free PDF. Akshay Agrawal.

Skip to search form Skip to main content You are currently offline. Some features of the site may not work correctly. The book covers the breadth of activities and methods and tools that Data Scientists use. Save to Library. Create Alert. Launch Research Feed.

It seems that you're in Germany. We have a dedicated site for Germany. This book presents conjectural advances in big data analysis, machine learning and computational intelligence, as well as their potential applications in scientific computing. It discusses major issues pertaining to big data analysis using computational intelligence techniques, and the conjectural elements are supported by simulation and modelling applications to help address real-world problems. An extensive bibliography is provided at the end of each chapter. Further, the main content is supplemented by a wealth of figures, graphs, and tables, offering a valuable guide for researchers in the field of big data analytics and computational intelligence.

100+ Free Data Science Books

2 Comments

  1. Monica C. 31.01.2021 at 08:43

    Note that while every book here is provided for free, consider purchasing the hard copy if you find any particularly helpful.

  2. Rigcoucongi 02.02.2021 at 15:55

    Nasm study guide 6th edition pdf ncert solution of maths class 11 pdf download