A Data Audit is required to ensure data can be trusted for decision making.
We aim our data organization to achive:
Representative: we measure what we say we measure.
Accurate and Complete: we consistently measure and represent all data
Timely: data is available and processed in a timely manner, enabling data driven decisions as soon as they are possible.
Aligned with Business: what we measure and how we calculate things is aligned with and throughout the organization.
In a Data Audit we'd look at the following:
Data Quality
User identifier captured and used in generating KPIs is consistent and allows to identify continuous user behavior. Make sure that userid does not change over the course of the user journey in the app (for example, if a paying user gets a new userid once they pay).
What Good Looks Like (WGLL) KPI: impact below 5% of all users, new users or revenue.
Data is captured consistently, meaning the same actions of users always lead to the same captured data.
All historical data is available.
Same timestamps are used to determine time of events.
Usage of local timestamps or any other timestamp does not lead to a distortion of data larger than 5% of measured KPI.
Monitoring and alerts of data values for positive and negative anomalies are set up.
Infrastructure and monitoring:
Technical designs available to address:
Scaling of storage and processing of data
Access control
Retention policies
Report generation - Can the people who need the information get it?
Data loss from user action to data sent to analytics system (internal or 3rd party) is minimal.
WGLL KPI: data loss under 5% of all users, new users or revenue.
Data loss from data sent to analytics system (internal or 3rd party) to data saved in analytics system is minimal.
WGLL KPI: data loss under 5% of all users, new users or revenue.
Data duplication is minimal at source (app) and DWH ETLs.
WGLL KPI: duplicates less than 1%
System metrics and their relationship to business data, for example: Can you correlate crashes to retention?
SLO/SLI definition and metrics.
Any tech debt?
Risk of and past incidents, including bug logging, fixing and future prevention.
Monitoring and alerts of data quality, availability, data loss and duplication are set up.
Data Definitions:
Definition of the audiences and their needs, like audience
Marketing to drive UA spend
Live Ops to drive user monetization
Alignment of KPI definitions with the way they are calculated in the analytics system.
All departments (BI, marketing, product, data infra etc.) have KPI definitions and they are the same.
Alignment of data model with business needs, allowing cohort and date views.
Data is tracked and aggregated the same way for all platforms, all products.
Cohort metrics are normalized correctly as applicable:
Day N data includes only cohorts that have had at least N complete days
For those cohorts per previous section, Day N data counted includes only data that happened until Day N and not Day N+1 or more.
User identifier used to calculate LTV is aligned with the user identifier used to attribute UA spend.
Workflow and Operations:
Who is responsible for making sure the reports are available to the right audience?
How are audience needs defined and managed
How are reports developed? How are the reports actionable? What change do they drive? Who can generate reports or dashboards ?
How is data gathering implemented?
How is data quality verified?
How are releases of new data gathering and reporting managed?