PyData NYC 2022

CATs: Content-Addressable Transformers
11-10, 14:15–15:00 (America/New_York), Music Box (5th floor)

CATs (Content-Addressable Transformers) is an open-source unified data product collaboration framework in Python to deploy distributed data processing workloads on a peer-to-peer mesh network using IPFS Content-Identifiers (CIDs) to Content-Address the means of processing (input, process, output, infrastructure [IaC]), to transport data between services, and enable maintenance of data processes provenance as chains of processes and data verification. A self-service platform of CATs reduces the operational overhead of data product implementation associated with adding new data sources by enabling an Agile / customer centric implementation methodology for collaboration across domains between cross-functional / multi-disciplinary teams of Data Scientists, Data Analysts, Data Engineers, etc. between organizations on products by decentralizing ownership and distributing responsibility to those within bounded domains to support continuous change and scalability.


CATs (Content-Addressable Transformers) is an open-source unified data product collaboration framework in Python to deploy distributed data processing workloads on a peer-to-peer mesh network using IPFS Content-Identifiers (CIDs) to Content-Address the means of processing (input, process, output, infrastructure [IaC]), to transport data between services, and enable maintenance of data processes provenance as chains of evidence that certify the accuracy of processing on products via process retrieval and re-execution.

CATs will enable the creation of data products as decentralized services on mesh network peers with data process verification using existing centralized cloud service technologies (SaaS, PaaS, IaaS) on AWS, GCP, Azure, etc. A self-service platform of CATs reduces the operational overhead of Data Product implementation associated with adding new data sources by enabling an Agile / customer centric implementation methodology by decentralizing ownership and distributing responsibility to those within bounded domains to support continuous change and scalability. This way, Data Scientists, Data Analysts, Data Engineers, etc. can collaborate across domains between cross-functional / multi-disciplinary teams and organizations on products.

CATs uses Kubernetes as a runtime environment that is used to scale and distribute concurrent or parallel processing to be deployed on multi-cloud environments using Terraform. CATs currently leverages Apache Spark as its default distributed data processing framework and will to be extended to include others such as Dask and a FaaS solution as processing options to provide verifiable data processing via an API that accepts data provenance record as input and output.

CATs will service the Cyber-Physical Systems community BlockScience is a part of with Decentralized Cloud & Science for the Economic System Design of Web3 technology to accommodate the Scientific Computing and Big Data subgroups with users acting as Distributed & Economic System Engineers, Data Scientists & Engineers, Software Engineers, etc.


Prior Knowledge Expected

No previous knowledge expected

Joshua has been a data & software engineer who implemented scalable, parallelized, concurrent, and distributed stochastic simulation software for digital twin implementations and sociotechnical system design of the decentralized web. He also implemented machine learning enabled big data processing solutions for viewership forecasting in AdTech, a cross-disciplinary data product for supply chain management, and conducted machine learning research enabling the prediction of student performance in online courses.