# Performance Workstream
Wednesday, March 11, 2020
# Performance Goals:
- Current HW system achieving stable 1k TPS, peak 5k and proven horizontal scalability
- More instances = more performance in almost linear fashion.
- Validate the minimum infrastructure to do 1K TPS (fin TPS)
- Determine the entry level configuration and cost (AWS and on-premise)
# POCs:
Test the impact or a direct replace of the mysql DB with an shared memory network service like redis (using redlock alg if locks are required)
Test a different method of sharing state, using a light version of event-drive with some CQRS
# Resources:
- Slack Channel:
#perf-engineering
- Mid-PI performance presentation (opens new window)
- Setting up the monitoring components (opens new window)
# Action/Follow-up Items:
What Kafka metrics (client & server side) should we be reviewing? - Confluent to assist
Explore Locking and position settlement - Sybrin to assist
- Review RedLock - pessimistic locking vs automatic locking
- Remove the shared DB in the middle (automatic locking on Reddis)
Combine prepare/position handler w/ distributed DB
Review node.js client and how it impact kafka, configuration of Node and ultimate Kafka client - Nakul
Turn back on tracing to see how latency and applications are behaving
Ensure the call counts have been rationalized (at a deeper level)
Validate the processing times on the handlers and we are hitting the cache
Async patterns in Node
- Missing someone who is excellent on mysql and percona
- Are we leveraging this correctly
What cache layer are we using (in memory)
Review the event modeling implementation - identify the domain events
Node.js/kubernetes -
Focus on application issues not as much as arch issues
How we are doing async technology - review this (Node.JS - larger issue) threaded models need to be optimize - Nakul
# Meeting Notes/Details
# History
Technology has been put in place, hoped the design solves an enterprise problem
Community effort did not prioritize on making the slices of the system enterprise grade or cheap to run
OSS technology choices
# Goals
- Optimize current system
- Make it cheaper to run
- Make it scalable to 5K TPS
- Ensure value added services can effectively and securely access transaction data
# Testing Constraints
- Only done the golden transfer - transfer leg
- Flow of transfer
- Simulators (legacy and advance) - using the legacy one for continuity
- Disabled the timeout handler
- 8 DFSP (participant organizations) w/ more DFSPs we would be able to scale
# Process
Jmeter initiates payer request
Legacy simulator Receives fulfill notify callback
Legacy simulator Handles Payee processing, initiatives Fulfillment Callback
Record in the positions table for each DFSP
- a. Partial algorithm where the locking is done to reserve the funds, do calculations and do the final commits
- b. Position handler is Processing one record at a time
Future algorithm would do a bulk
- One transfer is handler by one position handler
- Transfers are all pre-funded
- Reduced settlement costs
- Can control how fast DFSPs respond to the fulfill request (complete the transfers committed first before handling new requests)
System need to timeout transfers that go longer then 30 seconds
- Any redesign of the DBs
- Test Cases
Financial transaction
- End-to-end
- Prepare-only
- Fulfil only
Individual Mojaloop Characterization
- Services & Handlers
- Streaming Arch & Libraries
- Database
- What changed: 150 to 300 TPS?
How we process the messages
Position handler (run in mixed mode, random
- Latency Measurement
- 5 sec for DB to process, X sec for Kafka to process
- How to measure this?
# Targets
- High enough the system has to function well
- Crank the system up to add scale (x DFSPs addition)
- Suspicious cases for investigations
- Observing contentions around the DB
- Shared DB, 600MS w/ out any errors
Contention is fully on the DB
Bottleneck is the DB (distribute systems so they run independently
16 databases run end to end
GSMA - 500 TPS
What is the optimal design?
# Contentions
System handler contention
- Where the system can be scaled
If there are arch changes that we need to make we can explore this
- Consistency for each DFSP
- Threading of info flows - open question
Sku'ed results of single DB for all DFSPs
Challenge is where get to with additional HW
- What are the limits of the application design
Financial transfers (in and out of the system)
- Audit systems
- Settlement activity
- Grouped into DB solves some issues
- Confluent feedback
Shared DB issues, multiple DBs
Application design level issues
Seen situations where we ran a bunch of simulators/sandboxes
- Need to rely on tracers and scans once this gets in productions
- Miguel states we disable tracing for now
# Known Issues
- Load CPU resources on boxes (node waiting around) - reoptimize code
- Processing times increase over time
# Optimization
- Distributed monolithic - PRISM - getting rid of redundant reads
- Combine the handlers - Prepare+Position & Fulfil+Position
# What are we trying to fix?
- Can we scale the system?
- What does this cost to do this? (scale unit cost)
- Need to understand - how to do this from a small and large scale
- Optimized the resources
- 2.5 sprints
- Need to scale horizontal
- Add audit and repeatability -
# Attendees:
- Don, Joran (newly hired perf expert) - Coil
- Sam, Miguel, Roman, Valentine, Warren, Bryan, Rajiv - ModusBox
- Pedro - Crosslake
- Rhys, Nakul Mishra - Confluent
- Miller - Gates Foundation
- In-person: Lewis (CL), Rob (MB), Roland (Sybrin), Greg (Sybrin), Megan (V), Simeon (V), Kim (CL)